# Data Analytics Fall 2025 &mdash; Exercises 1

### Dinesh Bisht (last modified: Sun 31 Aug)

### Deadline: Around Tue 16 Sep (to be specified)

- Five problems
- Minor variations between users
- Theme: Python & Numpy (no Pandas allowed)
- Theory: see <tt>public/exrc_01</tt>
- Make a copy of the original notebook (e.g. <tt>File $\rightarrow$ Save Notebook As</tt>)<br/>
  and add your answers (new cells) there
- Please make both your code and your notebook readable
- Keep your folder structure up to date by running the config script:

In [None]:
import os
os.system('/usr/bin/bash /home/varpha/dan/config.sh');

## Problem 1. Documentation
- Browse through the Python and Numpy documentation
- Find a function that a) interests you, and b) has a messy documentation
- Play with the function and find simple use cases
- Explain the function to your anonymous peer reviewer.

Please write a nice and clear explanation. Include some elementary examples.

## Solution 1
- Used chatgpt to format and organise the content

# NumPy `isin` Function Guide

The **`numpy.isin`** function tests if each element of an array is present in a second array. It returns a boolean array of the same shape as the first input array, where `True` indicates that the element was found in the second array and `False` indicates it was not.


## Basic Syntax

``` python
numpy.isin(element, test_elements)
```

-   **element**: The input array whose elements you want to test. This can be an array-like object (e.g., a NumPy array or a list).
-   **test_elements**: The array of values you're searching for. This can also be an array-like object.



## Basic Examples

### 1. Membership Check

``` python
import numpy as np

a = np.array([10, 20, 30, 40, 50])
b = [20, 40, 60]

result = np.isin(a, b)
print(result)
```

**Output:** [False  True False  True False]

**Explanation**: Only `20` and `40` from array `a` are present in array `b`. The resulting boolean array can be used as a mask to filter the original array. It is very common and powerful use case.


### 2. Filtering Elements

``` python
a = np.array([10, 20, 30, 40, 50])
b = [20, 40]

filtered = a[np.isin(a, b)]
print(filtered)
```

**Output:** [20 40]

**Explanation**: We used `isin` as a boolean mask to extract elements from `a` that are in `b`.


## Summary

`np.isin` is super useful when you want to perform data cleansing, filtering, removing invalid values or outliers. In short use it whenever you need to keep or remove values based on membership.


## Problem 2. Map, Lambda, Groupby
In this problem, only plain python may be used, no numpy.<br/>
The following links may be helpful:
- [sorting howto](https://docs.python.org/3/howto/sorting.html)
- [lambda sorting](https://blogboard.io/blog/knowledge/python-sorted-lambda)
- [itertools groupby](https://stackoverflow.com/questions/773/how-do-i-use-itertools-groupby).

Using the code cell below, read a csv (real wind turbine data) into a list of dicts.<br/>
Then do the following:
- a) using map, convert the timestamps into the format <b>MM/dd/yyyy HH:mm:ss</b>, e.g. 11/04/2018 09:10:43
- b) using sorted and lambda, sort the rows according to increasing rotorspeed
- c) add a column called <b><i>WindSpeed_Group</i></b> that contains the letter A, B or C, where A = less than 5mps, B = 5-10mps, C = more than 10mps. Try to use [itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby) (although it may not be very smart).

In your handin, include the code that does a) - c) above. No need to save the modified data. Here is the code for reading the raw data:

In [None]:
from getpass import getuser
import csv
user = getuser()
csv_location = f'/home/varpha/dan/private/{user}' + \
                f'/exrc_01/data/prob2_{user}.csv'
with open(csv_location) as handle:
    mydata = list(csv.DictReader(handle))

## Solution 2
### Used below documents for reference:
- [datetime — Basic date and time types](https://docs.python.org/3/library/datetime.html)
- [sorting howto](https://docs.python.org/3/howto/sorting.html)

### Problem faced
- While working on task c) its been found that there are some rows where <b><i>WindSpeed_mps</i></b> value blank and caused the issue. For such cases I added NA for <b><i>WindSpeed_Group</i></b> 

In [None]:
from datetime import datetime
from getpass import getuser
from operator import itemgetter
import csv

user = getuser()
csv_location = f'/home/varpha/dan/private/{user}' + \
                f'/exrc_01/data/prob2_{user}.csv'

with open(csv_location) as handle:
    mydata = list(csv.DictReader(handle))


## mydata = mydata[0:20] # For debugging with sample data


# Convert the TimeStamp to MM/dd/yyyy HH:mm:ss format
update_mydata = list(map(
    lambda data: {
        **data,
        "TimeStamp": datetime.strptime(data["TimeStamp"], "%Y-%m-%d %H:%M:%S.%f").strftime("%m/%d/%Y %H:%M:%S")
    },
    mydata
))

## print(update_mydata) # For debugging results with sample data

# Sort the data with RotorSpeed_rpm
sort_mydata = sorted(update_mydata,key=itemgetter("RotorSpeed_rpm"))
print(sort_mydata)

# Add new column WindSpeed_Group
def assign_group(data):
    try:
        ws = float(data["WindSpeed_mps"])
        if ws < 5:
            group = "A"
        elif ws <= 10:
            group = "B"
        else:
            group = "C"
    except ValueError:
        group = "NA"   # Handle empty or bad values

    return {**data, "WindSpeed_Group": group}


add_mydata = list(map(assign_group, update_mydata))


## print(add_mydata) # For debugging results with sample data
                                

## Problem 3. Vectorization
- Some [general info](https://www.askpython.com/python-modules/numpy/vectorization-numpy)
- The code in <tt>dan/public/exrc_01/integrator.py</tt> contains rudimentary code,<br/>
  written in plain python, that numerically integrates a (math) function<br/>
  $f\colon \mathbb{R} \to \mathbb{R}$ over an interval $[a,b]$.
- Rewrite the code using numpy and vectorization.
- Introduce timings to measure the gain of vectorization.
- Use the (math) function $f(x)=10 x^{11} + 6 x^{9} - 12 x^{6} - 10$ and interval $[a,b] = [-7, 15]$ to test the code.
- Increase the number of subintervals in order to obtain a noticeable difference in the timings.

In your handin, include the rewritten code along with the timing measures.

## Solution 3

### Option 1: Without NumPy

In [None]:
import time

def create_mesh(a, b, n):
    return [a+i*(b-a)/n for i in range(n)]


def integrate(f, a, b, n):
    sum_of_rectangles = 0
    left_endpoints = create_mesh(a,b,n)
    mesh_width = (b-a)/n
    for left_endpoint in left_endpoints:
        midpoint = left_endpoint + mesh_width/2
        height = f(midpoint)
        sum_of_rectangles += height * mesh_width
    return sum_of_rectangles


def f(x):
    return 3*x**2 - 5


### main ###

start_time = time.time()
# integrate f over [-1,4], dividing the interval to 1000 subintervals
myresult = integrate(f,-1,4,10000000)
print(myresult) # 39.99999999999631
end_time = time.time()
print(f'Excecution Time {end_time - start_time}') # Output: Excecution Time 3.9350855350494385

### Option 2: With NumPy
- NumPy solution uses <b><i>Left Endpoint Rule </b></i> unlike <b><i>Midpoint rule </b></i> which is used in Option 1 so the numerical result ie output differs slightly. Objective of this problem was to demonstrate the power of numpy vectorization and the huge performance improvement over a Python loop.


In [None]:
import numpy as np
import time

def integrate(f, a, b, n):
    left_endpoints = np.linspace(a,b,n)
    mesh_width = (b-a)/n
    return np.sum(f(left_endpoints) * mesh_width)


def f(x):
    return 3*x**2 - 5


### main ###

start_time = time.time()
# integrate f over [-1,4], dividing the interval to 1000 subintervals
myresult = integrate(f,-1,4,10000000)
print(myresult) # Output: 40.000006250000624
end_time = time.time()
print(f'Excecution Time {end_time - start_time}') # Output: Excecution Time 0.10708189010620117

## Problem 4. Numpy arrays

- The directory <tt>dan/private/exrc_01/data</tt><br/>
  contains a csv file (<tt>prob4_ah4323.csv</tt>) with some weather data.
- a) Use [numpy.genfromtxt](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html) to read the file into a 2-dimensional numpy array.<br/>
  Use dtype=str in order to not lose the headers.
- b) Use Boolean masking to drop the rows that contain <tt>nan</tt> entries.
- c) Convert the time entries (standard timestamp) into a human-readable format of your choice.
- d) Add a new row that contains the averages of the columns, except <tt>nan</tt> for the time column.

In your handin, include the code that does a) - d) above. Do not include any saved data.

## Solution 4

In [2]:
import datetime
import numpy as np
from getpass import getuser
import csv


user = getuser()
csv_location = f'/home/varpha/dan/private/{user}' + \
                f'/exrc_01/data/prob4_{user}.csv'
csv_location = "prob4_ah4323.csv"
# Read entire CSV as strings to create a 2d array
data = np.genfromtxt(csv_location, delimiter=",", dtype=str)

## np.savetxt("output_data.csv", data, delimiter=",", fmt="%s") ## For debugging to see the output

# Remove rows containing "nan" (string form)
clean_data = data[~np.any(data == "nan", axis=1)]

## np.savetxt("output_clean_data.csv", clean_data, delimiter=",", fmt="%s") ## For debugging to see the output

# Keep the header and rows separately
header, rows = clean_data[0], clean_data[1:]

# Find the index of the "time" column
time_col_idx = np.where(header == "time")[0][0]

# Converted time series data to MM/dd/yyyy HH:mm:ss format
for row in rows:
        ts = float(row[time_col_idx])
        dt = datetime.datetime.utcfromtimestamp(ts)   # convert to UTC datetime
        row[time_col_idx] = dt.strftime("%m/%d/%Y %H:%M:%S")

# Compute the averages of the columns, except nan for the time column
averages = []
for i, colname in enumerate(header):
    if i == time_col_idx:
        averages.append("nan")  # placeholder for time column
    else:
        col_vals = rows[:, i].astype(float)
        averages.append(f"{np.mean(col_vals):.6f}")

# Stack header + formatted rows back + averages
formatted_data_with_header_averages = np.vstack([header, rows, averages]) 

np.savetxt("output_formatted_data.csv", formatted_data_with_header_averages, delimiter=",", fmt="%s") ## For debugging to see the output


  dt = datetime.datetime.utcfromtimestamp(ts)   # convert to UTC datetime


## Problem 5. Data download
- Start by exploring / running the code in <tt>dan/public/exrc_01/statfi.py</tt>
- Choose a topic that interests you. Then try to download a "lot" of data of data of that topic. Here a lot means something like 500kB - 2MB range. (It's not really a lot but enough that the downloaded data is hard to grasp manually.)
- Save your data in one or several json files.

In your handin, include the code that you used (no saved data).
Also, tell a few words about your experiences. What problems, if any, did you encounter?

## Solution 5

## Overall Experience
The overall experience was smooth, mainly because the topic I selected (**Railway statistics – rtie**) was not too large. As a result, I did not encounter major issues while working with it.

## Key Observation
While running the code, I noticed that the chosen language (`en` for English or `fi` for Finnish) only affects the **labels** displayed in the metadata.  

However, the actual data extraction always depends on the **variable codes**, which remain constant and are defined in **Finnish**.



In [None]:
##### imports #####

import requests
import json
import sys

# this has to do with pass by value / reference
from copy import deepcopy

##### config #####

english = True
# english = False


##### helpers #####


# notebook replacement of sys.exit()
# call with raise StopExecution
class StopExecution(Exception):
    def _render_traceback_(self):
        pass

query_template = {
    "query": [], # list of query items
    "response": {
        "format": "json"
    }
}

query_item_template = {
    "code": "", # variable
    "selection": {
        "filter": "item",
        "values": [] # list of strings
    }
}


##### main #####


with requests.Session() as session:

    '''
    first, some browsing in order to get the correct database
    you can do this with a browser too (but translation may become an issue)
    '''

    lang_id = 'en' if english else 'fi'
    base_url = f'https://pxdata.stat.fi/PXWeb/api/v1/{lang_id}/StatFin'
    response = session.get(base_url)

    for item in response.json():
        print(item['id'], item['text'])

    # stop execution
    # raise StopExecution

    '''
    next, append the id of your thing of interest to the url
    (EDIT the adopt below)
    '''

    catalogue_url = f'{base_url}/rtie'
    response = session.get(catalogue_url)

    '''
    check what .px files are available in the "catalogue"
    '''
    for item in response.json():
        print(item['id'], item['text'])

    # stop execution
    # raise StopExecution

    '''
    once you decide what .px file interests you, 
    EDIT it below in order to fetch the available data headers

    '''

    headers_url = f'{base_url}/rtie/statfin_rtie_pxt_12lz.px'
    response = session.get(headers_url)

    myjson = response.json()
    print()
    print('variables:', len(myjson['variables']))
    print()
    for var in myjson['variables']:
        print(var['text'])
    print()

    if english:
        tmp_url = headers_url.replace('/en/','/fi/')
        response = session.get(tmp_url)
        myjson = response.json()
        print()
        print('the corresponding variables in finnish (may needed in the actual query):')
        print()
        for var in myjson['variables']:
            print(var['text'])
        print()

    # stop execution
    # raise StopExecution

    '''
    okay, but then things get more serious as we build the actual query for the data

    first, fetch the maximum values that one can download
    (this is kind of hi-tech, got it from the documentation)
    (which typically sucks in free & public apis like this)
    '''
    response = session.get(f'https://statfin.stat.fi/PXWeb/api/v1/{lang_id}/?config')
    maxvalues = response.json()['maxValues']

    '''
    query building (we don't request anything yet)
    please edit only the "for myvar" line
    '''
    query = deepcopy(query_template)
    total_values = 1
    for myvar in ['Vetokalustolaji', 'Vuosi', 'Tiedot']: # EDIT this line and Value must be in Finnish
        myvalues = []
        query_item = deepcopy(query_item_template)
        for v in myjson['variables']:
            if v['code'] == myvar:
                myvalues = v['values']
        total_values = total_values * len(myvalues)
        query_item['code'] = myvar
        query_item['selection']['values'] = myvalues
        query['query'].append(query_item)
    if total_values > maxvalues:
        print('your query is too big, try again with fewer variables')
        raise StopExecution


    '''
    obtain the actual data with a "post" request
    that's like submitting a web form
    and cannot be done by gui browsing anymore
    '''
    response = session.post(headers_url, json=query)

    '''
    finally, dump the data to a file
    '''
    myjson = response.json()
    with open('test.json', 'w') as handle:
        json.dump(myjson, handle, indent=4)

    print("file created")


## How to submit my solutions?

Open a Terminal tab (e.g. <tt>File $\rightarrow$ New $\rightarrow$ Terminal</tt>, copy-paste the following into the Terminal command prompt, and press enter:
<pre>
  /home/varpha/dan/menu.py
</pre>