# Data Analytics Fall 2025 &mdash; Exercises 1

### XXXXX XXXXX

## Problem 1. Documentation
- Browse through the Python and Numpy documentation
- Find a function that a) interests you, and b) has a messy documentation
- Play with the function and find simple use cases
- Explain the function to your anonymous peer reviewer.

Please write a nice and clear explanation. Include some elementary examples.

### numpy.bitwise_and

#### Short description of the function

The function chosen for this exercise is the (rather obscure) `numpy.bitwise_and`. Documentation is available at [https://numpy.org/doc/stable/reference/generated/numpy.bitwise_and.html](https://numpy.org/doc/stable/reference/generated/numpy.bitwise_and.html).

Brief terminology beforehand: 
- The numbers we use in our day-to-day lives are *decimal numbers*, meaning that they exist as base10 numbers, represented by numbers between 0-9. 
- Bitwise operations are performed on *binary numbers*, which are numbers in base2, only represented by numbers 0 and 1. 
- Another relatively common format (but not used in this example) is *hexadecimal numbers*, which exist as base16 representations, written with numbers 0-9 and letters A-F. 

In short, `numpy.bitwise_and` compares the values of two input parameters as binary values, and returns a binary value representing the bits where both input values a value '1'. This is easier visualized by the example below: 

```
Input a: 13 ---> 0000 1101  # This is the binary representation of decimal number 13
Input b: 17 ---> 0001 0001  # This is the binary representation of decimal number 17
--------------------------
bitwise_and ---> 0000 0001  # Only the least significant bit is '1' in both a and b
```

The parameters for `bitwise_and` are turned into binary values, and the corresponding bits of the binary value are compared together. When both are '1', the resulting bit is '1'. In all other cases, the resulting bit is '0'. In the above case, the numbers 13 and 17 transformed into binary values only match at the very rightmost bit (the least significant bit), resulting in a resulting binary value of 00000001 (this also corresponds to integer 1). 

Another example: 
```
Decimal 21  ---> 0001 0101
Decimal 55  ---> 0011 0111
--------------------------
bitwise_and ---> 0001 0101 <--- Decimal 21
```
Taking a `bitwise_and` of decimal numbers 55 and 21 results in a decimal number 21, as all the '1' bits in the binary representation of decimal 21 also occur at the same locations in the binary representation of decimal number 55.

#### Function signature and parameter explanation

The function definition for bitwise_and is as follows: 
```python
numpy.bitwise_and(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj]) = <ufunc 'bitwise_and'>
```

The function requires two mandatory parameters, `x1`and `x2`. Both are required to be `array_like`, that is, both parameters must be able to be converted to NumPy arrays. Some common object types that fit this description are arrays (of course), NumPy arrays, lists and tuples. Parameters `x1` and `x2` are compared to each other in the `bitwise_and` operation.

The `bitwise_and` function also takes in some optional parameters that have to be specified by keywords (as indicated by the `/` as the third parameter in the function signature). The optional parameters are as follows: 
- `out`, default value `None`: This parameter can be used to specify a variable where the output is to be stored (needs to be of same shape as the inputs). With default value, a new array is returned by the function. 
- `where`, default value `True`: This can be used to set a condition on when the `bitwise_and` operation is to be performed. If the condition specified here evaluates to `False`, the value in the output array will not be modified (or will remain uninitialized in an empty array). 
- `casting`, default value `same_kind`: This parameter controls on whether `bitwise_and` can be performed between different data types, for example, between integers and floats. Possible values: 
    - `same_kind`: data type of the first input array is maintained
    - `no`: error is raised if data types for the two arrays are different
    - `equiv`: casting between integer and floating-point types is allowed when necessary
- `order`, default value `K`: this has to do with memory layouts for the output array. We're not even going to go there, leave this as default unless you really need to play around. 
- `dtype`, default value `None`: this parameter can be used to manually set output data type. With default value, data type will be inferred from array contents. 
- `subok`, default value `True`: this will only be needed if you extend `ndarray`to create subclasses - when `True`, the output will be of that extended subclass, with `False` the output array will default to the unextended `ndarray` class. If you have no idea what I'm talking about here, you will not need to touch this parameter. 

The universal function (`<ufunc 'bitwise_and'>`) implements the operator `&` that can be used directly instead of calling the function separately. 

Instead of:
```python
result = np.bitwise_and(array1, array2)
```
you can write:
```python
result = array1 & array2
```

#### Use cases for numpy.bitwise_and()

"All this sounds awfully complex, why should I use `bitwise_and`?" you may ask - well, here's why. 

##### Flagging

A flag is a value that is either `True` or `False`, and can be used to bind additional information to a row of data. Information about multiple flags can be embedded into a single column of data by utilizing binary values and bitwise operations. 

Consider a dataset on performed experiments, where the following observations have been made, and incorporated as flags in binary values in field `FLAGS`: 
- FLAG_A: `0001` (Measuring error)
- FLAG_B: `0010` (Test performer fell asleep mid-test)
- FLAG_C: `0100` (Technical malfunction mid-test)
- FLAG_D: `1000` (Experiment target ate the measuring device)

If you have a value of `0110` in field `FLAGS`, it would imply that during this specific experiment, a technical malfunction occurred, and the test performer fell asleep. A value of `1001` would indicate that there was a measuring error, and the experiment target ate the measuring device. A value of `0000` would indicate no issues were encountered during the experiment. The number of bits can be increased as new flags are needed, without having to increase the number of columns in the dataset. 

`bitwise_and` can be used to fetch all experiments with specific flags: 

```python
experiments_with_measuring_errors = np.bitwise_and(all_experiments, FLAG_A)
```
Example code below:

In [None]:
# Imports
import numpy as np

# Flag definitions
FLAG_A = 0b0001 # Measuring error
FLAG_B = 0b0010 # Test performer fell asleep mid-test
FLAG_C = 0b0100 # Technical malfunction
FLAG_D = 0b1000 # Experiment target ate measuring device

# Initialize np.array for experiment flags
experiment_flags = [0b0000, 0b0110, 0b0001, 0b0000, 0b1000, 0b1111, 0b1100, 0b0101, 0b0100]
# Note that these print out as integers, as each binary value can be represented as an integer
experiment_flags

In [None]:
# Filter with flags
technical_malfunction = np.bitwise_and(experiment_flags, FLAG_C) == FLAG_C
measuring_device_eaten = np.bitwise_and(experiment_flags, FLAG_D) == FLAG_D

print("Technical malfunction:\t", *technical_malfunction)
print("Measuring device eaten:\t", *measuring_device_eaten)

##### Filtering

`bitwise_and` can be used in applying filters to a dataset, in a similar way that boolean masking works. `bitwise_and` can be used to create a `True`/`False` filter, which can then be applied to an array to only retain values within wanted bounds. 

The example below uses `bitwise_and` to retrieve temperatures at which water would be in a liquid state (between 0 and 100 degrees Celcius). 

In [None]:
# Set up an array of temperatures
temperatures = np.array([20.0, -15.0, 0.0, 1.2, 4.5, 85.0, 102.0, -0.2, 44.3, 24.4, -11.2, -130.1, 2034.4])

# Create True/False filter arrays for upper and lower limits of temperatures
above_freezing = temperatures > 0.0
below_boiling = temperatures < 100.0

# Current content of filters:
print("Temperatures above freezing:\t", *above_freezing)
print("Temperatures below boiling:\t", *below_boiling)

In [None]:
# Combine filter masks to one filter, where temperatures are included only if both separate filters indicate "True"
liquid_state = above_freezing & below_boiling # Operator '&' used here in place of 'bitwise_and(above_freezing, below_boiling)'
print("Combined filter:\t", *liquid_state)

In [None]:
# Apply filter to temperatures (applying rounding to keep decimals when printing)
filtered_temperatures = np.round(temperatures[liquid_state], 2)

# Display results
print("Temperatures at which water is in liquid state:\t", *filtered_temperatures)

##### Low-level data access

Low-level data access works in a similar way as handling flags, no separate code is provided here. 

##### Other uses in different fields

`bitwise_and` can also be used for other tasks that we will most likely not encounter during this course. Bitwise operations are common in the field of encryption. Images can be processed and analyzed with bitwise operations. Network and signal processing also rely to bitwise operations to some extent. 

## Problem 2. Map, Lambda, Groupby
In this problem, only plain python may be used, no numpy.<br/>
The following links may be helpful:
- [sorting howto](https://docs.python.org/3/howto/sorting.html)
- [lambda sorting](https://blogboard.io/blog/knowledge/python-sorted-lambda)
- [itertools groupby](https://stackoverflow.com/questions/773/how-do-i-use-itertools-groupby).

Using the code cell below, read a csv (real wind turbine data) into a list of dicts.<br/>
Then do the following:
- a) using map, convert the timestamps into the format <b>MM/dd/yyyy HH:mm:ss</b>, e.g. 11/04/2018 09:10:43
- b) using sorted and lambda, sort the rows according to increasing rotorspeed
- c) add a column called <b><i>WindSpeed_Group</i></b> that contains the letter A, B or C, where A = less than 5mps, B = 5-10mps, C = more than 10mps. Try to use [itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby) (although it may not be very smart).

In your handin, include the code that does a) - c) above. No need to save the modified data. Here is the code for reading the raw data:

In [None]:
from getpass import getuser
import csv
user = getuser()
csv_location = f'/home/varpha/dan/private/{user}' + \
                f'/exrc_01/data/prob2_{user}.csv'
with open(csv_location) as handle:
    mydata = list(csv.DictReader(handle))

In [None]:
# Example data that was read in
print(mydata[0])

In [None]:
# Sanity check for part b)
# Get the lowest and highest rotor speeds
min_rotor_speed = min([x["RotorSpeedAve"] for x in mydata])
max_rotor_speed = max([x["RotorSpeedAve"] for x in mydata])
print(f"Minimum: {min_rotor_speed}, maximum: {max_rotor_speed}")

In [None]:
# ---------------------------------------------------------------------- #
# a) convert the timestamps into format MM/dd/yyyy HH:mm:ss with 'map'   #
# ---------------------------------------------------------------------- #

# Note: no DateTime, just string manipulation. This could also have been done as a two-liner,
# but it would not include the use of 'map' as required by the problem. 

# Split off the decimals from each timstamps, only pick the part before the decimal point.
# Make these into a list of dictionaries containing only TimeStamp key ({"TimeStamp": <new timestamp>}),
# as this makes updating the original list of dictionaries easier later on.
new_timestamps = list(map(lambda x: {"TimeStamp": x["TimeStamp"].split('.')[0]}, mydata))

# Check that timestamps look correct for the first few dicts
print(new_timestamps[0:3])

In [None]:
# Match the original dictionaries with updated timestamps,
# loop through each dictionary and update the new timestamp into the dictionary
for original, new_ts in zip(mydata, new_timestamps):
    original.update(new_ts)

In [None]:
# Sanity check to make sure that data has updated
print(mydata[0:2])

In [None]:
# ---------------------------------------------------------------------------- #
# b) using sorted and lambda, sort the rows according to increasing rotorspeed #
# ---------------------------------------------------------------------------- #

# Save sorted list to new variable
sorted_by_rotorspeed = sorted(mydata, key=lambda row: row['RotorSpeedAve'])
# Print the first row:
print(f"First row (slowest):\t{sorted_by_rotorspeed[0]}\n")
# Print the last row:
print(f"Last row (fastest):\t{sorted_by_rotorspeed[-1]}")

# Note: it seems like my data was from a time period where there was no wind, or just no data in general.
# The code still works - see the min and max values from just after data load. 

In [None]:
# -------------------------------------------------------------------------- #
# c) add a column called WindSpeed_Group that contains the letter A, B or C, #
# where A = less than 5mps, B = 5-10mps, C = more than 10mps. Try to use     #
# itertools.groupby (although it may not be very smart).                     #
# -------------------------------------------------------------------------- #

# Import groupby
from itertools import groupby

# Helper function for returning category based on wind speed
def get_wind_category(speed):
    # Handle missing values
    if speed == '':
        return ''
    # Return correct category according to wind speed
    elif float(speed) < 5:
        return 'A'
    elif float(speed) < 10:
        return 'B'
    elif float(speed) >= 10:
        return 'C'

# This makes it a bit cleaner in code down below to cast windspeed into floats, 
# and account for missing values at the same time
def get_numerical_windspeed(speed):
    if speed == '':
        return 0
    else: 
        return float(speed)
    
# Sort data by windspeed
sorted_by_windspeed = sorted(mydata, key=lambda x: get_numerical_windspeed(x['WindSpeed_mps']))

# Initialize an empty dict template where grouped data can be stored
wind_category_data = {
    '': [],
    'A': [],
    'B': [],
    'C': []
}

windspeed_groups = []

# Group by wind speed category
for key, val in groupby(sorted_by_windspeed, lambda x: get_wind_category(x['WindSpeed_mps'])):
    # I would prefer to add the category here while iterating through all items,
    # but it seems to be a pain in the ass to manipulate the original list here, or even
    # individual items. 
    
    # Add rows under their correct key in Windspeed-grouped dictionary
    wind_category_data[key].extend(list(val))
    
# Sanity check: number of rows under each wind category
print("Number of rows by category:")
for key in wind_category_data.keys():
    print(key if key != '' else '-', len(wind_category_data[key]))
    

# (The question is correct, this is a pretty stupid/complex way to do this. I added the easy way after this.)

# As the sorted_by_windspeed list is in order from least to most wind, we know that
# all the categories will be in order on that list too, so we can just simply update
# the correct number of rows on the sorted list without having to even check the windspeeds. 
cat_a_start_idx = len(wind_category_data[''])
cat_b_start_idx = cat_a_start_idx + len(wind_category_data['A'])
cat_c_start_idx = cat_b_start_idx + len(wind_category_data['B'])

# Update empty groups
for row in sorted_by_windspeed[0:cat_a_start_idx]:
    row.update({"WindSpeed_Group": ''})

# Update category A groups
for row in sorted_by_windspeed[cat_a_start_idx:cat_b_start_idx]:
    row.update({"WindSpeed_Group": 'A'})

# Update category B groups
for row in sorted_by_windspeed[cat_b_start_idx:cat_c_start_idx]:
    row.update({"WindSpeed_Group": 'B'})

# Update category C groups
for row in sorted_by_windspeed[cat_c_start_idx:]:
    row.update({"WindSpeed_Group": 'C'})

# Print an example row from sorted list to show that group has been added
print('---------------------------')
print("Last row from sorted list:")
print(sorted_by_windspeed[-1])

In [None]:
#### Part c) alternative solution (without groupby) ####

# Helper function for returning category based on wind speed
def get_wind_category(speed):
    # Handle missing values
    if speed == '':
        return ''
    # Return correct category according to wind speed
    elif float(speed) < 5:
        return 'A'
    elif float(speed) < 10:
        return 'B'
    elif float(speed) >= 10:
        return 'C'

for row in mydata:
    row.update({"WindSpeed_Group": get_wind_category(row["WindSpeed_mps"])})

# Example row of data
print(mydata[0])

# Much shorter, much cleaner. 

## Problem 3. Vectorization
- Some [general info](https://www.askpython.com/python-modules/numpy/vectorization-numpy)
- The code in <b>dan/public/exrc_01/integrator.py</b> contains rudimentary code,<br/>
  written in plain python, that numerically integrates a (math) function<br/>
  $f\colon \mathbb{R} \to \mathbb{R}$ over an interval $[a,b]$.
- Rewrite the code using numpy and vectorization.
- Introduce timings to measure the gain of vectorization.
- Use the (math) function $f(x)=- 8 x^{11} - 9 x^{10} + 9 x^{9} - 15$ and interval $[a,b] = [-17, 28]$ to test the code.
- Increase the number of subintervals in order to obtain a noticeable difference in the timings.

In your handin, include the rewritten code along with the timing measures.

In [None]:
# Original code:

def create_mesh(a, b, n):
    return [a+i*(b-a)/n for i in range(n)]


def integrate(f, a, b, n):
    sum_of_rectangles = 0
    left_endpoints = create_mesh(a,b,n)
    mesh_width = (b-a)/n
    for left_endpoint in left_endpoints:
        midpoint = left_endpoint + mesh_width/2
        height = f(midpoint)
        sum_of_rectangles += height * mesh_width
    return sum_of_rectangles


def f(x):
    return 3*x**2 - 5

### main ###

# integrate f over [-1,4], dividing the interval to 1000 subintervals
myresult = integrate(f,-1,4,1000)
print(myresult)


#### Note on refactored solution

The original implementation uses the following for create_mesh(): `[a+i*(b-a)/n for i in range(n)]`. What this basically does, is that it divides the x-axis from left to right (a to b) into *n* equal segments (number of subintervals, basically), and returns the starting points of all these subintervals on the x-axis. The return value is a list of x-values. 

Numpy has a function that does the same thing natively: [numpy.linspace](https://numpy.org/doc/stable/reference/generated/numpy.linspace). I have used this function here for simplicity, as I can also get `mesh_width` out of this function without having to calculate it separately with `retstep=True`. The endpoint is excluded with `endpoint=False`, so that the function only returns the left hand side points on the linear space. 

Linspace seems to be slightly inaccurate at large numbers over short intervals. At 500 steps within interval $[-1, 4]$ the results between original and new were still the same, but increasing the number of steps started to show tiny differences between results. The original code for `create_mesh` could also have been written as follows: 

```python 
return a + np.array(range(n)) * (b - a) / n
```

`numpy.linspace` was kept in this solution because it is more intuitive to read. 

In [None]:
import numpy as np

# Rewritten code (functions renamed):

def create_mesh_v(a, b, n):
    return np.linspace(a, b, n, endpoint=False, retstep=True)    

def integrate_v(f_v, a, b, n):
    left_endpoints, mesh_width = create_mesh_v(a, b, n)
    midpoints = left_endpoints + mesh_width / 2
    heights = f_v(midpoints)
    sum_of_rectangles = np.sum(heights * mesh_width)
    return sum_of_rectangles

# This could have been left out, f(x) from original code is still valid for this solution too.     
def f_v(x):
        return 3*x**2 - 5
## Main

integrate_v(f, -1, 4, 1000)


#### Code and timing tests

$f(x)=- 8 x^{11} - 9 x^{10} + 9 x^{9} - 15$ and interval $[a,b] = [-17, 28]$

In [None]:
import time

# Overwriting function definitions for tests
def f(x): 
    return -8*x**11 - 9*x**10 + 9*x**9 -15

def f_v(x):
    return -8*x**11 - 9*x**10 + 9*x**9 -15

# ----------------------
# Test 1: Original code
# ----------------------

start_time = time.time()
result = integrate(f, -17, 28, 100000)
orig_duration = time.time() - start_time

print(f"Integration result: {result}, calculation time: {orig_duration}")

# ------------------------
# Test 2: Refactored code
# ------------------------

start_time = time.time()
result = integrate_v(f_v, -17, 28, 100000)
vect_duration = time.time() - start_time

print(f"Integration result: {result}, calculation time: {vect_duration}")

# Print performance improvement statistics
print(f"Vectorization changed computation time by approximately {round((vect_duration - orig_duration)/orig_duration * 100, 2)} %.")

## Problem 4. Numpy arrays

- The directory <tt>dan/private/exrc_01/data</tt><br/>
  contains a csv file (<tt>prob4_XXXXX.csv</tt>) with some weather data.
- a) Use [numpy.genfromtxt](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html) to read the file into a 2-dimensional numpy array.<br/>
  Use dtype=str in order to not lose the headers.
- b) Use Boolean masking to drop the rows that contain <tt>nan</tt> entries.
- c) Convert the time entries (standard timestamp) into a human-readable format of your choice.
- d) Add a new row that contains the averages of the columns, except <tt>nan</tt> for the time column.

In your handin, include the code that does a) - d) above. Do not include any saved data.

In [None]:
import numpy as np

# -------------------------------------------------------------------------
# a) use numpy.genfromtxt to read the file into a 2-dimensional numpy array
# -------------------------------------------------------------------------

# Read in data, specify delimiter as comma only
data = np.genfromtxt('data/prob4_XXXXX.csv', dtype=str, delimiter=',')

# Sanity check with first few rows
data[0:3]

In [None]:
# ----------------------------------------------------------------
# b) use boolean masking to drop the rows that contain nan entries
# ----------------------------------------------------------------

# Use np.any() to search if any values in the row contain 'nan', exclude if so
filtered = data[~np.any(data == 'nan', axis=1)]

# Sanity
print(f"Rows in original data: {len(data)}\nRows in filtered data: {len(filtered)}")

In [None]:
# ------------------------------------------------------------
# c) Convert time entries into human-readable format of choice
# ------------------------------------------------------------

# Get timestamp utilities
from datetime import datetime

# Create a helper function to convert timestamp from epoch string to human readable string
def convert_timestamp(timestamp_str):
    return datetime.utcfromtimestamp(float(timestamp_str)).strftime("%d-%m-%Y %H:%M:%S")

# Vectorize the helper function (so that it will apply to each item of the array separately,
# instead of trying to handle the entire np.array at once)
timestamp_converter = np.vectorize(convert_timestamp)

# Replace the timestamp formats (ignore header row)
filtered[1:, 3] = timestamp_converter(filtered[1:, 3])

# Sanity check to make sure that correct things changed
filtered[0:3]

In [None]:
# d) Add a new row that contains averages of the columns, except nan for the time column
summary = np.concatenate([
    np.mean((filtered[1:, 0:3]).astype(float), axis=0),
    ['nan'],
    np.mean((filtered[1:, 4:]).astype(float), axis=0)
    ])

# Append summary row to the end of the filtered data (vstack = vertical stack)
filtered = np.vstack([filtered, summary])

# Sanity check with last two rows to ensure that summary row was updated
filtered[-2:]

## Problem 5. Data download
- Start by exploring / running the code in <tt>dan/public/exrc_01/statfi.py</tt>
- Choose a topic that interests you. Then try to download a "lot" of data of data of that topic (not adoptions please!). Here a lot means something like 500kB - 2MB range. (It's not really a lot but enough that the downloaded data is hard to grasp manually.)
- Save your data in one or several json files.

In your handin, include the code that you used (no saved data). You also don't need to do anything to your saved data (just be able to download it).
Also, tell a few words about your experiences. What problems, if any, did you encounter?

In [None]:
# Imports

import requests
import json

# Configuration

target_data_types = ['Real estate prices']
lang = 'en'
api_url = f"https://statfin.stat.fi/PXWeb/api/v1/{lang}/StatFin"
   
# Main program

with requests.Session() as session:
    # Get some configurations out of the way first:
    response = session.get('https://statfin.stat.fi/PXWeb/api/v1/fi/?config')
    maxvalues = response.json()['maxValues'] # Maximum number of records that can be fetched at once
    
    # Fetch available data
    print(f"Fetching list of endpoints from StatFin...")
    response = session.get(api_url)
    print(f"Done.")
        
    # Determine correct API endpoints (identifiers) for the data to fetch
    print(f"Searching response for target data types: {target_data_types}")
    identifiers = [entry['id'] for entry in response.json() if entry['text'] in target_data_types]
    print(f"Found the following identifiers: {identifiers}")
    
    # Get data for all tags
    for tag in identifiers:
        # Gather filenames for each identifier
        response = session.get(f"{api_url}/{tag}")
        files = [row['id'] for row in response.json()]
        
        # Fetch the wanted data (in this case: 'statfin_kihi_pxt_13zt.px',
        # price indexes for existing real estates on a quarterly basis from 1985)        
        file = 'statfin_kihi_pxt_13zt.px' # Note: hardcoding an interesting file, instead of going dynamic with this.
        print(f"Getting contents for file '{file}'")
        response = session.get(f"{api_url}/{tag}/{file}")
        variables = [var['text'] for var in response.json()['variables']]
        print(f"File '{file}' contained the following variables: {variables}")
        
        # Craft request to get actual data with given variables
        print("Building query...")
        # Query base
        query = {
            "query": [],
            "response": {
                "format": "json"
            }
        }
        
        # Initialize values count
        total_values = 1
        
        # Add query items to base
        for var in response.json()['variables']:
            query_item = {
                "code": var['code'],
                "selection": {
                    "filter": "item",
                    "values": var['values']
                }
            }
            query['query'].append(query_item)
            
            # Update value count
            total_values *= len(var['values'])
        
        # Stop handling if values exceed limits
        if total_values > maxvalues:
            print(f"Query too big. Maximum: {maxvalues} values, query contained {total_values} values.")
        
        # Send query
        print(f"Sending query for file {file}...")
        response = session.post(f"{api_url}/{tag}/{file}", json=query)
        print(f"Received HTTP response code {response.status_code}")
        
        # Save file contents if response was OK
        if response.status_code ==  200:
            print(f"Writing response contents to data_{file}.json...")
            with open(f"data_{file}.json", "w") as handle:
                json.dump(response.json(), handle, indent=4)
                print("Write complete.")

print("Data download and write process complete for all files.")

#### Notes on exercise 5

I rewrote all the code, but used the original code as a guide - I learn a lot better by doing from scratch by myself. The main reason for rewriting the code was that I wanted to make it more dynamic, configurable with one or two parameters only. The end result is almost that - currently the filename to be downloaded is hardcoded. There's a few possibilities to change that from hardcoded into a more dynamic functionality - the code could list all the files and their descriptions to the user and ask to pick which one to download (and use the identifier as a variable), or the code could simply loop through all files and download everything, or the code could arbitrarily pick the first (or a random file) to download. Just a matter of how to populate `file` variable in the code. 

I also preferred to create the query item on the fly during the data fetch - there was no need to fret over value and variable referencing, and no need to bring in copy module. A new query is initialized every time data is fetched, and all is good. 

I also added a check to make sure that we get data (`response_code`=`200`) before writing it to disk.

And I also changed the hardcoded URLs into a variable that was reused throughout the code. 

This was actually probably the easiest out of the problems in this exercise set, as it is the closest to what I have experience with. No major difficulties were encountered along the way, though I did debug my own code by modifying and running the original code at some points. The first POST query gave me a response code of HTTP 400, but a comparison of query JSON produced by the original and my own code revealed that I had forgot to put in a "filter" key within the query item list. That fixed, everything worked like a charm. Overall, a fairly simple and useful exercise. 

The downloaded file is 499 KB in size, but I hope that it's close enough for the requirements of the exercise (500 KB to 2 MB of data). 