# Milestone 1

In this milestone, we are using the Figshare API to pull data and analyze it in upcoming milestones.

## Downloading the data from Figshare

In [None]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

In [None]:
# Daily rainfall over NSW, Australia
# https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681
article_id = 14096681
# Metadata for the download
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "../data"

In [None]:
# List files in the associated dataset
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]
files

In [None]:
# Retrieve `data.zip`
files_to_dl = ["data.zip"]
for f in files:
    if f["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(f["download_url"], f"{output_directory}/{f['name']}")

In [None]:
# Extract `data.zip`
output_zip_file = os.path.join(output_directory, "./data.zip")
with zipfile.ZipFile(output_zip_file, 'r') as f:
    f.extractall(output_directory)

## Combine CSV files

In [None]:
# Gather a list of files of CSV to merge
files = glob.glob(f'{output_directory}/*.csv')
files.remove(f'{output_directory}/observed_daily_rainfall_SYD.csv')
# files = files[0:1]
files

In [None]:
columns_to_merge = ["time", "lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/day)"]

combined_path = f'{output_directory}/combined_data.csv'

In [None]:
%%time

# files = glob.glob('dailyrainfall/*.csv')
# df = pd.concat((pd.read_csv(files, index_col=0)
#                 .assign(model=re.findall("/([^_]*)", file)[0])
#                 for file in files)
#               )

# A Pythonic way (but not the most memory-efficient way) for merging the data
df = pd.concat((pd.read_csv(f, index_col=0, usecols=columns_to_merge)
                .assign(model=f[len(output_directory)+1:-len("_daily_rainfall_NSW.csv")])
                for f in files)
              )
df.to_csv(combined_path)

In [None]:
%%sh
du -sh ../data/combined_data.csv

In [None]:
print(df.shape)

In [None]:
df.head()

> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken  |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-----------:|
> | Chen, Ziyi   | OSX 13.2.1       | 32GB | M1 (10 processors)| YES    |  3min 30s   |
> | Guron, Mike  | Windows 11       | 16GB | Intel i7-12700H   | YES    |  5min 56s   |
> | Raina, Roan  | macOS 13.2.1     | 16GB | M2 (8 core)       | YES    |  3min 17s   |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    |  10min 6s   |
> 
> Table 1: Time taken to combine the CSV files

#### Observations from Combining Data

Table 1 above summarizes the results of the time trials for combining the data on our different computers.  We can see that the **Operating System** and/or **Processor** affects the amount of time taken to combine the files.  It is clear that the MacOS operating system and the M1/M2 processors performed the best as they took the least amount of time to combine the files; however, it is interesting to note that the M2 processor did not perform significantly faster than the M1 and it appears that the difference in RAM between these two computers (32GB with M1 vs. 16GB with M2) did not have an impact either.  Although, perhaps the advances in the M2 are masking the difference that would result from an increase in RAM, but that is not possible to determine given the testing completed above.

Furthermore, we can see that the Windows operating system with an Intel i7 processor performed quite a bit slower than the MacOS operating system with the M1/M2 processors since it took almost double the time to combine the files.  Finally, the computer with a Linux operating system and an AMD processor performed was the slowest of our four computers at completing this task as it took the longest to combine the files (about 67% longer than the Windows computer and about three times as long as the MacOS computers).

It should be noted that due to the specifications of the four computers above and the testing format, it is not possible to directly determine whether the differences in run times are due to the differences in **Operating System** or **Processors**; however, due to the similar times between the M1 and M2 processors it is possible that the operating systems account for the biggest difference, but differences in RAM between these two computers makes it difficult to confidently determine.

## EDA

### Baseline

This is the baseline time needed to load the CSV file as-is.

In [None]:
%%time
df = pd.read_csv(f"{output_directory}/combined_data.csv")

In [None]:
df.info()

> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken    | Memory usage |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-------------:|:------------:|
> | Chen, Ziyi   | OSX 13.2.1       | 32GB | M1 (10 processors)| YES    | 34s           | 3.3+ GB      |
> | Guron, Mike  | Windows 11       | 16GB | Intel i7-12700H   | YES    | 1min 16s      | 3.3+ GB      |
> | Raina, Roan  | macOS 13.2.1     | 16GB | M2 (8 core)       | YES    | 31.6s         | 3.3+ GB      |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    | 2min 45s      | 3.3+ GB      |
> 
> Table 2: Time taken to read the combined CSV (baseline)

#### Observations from Baseline

Table 2 above summarizes the results of the time trials for loading the file on our different computers.  We can see that the **Operating System** and/or **Processor** again affects the amount of time taken to load the files.  The computers with the MacOS operating system and the M1/M2 processors were again the fastest as they took about half as long as the computer with the Windows operating system and an Intel processor and about one fifth of the time that it took the computer with the Linux operating system and AMD processor to load the combined CSV file to memory.

We can also see here that the M2 processor did not perform significantly faster than the M1, but this could again be due to the differences in RAM between these two computers as referenced previously.  

These results will be used as the baseline for comparisons of run times utilizing different approaches to reduce memory usage while performing EDA below.

### Approach 1: Change the `dtype` of the data

We notice that by default it uses `float64` if we do not specify it. First, we try to see if switching to `float32` would make a smaller memory footprint, as well as a faster time.

In [None]:
%%time
df_float32 = pd.read_csv(f"{output_directory}/combined_data.csv", dtype={
    'lat_min': 'float32',
    'lat_max': 'float32',
    'lon_min': 'float32',
    'lon_max': 'float32',
    'rain (mm/day)': 'float32'
})

In [None]:
df_float32.info()

> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken    | Memory usage |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-------------:|:------------:|
> | Chen, Ziyi   | OSX 13.2.1       | 32GB | M1 (10 processors)| YES    | 31.7s         | 2.1+ GB      |
> | Guron, Mike  | Windows 11       | 16GB | Intel i7-12700H   | YES    | 1min 11s      | 2.1+ GB      |
> | Raina, Roan  | macOS 13.2.1     | 16GB | M2 (8 core)       | YES    | 30.5s         | 2.1+ GB      |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    | 2min 28s      | 2.1+ GB      |
> 
> Table 3: Time taken to read the combined CSV (approach 1: use `float32` instead of `float64`)

#### Observations from Approach 1

Table 3 above summarizes the results of the time trials for loading the file on our different computers after switching the data type from `float64` to `float32` for numeric columns.  The same trend in differences of run times for this task between the computers due to differing **Operating Systems** and **Processors** is still observed so we will focus on the differences for each computer compared to it's baseline run time for loading the file.

We can see that the memory usage has successfully been reduced from 3.3+ GB to 2.1+ GB across all four computers; however, there was not a significant reduction in run time for any of the four computers compared to the baseline results.  We see that the reductions in time do scale accordingly with the baseline run times as the slowest computer at this task (Linux OS with AMD processor) at baseline also had the largest reduction in run time with about a 10% reduction, while the fastest computer at this task (MacOS with M2) at baseline had the lowest reduction in run time with about a 3.5% reduction. 

### Approach 2: Load only column(s) we want

The dataset contains a number of columns that we may not need to use in one go. In this approach, we try to just load one column from the combined CSV file.

In [None]:
%%time
df_only_rain = pd.read_csv(f"{output_directory}/combined_data.csv", usecols=["rain (mm/day)"])

In [None]:
df_only_rain.info()

> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken    | Memory usage |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-------------:|:------------:|
> | Chen, Ziyi   | OSX 13.2.1       | 32GB | M1 (10 processors)| YES    | 16.1s         | 476.6 MB     |
> | Guron, Mike  | Windows 11       | 16GB | Intel i7-12700H   | YES    | 37.1s         | 476.6 MB     |
> | Raina, Roan  | macOS 13.2.1     | 16GB | M2 (8 core)       | YES    | 14.9s         | 476.6 MB     |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    | 1min 6s       | 476.6 MB     |
> 
> Table 4: Time taken to read the combined CSV (approach 2: just load `rain (mm/day)`)

#### Observations from Approach 2

Table 4 above summarizes the results of the time trials for loading the file on our different computers when loading just one column of the file.  The same trend in differences of run times for this task between the computers due to differing **Operating Systems** and **Processors** is still observed so we will focus on the differences for each computer compared to it's baseline run time for loading the file.

We can see that the memory usage has been even further reduced from 3.3+ GB for the whole file to just 476.6 MB using this approach, which also resulted in a significant reduction in run times as the time to load the file has approximately been cut in half across all four computers compared to the baseline.  Again, we see that the reductions in time do scale accordingly with the baseline run times as the slowest computer at this task (Linux OS with AMD processor) at baseline also had the largest reduction in run time with about a 60% reduction, while the fastest computer at this task (MacOS with M2) at baseline had the lowest reduction in run time with about a 53% reduction. 

## EDA in R

Here, we explore the EDA in R instead of Python. We try "exporting" our data frame as a Parquet file for processing in R.

In [None]:
%load_ext rpy2.ipython

In [None]:
%%time
df.to_parquet(f"{output_directory}/combined_data.parquet")

> Why we choose Parquet?
> 
> (WIP)

In [None]:
%%R
library(dplyr)
library(arrow)

In [None]:
%%R
r_parquet <- open_dataset("../data/combined_data.parquet")
r_df <- r_parquet |> collect()

In [None]:
%%time
%%R
r_df |> str()

In [None]:
%%time
%%R
r_df |> summary()

In [None]:
%%time
%%R
r_df |> head()

In [None]:
%%time
%%R
r_df |> tail()