# Milestone 1

In this milestone, we are using the Figshare API to pull data and analyze it in upcoming milestones.

## Downloading the data from Figshare

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

In [2]:
# Daily rainfall over NSW, Australia
# https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681
article_id = 14096681
# Metadata for the download
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "../data"

In [3]:
# List files in the associated dataset
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]
files

[{'id': 26579150,
  'name': 'daily_rainfall_2014.png',
  'size': 58863,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e'},
 {'id': 26579171,
  'name': 'environment.yml',
  'size': 192,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34'},
 {'id': 26586554,
  'name': 'README.md',
  'size': 5422,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c'},
 {'id': 26766812,
  'name': 'data.zip',
  'size': 814041183,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26766812',
  'supplied_md5': 'b517383f76e77bd03755a63a8f

In [4]:
# Retrieve `data.zip`
files_to_dl = ["data.zip"]
for f in files:
    if f["name"] in files_to_dl:
        if os.path.isfile(f"{output_directory}/{f['name']}"):
            print(f"Skipping {f['name']}, file exists")
        else:
            os.makedirs(output_directory, exist_ok=True)
            urlretrieve(f["download_url"], f"{output_directory}/{f['name']}")

Skipping data.zip, file exists


In [5]:
# Extract `data.zip`
output_zip_file = os.path.join(output_directory, "./data.zip")
with zipfile.ZipFile(output_zip_file, 'r') as f:
    f.extractall(output_directory)

## Combine CSV files

In [6]:
# The path of the combined CSV
combined_path = f'{output_directory}/combined_data.csv'

In [7]:
# Gather a list of files of CSV to merge
files = glob.glob(f'{output_directory}/*.csv')
files.remove(f'{output_directory}/observed_daily_rainfall_SYD.csv')
files.remove(combined_path)
files

['../data/NorESM2-MM_daily_rainfall_NSW.csv',
 '../data/INM-CM4-8_daily_rainfall_NSW.csv',
 '../data/AWI-ESM-1-1-LR_daily_rainfall_NSW.csv',
 '../data/ACCESS-ESM1-5_daily_rainfall_NSW.csv',
 '../data/MPI-ESM-1-2-HAM_daily_rainfall_NSW.csv',
 '../data/NorESM2-LM_daily_rainfall_NSW.csv',
 '../data/CMCC-CM2-HR4_daily_rainfall_NSW.csv',
 '../data/TaiESM1_daily_rainfall_NSW.csv',
 '../data/FGOALS-g3_daily_rainfall_NSW.csv',
 '../data/CMCC-ESM2_daily_rainfall_NSW.csv',
 '../data/CMCC-CM2-SR5_daily_rainfall_NSW.csv',
 '../data/CanESM5_daily_rainfall_NSW.csv',
 '../data/MPI-ESM1-2-LR_daily_rainfall_NSW.csv',
 '../data/BCC-CSM2-MR_daily_rainfall_NSW.csv',
 '../data/SAM0-UNICON_daily_rainfall_NSW.csv',
 '../data/ACCESS-CM2_daily_rainfall_NSW.csv',
 '../data/INM-CM5-0_daily_rainfall_NSW.csv',
 '../data/BCC-ESM1_daily_rainfall_NSW.csv',
 '../data/MPI-ESM1-2-HR_daily_rainfall_NSW.csv',
 '../data/NESM3_daily_rainfall_NSW.csv',
 '../data/EC-Earth3-Veg-LR_daily_rainfall_NSW.csv',
 '../data/GFDL-CM4_da

In [8]:
columns_to_merge = ["time", "lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/day)"]

In [9]:
%%time
# A Pythonic way (but not the most memory-efficient way) for merging the data
df = pd.concat((pd.read_csv(f, index_col=0, usecols=columns_to_merge)
                .assign(model=re.findall(r"/([^_/]*)", f)[-1])
                for f in files)
              )
df.to_csv(combined_path)

CPU times: user 7min 49s, sys: 39.9 s, total: 8min 28s
Wall time: 8min 40s


In [10]:
%%sh
du -sh ../data/combined_data.csv

5.6G	../data/combined_data.csv


In [11]:
df.shape

(62467843, 6)

In [12]:
df.head()

Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1889-01-01 12:00:00,-35.811518,-34.86911,140.625,141.875,0.513492,NorESM2-MM
1889-01-02 12:00:00,-35.811518,-34.86911,140.625,141.875,0.000923,NorESM2-MM
1889-01-03 12:00:00,-35.811518,-34.86911,140.625,141.875,9e-06,NorESM2-MM
1889-01-04 12:00:00,-35.811518,-34.86911,140.625,141.875,2.5e-05,NorESM2-MM
1889-01-05 12:00:00,-35.811518,-34.86911,140.625,141.875,1.3e-05,NorESM2-MM


> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken  |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-----------:|
> | Chen, Ziyi   | macOS 13.2.1     | 32GB | M1 (10 core)      | YES    |  3min 30s   |
> | Guron, Mike  | Windows 11       | 16GB | Intel i7-12700H   | YES    |  5min 56s   |
> | Raina, Roan  | macOS 13.2.1     | 16GB | M2 (8 core)       | YES    |  3min 17s   |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    |  6min 34s   |
> 
> Table 1: Time taken to combine the CSV files

#### Observations from Combining Data

Table 1 above summarizes the results of the time trials for combining the data on our different computers.  We can see that the **Operating System** and/or **Processor** affects the amount of time taken to combine the files.  It is clear that the MacOS operating system and the M1/M2 processors performed the best as they took the least amount of time to combine the files; however, it is interesting to note that the M2 processor did not perform significantly faster than the M1 and it appears that the difference in RAM between these two computers (32GB with M1 vs. 16GB with M2) did not have an impact either.  Although, perhaps the advances in the M2 are masking the difference that would result from an increase in RAM, but that is not possible to determine given the testing completed above.

Furthermore, we can see that the Windows operating system with an Intel i7 processor performed quite a bit slower than the MacOS operating system with the M1/M2 processors since it took almost double the time to combine the files.  Finally, the computer with a Linux operating system and an AMD processor performed was the slowest of our four computers at completing this task as it took the longest to combine the files (about 67% longer than the Windows computer and about three times as long as the MacOS computers).

It should be noted that due to the specifications of the four computers above and the testing format, it is not possible to directly determine whether the differences in run times are due to the differences in **Operating System** or **Processors**; however, due to the similar times between the M1 and M2 processors it is possible that the operating systems account for the biggest difference, but differences in RAM between these two computers makes it difficult to confidently determine.

## EDA

### Baseline

This is the baseline time needed to load the CSV file as-is.

In [13]:
%%time
df = pd.read_csv(f"{output_directory}/combined_data.csv")

CPU times: user 1min 18s, sys: 23.4 s, total: 1min 41s
Wall time: 1min 45s


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62467843 entries, 0 to 62467842
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   time           object 
 1   lat_min        float64
 2   lat_max        float64
 3   lon_min        float64
 4   lon_max        float64
 5   rain (mm/day)  float64
 6   model          object 
dtypes: float64(5), object(2)
memory usage: 3.3+ GB


> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken    | Memory usage |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-------------:|:------------:|
> | Chen, Ziyi   | macOS 13.2.1     | 32GB | M1 (10 core)      | YES    | 34s           | 3.3+ GB      |
> | Guron, Mike  | Windows 11       | 16GB | Intel i7-12700H   | YES    | 1min 16s      | 3.3+ GB      |
> | Raina, Roan  | macOS 13.2.1     | 16GB | M2 (8 core)       | YES    | 31.6s         | 3.3+ GB      |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    | 1min 28s      | 3.3+ GB      |
> 
> Table 2: Time taken to read the combined CSV (baseline)

#### Observations from Baseline

Table 2 above summarizes the results of the time trials for loading the file on our different computers.  We can see that the **Operating System** and/or **Processor** again affects the amount of time taken to load the files.  The computers with the MacOS operating system and the M1/M2 processors were again the fastest as they took about half as long as the computer with the Windows operating system and an Intel processor and about one fifth of the time that it took the computer with the Linux operating system and AMD processor to load the combined CSV file to memory.

We can also see here that the M2 processor did not perform significantly faster than the M1, but this could again be due to the differences in RAM between these two computers as referenced previously.  

These results will be used as the baseline for comparisons of run times utilizing different approaches to reduce memory usage while performing EDA below.

### Approach 1: Change the `dtype` of the data

We notice that by default it uses `float64` if we do not specify it. First, we try to see if switching to `float32` would make a smaller memory footprint, as well as a faster time.

In [15]:
%%time
df_float32 = pd.read_csv(f"{output_directory}/combined_data.csv", dtype={
    'lat_min': 'float32',
    'lat_max': 'float32',
    'lon_min': 'float32',
    'lon_max': 'float32',
    'rain (mm/day)': 'float32'
})

CPU times: user 1min 17s, sys: 16 s, total: 1min 33s
Wall time: 1min 35s


In [16]:
df_float32.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62467843 entries, 0 to 62467842
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   time           object 
 1   lat_min        float32
 2   lat_max        float32
 3   lon_min        float32
 4   lon_max        float32
 5   rain (mm/day)  float32
 6   model          object 
dtypes: float32(5), object(2)
memory usage: 2.1+ GB


In [17]:
# Clear it from memory after the experiment
df_float32 = None

> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken    | Memory usage |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-------------:|:------------:|
> | Chen, Ziyi   | macOS 13.2.1     | 32GB | M1 (10 core)      | YES    | 31.7s         | 2.1+ GB      |
> | Guron, Mike  | Windows 11       | 16GB | Intel i7-12700H   | YES    | 1min 11s      | 2.1+ GB      |
> | Raina, Roan  | macOS 13.2.1     | 16GB | M2 (8 core)       | YES    | 30.5s         | 2.1+ GB      |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    | 1min 22s      | 2.1+ GB      |
> 
> Table 3: Time taken to read the combined CSV (approach 1: use `float32` instead of `float64`)

#### Observations from Approach 1

Table 3 above summarizes the results of the time trials for loading the file on our different computers after switching the data type from `float64` to `float32` for numeric columns.  The same trend in differences of run times for this task between the computers due to differing **Operating Systems** and **Processors** is still observed so we will focus on the differences for each computer compared to it's baseline run time for loading the file.

We can see that the memory usage has successfully been reduced from 3.3+ GB to 2.1+ GB across all four computers; however, there was not a significant reduction in run time for any of the four computers compared to the baseline results.  We see that the reductions in time do scale accordingly with the baseline run times as the slowest computer at this task (Linux OS with AMD processor) at baseline also had the largest reduction in run time with about a 10% reduction, while the fastest computer at this task (MacOS with M2) at baseline had the lowest reduction in run time with about a 3.5% reduction. 

### Approach 2: Load only the column(s) we want

The dataset contains a number of columns that we may not need to use in one go. In this approach, we try to just load one column (`rain (mm/day)` here) from the combined CSV file.

In [18]:
%%time
df_only_rain = pd.read_csv(f"{output_directory}/combined_data.csv", usecols=["rain (mm/day)"])

CPU times: user 30.7 s, sys: 5.43 s, total: 36.2 s
Wall time: 37.3 s


In [19]:
df_only_rain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62467843 entries, 0 to 62467842
Data columns (total 1 columns):
 #   Column         Dtype  
---  ------         -----  
 0   rain (mm/day)  float64
dtypes: float64(1)
memory usage: 476.6 MB


In [20]:
# Clear it from memory after the experiment
df_only_rain = None

> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken    | Memory usage |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-------------:|:------------:|
> | Chen, Ziyi   | macOS 13.2.1     | 32GB | M1 (10 core)      | YES    | 16.1s         | 476.6 MB     |
> | Guron, Mike  | Windows 11       | 16GB | Intel i7-12700H   | YES    | 37.1s         | 476.6 MB     |
> | Raina, Roan  | macOS 13.2.1     | 16GB | M2 (8 core)       | YES    | 14.9s         | 476.6 MB     |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    | 36.7s         | 476.6 MB     |
> 
> Table 4: Time taken to read the combined CSV (approach 2: just load `rain (mm/day)`)

#### Observations from Approach 2

Table 4 above summarizes the results of the time trials for loading the file on our different computers when loading just one column of the file.  The same trend in differences of run times for this task between the computers due to differing **Operating Systems** and **Processors** is still observed so we will focus on the differences for each computer compared to it's baseline run time for loading the file.

We can see that the memory usage has been even further reduced from 3.3+ GB for the whole file to just 476.6 MB using this approach, which also resulted in a significant reduction in run times as the time to load the file has approximately been cut in half across all four computers compared to the baseline.  Again, we see that the reductions in time do scale accordingly with the baseline run times as the slowest computer at this task (Linux OS with AMD processor) at baseline also had the largest reduction in run time with about a 60% reduction, while the fastest computer at this task (MacOS with M2) at baseline had the lowest reduction in run time with about a 53% reduction. 

## EDA in R

Here, we explore the EDA in R instead of Python. We try "exporting" our data frame as a Parquet file for processing in R.

In [21]:
%load_ext rpy2.ipython

In [22]:
%%time
df.to_parquet(f"{output_directory}/combined_data.parquet")

CPU times: user 21.6 s, sys: 9.08 s, total: 30.7 s
Wall time: 30.9 s


In [23]:
# Clear the Python dataset from memory
df = None

> Why we choose Parquet?
> 
> (WIP)

In [24]:
%%R
library(dplyr)
library(arrow)

R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


R[write to console]: 
Attaching package: ‘arrow’


R[write to console]: The following object is masked from ‘package:utils’:

    timestamp




In [25]:
%%time
%%R
r_parquet <- open_dataset("../data/combined_data.parquet")
r_df <- r_parquet |> collect()

CPU times: user 10.8 s, sys: 7.43 s, total: 18.3 s
Wall time: 8.1 s


In [26]:
%%R
r_df |> dim()

[1] 62467843        7


In [27]:
%%R
class(r_df)

[1] "tbl_df"     "tbl"        "data.frame"


In [28]:
%%time
%%R
r_df |> glimpse()

Rows: 62,467,843
Columns: 7
$ time            <chr> "1889-01-01 12:00:00", "1889-01-02 12:00:00", "1889-01…
$ lat_min         <dbl> -35.81152, -35.81152, -35.81152, -35.81152, -35.81152,…
$ lat_max         <dbl> -34.86911, -34.86911, -34.86911, -34.86911, -34.86911,…
$ lon_min         <dbl> 140.625, 140.625, 140.625, 140.625, 140.625, 140.625, …
$ lon_max         <dbl> 141.875, 141.875, 141.875, 141.875, 141.875, 141.875, …
$ `rain (mm/day)` <dbl> 5.134920e-01, 9.230450e-04, 9.390591e-06, 2.520761e-05…
$ model           <chr> "NorESM2-MM", "NorESM2-MM", "NorESM2-MM", "NorESM2-MM"…
CPU times: user 16.5 s, sys: 1.33 s, total: 17.9 s
Wall time: 17.7 s


In [29]:
%%time
%%R
r_df |> head()

# A tibble: 6 × 7
  time                lat_min lat_max lon_min lon_max `rain (mm/day)` model     
  <chr>                 <dbl>   <dbl>   <dbl>   <dbl>           <dbl> <chr>     
1 1889-01-01 12:00:00   -35.8   -34.9    141.    142.      0.513      NorESM2-MM
2 1889-01-02 12:00:00   -35.8   -34.9    141.    142.      0.000923   NorESM2-MM
3 1889-01-03 12:00:00   -35.8   -34.9    141.    142.      0.00000939 NorESM2-MM
4 1889-01-04 12:00:00   -35.8   -34.9    141.    142.      0.0000252  NorESM2-MM
5 1889-01-05 12:00:00   -35.8   -34.9    141.    142.      0.0000133  NorESM2-MM
6 1889-01-06 12:00:00   -35.8   -34.9    141.    142.      0.0000129  NorESM2-MM
CPU times: user 111 ms, sys: 10.6 ms, total: 122 ms
Wall time: 125 ms


In [30]:
%%time
%%R
r_df |> tail()

# A tibble: 6 × 7
  time                lat_min lat_max lon_min lon_max `rain (mm/day)` model     
  <chr>                 <dbl>   <dbl>   <dbl>   <dbl>           <dbl> <chr>     
1 2014-12-26 12:00:00   -29.9   -29.1    153.    154.          0.196  FGOALS-f3…
2 2014-12-27 12:00:00   -29.9   -29.1    153.    154.          9.80   FGOALS-f3…
3 2014-12-28 12:00:00   -29.9   -29.1    153.    154.         11.0    FGOALS-f3…
4 2014-12-29 12:00:00   -29.9   -29.1    153.    154.          0.706  FGOALS-f3…
5 2014-12-30 12:00:00   -29.9   -29.1    153.    154.          0.840  FGOALS-f3…
6 2014-12-31 12:00:00   -29.9   -29.1    153.    154.          0.0508 FGOALS-f3…
CPU times: user 73.6 ms, sys: 13.6 ms, total: 87.1 ms
Wall time: 86.6 ms
