# Milestone 1

## Purpose:
In this notebook, we will attempt to download a data dump containing daily rainfall over NSW, Australia dataset found on [figshare](https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681). This data dump is comprised of different files contained in a 776.4 MB compressed format. This size of the uncompressed data dump is approximately 6.6 GB. 

In [44]:
# Imports
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
from memory_profiler import memory_usage

In [45]:
%load_ext rpy2.ipython
%load_ext memory_profiler

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler


## Downloading the data

### Creating the API request

We will use the [requests library](https://docs.python-requests.org/en/master/) to generate an http request call to the [figshare API](https://docs.figshare.com/). Specifically, we will make a call to the `/articles` endpoint to retrieve information about the article of interest including the url we need to download the data.

In [46]:
article_id = 14096681
base_url = 'https://api.figshare.com/v2'
headers = {"Content-Type": "application/json"}
endpoint = f'/articles/{article_id}'
output_directory = "rainfall/"

### Making the API call

In [47]:
response = requests.request("GET", base_url+endpoint, headers=headers)
data = json.loads(response.text)
data

{'defined_type_name': 'dataset',
 'embargo_date': None,
 'citation': 'Beuzen, Tomas (2021): Daily rainfall over NSW, Australia. figshare. Dataset. https://doi.org/10.6084/m9.figshare.14096681.v3',
 'url_private_api': 'https://api.figshare.com/v2/account/articles/14096681',
 'embargo_reason': '',
 'references': ['https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6',
  'https://pangeo-data.github.io/pangeo-cmip6-cloud/',
  'https://www.longpaddock.qld.gov.au/silo/'],
 'funding_list': [],
 'url_public_api': 'https://api.figshare.com/v2/articles/14096681',
 'id': 14096681,
 'custom_fields': [],
 'size': 814109773,
 'metadata_reason': '',
 'funding': None,
 'figshare_url': 'https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681',
 'embargo_type': 'file',
 'title': 'Daily rainfall over NSW, Australia',
 'defined_type': 3,
 'embargo_options': [],
 'is_embargoed': False,
 'version': 3,
 'resource_doi': None,
 'url_public_html': 'https://figshare.com/articles/dataset/Dai

The response above contains a `data` json key which is of interest of us. It lists the files corresponding to the article along with their download urls.

In [48]:
files = data["files"]           
files

[{'is_link_only': False,
  'name': 'daily_rainfall_2014.png',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'id': 26579150,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'size': 58863},
 {'is_link_only': False,
  'name': 'environment.yml',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'id': 26579171,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'size': 192},
 {'is_link_only': False,
  'name': 'README.md',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'id': 26586554,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'size': 5422},
 {'is_link_only': False,
  'name': 'data.zip',
  'supplied_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'computed_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'id': 26766812,
  'download_url': 'https://

The data dump is `data.zip` which can be retreived as follows

### Downloading the file of interest

In [49]:
%%time
files_to_dl = ["data.zip"] 
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

Wall time: 20min 18s


As seen in the output above, the single act of downloading the data dump to the local machine took around 2mins 9s.

#### Comparison

Results of the download operation across 4 different machines

|           | CPU               | RAM   | HD         | Operation time |
|-----------|-------------------|-------|------------|----------------|
| Machine 1 | i5-4460 @ 3.20Ghz | 10 GB | 1 TB SSD   |  2 min 9 sec   |
| Machine 2 | i7-7700HQ @2.80Ghz| 16 GB | 1 TB SSD   |  12 min 54 sec |
| Machine 3 |  i5 @1.60Ghz      | 4 GB  | 121 GB SSD |  5 min 9 sec   |
| Machine 4 |                   |       |            |                |

### Extracting the data

Here, we attempt to extract the data dump we downloaded `data.zip` into individual uncompressed csv files.

In [50]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

Wall time: 20.3 s


#### Comparison

Results of the extraction operation across 4 different machines

|           | CPU               | RAM   | HD         | Operation time |
|-----------|-------------------|-------|------------|----------------|
| Machine 1 | i5-4460 @ 3.20Ghz | 10 GB | 1 TB SSD | 19.5 sec   |
| Machine 2 | i7-7700HQ @ 2.80Ghz                  |     16 GB  |   1 TB SSD         |  20.3 sec              |
| Machine 3 | i5 @1.60Ghz      | 4 GB  | 121 GB SSD |  1 min 53 sec   |
| Machine 4 |                   |       |            |                |

## 4. Combining data CSVs

### Pandas Approach

Here, we attempt to combine the individuals csv files into a single pandas dataframe which we then save to a single csv file called `combined_data.csv`. We create a new column called `model` to be able to identify which dataset each record originally comes from. The names of the models are extracted using regex from the csv file names.

In [51]:
%%time
%memit
files = glob.glob('rainfall/*.csv')
df = pd.concat((pd.read_csv(file, index_col=0)
                .assign(model=re.findall(r'[^\/]+(?=\_d)', file)[0]) # use r'[^\\]+(?=\_d)' on Windows machines
                for file in files)
              )
df.to_csv("rainfall/combined_data.csv")

peak memory: 3476.37 MiB, increment: 0.02 MiB
Wall time: 7min 38s


#### Comparison

Results of the concatenation operation across 4 different machines

|           | CPU               | RAM   | HD         | Operation time |
|-----------|-------------------|-------|------------|----------------|
| Machine 1 | i5-4460 @ 3.20Ghz | 10 GB | 1 TB SSD | 5 min 34 sec  |
| Machine 2 | i7-7700HQ @ 2.80Ghz                  |     16 GB  |   1 TB SSD         |  7 min 38 sec              |
| Machine 3 | i5 @1.60Ghz      | 4 GB  | 121 GB SSD |  Out of memory error   |
| Machine 4 |                   |       |            |                |

## 5. Load the combined CSV to memory and perform a simple EDA

Check the size of the file

In [52]:
%%sh
du -sh rainfall/combined_data.csv

6.2G	rainfall/combined_data.csv


The size of the combined dataset is 5.6 GB! We attempt to load this entire file into memory using pandas in the next section.

### Pandas approach

In [53]:
%%time
import pandas as pd
df = pd.read_csv("rainfall/combined_data.csv")

Wall time: 1min 27s


In [54]:
print(df.shape)

(62513863, 7)


In [55]:
df.head()

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
0,1889-01-01 12:00:00,-36.25,-35.0,140.625,142.5,3.293256e-13,rainfall\ACCESS-CM2
1,1889-01-02 12:00:00,-36.25,-35.0,140.625,142.5,0.0,rainfall\ACCESS-CM2
2,1889-01-03 12:00:00,-36.25,-35.0,140.625,142.5,0.0,rainfall\ACCESS-CM2
3,1889-01-04 12:00:00,-36.25,-35.0,140.625,142.5,0.0,rainfall\ACCESS-CM2
4,1889-01-05 12:00:00,-36.25,-35.0,140.625,142.5,0.01047658,rainfall\ACCESS-CM2


Attempting to read the above csv file results in a dead kernel in JupyterLab. System Resources Monitor shows a steady increase in RAM usage until 100% after which the jupyterlab notebook crashes as seen in the image below.
![out-of-memory](img/machine1-1.png)

#### Comparison

Results of the loading operation across 4 different machines

|           | CPU               | RAM   | HD         | Operation time |
|-----------|-------------------|-------|------------|----------------|
| Machine 1 | i5-4460 @ 3.20Ghz | 10 GB | 1 TB SSD | OUT OF MEMORY ERROR  |
| Machine 2 | i7-7700HQ @ 2.80Ghz                  |     16 GB  |   1 TB SSD         |  1min 27s              |
| Machine 3 | i5 @1.60Ghz      | 4 GB  | 121 GB SSD |  Out of memory error   |
| Machine 4 |                   |       |            |                |

### DASK approach

Due to the limitations faced when loading the dataset using pandas, we attempt to use DASK - a python library that allows for parallel computing and works better with large datasets.

In [56]:
import dask.dataframe as dd

In [58]:
%%time
%%memit
# shows time that dask take to merge
ddf = dd.read_csv("data/combined_data.csv",assume_missing=True, dtype={'lon_min': 'object'})
ddf.to_csv("data/combined_data_dask.csv", single_file=True)

peak memory: 11087.65 MiB, increment: 5908.88 MiB
Wall time: 8min 8s


In [59]:
%%time
%%memit
## count the number of records for each model
print(ddf["model"].value_counts().compute())

MPI-ESM1-2-HR       5154240
TaiESM1             3541230
NorESM2-MM          3541230
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
SAM0-UNICON         3541153
FGOALS-f3-L         3219300
GFDL-CM4            3219300
GFDL-ESM4           3219300
EC-Earth3-Veg-LR    3037320
MRI-ESM2-0          3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM5-0           1609650
INM-CM4-8           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
MPI-ESM1-2-LR        966420
NESM3                966420
AWI-ESM-1-1-LR       966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
Name: model, dtype: int64
peak memory: 6295.63 MiB, increment: 1556.75 MiB
Wall time: 44.9 s


In [60]:
%%time
%%memit
## calculate the mean and std of rain amount grouping by model
print(ddf.groupby('model').agg({'rain (mm/day)': ['mean', 'std']}).compute().head())

               rain (mm/day)          
                        mean       std
model                                 
ACCESS-CM2          1.787025  5.914188
ACCESS-ESM1-5       2.217501  6.422397
AWI-ESM-1-1-LR      2.026071  5.321889
BCC-CSM2-MR         1.951832  6.200969
BCC-ESM1            1.811032  5.358361
peak memory: 6305.39 MiB, increment: 1577.31 MiB
Wall time: 42 s


### Loading in separate chunks

We also attempted to use pandas's chunksize argument to limit the number of lines that are read into local memory at a time.

In [None]:
%%time
%%memit
import numpy as np
rain_total = 0
num_entries = 0
for chunk in pd.read_csv("rainfall/combined_data.csv", chunksize=10_000_000):
    num_entries = num_entries + chunk.shape[0]
    rain_total = rain_total + np.sum(chunk['rain (mm/day)'])

In [None]:
rain_total / num_entries

By using chunksize argument of read_csv we are able to work-around the memory limitation we previously experienced. Here, for example, we were able to calculate the average rain fall per day across all days included in the data dump

### Loading particular columns

We also attempted to only load some columns that could be the more important for our analysis. 

In [None]:
%%time
%%memit
use_cols = ["time", "lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/day)", "model"]
df = pd.read_csv("rainfall/combined_data.csv", usecols=use_cols)
print(df["model"].value_counts())

In [None]:
%%time
%%memit
use_cols = ["time", "model"]
df = pd.read_csv("rainfall/combined_data.csv", usecols=use_cols)
print(df["model"].value_counts())

By only loading some columns that are directly related to the process of interest (in this case counting how many data point of each model), we are able to reduce the memory and time requirement for this process. 

## 6. Perform a simple EDA in R

### Transfer dataframe to R

In order to transfer the dataframe to R we will attempt to use `feather` - a protable language-agnostic file format for storing dataframes and sharing them between R and python projects.

In [None]:
import pyarrow.feather as feather
#feather.write_feather(df, '/rainfall/combined_data_feather')
##
##
## TO DO
##
##