# Milestone 1

In this milestone, we are using the Figshare API to pull data and analyze it in upcoming milestones.

## Downloading the data from Figshare

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

In [2]:
# Daily rainfall over NSW, Australia
# https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681
article_id = 14096681

# Metadata for the download
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "../data"

In [13]:
# List files in the associated dataset
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]
files

[{'id': 26579150,
  'name': 'daily_rainfall_2014.png',
  'size': 58863,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e'},
 {'id': 26579171,
  'name': 'environment.yml',
  'size': 192,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34'},
 {'id': 26586554,
  'name': 'README.md',
  'size': 5422,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c'},
 {'id': 26766812,
  'name': 'data.zip',
  'size': 814041183,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26766812',
  'supplied_md5': 'b517383f76e77bd03755a63a8f

In [20]:
# Retrieve `data.zip`
files_to_dl = ["data.zip"]

for f in files:
    if f["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(f["download_url"], f"{output_directory}/{f['name']}")

KeyboardInterrupt: 

In [8]:
# Extract `data.zip`
output_zip_file = os.path.join(output_directory, "/data.zip")
with zipfile.ZipFile(output_zip_file, 'r') as f:
    f.extractall(output_directory)

## Combine CSV files

In [8]:
# Gather a list of files of CSV to merge
files = glob.glob(f'{output_directory}/*.csv')
files.remove(f'{output_directory}/observed_daily_rainfall_SYD.csv')
files = files[0:1]
files

['../data/NorESM2-MM_daily_rainfall_NSW.csv']

In [9]:
columns_to_merge = ["time", "lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/day)"]

combined_path = f'{output_directory}/combined_data.csv'

In [10]:
%%time
# A Pythonic way (but not the most memory-efficient way) for merging the data
df = pd.concat((pd.read_csv(f, index_col=0, usecols=columns_to_merge)
                .assign(model=f[len(output_directory)+1:-len("_daily_rainfall_NSW.csv")])
                for f in files)
              )
df.to_csv(combined_path)

CPU times: user 35.1 s, sys: 2.37 s, total: 37.5 s
Wall time: 37.8 s


In [38]:
%%sh
du -sh ../data/combined_data.csv

6.7G	../data/combined_data.csv


In [39]:
print(df.shape)

(62467843, 6)


In [11]:
df.head()

Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1889-01-01 12:00:00,-35.811518,-34.86911,140.625,141.875,0.513492,NorESM2-MM
1889-01-02 12:00:00,-35.811518,-34.86911,140.625,141.875,0.000923,NorESM2-MM
1889-01-03 12:00:00,-35.811518,-34.86911,140.625,141.875,9e-06,NorESM2-MM
1889-01-04 12:00:00,-35.811518,-34.86911,140.625,141.875,2.5e-05,NorESM2-MM
1889-01-05 12:00:00,-35.811518,-34.86911,140.625,141.875,1.3e-05,NorESM2-MM


> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken  |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-----------:|
> | Chen, Ziyi   |                  |      |                   |        |             |
> | Guron, Mike  |                  |      |                   |        |             |
> | Raina, Roan  |                  |      |                   |        |             |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    | 10min 6secs |
> 
> Table 1: Time taken to combine the CSV files

## EDA

### Baseline

This is the baseline time needed to load the CSV file as-is.

In [65]:
%%time
df = pd.read_csv(f"{output_directory}/combined_data.csv")

CPU times: user 2min 3s, sys: 30.3 s, total: 2min 33s
Wall time: 2min 41s


In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62467843 entries, 0 to 62467842
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   time           object 
 1   lat_min        float64
 2   lat_max        float64
 3   lon_min        float64
 4   lon_max        float64
 5   rain (mm/day)  float64
 6   model          object 
dtypes: float64(5), object(2)
memory usage: 3.3+ GB


> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken    | Memory usage |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-------------:|:------------:|
> | Chen, Ziyi   |                  |      |                   |        |               |              |
> | Guron, Mike  |                  |      |                   |        |               |              |
> | Raina, Roan  |                  |      |                   |        |               |              |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    | 2 min 45 secs | 3.3+ GB      |
> 
> Table 2: Time taken to read the combined CSV (baseline)

#### Observations from Baseline

(WIP)

### Approach 1: Change the `dtype` of the data

We notice that by default it uses `float64` if we do not specify it. First, we try to see if switching to `float32` would make a smaller memory footprint, as well as a faster time.

In [60]:
%%time
df_float32 = pd.read_csv(f"{output_directory}/combined_data.csv", dtype={
    'lat_min': 'float32',
    'lat_max': 'float32',
    'lon_min': 'float32',
    'lon_max': 'float32',
    'rain (mm/day)': 'float32'
})

CPU times: user 1min 56s, sys: 27.6 s, total: 2min 23s
Wall time: 2min 28s


In [61]:
df_float32.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62467843 entries, 0 to 62467842
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   time           object 
 1   lat_min        float32
 2   lat_max        float32
 3   lon_min        float32
 4   lon_max        float32
 5   rain (mm/day)  float32
 6   model          object 
dtypes: float32(5), object(2)
memory usage: 2.1+ GB


> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken    | Memory usage |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-------------:|:------------:|
> | Chen, Ziyi   |                  |      |                   |        |               |              |
> | Guron, Mike  |                  |      |                   |        |               |              |
> | Raina, Roan  |                  |      |                   |        |               |              |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    | 2 min 28 secs | 2.1+ GB      |
> 
> Table 3: Time taken to read the combined CSV (approach 1: use `float32` instead of `float64`)

#### Observations from Approach 1

(WIP)

### Approach 2: Load only column(s) we want

The dataset contains a number of columns that we may not need to use in one go. In this approach, we try to just load one column from the combined CSV file.

In [63]:
%%time
df_only_rain = pd.read_csv(f"{output_directory}/combined_data.csv", usecols=["rain (mm/day)"])

CPU times: user 52 s, sys: 13 s, total: 1min 5s
Wall time: 1min 6s


In [64]:
df_only_rain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62467843 entries, 0 to 62467842
Data columns (total 1 columns):
 #   Column         Dtype  
---  ------         -----  
 0   rain (mm/day)  float64
dtypes: float64(1)
memory usage: 476.6 MB


> | Team Member  | Operating System | RAM  | Processor         | Is SSD | Time taken    | Memory usage |
> |:------------:|:----------------:|:----:|:-----------------:|:------:|:-------------:|:------------:|
> | Chen, Ziyi   |                  |      |                   |        |               |              |
> | Guron, Mike  |                  |      |                   |        |               |              |
> | Raina, Roan  |                  |      |                   |        |               |              |
> | Wong, Kelvin | Linux Mint 21    | 16GB | AMD Ryzen 5 3500U | YES    | 1 min 6 secs  | 476.6 MB     |
> 
> Table 4: Time taken to read the combined CSV (approach 2: just load `rain (mm/day)`)

#### Observations from Approach 2

(WIP)

## EDA in R

Here, we explore the EDA in R instead of Python. We try "exporting" our data frame as a Parquet file for processing in R.

In [68]:
%load_ext rpy2.ipython

In [66]:
%%time
df.to_parquet(f"{output_directory}/combined_data.parquet")

CPU times: user 30.3 s, sys: 8.75 s, total: 39.1 s
Wall time: 35.8 s


> Why we choose Parquet?
> 
> (WIP)

In [73]:
%%R
library(dplyr)
library(arrow)

R[write to console]: 
Attaching package: ‘arrow’


R[write to console]: The following object is masked from ‘package:utils’:

    timestamp




In [74]:
%%R
r_parquet <- open_dataset("../data/combined_data.parquet")
r_df <- r_parquet |> collect()

In [75]:
%%time
%%R
r_df |> str()

tibble [62,467,843 × 7] (S3: tbl_df/tbl/data.frame)
 $ time         : chr [1:62467843] "1889-01-01 12:00:00" "1889-01-02 12:00:00" "1889-01-03 12:00:00" "1889-01-04 12:00:00" ...
 $ lat_min      : num [1:62467843] -35.8 -35.8 -35.8 -35.8 -35.8 ...
 $ lat_max      : num [1:62467843] -34.9 -34.9 -34.9 -34.9 -34.9 ...
 $ lon_min      : num [1:62467843] 141 141 141 141 141 ...
 $ lon_max      : num [1:62467843] 142 142 142 142 142 ...
 $ rain (mm/day): num [1:62467843] 5.13e-01 9.23e-04 9.39e-06 2.52e-05 1.33e-05 ...
 $ model        : chr [1:62467843] "NorESM2-MM_daily_rainfall_NSW" "NorESM2-MM_daily_rainfall_NSW" "NorESM2-MM_daily_rainfall_NSW" "NorESM2-MM_daily_rainfall_NSW" ...


In [76]:
%%time
%%R
r_df |> summary()

: 

In [None]:
%%time
%%R
r_df |> head()

In [None]:
%%time
%%R
r_df |> tail()