# DSCI 525 - Web and Cloud Computing
## Milestone 1: Tackling big data on your laptop
### Group 14
Group Members: Sasha Babicki, Cheuk Ho, Sakshi Jain, Zeliha Ural Merpez

#### 3. Download the data
1. Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
2. Extract the zip file, again programmatically, similar to how we did it in class.

#### Note: code below is modified from 525 lecture notes
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Lectures/Lecture_1_2.ipynb

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
from memory_profiler import memory_usage

In [2]:
# %load_ext rpy2.ipython  # commenting out until we find a fix
%load_ext memory_profiler

In [3]:
article_id = 14096681 # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figshareairline/"

In [4]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'is_link_only': False,
  'name': 'daily_rainfall_2014.png',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'id': 26579150,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'size': 58863},
 {'is_link_only': False,
  'name': 'environment.yml',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'id': 26579171,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'size': 192},
 {'is_link_only': False,
  'name': 'README.md',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'id': 26586554,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'size': 5422},
 {'is_link_only': False,
  'name': 'data.zip',
  'supplied_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'computed_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'id': 26766812,
  'download_url': 'https://

In [5]:
%%time
files_to_dl = ["data.zip"]  # feel free to add other files here
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

CPU times: user 4.85 s, sys: 6.52 s, total: 11.4 s
Wall time: 15min 48s


In [6]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

CPU times: user 17.2 s, sys: 11.7 s, total: 28.8 s
Wall time: 40 s


#### 4. Combining data CSVs
1. Use one of the following options to combine data CSVs into a single CSV. (Pandas, DASK)
2. When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)
3. Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

In [7]:
combined_file_path = output_directory + "combined_data.csv"

In [8]:
%%time
%memit

files = glob.glob(output_directory + "*_daily_rainfall_NSW.csv")
df = pd.concat(
    (
        pd.read_csv(file, index_col=0).assign(
            model=re.findall(r"/(.*)_daily_rainfall", file)[0]
        )
        for file in files
    )
)
df.to_csv(combined_file_path)

peak memory: 86.82 MiB, increment: 0.26 MiB
CPU times: user 6min 14s, sys: 2min 41s, total: 8min 56s
Wall time: 9min 26s


In [9]:
%%sh
du -sh figshareairline/combined_data.csv

5.6G	figshareairline/combined_data.csv


In [10]:
%%time
df = pd.read_csv(combined_file_path, index_col=0, parse_dates=True)

CPU times: user 1min 8s, sys: 15.3 s, total: 1min 23s
Wall time: 1min 35s


In [11]:
print(df.shape)

(62467843, 6)


In [12]:
df.head()

Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.244226e-13,MPI-ESM-1-2-HAM
1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.217326e-13,MPI-ESM-1-2-HAM
1889-01-03 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.498125e-13,MPI-ESM-1-2-HAM
1889-01-04 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.251282e-13,MPI-ESM-1-2-HAM
1889-01-05 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.270161e-13,MPI-ESM-1-2-HAM


In [13]:
df['model'].nunique()

27

In [14]:
df['model'].unique()

array(['MPI-ESM-1-2-HAM', 'AWI-ESM-1-1-LR', 'NorESM2-LM', 'ACCESS-CM2',
       'FGOALS-f3-L', 'CMCC-CM2-HR4', 'MRI-ESM2-0', 'GFDL-CM4',
       'BCC-CSM2-MR', 'EC-Earth3-Veg-LR', 'CMCC-ESM2', 'NESM3',
       'MPI-ESM1-2-LR', 'ACCESS-ESM1-5', 'FGOALS-g3', 'INM-CM4-8',
       'MPI-ESM1-2-HR', 'TaiESM1', 'NorESM2-MM', 'CMCC-CM2-SR5',
       'KIOST-ESM', 'INM-CM5-0', 'MIROC6', 'BCC-ESM1', 'GFDL-ESM4',
       'CanESM5', 'SAM0-UNICON'], dtype=object)

##### 4.3 Runtime Observations: 

##### Zeliha
- Combining files:
    - peak memory: 130.89 MiB, increment: 0.18 MiB
    - Wall time: 14min 21s
- Reading combined file:
    - Wall time: 5min 27s
    
##### Sasha
- Combining files:
    - peak memory: 86.07 MiB, increment: 0.27 MiB
    - CPU times: user 6min 14s, sys: 21.9 s, total: 6min 36s
    - Wall time: 7min 4s
- Reading combined file:
    - CPU times: user 59.9 s, sys: 15.4 s, total: 1min 15s
    - Wall time: 1min 28s
    
##### Chuck
- Combining files:
    - peak memory: 75.36 MiB, increment: 0.71 MiB
    - CPU times: user 5min 59s, sys: 15.5 s, total: 6min 15s
    - Wall time: 6min 22s
- Reading combined file:
    - Peak memory: 3355.27 MiB, increment: 0.12 MiB
    - CPU times: user 56.9 s, sys: 14.2 s, total: 1min 11s
    - Wall time: 1min 14s

### 5. Load the combined CSV to memory and perform a simple EDA
1. Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
    - Changing dtype of your data
    - Load just columns what we want
    - Loading in chunks
    - Dask
2. Discuss your observations.

#### 5.1.1 Changing dtype of data:

In [15]:
# View original dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 62467843 entries, 1889-01-01 12:00:00 to 2014-12-31 12:00:00
Data columns (total 6 columns):
 #   Column         Dtype  
---  ------         -----  
 0   lat_min        float64
 1   lat_max        float64
 2   lon_min        float64
 3   lon_max        float64
 4   rain (mm/day)  float64
 5   model          object 
dtypes: float64(5), object(1)
memory usage: 3.3+ GB


In [16]:
float_cols = ["lat_min","lat_max","lon_min","lon_max","rain (mm/day)"]
df_32 = df.copy()
df_64 = df.copy()

df_32[float_cols] = df_32[float_cols].astype('float32', errors='ignore')
print(f"DataFrame with numeric columns as float64: {df_64.memory_usage().sum() / 1e9:.2f} GB")
print(f"DataFrame with numeric columns as float32: {df_32.memory_usage().sum() / 1e9:.2f} GB")

DataFrame with numeric columns as float64: 3.50 GB
DataFrame with numeric columns as float32: 2.25 GB


In [17]:
%%time
%%memit
df_64["lat_min"].value_counts()

peak memory: 4754.64 MiB, increment: 137.23 MiB
CPU times: user 677 ms, sys: 189 ms, total: 865 ms
Wall time: 3.22 s


In [18]:
%%time
%%memit
df_32["lat_min"].value_counts()

peak memory: 5231.79 MiB, increment: 477.13 MiB
CPU times: user 753 ms, sys: 156 ms, total: 909 ms
Wall time: 1.56 s


#### 5.1.2 Dask:

In [19]:
import dask.dataframe as dd

df_dask = dd.read_csv(combined_file_path)

In [20]:
%%time
%%memit
df["lat_min"].value_counts()

peak memory: 4634.13 MiB, increment: 67.19 MiB
CPU times: user 631 ms, sys: 29.3 ms, total: 660 ms
Wall time: 1.27 s


In [21]:
%%time
%%memit
df_dask["lat_min"].value_counts()

peak memory: 4634.17 MiB, increment: 0.04 MiB
CPU times: user 52.3 ms, sys: 20.9 ms, total: 73.2 ms
Wall time: 1.66 s


In [22]:
%%time
%%memit
df["model"].value_counts()

peak memory: 4878.47 MiB, increment: 244.29 MiB
CPU times: user 4.98 s, sys: 178 ms, total: 5.16 s
Wall time: 5.79 s


In [23]:
%%time
%%memit
df_dask["model"].value_counts()

peak memory: 4878.48 MiB, increment: 0.02 MiB
CPU times: user 48.2 ms, sys: 21 ms, total: 69.1 ms
Wall time: 1.62 s


#### 5.2 Discussion:

- Changing the dtype of numeric columns from `float64` to `float32` did reduce the space the dataframe takes in memory by almost half. However, performing `value_counts()` on a column actually used more memory and was slower for `float32` columns than `float64` columns. 
- Dask seems to use marginally less memory and is slightly faster when performing `value_counts()` on a numeric column. It is significantly faster when operating on a `str` type column.

#### 6. Perform a simple EDA in R
1. Pick an approach to transfer the dataframe from python to R.
    - Parquet file
    - Feather file
    - Pandas exchange
    - Arrow exchange
2. Discuss why you chose this approach over others.