# DSCI 525 - Web and Cloud Computing
## Milestone 1: Tackling big data on your laptop
### Group 14
Group Members: Sasha Babicki, Cheuk Ho, Sakshi Jain, Zeliha Ural Merpez

#### 3. Download the data
1. Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
2. Extract the zip file, again programmatically, similar to how we did it in class.

#### Note: code below is modified from 525 lecture notes
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Lectures/Lecture_1_2.ipynb

In [2]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
from memory_profiler import memory_usage

In [3]:
%load_ext rpy2.ipython
%load_ext memory_profiler

In [4]:
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figshareairline/"

In [38]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'is_link_only': False,
  'name': 'daily_rainfall_2014.png',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'id': 26579150,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'size': 58863},
 {'is_link_only': False,
  'name': 'environment.yml',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'id': 26579171,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'size': 192},
 {'is_link_only': False,
  'name': 'README.md',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'id': 26586554,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'size': 5422},
 {'is_link_only': False,
  'name': 'data.zip',
  'supplied_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'computed_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'id': 26766812,
  'download_url': 'https://

In [39]:
%%time

download_file = "data.zip"
for file in files:
    if file["name"] == download_file:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

CPU times: user 4.26 s, sys: 2.87 s, total: 7.13 s
Wall time: 1min 31s


In [40]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, download_file), 'r') as f:
    f.extractall(output_directory)

CPU times: user 12.5 s, sys: 1.84 s, total: 14.4 s
Wall time: 14.4 s


#### 4. Combining data CSVs
1. Use one of the following options to combine data CSVs into a single CSV. (Pandas, DASK)
2. When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)
3. Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

In [5]:
combined_file_path = output_directory + "combined_data.csv"

In [42]:
%%time
%memit

files = glob.glob(output_directory + "*_daily_rainfall_NSW.csv")
df = pd.concat(
    (
        pd.read_csv(file, index_col=0).assign(
            model=re.findall(r"/(.*)_daily_rainfall", file)[0]
        )
        for file in files
    )
)
df.to_csv(combined_file_path)

peak memory: 13663.16 MiB, increment: 0.01 MiB
CPU times: user 4min 16s, sys: 6.37 s, total: 4min 22s
Wall time: 4min 24s


In [43]:
%%sh
du -sh figshareairline/combined_data.csv

5.6G	figshareairline/combined_data.csv


In [6]:
%%time
df = pd.read_csv(combined_file_path, index_col=0, parse_dates=True)

CPU times: user 1min 8s, sys: 10.3 s, total: 1min 18s
Wall time: 1min 26s


In [45]:
print(df.shape)

(62467843, 6)


In [47]:
%%time
df

CPU times: user 194 µs, sys: 11 µs, total: 205 µs
Wall time: 210 µs


Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-12-27 12:00:00,-30.157068,-29.21466,153.125,154.375,0.554375,TaiESM1
2014-12-28 12:00:00,-30.157068,-29.21466,153.125,154.375,7.028577,TaiESM1
2014-12-29 12:00:00,-30.157068,-29.21466,153.125,154.375,0.234757,TaiESM1
2014-12-30 12:00:00,-30.157068,-29.21466,153.125,154.375,2.097459,TaiESM1
2014-12-31 12:00:00,-30.157068,-29.21466,153.125,154.375,0.548421,TaiESM1


In [48]:
df['model'].nunique()

27

In [49]:
df['model'].unique()

array(['GFDL-ESM4', 'BCC-CSM2-MR', 'AWI-ESM-1-1-LR', 'GFDL-CM4',
       'FGOALS-g3', 'CMCC-ESM2', 'NorESM2-LM', 'CanESM5', 'CMCC-CM2-HR4',
       'KIOST-ESM', 'BCC-ESM1', 'FGOALS-f3-L', 'NESM3', 'NorESM2-MM',
       'INM-CM4-8', 'MRI-ESM2-0', 'SAM0-UNICON', 'MPI-ESM1-2-LR',
       'CMCC-CM2-SR5', 'EC-Earth3-Veg-LR', 'MPI-ESM1-2-HR',
       'ACCESS-ESM1-5', 'MIROC6', 'INM-CM5-0', 'MPI-ESM-1-2-HAM',
       'ACCESS-CM2', 'TaiESM1'], dtype=object)

In [50]:
%%time
df.describe()

CPU times: user 6.92 s, sys: 1.25 s, total: 8.17 s
Wall time: 8.17 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10482,-31.97757,146.9059,148.215,1.90117
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.838413e-06
50%,-33.0,-32.04188,146.875,148.125,0.06154947
75%,-31.4017,-30.15707,150.1875,151.3125,1.020918
max,-29.9,-27.90606,153.75,155.625,432.9395


##### 4.3 Runtime Observations: 

##### Zeliha
- Combining files:
    - peak memory: 13663.16 MiB, increment: 0.01 MiB
    - CPU times: user 4min 16s, sys: 6.37 s, total: 4min 22s
    - Wall time: 4min 24s
- Reading combined file:
    - CPU times: user 45.3 s, sys: 3.61 s, total: 48.9 s
    - Wall time: 49 s
    
##### Sasha
- Combining files:
    - peak memory: 86.82 MiB, increment: 0.26 MiB
    - CPU times: user 6min 14s, sys: 2min 41s, total: 8min 56s
    - Wall time: 9min 26s
- Reading combined file:
    - CPU times: user 1min 8s, sys: 15.3 s, total: 1min 23s
    - Wall time: 1min 35s
    
##### Chuck
- Combining files:
    - peak memory: 75.36 MiB, increment: 0.71 MiB
    - CPU times: user 5min 59s, sys: 15.5 s, total: 6min 15s
    - Wall time: 6min 22s
- Reading combined file:
    - Peak memory: 3355.27 MiB, increment: 0.12 MiB
    - CPU times: user 56.9 s, sys: 14.2 s, total: 1min 11s
    - Wall time: 1min 14s

### 5. Load the combined CSV to memory and perform a simple EDA
1. Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
    - Changing dtype of your data
    - Load just columns what we want
    - Loading in chunks
    - Dask
2. Discuss your observations.

#### 5.1.1 Changing dtype of data:

In [51]:
# View original dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 62467843 entries, 1889-01-01 12:00:00 to 2014-12-31 12:00:00
Data columns (total 6 columns):
 #   Column         Dtype  
---  ------         -----  
 0   lat_min        float64
 1   lat_max        float64
 2   lon_min        float64
 3   lon_max        float64
 4   rain (mm/day)  float64
 5   model          object 
dtypes: float64(5), object(1)
memory usage: 3.3+ GB


In [52]:
float_cols = ["lat_min","lat_max","lon_min","lon_max","rain (mm/day)"]
df_32 = df.copy()
df_64 = df.copy()

df_32[float_cols] = df_32[float_cols].astype('float32', errors='ignore')
print(f"DataFrame with numeric columns as float64: {df_64.memory_usage().sum() / 1e9:.2f} GB")
print(f"DataFrame with numeric columns as float32: {df_32.memory_usage().sum() / 1e9:.2f} GB")

DataFrame with numeric columns as float64: 3.50 GB
DataFrame with numeric columns as float32: 2.25 GB


In [53]:
%%time
%%memit
df_64["lat_min"].value_counts()

peak memory: 17142.71 MiB, increment: 0.52 MiB
CPU times: user 554 ms, sys: 116 ms, total: 670 ms
Wall time: 823 ms


In [54]:
%%time
%%memit
df_32["lat_min"].value_counts()

peak memory: 17618.94 MiB, increment: 476.96 MiB
CPU times: user 557 ms, sys: 164 ms, total: 720 ms
Wall time: 879 ms


#### 5.1.2 Dask:

In [55]:
import dask.dataframe as dd

df_dask = dd.read_csv(combined_file_path)

In [56]:
%%time
%%memit
df["lat_min"].value_counts()

peak memory: 17142.24 MiB, increment: 0.51 MiB
CPU times: user 561 ms, sys: 164 ms, total: 726 ms
Wall time: 807 ms


In [57]:
%%time
%%memit
df_dask["lat_min"].value_counts()

peak memory: 17141.73 MiB, increment: 0.00 MiB
CPU times: user 35 ms, sys: 216 ms, total: 251 ms
Wall time: 452 ms


In [58]:
%%time
%%memit
df["model"].value_counts()

peak memory: 17141.73 MiB, increment: 0.00 MiB
CPU times: user 3.54 s, sys: 188 ms, total: 3.73 s
Wall time: 3.84 s


In [59]:
%%time
%%memit
df_dask["model"].value_counts()

peak memory: 17141.73 MiB, increment: 0.00 MiB
CPU times: user 36.4 ms, sys: 205 ms, total: 242 ms
Wall time: 449 ms


#### 5.2 Discussion:

- Changing the dtype of numeric columns from `float64` to `float32` did reduce the space the dataframe takes in memory by almost half. However, performing `value_counts()` on a column actually used more memory and was slower for `float32` columns than `float64` columns. 
- Dask seems to use marginally less memory and is slightly faster when performing `value_counts()` on a numeric column. It is significantly faster when operating on a `str` type column.

#### 6. Perform a simple EDA in R
1. Pick an approach to transfer the dataframe from python to R.
    - Parquet file
    - Feather file
    - Pandas exchange
    - Arrow exchange
2. Discuss why you chose this approach over others.

In [7]:
## Install the pyarrow packages: https://arrow.apache.org/docs/python/install.html
import pyarrow.dataset as ds
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.feather as feather

## Install rpy2: https://anaconda.org/conda-forge/rpy2
import rpy2.rinterface

In [8]:
%%time
%%memit
table = pa.Table.from_pandas(df)

peak memory: 4395.28 MiB, increment: 2678.63 MiB
CPU times: user 5.05 s, sys: 1.83 s, total: 6.88 s
Wall time: 6.37 s


In [9]:
%%time

# Write to feather format 
feather.write_feather(table, "figshareairline/example.feather")

CPU times: user 2.19 s, sys: 1.25 s, total: 3.43 s
Wall time: 3.43 s


In [10]:
%%sh
du -sh figshareairline/example.feather

1.1G	figshareairline/example.feather


In [None]:
%%time
%%R
### her we are showing how much time it took to read a feather file what we wrote in python
library(arrow)
library(dplyr)

start_time <- Sys.time()
r_table <- arrow::read_feather("figshareairline/example.feather")
print(class(r_table))

result <- r_table %>% count(model)
end_time <- Sys.time()
print(result)
print(end_time - start_time)

R[write to console]: 
Attaching package: ‘arrow’


R[write to console]: The following object is masked from ‘package:utils’:

    timestamp


R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




#### 6.2 Discussion: 
- We choose to get table directly from pandas data frame and then save the file as feather. Both constructing table and saving file was quite fast.
- Using the feather file format saves a lot of space, the file only takes up ~1.2 GB in memory.