# DSCI 525 - Web and Cloud Computing
## Project: Daily Rainfall Over NSW, Australia
## Milestone 1: Tackling Big Data on Your Laptop 
#### Authors: Group 24 Huanhuan Li, Nash Makhija and Nicholas Wu

## Imports

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
from memory_profiler import memory_usage
import dask.dataframe as dd

In [2]:
import pyarrow.dataset as ds
import pyarrow as pa
import pyarrow.parquet as pq
import rpy2.rinterface
import rpy2_arrow.pyarrow_rarrow as pyra
import pyarrow.feather as feather

In [3]:
%load_ext rpy2.ipython
%load_ext memory_profiler

In [4]:
# Code for this notebook was adapted from DSCI 525 course notes

## Introduction

In this notebook, we will work with large dataset in Pandas and vanilla CSV files. Typically these are not the best for dealing with large data.. The purpose of this exercise is for us to get exposure to working with some useful tools for working with big data, such as DASK, Apache Arrow package, Feather and Parquet files formats.   

The dataset we will be using can be found [here](https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681). 

## 1) Downloading the data

We will start with downloading the data from [figshare](https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681) to our local computer using the [figshare API](https://docs.figshare.com/) with the help of `requests` library.   

The code below defines the endpoint and header info:

In [5]:
# Necessary metadata
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "../data/"

We start with sending a GET request to list the available files from the endpoint. 

In [6]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'is_link_only': False,
  'name': 'daily_rainfall_2014.png',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'id': 26579150,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'size': 58863},
 {'is_link_only': False,
  'name': 'environment.yml',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'id': 26579171,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'size': 192},
 {'is_link_only': False,
  'name': 'README.md',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'id': 26586554,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'size': 5422},
 {'is_link_only': False,
  'name': 'data.zip',
  'supplied_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'computed_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'id': 26766812,
  'download_url': 'https://

>   
According to the url from metadata, we are going to download the file named "data.zip" to the data folder.   
>  
Once the data.zip is successfully downloaded, we are going to extract the zip file programmatically.

In [7]:
%%time
files_to_dl = ["data.zip"]
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

CPU times: user 6.49 s, sys: 6.44 s, total: 12.9 s
Wall time: 14min 56s


In [8]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

CPU times: user 17.4 s, sys: 2.83 s, total: 20.3 s
Wall time: 20.7 s


>  
We can confirm by looking at the data foler that "data.zip" has been downloaded and extracted successfully. 

## 2) Combine data CSVs

From the zip file, we extracted 28 .csv files. 27 of them are machine learning models. We are now going to merge the .csv files into one and add an extra column called "model" that identifies the name of the model. 

Note, we extracted the model name from the file names. For example, for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON. 

The tool we chose to use for this task is `Pandas`. 

Let's start by inspecting what an individual .csv file looks like. 

In [7]:
### just listing to get an idea how individual file looks like 
use_cols = ['time', 'lat_min', 'lat_max', 'lon_min', 'lon_max', 'rain (mm/day)']
df = pd.read_csv("../data/ACCESS-CM2_daily_rainfall_NSW.csv", usecols=use_cols)
df

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
0,1889-01-01 12:00:00,-36.25,-35.00,140.625,142.50,3.293256e-13
1,1889-01-02 12:00:00,-36.25,-35.00,140.625,142.50,0.000000e+00
2,1889-01-03 12:00:00,-36.25,-35.00,140.625,142.50,0.000000e+00
3,1889-01-04 12:00:00,-36.25,-35.00,140.625,142.50,0.000000e+00
4,1889-01-05 12:00:00,-36.25,-35.00,140.625,142.50,1.047658e-02
...,...,...,...,...,...,...
1932835,2014-12-27 12:00:00,-30.00,-28.75,151.875,153.75,2.951144e-02
1932836,2014-12-28 12:00:00,-30.00,-28.75,151.875,153.75,2.257118e-01
1932837,2014-12-29 12:00:00,-30.00,-28.75,151.875,153.75,1.204670e-01
1932838,2014-12-30 12:00:00,-30.00,-28.75,151.875,153.75,2.632404e-02


> We can see that one .csv file has more than 1.9 million rows and 6 columns. So to combine all 27 files, presumably, we would have more than 51 million rows. To compare run times and memory usages of `Pandas` on different machines, we will use magic command `%%time` from IPYTHON and `%%memit` from memory_profiler to record these info. 

The following code extract and add model name from the .csv files, and combine all .csv files into one. 

In [8]:
%%time
%memit
# Shows time that regular python takes to merge file
# Join all data together
## here we are using a normal python way of merging the data 
files = glob.glob('../data/*NSW.csv')
df = pd.concat((pd.read_csv(file, index_col=0, usecols=use_cols)
                .assign(model=file[8:file.index("_daily")])
                for file in files)
              )
df.to_csv("../data/combined_data.csv")

peak memory: 477.45 MiB, increment: 0.07 MiB
CPU times: user 4min 31s, sys: 9.92 s, total: 4min 41s
Wall time: 4min 44s


In [9]:
%%sh
du -sh ../data/combined_data.csv

5.6G	../data/combined_data.csv


In [10]:
%%time

df_pandas = pd.read_csv("../data/combined_data.csv")

CPU times: user 45.4 s, sys: 11.5 s, total: 56.9 s
Wall time: 1min


In [11]:
df_pandas.shape

(62467843, 7)

In [12]:
df_pandas.head()

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
0,1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.244226e-13,MPI-ESM-1-2-HAM
1,1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.217326e-13,MPI-ESM-1-2-HAM
2,1889-01-03 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.498125e-13,MPI-ESM-1-2-HAM
3,1889-01-04 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.251282e-13,MPI-ESM-1-2-HAM
4,1889-01-05 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.270161e-13,MPI-ESM-1-2-HAM


#### Summary of Observation on Run Times and Memory Usage Comparison on Different Machines

|Name|Machine| Total Time Taken to Concatenate and Create .csv File | Peak Memory Usage | Time taken to Load |
|---|---| --- | --- | --- |
|Huanhuan|Windows| 5min 48s | 427 MiB | 1min 7s |
|Nash|macOS| 6min 7s | 359 MiB | 1min 15s |
|Nick|macOS| 4min 44s | 397 MiB | 50.5s |

> Summary: The run times and memory usages on our team members' machine are all similar in concatenating and creating the .csv files. 

>Nash's laptop initially had storage issues due to hard drive being close to full storage. Nash had to free up space before he was successfully able to create combined_data.csv

<br>

## 3) Load the Combined CSV to Memory and Perform a Simple EDA


There are a number of ways to load the combined CSV file to memory. Pandas and R by default load the entire data frame to memory at once. 

Space issue arises quickly when the data we are trying to process is bigger than our RAM. In our case, the combined_data.csv is 5.6 GB. 

In this section, we are going to explore some approaches to reduce memory usage while performing an EDA. 

### Approach 1. Load the Entire Dataframe to Memory Using Pandas (Baseline for Comparison)  

In [13]:
%%time
%%memit

df_pandas = pd.read_csv("../data/combined_data.csv")
print(df_pandas["model"].value_counts())

MPI-ESM1-2-HR       5154240
NorESM2-MM          3541230
CMCC-CM2-HR4        3541230
TaiESM1             3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
SAM0-UNICON         3541153
FGOALS-f3-L         3219300
GFDL-CM4            3219300
GFDL-ESM4           3219300
EC-Earth3-Veg-LR    3037320
MRI-ESM2-0          3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM5-0           1609650
INM-CM4-8           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
AWI-ESM-1-1-LR       966420
NESM3                966420
MPI-ESM-1-2-HAM      966420
MPI-ESM1-2-LR        966420
NorESM2-LM           919800
CanESM5              551880
BCC-ESM1             551880
Name: model, dtype: int64
peak memory: 9448.29 MiB, increment: 3986.23 MiB
CPU times: user 49.3 s, sys: 9.49 s, total: 58.8 s
Wall time: 1min


>  
Our baseline for comparison is to use Pandas to load the entire data to memory. 
>  
The above code loads the combined_data.csv to memory and performs a simple EDA to calculate counts of values in the "model" column.   
>  
We can see that the peak memory is 9,448.29 MiB and the CPU and wall time is close to one minute. 
>  
Let's explore some other approaches to see if we can reduce the memory usage.

### Approach 2. Changing `dtype` of the Data 


One approach to reduce memory usage is to change the `dtype` of the original data. 
  
We can see from the output below, that five of the six columns are of float64 datatype. We will convert them to float32 and check if the memory usage is reduced. 

In [14]:
df_pandas.dtypes

time              object
lat_min          float64
lat_max          float64
lon_min          float64
lon_max          float64
rain (mm/day)    float64
model             object
dtype: object

In [15]:
print(f"Memory usage with float64: {df_pandas[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df_pandas[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with float64: 2498.71 MB
Memory usage with float32: 1249.36 MB


In [97]:
#converting df_pandas into float32 for columns with float type values
df_pandas_float32 = df_pandas.copy()
df_pandas_float32["lat_min"] = df_pandas["lat_min"].astype('float32')
df_pandas_float32["lat_max"] = df_pandas["lat_max"].astype('float32')
df_pandas_float32["lon_min"] = df_pandas["lat_min"].astype('float32')
df_pandas_float32["lon_max"] = df_pandas["lat_max"].astype('float32')
df_pandas_float32["rain (mm/day)"] = df_pandas["rain (mm/day)"].astype('float32')

#saving the dataframe of float32 to file
df_pandas_float32.to_csv("../data/combined_data_float32.csv")

In [98]:
%%time
%%memit

#loading the float32 dataframe to memory and perform a simple EDA for value counts of model column
df_pandas_float32 = pd.read_csv("../data/combined_data_float32.csv")
print(df_pandas_float32["model"].value_counts())

MPI-ESM1-2-HR       5154240
TaiESM1             3541230
CMCC-ESM2           3541230
CMCC-CM2-SR5        3541230
NorESM2-MM          3541230
CMCC-CM2-HR4        3541230
SAM0-UNICON         3541153
GFDL-CM4            3219300
GFDL-ESM4           3219300
FGOALS-f3-L         3219300
MRI-ESM2-0          3037320
EC-Earth3-Veg-LR    3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM4-8           1609650
INM-CM5-0           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
MPI-ESM1-2-LR        966420
NESM3                966420
AWI-ESM-1-1-LR       966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
Name: model, dtype: int64
peak memory: 9193.91 MiB, increment: 1797.70 MiB
CPU times: user 55.8 s, sys: 25.2 s, total: 1min 20s
Wall time: 1min 33s


> Changing the `dtype` in the dataframe make the performance slightly better. The peak memory usage has decreased to 9,193 MiB and both CPU and wall time increased slightly. 

### Approach 3. Loading in Chunks  

Another approach is to load the dataframe in chunks. 

The following code helps us to explore loading the dataframe in two different chunk size, 10 million per chunk and 1 million per chunk. 

#### Chunksize = 10 million:

In [19]:
%%time
%%memit
counts = pd.Series(dtype=int)
for chunk in pd.read_csv("../data/combined_data.csv", chunksize=10_000_000): #loading with 10 million per chunk
    counts = counts.add(chunk["model"].value_counts(), fill_value=0)
print(counts.astype(int))

ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
AWI-ESM-1-1-LR       966420
BCC-CSM2-MR         3035340
BCC-ESM1             551880
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
CanESM5              551880
EC-Earth3-Veg-LR    3037320
FGOALS-f3-L         3219300
FGOALS-g3           1287720
GFDL-CM4            3219300
GFDL-ESM4           3219300
INM-CM4-8           1609650
INM-CM5-0           1609650
KIOST-ESM           1287720
MIROC6              2070900
MPI-ESM-1-2-HAM      966420
MPI-ESM1-2-HR       5154240
MPI-ESM1-2-LR        966420
MRI-ESM2-0          3037320
NESM3                966420
NorESM2-LM           919800
NorESM2-MM          3541230
SAM0-UNICON         3541153
TaiESM1             3541230
dtype: int64
peak memory: 6838.55 MiB, increment: 1604.71 MiB
CPU times: user 48.3 s, sys: 5.43 s, total: 53.7 s
Wall time: 54.4 s


<br>

#### chunksize = 1 million:

In [20]:
%%time
%%memit
counts = pd.Series(dtype=int)
for chunk in pd.read_csv("../data/combined_data.csv", chunksize=1_000_000): #loading with 1 million per chunk
    counts = counts.add(chunk["model"].value_counts(), fill_value=0)
print(counts.astype(int))

ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
AWI-ESM-1-1-LR       966420
BCC-CSM2-MR         3035340
BCC-ESM1             551880
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
CanESM5              551880
EC-Earth3-Veg-LR    3037320
FGOALS-f3-L         3219300
FGOALS-g3           1287720
GFDL-CM4            3219300
GFDL-ESM4           3219300
INM-CM4-8           1609650
INM-CM5-0           1609650
KIOST-ESM           1287720
MIROC6              2070900
MPI-ESM-1-2-HAM      966420
MPI-ESM1-2-HR       5154240
MPI-ESM1-2-LR        966420
MRI-ESM2-0          3037320
NESM3                966420
NorESM2-LM           919800
NorESM2-MM          3541230
SAM0-UNICON         3541153
TaiESM1             3541230
dtype: int64
peak memory: 5265.30 MiB, increment: 0.04 MiB
CPU times: user 50.2 s, sys: 5.34 s, total: 55.5 s
Wall time: 56.7 s


>   
We can see that loading in 10 million per chunk requires 7,605 MiB in peak memory usage, which is less than using Pandas to load all at once. 
>  
Loading in 1 million per chunk requires only 4,235 MiB in peak memory usage. This is significantly more efficient than using Pandas. However, we also notice that the CPU and wall time remains roughly the same in all these approaches. 

### Approach 4. Load using DASK  

Lastly, we will explore using [DASK](https://dask.org). 

DASK is a scalable python library. It does the chunking and parallel execution for us, so we don't have to manually take care of it using the chunk_size for chunking up. 

In [21]:
%%time
%%memit
# dask way

df_dask = dd.read_csv("../data/combined_data.csv")
print(df_dask["model"].value_counts().compute())

MPI-ESM1-2-HR       5154240
TaiESM1             3541230
NorESM2-MM          3541230
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
SAM0-UNICON         3541153
FGOALS-f3-L         3219300
GFDL-CM4            3219300
GFDL-ESM4           3219300
EC-Earth3-Veg-LR    3037320
MRI-ESM2-0          3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM5-0           1609650
INM-CM4-8           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
MPI-ESM1-2-LR        966420
NESM3                966420
AWI-ESM-1-1-LR       966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
Name: model, dtype: int64
peak memory: 6522.68 MiB, increment: 1465.07 MiB
CPU times: user 1min 11s, sys: 16 s, total: 1min 27s
Wall time: 34.9 s


### Discussion on Observations

- Loading the entire data to memory using `Pandas` all at once takes the longest in wall time and the highest memory usage. 
- If we change the columns with float64 data type to float32, the memory usage of the file reduced significantly from 2,498 MB to 1,249 MB. By converting the data type, we reduced the peak memory usage from 9,448 MB to 9,193 MB. 
- Loading in chunks reduced peak memory usage but the CPU and sys time for running the cell was about the same as loading with Pandas. We tried with two different chunksize: 10 million and 1 million. The processing time is similar but the peak memory usage decreased significantly to 6,838 MB and 5,265 MB respectively. 
- Loading with Dask reduced peak memory usage to 6,522 MB. It also reduced wall time significantly by almost half. 
- In summary, out of the approaches investigated, loading in chunksize of 1 million requires the least amount of peak memory. However, if we are interested in reducing both memory usage and wall time, loading with DASK is the best way to go. 

<br>

## 4) Perform a Simple EDA in R  

To perform an EDA in R, we will need to transfer the dataframe from Python to R first.   

In this section, we will write our combined dataframe into some advanced file formats and compare their performance while doing a simple EDA in R. 

### 1. Store the Data in Different Formats

#### Arrow file format

In [24]:
%%R
#Loading library
library("arrow");
library("dplyr");

In [25]:
%%time
%%memit

dataset = ds.dataset("../data/combined_data.csv", format="csv")
## this is of arrow table format
table = dataset.to_table()

peak memory: 6157.95 MiB, increment: 962.28 MiB
CPU times: user 17.6 s, sys: 9.9 s, total: 27.5 s
Wall time: 24.8 s


#### Feather format

In [26]:
%%time
# experiment in writing in feather format 
feather.write_feather(table, '../data/combined_data.feather')

CPU times: user 5.14 s, sys: 12.6 s, total: 17.7 s
Wall time: 6.43 s


#### Parquet format

In [27]:
%%time
## writing as a single parquet 
pq.write_table(table, '../data/combined_data.parquet')

CPU times: user 8.34 s, sys: 1.06 s, total: 9.4 s
Wall time: 9.45 s


In [28]:
%%time
## writing as a partitioned parquet 
pq.write_to_dataset(table, '../data/combined_data_partitioned.parquet',partition_cols=['model'])

CPU times: user 19.3 s, sys: 12.4 s, total: 31.7 s
Wall time: 29.5 s


In [29]:
%%sh
# Check the size of different format
du -sh ../data/combined_data.csv
du -sh ../data/combined_data.feather
du -sh ../data/combined_data.parquet
du -sh ../data/combined_data_partitioned.parquet

5.6G	../data/combined_data.csv
1.0G	../data/combined_data.feather
544M	../data/combined_data.parquet
1.1G	../data/combined_data_partitioned.parquet


>  
We can see that both Feather and Parquet have reduced the file size significantly. 

### 2. Experimenting Different Approaches

#### Approach 1. Pandas Exchange

In [49]:
%%time
%%memit
#simple pandas: read the entire dataset into memory
df = pd.read_csv("../data/combined_data.csv")

peak memory: 7457.48 MiB, increment: 4716.52 MiB
CPU times: user 43.8 s, sys: 7.81 s, total: 51.6 s
Wall time: 54 s


In [66]:
##Comment out the pandas exchange due to memory limitation.
#%%time
#%%R -i df
### Transferring the python dataframe to R
#start_time <- Sys.time()
#library(dplyr)
#print(class(df))
#result <- df %>% count(model)
#print(result)
#end_time <- Sys.time()
#print(end_time - start_time)

>  
We were not able to run the above code due to memory limitation. We believe that the cause of this issue is due to time and memory spent on serialization and deserialization during file transfers from Pandas to R.

#### Approach 2. Arrow Exchange

In [30]:
%%time
%%memit
dataset = ds.dataset("../data/combined_data.csv", format="csv")
## this is of arrow table format
table = dataset.to_table()

peak memory: 5357.58 MiB, increment: 3950.18 MiB
CPU times: user 16.9 s, sys: 14.2 s, total: 31 s
Wall time: 28.4 s


In [31]:
%%time
%%memit
## Here we are loading the arrow dataframe that we have loaded previously
r_table = pyra.converter.py2rpy(table)

5695
rarrow.ChunkedArray: 0.027658939361572266
5695
rarrow.ChunkedArray: 0.021301984786987305
5695
rarrow.ChunkedArray: 0.027105093002319336
5695
rarrow.ChunkedArray: 0.02475905418395996
5695
rarrow.ChunkedArray: 0.029931068420410156
5695
rarrow.ChunkedArray: 0.021418094635009766
5695
rarrow.ChunkedArray: 0.02129817008972168
peak memory: 4328.71 MiB, increment: 237.41 MiB
CPU times: user 18.4 s, sys: 838 ms, total: 19.3 s
Wall time: 19.9 s


In [32]:
%%time
%%R -i r_table
start_time <- Sys.time()
print(class(r_table))
library(dplyr)
result <- r_table %>% collect() %>% count(model)
print(class(r_table %>% collect()))
end_time <- Sys.time()
print(result)
print(end_time - start_time)

[1] "Table"       "ArrowObject" "R6"         
[1] "tbl_df"     "tbl"        "data.frame"
[90m# A tibble: 27 x 2[39m
   model                  n
   [3m[90m<chr>[39m[23m              [3m[90m<int>[39m[23m
[90m 1[39m ACCESS-CM2       1[4m9[24m[4m3[24m[4m2[24m840
[90m 2[39m ACCESS-ESM1-5    1[4m6[24m[4m1[24m[4m0[24m700
[90m 3[39m AWI-ESM-1-1-LR    [4m9[24m[4m6[24m[4m6[24m420
[90m 4[39m BCC-CSM2-MR      3[4m0[24m[4m3[24m[4m5[24m340
[90m 5[39m BCC-ESM1          [4m5[24m[4m5[24m[4m1[24m880
[90m 6[39m CanESM5           [4m5[24m[4m5[24m[4m1[24m880
[90m 7[39m CMCC-CM2-HR4     3[4m5[24m[4m4[24m[4m1[24m230
[90m 8[39m CMCC-CM2-SR5     3[4m5[24m[4m4[24m[4m1[24m230
[90m 9[39m CMCC-ESM2        3[4m5[24m[4m4[24m[4m1[24m230
[90m10[39m EC-Earth3-Veg-LR 3[4m0[24m[4m3[24m[4m7[24m320
[90m# … with 17 more rows[39m
Time difference of 7.949494 secs
CPU times: user 9.19 s, sys: 9.8 s, total: 19 s
Wall time: 8.56 s


#### Approach 3. Feather File

In [33]:
%%time
%%R
library(arrow)
start_time <- Sys.time()
r_table <- arrow::read_feather("../data/combined_data.feather")
print(class(r_table))
library(dplyr)
result <- r_table %>% count(model)
end_time <- Sys.time()
print(result)
print(end_time - start_time)

[1] "tbl_df"     "tbl"        "data.frame"
[90m# A tibble: 27 x 2[39m
   model                  n
   [3m[90m<chr>[39m[23m              [3m[90m<int>[39m[23m
[90m 1[39m ACCESS-CM2       1[4m9[24m[4m3[24m[4m2[24m840
[90m 2[39m ACCESS-ESM1-5    1[4m6[24m[4m1[24m[4m0[24m700
[90m 3[39m AWI-ESM-1-1-LR    [4m9[24m[4m6[24m[4m6[24m420
[90m 4[39m BCC-CSM2-MR      3[4m0[24m[4m3[24m[4m5[24m340
[90m 5[39m BCC-ESM1          [4m5[24m[4m5[24m[4m1[24m880
[90m 6[39m CanESM5           [4m5[24m[4m5[24m[4m1[24m880
[90m 7[39m CMCC-CM2-HR4     3[4m5[24m[4m4[24m[4m1[24m230
[90m 8[39m CMCC-CM2-SR5     3[4m5[24m[4m4[24m[4m1[24m230
[90m 9[39m CMCC-ESM2        3[4m5[24m[4m4[24m[4m1[24m230
[90m10[39m EC-Earth3-Veg-LR 3[4m0[24m[4m3[24m[4m7[24m320
[90m# … with 17 more rows[39m
Time difference of 12.87811 secs
CPU times: user 10.6 s, sys: 13.3 s, total: 23.9 s
Wall time: 12.9 s


#### Approach 4. Parquet File

In [34]:
%%time
%%R
library(arrow)
start_time <- Sys.time()
r_table <- arrow::read_parquet("../data/combined_data.parquet")
print(class(r_table))
library(dplyr)
result <- r_table %>% count(model)
end_time <- Sys.time()
print(result)
print(end_time - start_time)

[1] "tbl_df"     "tbl"        "data.frame"
[90m# A tibble: 27 x 2[39m
   model                  n
   [3m[90m<chr>[39m[23m              [3m[90m<int>[39m[23m
[90m 1[39m ACCESS-CM2       1[4m9[24m[4m3[24m[4m2[24m840
[90m 2[39m ACCESS-ESM1-5    1[4m6[24m[4m1[24m[4m0[24m700
[90m 3[39m AWI-ESM-1-1-LR    [4m9[24m[4m6[24m[4m6[24m420
[90m 4[39m BCC-CSM2-MR      3[4m0[24m[4m3[24m[4m5[24m340
[90m 5[39m BCC-ESM1          [4m5[24m[4m5[24m[4m1[24m880
[90m 6[39m CanESM5           [4m5[24m[4m5[24m[4m1[24m880
[90m 7[39m CMCC-CM2-HR4     3[4m5[24m[4m4[24m[4m1[24m230
[90m 8[39m CMCC-CM2-SR5     3[4m5[24m[4m4[24m[4m1[24m230
[90m 9[39m CMCC-ESM2        3[4m5[24m[4m4[24m[4m1[24m230
[90m10[39m EC-Earth3-Veg-LR 3[4m0[24m[4m3[24m[4m7[24m320
[90m# … with 17 more rows[39m
Time difference of 8.777983 secs
CPU times: user 10.2 s, sys: 6.58 s, total: 16.7 s
Wall time: 8.82 s


### 3. Discussion on Observations
- Comparing to 5.6G of csv file, feather file formate takes 1.1G, while parquet file formate only takes 542M. Both feather file and parquet file are more space efficient than csv file.
- Exchanging data to R with Pandas, my computer ran out of memory and failed to exchange the data.
- Exchanging data to R with Arrow Exchange took 20s. EDA in R using the data table from Arrow took 9s. Total is 29s. 
- Exchanging data to R with Feather File and performing an EDA took 12s. 
- Exchanging data to R with Parquet File and performing an EDA took 9s.       

In this case, we wound choose Parquet File because it is the fastest approach.      
If we want to store the data in hard disk, Parquet File format would be the best choice because it used the least space. Exchanging data to R using Parquet File is also fast. The Parquet is column-oriented data storage format. We are only counting by the column `model`. Parquet only read the column `model` therefore greatly minimized the processing time.

## Challenges and Difficulties Faced 

- Peak Memory Seems to be Changing
    - The peak memory usage seem to vary on each of team member's machines by a lot. We believe it is because of differnt backgroud applications that are taking up the RAM space. 
- Combining and Creating the CSV File
    - Nash's laptop initially had storage issues due to hard drive being close to full storage. Nash had to free up space before he was successfully able to create combined_data.csv
    
- Pandas Exchange
    - We were not able to exchange the combined data from Pandas to R due to memory limitation. We believe that the cause of this issue is due to time and memory spent on serialization and deserialization during file transfers.
    
- Arrow Exchange
    - For Arrow Exchange, we had to manually add the wall time of converting to R and EDA code together because the converting function is from a Python package and the EDA is done in R. This is different comparing with Feather or Parquet approaches.  We were hoping that we can run the Python code and R code together in one cell so we can measure the memory usage together. However, the kernel restarts every time we tried to run. 

- Running The Notebook on Team Members' Computer
    - The notebook can only be run through successfully on Nick's computer. For Nash, since his computer has only 8GB in RAM, it fails load the combined_data.csv as the peak memory usage is more than 8,000 MiB. The same issue persists on all the other cells where the peak memory requires more than 8,000 MiB. 
    
    - On Huanhuan's computer, EDA in R using feather file failed to run through. The CPU time for her to run takes more than 20 mintues and there are no error message after running.  