# DSCI 525 - Web and Cloud Computing
## Milestone 1: Tackling big data on your laptop
### Group 14
Group Members: Sasha Babicki, Cheuk Ho, Sakshi Jain, Zeliha Ural Merpez

#### Note: code in this milestone is modified from 525 lecture notes
https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Lectures/Lecture_1_2.ipynb

### 3. Download the data
1. Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
2. Extract the zip file, again programmatically, similar to how we did it in class.

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
from memory_profiler import memory_usage

import dask.dataframe as dd

In [2]:
%load_ext rpy2.ipython
%load_ext memory_profiler

In [3]:
output_directory = "figshareairline/"
combined_file_path = output_directory + "combined_data.csv"

In [4]:
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}

In [5]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'is_link_only': False,
  'name': 'daily_rainfall_2014.png',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'id': 26579150,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'size': 58863},
 {'is_link_only': False,
  'name': 'environment.yml',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'id': 26579171,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'size': 192},
 {'is_link_only': False,
  'name': 'README.md',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'id': 26586554,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'size': 5422},
 {'is_link_only': False,
  'name': 'data.zip',
  'supplied_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'computed_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'id': 26766812,
  'download_url': 'https://

In [6]:
%%time

download_file = "data.zip"
for file in files:
    if file["name"] == download_file:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

CPU times: user 5.34 s, sys: 4.73 s, total: 10.1 s
Wall time: 1min 19s


In [7]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, download_file), 'r') as f:
    f.extractall(output_directory)

CPU times: user 16.9 s, sys: 3.1 s, total: 20 s
Wall time: 24.1 s


### 4. Combining data CSVs
1. Use one of the following options to combine data CSVs into a single CSV. (Pandas, DASK)
2. When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)
3. Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

In [8]:
%%time
%memit

# Combine files and save
files = glob.glob(output_directory + "*_daily_rainfall_NSW.csv")
df = pd.concat(
    (
        pd.read_csv(file, index_col=0).assign(
            model=re.findall(r"[\/|\\](.*)_daily_rainfall", file)[0]
        )
        for file in files
    )
)
df.to_csv(combined_file_path)

peak memory: 157.21 MiB, increment: 0.05 MiB
CPU times: user 6min 17s, sys: 15 s, total: 6min 32s
Wall time: 6min 46s


In [9]:
%%sh
du -sh figshareairline/combined_data.csv

5.6G	figshareairline/combined_data.csv


In [10]:
%%time

# Read file
df = pd.read_csv(combined_file_path, index_col=0, parse_dates=True)

CPU times: user 1min 8s, sys: 13.2 s, total: 1min 21s
Wall time: 1min 25s


In [11]:
%%time
print(df.shape)

(62467843, 6)
CPU times: user 204 µs, sys: 443 µs, total: 647 µs
Wall time: 887 µs


In [12]:
%%time
df

CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 3.81 µs


Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.244226e-13,MPI-ESM-1-2-HAM
1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.217326e-13,MPI-ESM-1-2-HAM
1889-01-03 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.498125e-13,MPI-ESM-1-2-HAM
1889-01-04 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.251282e-13,MPI-ESM-1-2-HAM
1889-01-05 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.270161e-13,MPI-ESM-1-2-HAM
...,...,...,...,...,...,...
2014-12-27 12:00:00,-30.157068,-29.214660,153.1250,154.3750,6.689683e+00,SAM0-UNICON
2014-12-28 12:00:00,-30.157068,-29.214660,153.1250,154.3750,7.862555e+00,SAM0-UNICON
2014-12-29 12:00:00,-30.157068,-29.214660,153.1250,154.3750,1.000503e+01,SAM0-UNICON
2014-12-30 12:00:00,-30.157068,-29.214660,153.1250,154.3750,8.541592e+00,SAM0-UNICON


In [13]:
df['model'].nunique()

27

In [14]:
df['model'].unique()

array(['MPI-ESM-1-2-HAM', 'AWI-ESM-1-1-LR', 'NorESM2-LM', 'ACCESS-CM2',
       'FGOALS-f3-L', 'CMCC-CM2-HR4', 'MRI-ESM2-0', 'GFDL-CM4',
       'BCC-CSM2-MR', 'EC-Earth3-Veg-LR', 'CMCC-ESM2', 'NESM3',
       'MPI-ESM1-2-LR', 'ACCESS-ESM1-5', 'FGOALS-g3', 'INM-CM4-8',
       'MPI-ESM1-2-HR', 'TaiESM1', 'NorESM2-MM', 'CMCC-CM2-SR5',
       'KIOST-ESM', 'INM-CM5-0', 'MIROC6', 'BCC-ESM1', 'GFDL-ESM4',
       'CanESM5', 'SAM0-UNICON'], dtype=object)

#### 4.3 Runtime Observations: 

##### Summary
- The file size of the combined csv is `5.6 GB`. We are using the pandas concat method to combine the data CSVs. 
- Different run times and memory usages are observed among different machines within the team. The run times range from `4 - 7.5 minutes` It could be a result of different processing power and speed for our laptop. A high memory usage of the system (peak memory) seems to correlate with the lower runtime. 
- Please find runtime and memory usage observations on different laptops below:


##### Zeliha
- Combining files:
    - peak memory: 13663.16 MiB, increment: 0.01 MiB
    - CPU times: user 4min 16s, sys: 6.37 s, total: 4min 22s
    - Wall time: 4min 24s
- Reading combined file:
    - CPU times: user 45.3 s, sys: 3.61 s, total: 48.9 s
    - Wall time: 49 s
    
##### Sasha
- Combining files:
    - peak memory: 157.21 MiB, increment: 0.05 MiB
    - CPU times: user 6min 17s, sys: 15 s, total: 6min 32s
    - Wall time: 6min 46s
- Reading combined file:
    - CPU times: user 1min 8s, sys: 13.2 s, total: 1min 21s
    - Wall time: 1min 25s
    
##### Chuck
- Combining files:
    - peak memory: 2584.68 MiB, increment: 0.08 MiB
    - CPU times: user 5min 32s, sys: 17.7 s, total: 5min 49s
    - Wall time: 5min 55s
- Reading combined file:
    - CPU times: user 56.9 s, sys: 14.2 s, total: 1min 11s
    - Wall time: 1min 14s
    
Note: Sakshi was not able to successfully run this, see issue here for details: https://github.com/UBC-MDS/DSCI525_Group14/issues/17

### 5. Load the combined CSV to memory and perform a simple EDA
1. Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
    - Changing dtype of your data
    - Load just columns what we want
    - Loading in chunks
    - Dask
2. Discuss your observations.

#### 5.1.1 Changing dtype of data:

In [15]:
# View original dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 62467843 entries, 1889-01-01 12:00:00 to 2014-12-31 12:00:00
Data columns (total 6 columns):
 #   Column         Dtype  
---  ------         -----  
 0   lat_min        float64
 1   lat_max        float64
 2   lon_min        float64
 3   lon_max        float64
 4   rain (mm/day)  float64
 5   model          object 
dtypes: float64(5), object(1)
memory usage: 3.3+ GB


In [16]:
float_cols = ["lat_min","lat_max","lon_min","lon_max","rain (mm/day)"]
df_32 = df.copy()
df_64 = df.copy()

df_32[float_cols] = df_32[float_cols].astype('float32', errors='ignore')
print(f"DataFrame with numeric columns as float64: {df_64.memory_usage().sum() / 1e9:.2f} GB")
print(f"DataFrame with numeric columns as float32: {df_32.memory_usage().sum() / 1e9:.2f} GB")

DataFrame with numeric columns as float64: 3.50 GB
DataFrame with numeric columns as float32: 2.25 GB


In [17]:
%%time
%%memit
df_64["lat_min"].value_counts()

peak memory: 4448.21 MiB, increment: 394.99 MiB
CPU times: user 709 ms, sys: 325 ms, total: 1.03 s
Wall time: 3 s


In [18]:
%%time
%%memit
df_32["lat_min"].value_counts()

peak memory: 4924.55 MiB, increment: 477.18 MiB
CPU times: user 762 ms, sys: 147 ms, total: 909 ms
Wall time: 1.41 s


#### 5.1.2 Dask:

In [19]:
# Clear pandas dataframe to reload in following cell
del df

In [20]:
%%time
%%memit

# Pandas - load file
df = pd.read_csv(combined_file_path, index_col=0, parse_dates=True)

peak memory: 7893.68 MiB, increment: 3446.18 MiB
CPU times: user 1min 7s, sys: 12 s, total: 1min 19s
Wall time: 1min 23s


In [21]:
%%time
%%memit

# Dask - load file
df_dask = dd.read_csv(combined_file_path)

peak memory: 2181.13 MiB, increment: -0.11 MiB
CPU times: user 87.7 ms, sys: 123 ms, total: 211 ms
Wall time: 3.12 s


In [22]:
%%time
%memit

# Pandas - value_counts for numeric column with many unique values
df["lat_min"].value_counts()

peak memory: 2180.78 MiB, increment: 0.06 MiB
CPU times: user 769 ms, sys: 346 ms, total: 1.11 s
Wall time: 4.61 s


-32.041885    3035329
-32.984293    3035329
-31.099476    3035329
-34.869110    3035329
-30.000000    1747830
               ...   
-30.696652     183960
-36.277805     183960
-33.490981     183960
-30.700015     183960
-36.281964     183960
Name: lat_min, Length: 84, dtype: int64

In [23]:
%%time
%memit

# Dask - value_counts for numeric column with many unique values
df_dask["lat_min"].value_counts().compute()

peak memory: 2724.75 MiB, increment: 0.01 MiB
CPU times: user 1min 27s, sys: 15.2 s, total: 1min 42s
Wall time: 51.9 s


-31.099476    3035329
-32.984293    3035329
-34.869110    3035329
-32.041885    3035329
-30.000000    1747830
               ...   
-30.696652     183960
-36.277805     183960
-36.281964     183960
-30.700015     183960
-33.487232     183960
Name: lat_min, Length: 84, dtype: int64

In [24]:
%%time
%memit

# Pandas - value_counts for str column with few unique values
df["model"].value_counts()

peak memory: 1097.20 MiB, increment: 0.00 MiB
CPU times: user 4.85 s, sys: 118 ms, total: 4.97 s
Wall time: 8.9 s


MPI-ESM1-2-HR       5154240
TaiESM1             3541230
CMCC-CM2-SR5        3541230
NorESM2-MM          3541230
CMCC-ESM2           3541230
CMCC-CM2-HR4        3541230
SAM0-UNICON         3541153
FGOALS-f3-L         3219300
GFDL-ESM4           3219300
GFDL-CM4            3219300
EC-Earth3-Veg-LR    3037320
MRI-ESM2-0          3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM5-0           1609650
INM-CM4-8           1609650
FGOALS-g3           1287720
KIOST-ESM           1287720
MPI-ESM1-2-LR        966420
MPI-ESM-1-2-HAM      966420
NESM3                966420
AWI-ESM-1-1-LR       966420
NorESM2-LM           919800
CanESM5              551880
BCC-ESM1             551880
Name: model, dtype: int64

In [25]:
%%time
%memit

# Dask - value_counts for str column with few unique values
df_dask["model"].value_counts().compute()

peak memory: 1216.67 MiB, increment: 0.01 MiB
CPU times: user 1min 26s, sys: 13.9 s, total: 1min 40s
Wall time: 50.4 s


MPI-ESM1-2-HR       5154240
TaiESM1             3541230
NorESM2-MM          3541230
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
SAM0-UNICON         3541153
FGOALS-f3-L         3219300
GFDL-CM4            3219300
GFDL-ESM4           3219300
EC-Earth3-Veg-LR    3037320
MRI-ESM2-0          3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM5-0           1609650
INM-CM4-8           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
MPI-ESM1-2-LR        966420
NESM3                966420
AWI-ESM-1-1-LR       966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
Name: model, dtype: int64

In [26]:
%%time
%memit

# Pandas - summary statistics
df.describe()

peak memory: 1097.90 MiB, increment: 0.02 MiB
CPU times: user 13.1 s, sys: 8.31 s, total: 21.4 s
Wall time: 26.1 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10482,-31.97757,146.9059,148.215,1.90117
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.838413e-06
50%,-33.0,-32.04188,146.875,148.125,0.06154947
75%,-31.4017,-30.15707,150.1875,151.3125,1.020918
max,-29.9,-27.90606,153.75,155.625,432.9395


In [27]:
%%time
%memit

# Dask - summary statistics
df_dask.describe().compute()

peak memory: 3107.04 MiB, increment: 0.14 MiB
CPU times: user 1min 43s, sys: 20.4 s, total: 2min 3s
Wall time: 56.3 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10482,-31.97757,146.9059,148.215,1.90117
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.375,-33.1,145.5469,146.8125,0.0050607
50%,-32.9,-31.70937,148.125,150.0,0.2565542
75%,-31.09948,-30.0,151.875,153.125,2.824743
max,-29.9,-27.90606,153.75,155.625,432.9395


In [28]:
%%time
%%memit

# Pandas - multiple operations
df["lat_min"].value_counts()
df["model"].value_counts()
df.describe()

peak memory: 6937.61 MiB, increment: 4413.84 MiB
CPU times: user 18 s, sys: 6.14 s, total: 24.1 s
Wall time: 26.4 s


In [29]:
%%time
%%memit

# Dask - multiple operations
dd.compute(
    df_dask["lat_min"].value_counts(),
    df_dask["model"].value_counts(),
    df_dask.describe()
)

peak memory: 3818.21 MiB, increment: 693.53 MiB
CPU times: user 1min 48s, sys: 20.3 s, total: 2min 8s
Wall time: 59.5 s


#### 5.2 Discussion:

- Changing the dtype of numeric columns from `float64` to `float32` did reduce the space the dataframe takes in memory by almost half. However, performing `value_counts()` on a column actually used more memory and was slower for `float32` columns than `float64` columns. This may be due to type conversions happening under the hood. 
- Reading data from csv into a local variable is much faster and takes less memory when using `dask` rather than `pandas`. This makes sense because `dask` loads a representation of the structure of the dataframe rather than the data itself, whereas `pandas` loads all the data into memory. `dask` seems to use a similar amount of memory to `pandas` when performing `value_counts()` on columns, but is much slower than `pandas` when computing `value_counts()` for a single column. We expect this is because the data needs to be loaded into memory when the task is being performed, whereas with `pandas` this step is already complete. When performing multiple operations with `dask` there is far less memory usage, however the speed of the operation is almost double that of `pandas`.

### 6. Perform a simple EDA in R
1. Pick an approach to transfer the dataframe from python to R.
    - Parquet file
    - Feather file
    - Pandas exchange
    - Arrow exchange
2. Discuss why you chose this approach over others.

In [30]:
## Install the pyarrow packages: https://arrow.apache.org/docs/python/install.html
import pyarrow.dataset as ds
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.feather as feather

## Install rpy2: https://anaconda.org/conda-forge/rpy2
import rpy2.rinterface

import rpy2_arrow.pyarrow_rarrow as pyra

In [31]:
%%R
library(arrow)
library(dplyr)

R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




#### 6.1.1 Convert Table with Pandas and Save File in Feather Format

In [32]:
%%time
%%memit
table = pa.Table.from_pandas(df)

peak memory: 4085.97 MiB, increment: 1439.46 MiB
CPU times: user 4.67 s, sys: 940 ms, total: 5.62 s
Wall time: 5.56 s


In [33]:
%%time
%memit

# Write to feather format 
feather.write_feather(table, "figshareairline/combined.feather")

peak memory: 4086.95 MiB, increment: 0.23 MiB
CPU times: user 2.51 s, sys: 1.88 s, total: 4.39 s
Wall time: 7.76 s


In [34]:
%%sh
du -sh figshareairline/combined.feather

1.1G	figshareairline/combined.feather


In [35]:
%%time
%%R

start_time <- Sys.time()
r_table <- arrow::read_feather("figshareairline/combined.feather")
print(class(r_table))

result <- r_table %>% count(model)
end_time <- Sys.time()
print(result)
print(end_time - start_time)

[1] "tbl_df"     "tbl"        "data.frame"
[90m# A tibble: 27 x 2[39m
   model                  n
   [3m[90m<chr>[39m[23m              [3m[90m<int>[39m[23m
[90m 1[39m ACCESS-CM2       1[4m9[24m[4m3[24m[4m2[24m840
[90m 2[39m ACCESS-ESM1-5    1[4m6[24m[4m1[24m[4m0[24m700
[90m 3[39m AWI-ESM-1-1-LR    [4m9[24m[4m6[24m[4m6[24m420
[90m 4[39m BCC-CSM2-MR      3[4m0[24m[4m3[24m[4m5[24m340
[90m 5[39m BCC-ESM1          [4m5[24m[4m5[24m[4m1[24m880
[90m 6[39m CanESM5           [4m5[24m[4m5[24m[4m1[24m880
[90m 7[39m CMCC-CM2-HR4     3[4m5[24m[4m4[24m[4m1[24m230
[90m 8[39m CMCC-CM2-SR5     3[4m5[24m[4m4[24m[4m1[24m230
[90m 9[39m CMCC-ESM2        3[4m5[24m[4m4[24m[4m1[24m230
[90m10[39m EC-Earth3-Veg-LR 3[4m0[24m[4m3[24m[4m7[24m320
[90m# … with 17 more rows[39m
Time difference of 33.11376 secs
CPU times: user 11.1 s, sys: 13.6 s, total: 24.7 s
Wall time: 33.5 s


#### 6.1.2 Convert Table with Arrow and Save File in Feather Format

In [36]:
%%time
%memit

dataset_arrow = ds.dataset(combined_file_path, format="csv")

# arrow table format
table_arrow = dataset_arrow.to_table()

peak memory: 5460.56 MiB, increment: 0.17 MiB
CPU times: user 19.9 s, sys: 11.1 s, total: 31 s
Wall time: 33.4 s


In [37]:
table_arrow

pyarrow.Table
time: timestamp[s]
lat_min: double
lat_max: double
lon_min: double
lon_max: double
rain (mm/day): double
model: string

In [38]:
%%time
%memit

# experiment in writing in feather format 
feather.write_feather(table_arrow, 'figshareairline/arrow_combined.feather')

peak memory: 4685.31 MiB, increment: -156.72 MiB
CPU times: user 4.54 s, sys: 9.27 s, total: 13.8 s
Wall time: 14.7 s


In [39]:
%%sh
du -sh figshareairline/arrow_combined.feather

1.0G	figshareairline/arrow_combined.feather


In [40]:
%%time
%%R

start_time <- Sys.time()
r_table <- arrow::read_feather("figshareairline/arrow_combined.feather")
print(class(r_table))

result <- r_table %>% count(model)
end_time <- Sys.time()
print(result)
print(end_time - start_time)

[1] "tbl_df"     "tbl"        "data.frame"
[90m# A tibble: 27 x 2[39m
   model                  n
   [3m[90m<chr>[39m[23m              [3m[90m<int>[39m[23m
[90m 1[39m ACCESS-CM2       1[4m9[24m[4m3[24m[4m2[24m840
[90m 2[39m ACCESS-ESM1-5    1[4m6[24m[4m1[24m[4m0[24m700
[90m 3[39m AWI-ESM-1-1-LR    [4m9[24m[4m6[24m[4m6[24m420
[90m 4[39m BCC-CSM2-MR      3[4m0[24m[4m3[24m[4m5[24m340
[90m 5[39m BCC-ESM1          [4m5[24m[4m5[24m[4m1[24m880
[90m 6[39m CanESM5           [4m5[24m[4m5[24m[4m1[24m880
[90m 7[39m CMCC-CM2-HR4     3[4m5[24m[4m4[24m[4m1[24m230
[90m 8[39m CMCC-CM2-SR5     3[4m5[24m[4m4[24m[4m1[24m230
[90m 9[39m CMCC-ESM2        3[4m5[24m[4m4[24m[4m1[24m230
[90m10[39m EC-Earth3-Veg-LR 3[4m0[24m[4m3[24m[4m7[24m320
[90m# … with 17 more rows[39m
Time difference of 1.064979 mins
CPU times: user 13.5 s, sys: 24.6 s, total: 38.1 s
Wall time: 1min 4s


#### 6.1.3 Convert Table with Arrow and Directly Loaded into R

In [41]:
%%time
%memit

r_table = pyra.converter.py2rpy(table_arrow)

peak memory: 5443.09 MiB, increment: 0.25 MiB
5695
rarrow.ChunkedArray: 0.0340421199798584
5695
rarrow.ChunkedArray: 0.022715091705322266
5695
rarrow.ChunkedArray: 0.0239717960357666
5695
rarrow.ChunkedArray: 0.030026912689208984
5695
rarrow.ChunkedArray: 0.031365156173706055
5695
rarrow.ChunkedArray: 0.02218317985534668
5695
rarrow.ChunkedArray: 0.024814128875732422
CPU times: user 25.9 s, sys: 3.54 s, total: 29.5 s
Wall time: 34.7 s


In [42]:
%%R -i r_table

start_time <- Sys.time()
print(class(r_table %>% collect()))
result2 <- r_table %>% collect() %>% count(model)
end_time <- Sys.time()
print(result2)
print(end_time - start_time)

[1] "tbl_df"     "tbl"        "data.frame"
[90m# A tibble: 27 x 2[39m
   model                  n
   [3m[90m<chr>[39m[23m              [3m[90m<int>[39m[23m
[90m 1[39m ACCESS-CM2       1[4m9[24m[4m3[24m[4m2[24m840
[90m 2[39m ACCESS-ESM1-5    1[4m6[24m[4m1[24m[4m0[24m700
[90m 3[39m AWI-ESM-1-1-LR    [4m9[24m[4m6[24m[4m6[24m420
[90m 4[39m BCC-CSM2-MR      3[4m0[24m[4m3[24m[4m5[24m340
[90m 5[39m BCC-ESM1          [4m5[24m[4m5[24m[4m1[24m880
[90m 6[39m CanESM5           [4m5[24m[4m5[24m[4m1[24m880
[90m 7[39m CMCC-CM2-HR4     3[4m5[24m[4m4[24m[4m1[24m230
[90m 8[39m CMCC-CM2-SR5     3[4m5[24m[4m4[24m[4m1[24m230
[90m 9[39m CMCC-ESM2        3[4m5[24m[4m4[24m[4m1[24m230
[90m10[39m EC-Earth3-Veg-LR 3[4m0[24m[4m3[24m[4m7[24m320
[90m# … with 17 more rows[39m
Time difference of 9.672174 secs


#### 6.2 Discussion: 

##### **Observation Summary - Chuck's Laptop**

|                                                                | Peak Memory(MiB) | Increment memory (MiB) | Wall Time (s) | File Size |
| -------------------------------------------------------------- | ---------------- | ---------------------- | ------------- | --------- |
| Convert to Table - Pandas                                      | 5406             | 1019.34                | 3.86          | NA        |
| Convert to Table - Arrow                                       | 9439             | \-6.78                 | 42.1          | NA        |
|                                                                |                  |                        |               |           |
| Write File to Feather - Pandas                                 | 5446             | \-32.73                | 24.6          | 1.1G      |
| Write File to Feather - Arrow                                  | 4227             | \-11.76                | 25.2          | 1.0G      |
|                                                                |                  |                        |               |           |
| Direct Convert python table to R with Arrow (converter.py2rpy) | 8801             | \-242.04               | 70            | NA        |
|                                                                |                  |                        |               |           |
| EDA - Feather + Pandas                                         | NA               | NA                     | 18.1          | NA        |
| EDA - Feather + Arrows                                         | NA               | NA                     | 26.2          | NA        |
| EDA - Directly Loading with converter.py2rpy                   | NA               | NA                     | 8             | NA        |
|                                                                |                  |                        |               |           |
|                                                                |                  |                        |               |           |
| **Overall Run Time -  Feather + Pandas**                           | NA               | NA                     | **46.6**         | NA        |
| **Overall Run Time -  Feather + Arrows**                           | NA               | NA                     | **93.5**          | NA        |
| **Overall Run Time -   Direct Convert with converter.py2rpy**      | NA               | NA                     | **120.1**         | NA        |

- We would want to have faster runtime. We chose feather over parquet because the operations of parquet would be slower as it compress the data to saving storage that may hinder the speed. We would probably use parquet if we want to consider more long term storage. In this case, the file size using feather would be sufficiently small. 
- We tried getting a table from Pandas and Arrow. We choose to get a table directly from the pandas data frame and then save the file as feather. It runs quite fast from `constructing table, saving file, reading the file into R and performing EDA (i.e. value_counts())`.
- As we expected, using the feather file format already saves a lot of space, the file only takes up `~1.1 GB` in memory compared to 5.6 GB with CSV. It's sufficiently small for this case.