# Milestone 1
## DSCI 525 Web and Cloud Computing
## Group 16

This notebook downloads various observed and simulated rainfall data sets from New South Wales, Australia over the period of 1889 - 2014.  The data are then combined and basic exporatory data analyses are conducted using both Python and R programming languages.

In [1]:
import re
import os
import zipfile
import requests
from urllib.request import urlretrieve
import json
import rpy2.rinterface
import dask.dataframe as dd
import pandas as pd
from memory_profiler import memory_usage
import pyarrow.dataset as ds
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.feather as feather
# import rpy2.rinterface
# import rpy2_arrow.pyarrow_rarrow as pyra

In [2]:
# %load_ext rpy2.ipython
%load_ext memory_profiler

In [3]:
# %%R
# library(dplyr)
# library(arrow)

# Data Download
The following code chunk downloads the data used in the subsequent analyses.  The data are downloaded from 'figshare.com'.  The file 'data.zip' is saved to a local directory called 'data'.

In [4]:
%%time
%%memit
# Print out time and memory taken for downloading data

# This code is adapted from DSCI 525 lecture demonstration notebook (Gittu George, 2021,
# https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Lectures/Lecture_1_2.ipynb)
url = f"https://api.figshare.com/v2/articles/14096681"
headers = {"Content-Type": "application/json"}
output_directory = "../data/"

response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]

for file in files:
    if file["name"] in "data.zip":
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

peak memory: 101.90 MiB, increment: 3.66 MiB
Wall time: 20min 34s


After it has been downloaded locally, 'data.zip' is extracted and stored in the 'data' directory.

In [5]:
%%time
%%memit
# Print out time and memory taken to extract data

with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), "r") as f:
    f.extractall(output_directory)

peak memory: 107.45 MiB, increment: 5.59 MiB
Wall time: 16.1 s


So annoying to load all csvs into ram, combine, then resave.  Would be much easier if we could stitch the files together directly without loading them into RAM.

# Combining Data
The following code chunk combines all of the unzipped rainfall data .csv files into a single file called 'combined_data.csv'.  This process is accomplished by creating a pandas dataframe called `full_df`, then one by one loading each .csv file and concatenating it with `full_df`.  This requires that all of the .csv files be read into a pandas dataframe variable and held in RAM at once.  In this case, this requires that almost 7 GB of data be held in RAM and manipulated.  Some computers will not be able to perform this data combining operation because they do not have sufficient RAM.  Even for systems which have sufficient RAM, performing simple operations (such as concatenation) on on a variable of this size are time consuming.  To demonstrate this, below the code chunk, we have included screen shots of the time and memory usage for the execution of this data combining operation.  To summarize, the time taken to complete this operation on each system are listed below (along with some general hardware specifications):
1. Wall time: 7min 9s; Peak memory: 6891.53 MiB
    - Processor: i7-10510U (4 cores, up to 4.90 GHz)
    - RAM: 16 GB
2. Wall time: 9min 46s; Peak memory: 3097.45 MiB
    - Processor: i5
    - RAM: 8 GB
3. Wall time: 6min 5s; Peak memory: 7265.16 MiB
    - Processor: i7-8700K (6 cores, up to 3.70 GHz)
    - RAM: 16 GB

In [6]:
%%time
%%memit
# Print out time and memory taken to merge and save csv files

file_names = os.listdir(output_directory)
file_names = [file for file in file_names if file[-4:] == ".csv"]


cols = ["lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/day)"]
full_df = pd.DataFrame(columns=["model"] + cols)
full_df.index.rename("time", inplace=True)

for file in file_names:
        model_name = re.search("^.*(?=_daily)", file).group(0)
    full_df = pd.concat(
        [
            full_df,
            pd.read_csv(output_directory + file, index_col=0).assign(model=model_name),
        ]
    )

full_df.to_csv(output_directory + "combined_data.csv")

peak memory: 7230.48 MiB, increment: 7127.28 MiB
Wall time: 6min 11s


1. Processor: i7-10510U (4 cores, up to 4.90 GHz); RAM: 16 GB

![](../img/i7-10510_16GB-SP.png)

2. Processor: 2.3 GHz Quad-Core Intel Core i5; RAM: 8GB

![](../img/i5_8GB.png)

3. Processor: i7-8700K (6 cores, up to 3.70 GHz); RAM: 16 GB

![](../img/i7-8700K_16GB_CZ.png)

## Task 5. Load the combined CSV to memory and perform a simple EDA

### 1. Investigate at least 2 approaches and perform a simple EDA

In [4]:
full_df.head()

NameError: name 'full_df' is not defined

In [8]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 62513863 entries, 1889-01-01 12:00:00 to 2014-12-31 12:00:00
Data columns (total 6 columns):
 #   Column         Dtype  
---  ------         -----  
 0   model          object 
 1   lat_min        float64
 2   lat_max        float64
 3   lon_min        float64
 4   lon_max        float64
 5   rain (mm/day)  float64
dtypes: float64(5), object(1)
memory usage: 3.3+ GB


In [9]:
full_df.dtypes

model             object
lat_min          float64
lat_max          float64
lon_min          float64
lon_max          float64
rain (mm/day)    float64
dtype: object

#### Method 1: Loading in Chunks

In [10]:
%%time
%%memit
import dask.dataframe as dd

### Code adapted from DSCI 525 Lecture ipynb notebook (Gittu George, 2021)
counts = pd.Series(dtype=int)
for chunk in pd.read_csv("../data/combined_data.csv", chunksize=10_000_000):
    counts = counts.add(chunk["model"].value_counts(), fill_value=0)
print(counts.astype(int))

ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
AWI-ESM-1-1-LR       966420
BCC-CSM2-MR         3035340
BCC-ESM1             551880
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
CanESM5              551880
EC-Earth3-Veg-LR    3037320
FGOALS-f3-L         3219300
FGOALS-g3           1287720
GFDL-CM4            3219300
GFDL-ESM4           3219300
INM-CM4-8           1609650
INM-CM5-0           1609650
KIOST-ESM           1287720
MIROC6              2070900
MPI-ESM-1-2-HAM      966420
MPI-ESM1-2-HR       5154240
MPI-ESM1-2-LR        966420
MRI-ESM2-0          3037320
NESM3                966420
NorESM2-LM           919800
NorESM2-MM          3541230
SAM0-UNICON         3541153
TaiESM1             3541230
observed              46020
dtype: int32
peak memory: 5698.80 MiB, increment: 2152.75 MiB
Wall time: 1min 1s


#### Method 2: Using Dask

In [24]:
%%time
%%memit

### Code adapted from DSCI 525 Lecture ipynb notebook (Gittu George, 2021)

dask_df = dd.read_csv("../data/combined_data.csv")
print(dask_df["model"].value_counts().compute())

MPI-ESM1-2-HR       5154240
TaiESM1             3541230
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
NorESM2-MM          3541230
SAM0-UNICON         3541153
FGOALS-f3-L         3219300
GFDL-CM4            3219300
GFDL-ESM4           3219300
EC-Earth3-Veg-LR    3037320
MRI-ESM2-0          3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM4-8           1609650
INM-CM5-0           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
MPI-ESM-1-2-HAM      966420
MPI-ESM1-2-LR        966420
NESM3                966420
AWI-ESM-1-1-LR       966420
NorESM2-LM           919800
CanESM5              551880
BCC-ESM1             551880
observed              46020
Name: model, dtype: int64
peak memory: 9445.29 MiB, increment: 2332.27 MiB
Wall time: 32.5 s


#### Method 3: Loading just columns what we want

In [12]:
%%time
%%memit

# The only column we want is the model column
model_df = pd.read_csv("../data/combined_data.csv", usecols=["model"])
print(model_df["model"].value_counts())

MPI-ESM1-2-HR       5154240
NorESM2-MM          3541230
CMCC-ESM2           3541230
CMCC-CM2-HR4        3541230
TaiESM1             3541230
CMCC-CM2-SR5        3541230
SAM0-UNICON         3541153
GFDL-ESM4           3219300
FGOALS-f3-L         3219300
GFDL-CM4            3219300
MRI-ESM2-0          3037320
EC-Earth3-Veg-LR    3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM4-8           1609650
INM-CM5-0           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
NESM3                966420
AWI-ESM-1-1-LR       966420
MPI-ESM-1-2-HAM      966420
MPI-ESM1-2-LR        966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
observed              46020
Name: model, dtype: int64
peak memory: 4728.71 MiB, increment: 957.17 MiB
Wall time: 36.7 s


### 2. Observations discussion.

- Loading just the column we want seems to have the shortest CPU times (user 30.9 s, sys: 2.3 s, total: 33.2 s) and wall time (33.9 s). 

- Loading the combined data using Dask has a shorter wall time (40.6 s) than loading in chunks, however, it has longer CPU times (user 1min 24s, sys: 18.6 s, total: 1min 43s) than loading in chunks (user 59.6 s, sys: 7.12 s, total: 1min 6s).

- Loading just the column we want has the minimum peak memory and increment used (1166.77 MiB and 780.00 MiB), whilst loading in chunks has the maximum peak memory and increment (1873.48 MiB, increment: 1458.30 MiB). 

- It is also worth noting that the memory usage from full_df.info() was memory usage: 3.3+ GB. Thus, using these methods to load the data all saved us considerable memory space. 

- In conclusion, loading just the column we want gives us the optimum time and space savings. 

## Task 6. Perform a simple EDA in R

### 1. Store data in different format

Here we will write the data in 2 more different formats to compare the running time and ocuppied storage between different formats. All formats of data in this section including:
- csv format
- feather format
- parquet format

In [19]:
dataset = ds.dataset("../data/combined_data.csv", format="csv")
table = dataset.to_table()

**Feather format**

In [20]:
%%time
feather.write_feather(table, "../data/example.feather")

Wall time: 2.23 s


**Parquet format**

In [21]:
%%time
pq.write_table(table, "../data/example.parquet")

Wall time: 9.97 s


**Check the size of data in all different formats**

In [22]:
%%sh
du -sh ../data/combined_data.csv
du -sh ../data/example.feather
du -sh ../data/example.parquet

5.7G	../data/combined_data.csv
1.1G	../data/example.feather
542M	../data/example.parquet


**Discussion:**

### 2. Transfer the dataframe from python to R and perform EDA

Here we will experiment 3 exchange approaches to transfer the loaded dataset from python to R and perform EDA. In the end, we will pick one appropriate approach over others. All exchange approaches in this section including:
- Arrow exchange
- feather file exchange
- parquet file exchange

**Arrow exchange and EDA**

In [None]:
%%time
r_table = pyra.converter.py2rpy(table)

In [None]:
%%time
%%R
start_time <- Sys.time()
head_df <- head(r_table)
glimpse_df <- glimpse(r_table)
model_count <- r_table %>% collect() %>% count(model)
end_time <- Sys.time()
print(class(r_table))
print(head_df)
print(glimpse_df)
print(model_count)
print(end_time - start_time)

**Feather file exchange and EDA**

In [None]:
%%time
%%R
start_time <- Sys.time()
r_table <- arrow::read_feather("../data/example.feather")
head_df <- head(r_table)
glimpse_df <- glimpse(r_table)
model_count <- r_table %>% count(model)
end_time <- Sys.time()
print(class(r_table))
print(head_df)
print(glimpse_df)
print(model_count)
print(end_time - start_time)

**Parquet file exchange and EDA**

In [None]:
%%time
%%R
start_time <- Sys.time()
r_table <- arrow::read_parquet("../data/example.parquet")
head_df <- head(r_table)
glimpse_df <- glimpse(r_table)
model_count <- r_table %>% count(model)
end_time <- Sys.time()
print(class(r_table))
print(head_df)
print(glimpse_df)
print(model_count)
print(end_time - start_time)

**Discussion:**