# Milestone 1 Notebook

#### Authors: Julien Gordon, Adam Morphy, Mukund Iyer, Shiva Shankar Jena

## Questions 1. and 2.

#### Link to Team Contract: https://docs.google.com/document/d/1uDSQLGPSfcgl3PisaC1-ngaViqJCkBiWFmDsN2FzZ9w/edit?usp=sharing
#### Link ot Repo: https://github.com/UBC-MDS/DSCI_525_Group26

In [13]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
import numpy as np

## 3. Data Download

In [14]:
#%cd /Users/apple/MDS/block6/525/DSCI_525_Group26/notebooks

In [15]:
# Necessary metadata
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figsharerainfall/"

# Query
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data
files = data["files"]             

In [18]:
%%time
files_to_dl = ["data.zip"]  
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

CPU times: user 3.19 s, sys: 6.97 s, total: 10.2 s
Wall time: 2min 51s


> Data Download Comparison

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | 1m 6s|
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | 4min 18s |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | 2min 51s |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 1m 16s |

In [19]:
# Extracting files from zip
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

## 4. Combining data CSVs

In [20]:
%%time

#use_cols = ["rain", "lat_min", "lat_max", "Ion_min", "Ion_max", "rain (mm/day)", "model"]
files = glob.glob('figsharerainfall/*.csv')
excluded_files = ["figsharerainfall/observed_daily_rainfall_SYD.csv"]
df = pd.concat(
    (
        pd.read_csv(file, index_col=0)
        .assign(model=re.findall(r'\/(.*?)_', file)[0])
        for file in files
        if file not in excluded_files
        
    )
)
df.to_csv("figsharerainfall/combined_data.csv")

CPU times: user 6min 35s, sys: 26.4 s, total: 7min 2s
Wall time: 7min 23s


> Combining Data Comparison

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | ~10m (DNF)|
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | 7min 23s |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | N/A (kernel chrash) |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 12m 58s |

**Note:** Please note that Adam's machine was only able to run the analysis on a partially combined csv by stopping the combining process partway through completion. While this enabled him to finish the analysis, the timing results are misleading because the processes were not done on the full csv.

Julien was unable to run the notebook as the kernel kept crashing. We suspect it may have to do with Apache Arrow not being optimised for his Ubuntu Linux distribution. Since his computer has 16gb of Ram, it is rather surprising that he was not able to run the notebook. However, the installation documentation lists Linux as supported (https://anaconda.org/conda-forge/arrow) so it is unclear why the issue is happening and this is mostly speculation.

Overall, Mukund and Shiva's runtimes were consistent with what we would expect intuitively. Mukund's machine with double the working memory resulted in about half the time taken to complete the combining operation. This suggests a linear relationship between ram and time taken for this kind of task. It is useful for comparison that both of them have similar machines with the same processors. Interestingly, Adam's machine with a theoretically more powerful processor than Shiva's was unable to complete the task, but we were unable to determine the cause of this. 

## 5. Load the combined CSV to memory and perform a simple EDA

### 5.1 Investigating 2 approaches to reduce memory usage while performing the EDA

#### Loading the whole data and performing EDA

In [21]:
%%time
# Loading data (Pandas)
df_combined = pd.read_csv(
    "figsharerainfall/combined_data.csv", 
    index_col=0,
    parse_dates=True 
)

CPU times: user 1min 5s, sys: 20.7 s, total: 1min 26s
Wall time: 1min 32s


> Combining Data Comparison

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | 1m 16s (limited data) |
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | 1min 32s |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | N/A (kernel crash) |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 3m 53s|

Overall, once again Mukund and Shiva's runtimes were consistent with what we would expect intuitively. Mukund's machine with double the working memory resulted in about half the time taken to complete the data loading task. This suggests a linear relationship between ram and time taken for this kind of task. It is useful for comparison that both of them have similar machines with the same processors. Please note once again that working with a subset of the data, Adam's runtime is not a fair comparison across machines.

In [23]:
%%time

df_combined.model.value_counts()

CPU times: user 3.33 s, sys: 116 ms, total: 3.44 s
Wall time: 3.51 s


MPI-ESM1-2-HR       5154240
CMCC-CM2-HR4        3541230
CMCC-ESM2           3541230
CMCC-CM2-SR5        3541230
NorESM2-MM          3541230
TaiESM1             3541230
SAM0-UNICON         3541153
GFDL-ESM4           3219300
FGOALS-f3-L         3219300
GFDL-CM4            3219300
MRI-ESM2-0          3037320
EC-Earth3-Veg-LR    3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM4-8           1609650
INM-CM5-0           1609650
FGOALS-g3           1287720
KIOST-ESM           1287720
AWI-ESM-1-1-LR       966420
MPI-ESM1-2-LR        966420
NESM3                966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
Name: model, dtype: int64

> Performing a simple EDA

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | 1.06s (limited data)  |
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | 3.51 s |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | N/A (kernel crash) |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 5.99s |

Once more we observe that Mukund and Shiva's runtimes were consistent with previous findings. Mukund's machine with double the working memory resulted in about half the time taken to complete the EDA. This suggests a roughly linear relationship between ram and time taken for this kind of task. With this task being less onerous to complete, we see that the speed differential is not as pronounced as the previous exercises, which suggests that with smaller tasks, we do not observe as much benefits from more RAM as for larger tasks. Please note once again that working with a subset of the data, Adam's runtime is not a fair comparison across machines.

### 5.1.1 Approach 1 to reduce memory usage: Changing dtype

In [24]:
df_combined.dtypes

lat_min          float64
lat_max          float64
lon_min          float64
lon_max          float64
rain (mm/day)    float64
model             object
dtype: object

In [25]:
print(f"Memory usage with float64: {df_combined[['lat_min', 'lat_max','lon_min', 'lon_max', 'rain (mm/day)']].memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df_combined[['lat_min', 'lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with float64: 2998.46 MB
Memory usage with float32: 1749.10 MB


In [26]:
%%time
df_combined_float32 = df_combined[['lat_min', 'lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float32', errors='ignore')

CPU times: user 1.67 s, sys: 3.85 s, total: 5.52 s
Wall time: 7.71 s


### 5.1.2 Approach 2 to reduce memory usage: loading in chunks

In [27]:
%%time

# Doing EDA with only chunks of data
counts=pd.Series(dtype=int)
for chunk in pd.read_csv(
    "figsharerainfall/combined_data.csv",
    parse_dates=True,
    chunksize=1_000_000
):
    counts=counts.add(chunk.model.value_counts(), fill_value=0)

print(counts.astype(int))

ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
AWI-ESM-1-1-LR       966420
BCC-CSM2-MR         3035340
BCC-ESM1             551880
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
CanESM5              551880
EC-Earth3-Veg-LR    3037320
FGOALS-f3-L         3219300
FGOALS-g3           1287720
GFDL-CM4            3219300
GFDL-ESM4           3219300
INM-CM4-8           1609650
INM-CM5-0           1609650
KIOST-ESM           1287720
MIROC6              2070900
MPI-ESM-1-2-HAM      966420
MPI-ESM1-2-HR       5154240
MPI-ESM1-2-LR        966420
MRI-ESM2-0          3037320
NESM3                966420
NorESM2-LM           919800
NorESM2-MM          3541230
SAM0-UNICON         3541153
TaiESM1             3541230
dtype: int64
CPU times: user 1min, sys: 8.12 s, total: 1min 8s
Wall time: 1min 9s


> Loading data and performing a simple EDA with reduced memory usage (minimum out of 2 approaches)

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | 27s (limited data)  |
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | 1min 9s |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 1m 45s |

We see that our chunking strategy did reduce time, but not by an overly significant margin. We can gather from this experience that these strategies in combination can perhaps make an impossible task for a machine possible, but there is a size of data for which these methods may not reduce the time required enough to be feasible. Once more the comparisons across machines are consistent with our previous observations.

## 6. Perform a simple EDA in R

### 6.1 Approaches to transfer the dataframe from python to R

We tried different approaches for data transfer to compare time taken.

#### 6.1.1 Parquet file

In [28]:
%%time
# Using pandas

df_combined.to_parquet("figsharerainfall/combined_data_partition.parquet")

In [29]:
%load_ext rpy2.ipython

In [30]:
import rpy2_arrow.pyarrow_rarrow as pyra

In [31]:
%%R
suppressMessages(library(arrow, warn.conflicts = FALSE))
suppressMessages(library(dplyr, warn.conflicts = FALSE))

In [32]:
%%time
%%R
ds_rainfall <- open_dataset("figsharerainfall/combined_data_partition.parquet")

CPU times: user 9.21 ms, sys: 13.2 ms, total: 22.4 ms
Wall time: 32.8 ms


In [33]:
%%time
%%R
query <- ds_rainfall %>%
    select(model) %>%
    group_by(model) %>%
    summarise(
        count = n()
    )

CPU times: user 27.7 ms, sys: 5.66 ms, total: 33.3 ms
Wall time: 36.6 ms


In [34]:
%%time
%%R
print(query %>% collect())

# A tibble: 27 × 2
   model              count
   <chr>              <int>
 1 MPI-ESM-1-2-HAM   966420
 2 AWI-ESM-1-1-LR    966420
 3 MRI-ESM2-0       3037320
 4 GFDL-CM4         3219300
 5 EC-Earth3-Veg-LR 3037320
 6 INM-CM4-8        1609650
 7 TaiESM1          3541230
 8 CMCC-CM2-SR5     3541230
 9 KIOST-ESM        1287720
10 GFDL-ESM4        3219300
# … with 17 more rows
CPU times: user 3.95 s, sys: 803 ms, total: 4.75 s
Wall time: 2.54 s


> Comparison of Loading data and EDA time in R using parquet file

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken | Method |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | 1.74s (limited data) | Parquet file |
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | 2.54s | Parquet file |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | N/A (kernel crash) | N/A (kernel crash) |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 10s | Parquet file |

We see that the Parquet file resulted in a very low time requirement, but this may be partly because the task we were doing was not as arduous. Overall this shows the advantage of using Parquet files in reducing computation times. These advantages will be discussed below.

### 6.2 Reasons for choosing the approaches

1. Parquet file: The primary advantages of parquet file approach, apart from its hybrid file format for use in multiple languages, was that it lead to significantly reduced memory usage(539.6 MB compared to the 8 GB combined CSV file) as well as speed leveraging the power of efficient compression and encoding techniques of Arrow as well as the lazy evaluation benefits of R. The method proved immensely efficient. The implementation uses the same underlying C/C++ pointer for R and Python. The developers highlight a large gain in performance compared to typical ways of sharing arrays or data frames between Python and R through the conversion rules included in rpy2 (https://rpy2.github.io/rpy2-arrow/version/main/html/).

### Overall Difficulties Discussion

Overall we found that working with such large data presented unique difficulties that were hard to overcome. The time taken to complete basic components of the exercise introduced some frustration in terms of the lag between attempting a solution and finding out the outcome. Moreover, we found in Adam and Julien's case instances where we could not complete the exercise as intended. In Adam's case, reducing the amount of data loaded into working memory solved the problem. In Julien's case, the python kernel was crashing and we tried a different method's such as using reduced data size and making sure the environment was set up correctly. The issue is that with such opaque error messages, it is difficult to diagnose what the problem is and ultimately we used the other group member's machines for the analysis. Another simple difficulty we came across is that it is often useful to simply open a file to look at its data structure, column names, and other information. With such large datasets opening these is prohibitive, so we had to use other methods such as looking at a subset or programatically extracting the information we were looking for. We are looking forward to working on the cloud so as to reduce difficulties we experienced in this milestone.