# Milestone 1 Notebook

#### Authors: Julien Gordon, Adam Morphy, Mukund Iyer, Shiva Shankar Jena

## Questions 1. and 2.

#### Link to Team Contract: https://docs.google.com/document/d/1uDSQLGPSfcgl3PisaC1-ngaViqJCkBiWFmDsN2FzZ9w/edit?usp=sharing
#### Link ot Repo: https://github.com/UBC-MDS/DSCI_525_Group26

In [3]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
import numpy as np

## 3. Data Download

In [None]:
%cd /Users/apple/MDS/block6/525/DSCI_525_Group26/notebooks

In [None]:
# Necessary metadata
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figsharerainfall/"

# Query
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data
files = data["files"]             

In [None]:
# Downloading file
%%time
files_to_dl = ["data.zip"]  
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

> Data Download Comparison

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | |
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 1m 16s |

In [None]:
# Extracting files from zip
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

## 4. Combining data CSVs

In [None]:
%%time

use_cols = ["rain", "lat_min", "lat_max", "Ion_min", "Ion_max", "rain (mm/day)", "model"]
files = glob.glob('figsharerainfall/*.csv')
excluded_files = ["figsharerainfall/observed_daily_rainfall_SYD.csv"]
df = pd.concat(
    (
        pd.read_csv(file, index_col=0)
        .assign(model=re.findall(r'\/(.*?)_', file)[0])
        for file in files
        if file not in excluded_files
        
    )
)
df.to_csv("figsharerainfall/combined_data.csv")

> Combining Data Comparison

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | |
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 12m 58s |

## 5. Load the combined CSV to memory and perform a simple EDA

### 5.1 Investigating 2 approaches to reduce memory usage while performing the EDA

#### Loading the whole data and performing EDA

In [None]:
# Loading data (Pandas)
%%time

df_combined = pd.read_csv(
    "figsharerainfall/combined_data.csv", 
    index_col=0,
    parse_dates=True 
)

> Combining Data Comparison

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | |
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 3m 53s|

In [None]:
# Simple EDA (Pandas)

%%time

df_combined.model.value_counts()

> Performing a simple EDA

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | |
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 5.99s |

### 5.1.1 Approach 1 to reduce memory usage: Changing dtype

In [None]:
df_combined.dtypes

In [None]:
print(f"Memory usage with float64: {df_combined[['lat_min', 'lat_max','lon_min', 'lon_max', 'rain (mm/day)']].memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df_combined[['lat_min', 'lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

In [None]:
%%time
df_combined_float32 = df_combined[['lat_min', 'lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float32', errors='ignore')

### 5.1.2 Approach 2 to reduce memory usage: loading in chunks

In [None]:
%%time

# Doing EDA with only chunks of data
counts=pd.Series(dtype=int)
for chunk in pd.read_csv(
    "figsharerainfall/combined_data.csv",
    parse_dates=True,
    chunksize=1_000_000
):
    counts=counts.add(chunk.model.value_counts(), fill_value=0)

print(counts.astype(int))

> Loading data and performing a simple EDA with reduced memory usage (minimum out of 2 approaches)

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | |
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 1m 45s |

## 6. Perform a simple EDA in R

### 6.1 Approaches to transfer the dataframe from python to R

We tried different approaches for data transfer to compare time taken.

#### 6.1.1 Parquet file

In [None]:
# Using pandas
df_combined.to_parquet("figsharerainfall/combined_data_partition.parquet")

In [None]:
%load_ext rpy2.ipython

In [None]:
import rpy2_arrow.pyarrow_rarrow as pyra

In [None]:
%%R
suppressMessages(library(arrow, warn.conflicts = FALSE))
suppressMessages(library(dplyr, warn.conflicts = FALSE))

In [None]:
%%time
%%R
ds_rainfall <- open_dataset("figsharerainfall/combined_data_partition.parquet")

In [None]:
%%time
%%R
query <- ds_rainfall %>%
    select(model) %>%
    group_by(model) %>%
    summarise(
        count = n()
    )

In [None]:
%%time
%%R
print(query %>% collect())

> Comparison of Loading data and EDA time in R using parquet file

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken | Method |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|:----------:|
| Adam Morphy | MacOS Big Sur | 8GB | 1.8 GHz Dual-Core Intel Core i5 | Yes | | |
| Mukund Iyer | MacOS Monterey | 8GB | 1.4 GHz Quad-Core Intel Core i5 | Yes | | |
| Julien Gordon | Ubuntu 20.04.4 LTS | 16GB | AMD® Ryzen 7 5800h with radeon graphics | Yes | | |
| Shiva Shankar Jena | MacOS Catalina 10.15.7 | 4GB | 1.4 GHz Dual-Core Intel Core i5 | Yes | 10s | Parquet file |

### 6.2 Reasons for choosing the approaches

1. Parquet file: The primary advantages of parquet file approach, apart from its hybrid file format for use in multiple languages, was that it lead to significantly reduced memory usage(539.6 MB compared to the 8 GB combined CSV file) as well as speed leveraging the power of efficient compression and encoding techniques of Arrow as well as the lazy evaluation benefits of R. The method proved immensely efficient.