# DSCI 525 - Milestone 1

### Group 10: Shaun Hutchinson, Morris Zhao, Yurui Feng

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

## 3. Downloading the data

In [2]:
# Necessary metadata
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figsharerainfall/"

#### List of files in the dataset

In [3]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'id': 26579150,
  'name': 'daily_rainfall_2014.png',
  'size': 58863,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e'},
 {'id': 26579171,
  'name': 'environment.yml',
  'size': 192,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34'},
 {'id': 26586554,
  'name': 'README.md',
  'size': 5422,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c'},
 {'id': 26766812,
  'name': 'data.zip',
  'size': 814041183,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26766812',
  'supplied_md5': 'b517383f76e77bd03755a63a8f

#### Save the `.zip` files locally
The files will be downloaded in the same directory as this notebook in a folder called `figsharerainfall/`

In [4]:
%%time
files_to_dl = ["data.zip"]  # feel free to add other files here
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

CPU times: user 2.32 s, sys: 4.2 s, total: 6.52 s
Wall time: 1min 43s


#### Unzip/extract the `.csv` files 

In [5]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

CPU times: user 7.69 s, sys: 941 ms, total: 8.63 s
Wall time: 8.95 s


## 4. Combining data CSVs
Combine data CSVs into a single CSV using pandas

In [8]:
%%time
## here we are using a normal python way for merging the data 
import pandas as pd
#use_cols = ["ArrDelay", "DepDelay", "Distance", "TailNum","UniqueCarrier","Origin","Dest"]
files = glob.glob('figsharerainfall/*.csv')
df = pd.concat((pd.read_csv(file, index_col=0)
                .assign(model=re.findall("/([^_]*)", file)[0])
                for file in files)
              )
df.to_csv("figsharerainfall/combined_data.csv")

CPU times: user 3min 39s, sys: 11 s, total: 3min 50s
Wall time: 3min 51s


#### Compare run times across team members' laptop

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Yurui Feng  | macOS                 |16GB     | Apple M1 Pro          |Yes        |3m 33s            |
| Morris Zhao    | macOS                 |8GB     |Apple M2           |Yes        |3m 14s            |
|  Shaun Hutchinson   | macOS      |8GB     | 1.4 GHz Quad-Core Intel Core i5          |Yes        |7m 49s            |

#### Discussion:
The run times between the Apple M1 Pro and the Apple M2 were comparable with a difference of 13 seconds. While the Apple M1 Pro has more RAM, it seems the new processor is more powerful despite having less RAM. The Mac with the Quad-Core Intel Core i5 has much slower processor. In addition, the new processors with Apple's M1 and M2 are optimized to perform more calculations with lower power consumption, so it does not come as a surprise that the Intel Chip is slower. We believe that the superior performance of Apple processors can be attributed to two main factors: first, they utilize a simpler ARM-based architecture compared to Intel's x86; second, they are built on a 5nm process as opposed to the i5's 14nm process, which results in denser and more efficient transistors.

## 5. Load the combined CSV to memory and perform a simple EDA

##### In this section, we investigate two approaches to reduce memory usage while counting the number of `models` in the `combined_data.csv`.


##### 1. Read all the columns and change the data type for column `rain (mm/day)` and `lat_min` to `float16`.

In [9]:
%%time
df = pd.read_csv("figsharerainfall/combined_data.csv",dtype={'rain (mm/day)': 'float16', 'lat_min': 'float16'})
print(df["model"].value_counts())

combined            7833020
MPI-ESM1-2-HR       5154240
CMCC-ESM2           3541230
NorESM2-MM          3541230
TaiESM1             3541230
CMCC-CM2-SR5        3541230
CMCC-CM2-HR4        3541230
SAM0-UNICON         3541153
GFDL-CM4            3219300
FGOALS-f3-L         3219300
GFDL-ESM4           3219300
MRI-ESM2-0          3037320
EC-Earth3-Veg-LR    3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM5-0           1609650
INM-CM4-8           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
AWI-ESM-1-1-LR       966420
MPI-ESM1-2-LR        966420
NESM3                966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
observed              46020
Name: model, dtype: int64
CPU times: user 34.1 s, sys: 5.1 s, total: 39.2 s
Wall time: 41.1 s


#### 2. Load only `rain (mm/day)` and `model` when reading the `.csv` file

In [10]:
%%time
use_cols = ['rain (mm/day)', 'model']
df = pd.read_csv("figsharerainfall/combined_data.csv",usecols=use_cols)
print(df["model"].value_counts())

combined            7833020
MPI-ESM1-2-HR       5154240
CMCC-ESM2           3541230
NorESM2-MM          3541230
TaiESM1             3541230
CMCC-CM2-SR5        3541230
CMCC-CM2-HR4        3541230
SAM0-UNICON         3541153
GFDL-CM4            3219300
FGOALS-f3-L         3219300
GFDL-ESM4           3219300
MRI-ESM2-0          3037320
EC-Earth3-Veg-LR    3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM5-0           1609650
INM-CM4-8           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
AWI-ESM-1-1-LR       966420
MPI-ESM1-2-LR        966420
NESM3                966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
observed              46020
Name: model, dtype: int64
CPU times: user 20.5 s, sys: 1.4 s, total: 21.9 s
Wall time: 22 s


| Team Member | Operating System | RAM | Processor | Is SSD | Time taken change Dtype | Time taken select columns |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|:----------: |
| Yurui Feng  | macOS                 |16GB     | Apple M1 Pro          |Yes        |39.5s            | 21.4s|
| Morris Zhao   | macOS                 | 8GB    |Apple M2           |   Yes    |37.1s            |19.2s |
| Shaun Hutchinson   | macOS      |8GB     | 1.4 GHz Quad-Core Intel Core i5    |Yes |     1m 43s       |49.4s |

Again, the run times between the Apple M1 Pro and the Apple M2 were comparable with a difference of around 2 seconds for both method. It appears that selecting only the columns you need takes less time than changing the data types of all of the columns we are bringing in on each processor. The reasons for these similarities and differences are likely the same as mentioned in the previous part. The M2 chip has less RAM but a stronger processor. Both of the Mac M1 and M2 chips are stronger processors than the Intel chips.

## 6. Perform a simple EDA in R

### Transferring the dataframe from Python to R using `Arrow exchange`:

Here we have decided to opt for `Arrow exchange` method to transfer the dataframe from Python to R. We have not opted for the `Pandas exchange` because this suffers from a serialization/deserialization process which can be quite time consuming as we saw in lecture 1. While `Arrow exchange` still has serialization/deserialization, the time that is spent on this is much less than that the `Pandas exchange`.

We also did not choose to use `Parquet file` as the intention here is for transferring the entire dataframe from Python to R, rather than storing the dataframe in a a compresssed form. `Parquet` would be a more suitable choice if we were storing and reusing this dataframe, but since we are transferring between programming languages `Arrow exchange` seems like the better method.

In [11]:
%load_ext rpy2.ipython

In [12]:
filepathcsv = "figsharerainfall/combined_data.csv"

In [13]:
# !pip install rpy2_arrow
import pyarrow.dataset as ds
import pyarrow as pa
import pandas as pd
import pyarrow 
from pyarrow import csv
import rpy2_arrow.pyarrow_rarrow as pyra

In [14]:
%%time
dataset = ds.dataset(filepathcsv, format="csv")
# Converting the `pyarrow dataset` to a `pyarrow table`
table = dataset.to_table()
# Converting a `pyarrow table` to a `rarrow table`
r_table = pyra.converter.py2rpy(table)

CPU times: user 13.6 s, sys: 1.36 s, total: 15 s
Wall time: 14.1 s


#### Count the number of `model` in R

In [15]:
%%time
%%R -i r_table
start_time <- Sys.time()
suppressMessages(library(dplyr))
result <- r_table %>% count(model)
end_time <- Sys.time()
print(result %>% collect())
print(end_time - start_time)

# A tibble: 29 × 2
   model                  n
   <chr>              <int>
 1 MPI-ESM-1-2-HAM   966420
 2 AWI-ESM-1-1-LR    966420
 3 NorESM2-LM        919800
 4 ACCESS-CM2       1932840
 5 FGOALS-f3-L      3219300
 6 CMCC-CM2-HR4     3541230
 7 MRI-ESM2-0       3037320
 8 GFDL-CM4         3219300
 9 BCC-CSM2-MR      3035340
10 EC-Earth3-Veg-LR 3037320
# ℹ 19 more rows
# ℹ Use `print(n = ...)` to see more rows
Time difference of 0.142597 secs
CPU times: user 1.38 s, sys: 295 ms, total: 1.67 s
Wall time: 458 ms
