# DSCI 525 Web and Cloud Computing 
## Milestone 1 Tackling big data on your laptop 
Authors: Amelia Tang, Chaoran Wang, Junrong Zhu (Group 13) 

### Import Dependencies

In [2]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
import numpy as np

### Downloading the data
1. Download the data from figshare to local computers using the figshare API and requests library.
2. Extract the zip file

In [2]:
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figshare/"

In [3]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]
files

[{'id': 26579150,
  'name': 'daily_rainfall_2014.png',
  'size': 58863,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e'},
 {'id': 26579171,
  'name': 'environment.yml',
  'size': 192,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34'},
 {'id': 26586554,
  'name': 'README.md',
  'size': 5422,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c'},
 {'id': 26766812,
  'name': 'data.zip',
  'size': 814041183,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26766812',
  'supplied_md5': 'b517383f76e77bd03755a63a8f

In [4]:
files_to_dl = ["data.zip"]  
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [5]:
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

In [6]:
os.remove("figshare/observed_daily_rainfall_SYD.csv")

### Combining data CSVs
1. Combine data CSVs into a single CSV using pandas.
2. When combining the CSV files, add an extra column called "model" that identifies the model. 
3. Compare run times on different machines within our team. 

In [7]:
%%time

use_cols = ["time", "lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/day)"]
files = glob.glob('figshare/*.csv')

df = pd.concat((pd.read_csv(file, index_col=0, usecols=use_cols)
                .assign(model=re.findall(r"/([^_]*)", file)[0])
                for file in files))

df.to_csv("figshare/combined_data.csv")

CPU times: user 5min 15s, sys: 11.7 s, total: 5min 27s
Wall time: 5min 33s


#### Time Comparison Table for Combining CSVs


| Team Member  |Operating System|RAM|Processor|Is SSD| Time Taken|
| -------------| -------------- | - | --- ----| -----| --------- |
| Junrong Zhu  |macOS Monterey  |8GB|   CPU - Apple M1 chip 8-core   |  Yes | Total time 5min 57s|
| Amelia Tang  |macOS Monterey  |8GB|   CPU - 2.2 GHz Dual-Core Intel Core i7   |  Yes | Total time 10min 1s|
| Chaoran Wang | macOS Big Sur   | 16GB  | CPU - 4.2 Ghz Quad-Core Intel Core i7 | Yes | Total time 5min 27s |

***Our Observations***
We observed that computers whose CPU had more cores tended to combine the files faster. Besides, the more RAM a computer had, the less time it took to process the files. Given that our operating systems were all MacOS and we all had SSD, we did not observe how different operating systems and whether have SSD affected the speed. However, based on our research, different operating systems do impact the speed so do the specifications of SSDs.  

Sources: https://dash.harvard.edu/bitstream/handle/1/24829608/tr-09-95.pdf
<br>https://ssdsphere.com/how-does-ssd-speed-up-a-system/

#### In order to understand our data better, we performed following exploratory data analysis steps:

**Python:**
- observing and changing the `dtype` of the data
- loading the columns of interest

**R:**
- obtaining summary statistic of columns
- constructing plot of parameters of interests

## Load CSV to memory and perform a simple EDA in Python

In [3]:
df = pd.read_csv("figshare/combined_data.csv", parse_dates=True, index_col='time')

#### Changing `dtype` of the data

We firstly observed the `dtype` for each column. 

In [6]:
df.dtypes

lat_min          float64
lat_max          float64
lon_min          float64
lon_max          float64
rain (mm/day)    float64
model             object
dtype: object

Then, we used `.describe()` to do a simple EDA on the combined dataset with default data types for all the columns and we loaded all the original columns. We timed the process to establish the baseline for comparisons. 

In [7]:
%%time
df.describe() # baseline

CPU times: user 6.88 s, sys: 6.38 s, total: 13.3 s
Wall time: 15.5 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10482,-31.97757,146.9059,148.215,1.90117
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.838413e-06
50%,-33.0,-32.04188,146.875,148.125,0.06154947
75%,-31.4017,-30.15707,150.1875,151.3125,1.020918
max,-29.9,-27.90606,153.75,155.625,432.9395


We further explored the memory consumption with different `dtype` for the numeric columns below, and we assumed that less memory usage would be likely to lead to less running time.

In [8]:
print(f"Memory usage with float64: {df[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with float64: 2998.46 MB
Memory usage with float32: 1749.10 MB


We observed that using data type `float32` instead of `float64` would save about half of the memory, so we changed 5 numeric columns from `float64` to `float32` in the code cell below.

In [4]:
# converting dtype
colum_dtypes = {'lat_min': np.float32, 
                'lat_max': np.float32, 
                'lon_min': np.float32, 
                'lon_max': np.float32, 
                'rain (mm/day)': np.float32}

In [5]:
df_new = pd.read_csv("figshare/combined_data.csv", 
                     parse_dates=True, index_col='time', dtype=colum_dtypes)

In [6]:
# Check the columns' data type after converting
df_new.dtypes

lat_min          float32
lat_max          float32
lon_min          float32
lon_max          float32
rain (mm/day)    float32
model             object
dtype: object

In [7]:
%%time
df_new.describe() # time comparison

CPU times: user 5.18 s, sys: 2.17 s, total: 7.35 s
Wall time: 7.86 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10497,-31.97765,146.9057,148.215,1.901173
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.838413e-06
50%,-33.0,-32.04189,146.875,148.125,0.06154947
75%,-31.4017,-30.15707,150.1875,151.3125,1.020918
max,-29.9,-27.90606,153.75,155.625,432.9395


#### Loading the columns of interest

Since we were mostly interested in the rainfall in mm/day, we loaded only the `time` and `rain (mm/day)` columns this time to perform the same eda step. 

In [13]:
df_subset = pd.read_csv("figshare/combined_data.csv",
                        parse_dates=True, index_col='time', 
                        usecols=['time', 'rain (mm/day)'])

df_subset.head()

Unnamed: 0_level_0,rain (mm/day)
time,Unnamed: 1_level_1
1889-01-01 12:00:00,4.244226e-13
1889-01-02 12:00:00,4.217326e-13
1889-01-03 12:00:00,4.498125e-13
1889-01-04 12:00:00,4.251282e-13
1889-01-05 12:00:00,4.270161e-13


In [14]:
%%time
df_subset.describe() # running same EDA step on the subset data

CPU times: user 2.19 s, sys: 2.45 s, total: 4.64 s
Wall time: 8.15 s


Unnamed: 0,rain (mm/day)
count,59248540.0
mean,1.90117
std,5.585735
min,-3.807373e-12
25%,3.838413e-06
50%,0.06154947
75%,1.020918
max,432.9395


### Comparison table for Python EDA timing

| Team Member  |Operating System|RAM|Processor|Is SSD| Baseline time for EDA | Time after changing `dtype`| Time for fewer columns|
| -------------| -------------- | - | --- ----| -----| --------- |  --------- |  --------- |
| Junrong Zhu  |macOS Monterey  |8GB|   CPU - Apple M1 chip 8-core   |  Yes | Total: 13.5s Wall time: 16s|  Total: 7.35s Wall time: 7.73s |  total: 3.04 s Wall time: 3.24 s
| Amelia Tang  |macOS Monterey  |8GB|   CPU - 2.2 GHz Dual-Core Intel Core i7   |  Yes | Total: 22.4s Wall time: 26.1s| Total: 13.2s Wall time: 15s | total: 4.82 s Wall time: 4.97 s 
| Chaoran Wang | macOS Big Sur   | 16GB | CPU - 4.2 Ghz Quad-Core Intel Core i7 | Yes | Total: 12.7s Wall time: 12.8s| Total: 8s Wall time: 8.06s | total: 3.53 s Wall time: 3.57 s

### Summary for Python 
- Changing `dtype` 
> After changing the `dtype` from `float64` to `float32`, the memory usage decreased by around 50% and we observed obvious decreases in total / wall time to perform the simple EDA across our team member's computers.  
- Loading on the columns needed 
> Since our main focus here was daily rainfall so we are most interested in the `rain (mm/day)` column. After extracting the subset of data, we saw significant decreases in total / wall time comparing to both Baseline time and the time after converting `dtype` across our team members' computers. 

### R Section

#### Transfer dataframe from python to R using `Feather`

Since we only need to store the data for short term here, we decide to use `feather file` over `Parquet file`. Also, `feather` allows us to exchange data from Python to R quickly with fairly simple implementation. We do not choose `Pandas Exchange` because it spends a long time on serialization and deserialization process and it's slower than `feather file`. Moreover, we also do not want to use `Arrow Exchange` to keep the `csv` format because size of it will not be compressed and it will take longer time to work with.    

In [1]:
df_new = df_new.reset_index()

NameError: name 'df_new' is not defined

In [9]:
# converting df to feature file in Python
df_new.to_feather("figshare/combined_data.feather")

In [None]:
# converting df to parquet file in Python
import pyarrow.dataset as ds
df_new_parquet = ds.dataset("figshare/combined_data.csv")
parquet_result = df_new_parquet.scanner(columns=use_cols)
ds.write_dataset(parquet_result,"figshare/combined_data.parquet",format = "parquet")

In [10]:
%%sh
du -sh figshare/combined_data.feather

960M	figshare/combined_data.feather


We can see our feather file of the data is about 900M.

In [None]:
%%time
df_feather = pd.read_feather('figshare/combined_data.feather')

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
0,1889-01-01 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.244226e-13,MPI-ESM-1-2-HAM
1,1889-01-02 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.217326e-13,MPI-ESM-1-2-HAM
2,1889-01-03 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.498125e-13,MPI-ESM-1-2-HAM
3,1889-01-04 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.251282e-13,MPI-ESM-1-2-HAM
4,1889-01-05 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.270161e-13,MPI-ESM-1-2-HAM


In [None]:
df_feather.head()

In [None]:
%%time
df_parquet = pd.read_parquet('figshare/combined_data.python.parquet')

In [None]:
df_parquet

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
suppressMessages(library(arrow))
df_feather <- arrow::read_feather('figshare/combined_data.feather')
head(df_feather)

                 time   lat_min   lat_max  lon_min  lon_max rain (mm/day)
1 1889-01-01 04:00:00 -35.43987 -33.57462 141.5625 143.4375  4.244226e-13
2 1889-01-02 04:00:00 -35.43987 -33.57462 141.5625 143.4375  4.217326e-13
3 1889-01-03 04:00:00 -35.43987 -33.57462 141.5625 143.4375  4.498125e-13
4 1889-01-04 04:00:00 -35.43987 -33.57462 141.5625 143.4375  4.251282e-13
5 1889-01-05 04:00:00 -35.43987 -33.57462 141.5625 143.4375  4.270161e-13
6 1889-01-06 04:00:00 -35.43987 -33.57462 141.5625 143.4375  4.197289e-13
            model
1 MPI-ESM-1-2-HAM
2 MPI-ESM-1-2-HAM
3 MPI-ESM-1-2-HAM
4 MPI-ESM-1-2-HAM
5 MPI-ESM-1-2-HAM
6 MPI-ESM-1-2-HAM


#### Simple EDA

In [21]:
%%R
summary(df_feather)

      time                        lat_min           lat_max      
 Min.   :1888-12-31 16:00:00   Min.   :-36       Min.   :-36.00  
 1st Qu.:1920-07-02 04:00:00   1st Qu.:-35       1st Qu.:-33.66  
 Median :1952-01-01 04:00:00   Median :-33       Median :-32.04  
 Mean   :1952-01-01 08:32:08   Mean   :-33       Mean   :-31.98  
 3rd Qu.:1983-07-02 05:00:00   3rd Qu.:-31       3rd Qu.:-30.16  
 Max.   :2014-12-31 04:00:00   Max.   :-30       Max.   :-27.91  
                               NA's   :3219300                   
    lon_min           lon_max      rain (mm/day)        model          
 Min.   :141       Min.   :141.2   Min.   :  0       Length:62467843   
 1st Qu.:143       1st Qu.:145.0   1st Qu.:  0       Class :character  
 Median :147       Median :148.1   Median :  0       Mode  :character  
 Mean   :147       Mean   :148.2   Mean   :  2                         
 3rd Qu.:150       3rd Qu.:151.3   3rd Qu.:  1                         
 Max.   :154       Max.   :155.6   Max. 

In [None]:
%%R
df_feather$year <- format(df_feather$time, format = "%Y")
head(df_feather, 3)

                 time   lat_min   lat_max  lon_min  lon_max rain (mm/day)
1 1889-01-01 04:00:00 -35.43987 -33.57462 141.5625 143.4375  4.244226e-13
2 1889-01-02 04:00:00 -35.43987 -33.57462 141.5625 143.4375  4.217326e-13
3 1889-01-03 04:00:00 -35.43987 -33.57462 141.5625 143.4375  4.498125e-13
            model year
1 MPI-ESM-1-2-HAM 1889
2 MPI-ESM-1-2-HAM 1889
3 MPI-ESM-1-2-HAM 1889


Based on the previous discussion, we are mostly interested in the `time` and `rain (mm/day)` columns, therefore, we'll perform further EDA with a focus on a subset of data.

In [None]:
%%R
colnames(df_feather)[7] <- "rain"

In [None]:
%%R
df_feather <- aggregate(rain ~ year, data = df_feather, mean)
head(df_feather)

In [None]:
%%R
df_feather$year <- as.numeric(df_feather$year)

In [None]:
%%R
suppressMessages(library(ggplot2))
ggplot(data = df_feather, aes(x = year, y = rain)) +
  geom_line() +
  labs(title = "Rainfall trend in Austrilia", x = "Year", y = "Rainfall")

## Challenges

1. One of the challenges we had with Q5 was the long running time. For example, we wanted to have a general overview on the dataframe by using `.info()` like what we did in other courses, however, it took a long time to output the dtype for each variable as well as other information which we were not particularly interested in. As an alternative approach, we used `.dtypes` to get the data type of columns and it returned the results immediately.

2. For EDA of R, we were struggling to come up with a proper analysis because of the big data set. We finally choose to plot a simple line plot to show the trend of rain over years. Due to the large data size, we ended up aggregating it with means of rainfall by years instead of plotting the original rainfall data. The aggregation process takes about 10 minutes for one of us and it seems acceptable, but a teammate's laptop can't run it at all. 