# DSCI 525 Web and Cloud Computing 
## Milestone 1 Tackling big data on your laptop 
Authors: Amelia Tang, Chaoran Wang, Junrong Zhu (Group 13) 

### Import Dependencies

In [24]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
import numpy as np

### Downloading the data
1. Download the data from figshare to local computers using the figshare API and requests library.
2. Extract the zip file

In [None]:
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figshare/"

In [None]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]
files

In [None]:
files_to_dl = ["data.zip"]  
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [None]:
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

In [None]:
os.remove("figshare/observed_daily_rainfall_SYD.csv")

### Combining data CSVs
1. Combine data CSVs into a single CSV using pandas.
2. When combining the CSV files, add an extra column called "model" that identifies the model. 
3. Compare run times on different machines within our team. 

In [None]:
%%time

use_cols = ["time", "lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/day)"]
files = glob.glob('figshare/*.csv')

df = pd.concat((pd.read_csv(file, index_col=0, usecols=use_cols)
                .assign(model=re.findall(r"/([^_]*)", file)[0])
                for file in files))

df.to_csv("figshare/combined_data.csv")

### Time Comparison Table for Combining CSVs


| Team Member  |Operating System|RAM|Processor|Is SSD| Time Taken|
| -------------| -------------- | - | --- ----| -----| --------- |
| Junrong Zhu  |macOS Monterey  |8GB|   CPU - Apple M1 chip 8-core   |  Yes | Total time 5min 57s|
| Amelia Tang  |macOS Monterey  |8GB|   CPU - 2.2 GHz Dual-Core Intel Core i7   |  Yes | Total time 10min 1s|
| Chaoran Wang | macOS Big Sur   | 16GB  | CPU - Intel Core i7-7700k | Yes | Total time 5min 39s |

***Our Observations***
We observed that computers whose CPU had more cores tended to combine the files faster. Besides, the more RAM a computer had, the less time it took to process the files. Given that our operating systems were all MacOS and we all had SSD, we did not observe how different operating systems and whether have SSD affected the speed. However, based on our research, different operating systems do impact the speed so do the specifications of SSDs.  

Sources: https://dash.harvard.edu/bitstream/handle/1/24829608/tr-09-95.pdf
<br>https://ssdsphere.com/how-does-ssd-speed-up-a-system/

### Load the combined CSV to memory and perform a simple EDA

In order to understand our data better, we performed following exploratory data analysis steps:

- observing and changing the `dtype` of the data
- loading the columns of interest
- loading in chunks

We are going to present the EDA in `Python` and `R` respectively.

#### Python: Reading the dataset

Reading in the data took quite some time. We examined the shape of the data set. 

In [20]:
df = pd.read_csv("figshare/combined_data.csv", parse_dates=True, index_col='time')

In [21]:
df.shape

(62467843, 6)

#### Python: observing and changing the `dtype` of the data 
We observed the `dtype` for each column. 

In [22]:
df.dtypes

lat_min          float64
lat_max          float64
lon_min          float64
lon_max          float64
rain (mm/day)    float64
model             object
dtype: object

Then, we used `.describe()` to do a simple EDA on the combined dataset with default data types for all the columns and we loaded all the original columns. We timed the process to establish the baseline for comparisons. 

In [26]:
%%time
df.describe() # baseline

CPU times: user 13.6 s, sys: 8.82 s, total: 22.4 s
Wall time: 26.1 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10482,-31.97757,146.9059,148.215,1.90117
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.838413e-06
50%,-33.0,-32.04188,146.875,148.125,0.06154947
75%,-31.4017,-30.15707,150.1875,151.3125,1.020918
max,-29.9,-27.90606,153.75,155.625,432.9395


In [7]:
#print(f"Memory usage with object for the time column: {df[['time']].memory_usage().sum() / 1e6:.2f} MB")
#print(f"Memory usage with datetime64[ns] for the time column: {df[['time']].astype('datetime64[ns]', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with object for the time column: 499.74 MB
Memory usage with datetime64[ns] for the time column: 499.74 MB


In [8]:
#print(f"Memory usage with object: {df[['model']].memory_usage().sum() / 1e6:.2f} MB")
#print(f"Memory usage with string: {df[['model']].astype('str', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with object: 499.74 MB
Memory usage with string: 499.74 MB


In [13]:
# df['time'] = pd.to_datetime(df['time'])

We observed that using data type `float32` instead of `float64` would save about half of the memory, so we changed the columns with the data type `float64` to `float32`.

In [27]:
print(f"Memory usage with float64: {df[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with float64: 2998.46 MB
Memory usage with float32: 1749.10 MB


In [28]:
colum_dtypes = {'lat_min': np.float32, 'lat_max': np.float32, 'lon_min': np.float32, 'lon_max': np.float32, 'rain (mm/day)': np.float32}
df_new = pd.read_csv("figshare/combined_data.csv",parse_dates=True, index_col='time', dtype=colum_dtypes)
df_new.head()

Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1889-01-01 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.244226e-13,MPI-ESM-1-2-HAM
1889-01-02 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.217326e-13,MPI-ESM-1-2-HAM
1889-01-03 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.498125e-13,MPI-ESM-1-2-HAM
1889-01-04 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.251282e-13,MPI-ESM-1-2-HAM
1889-01-05 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.270161e-13,MPI-ESM-1-2-HAM


In [29]:
# Check the new data type for each column
df_new.dtypes 

lat_min          float32
lat_max          float32
lon_min          float32
lon_max          float32
rain (mm/day)    float32
model             object
dtype: object

In [30]:
%%time
df_new.describe() # after changing data types 

CPU times: user 9.34 s, sys: 3.85 s, total: 13.2 s
Wall time: 15 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10497,-31.97765,146.9057,148.215,1.901173
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.838413e-06
50%,-33.0,-32.04189,146.875,148.125,0.06154947
75%,-31.4017,-30.15707,150.1875,151.3125,1.020918
max,-29.9,-27.90606,153.75,155.625,432.9395


#### Python: loading the columns of interest

Since what we were interested in the most was the rainfall in mm/day, we loaded only the `time` and `rain (mm/day)` columns this time. 

In [31]:
df_subset = pd.read_csv("figshare/combined_data.csv",parse_dates=True, index_col='time', usecols=['time', 'rain (mm/day)'])
df_subset.head()

Unnamed: 0_level_0,rain (mm/day)
time,Unnamed: 1_level_1
1889-01-01 12:00:00,4.244226e-13
1889-01-02 12:00:00,4.217326e-13
1889-01-03 12:00:00,4.498125e-13
1889-01-04 12:00:00,4.251282e-13
1889-01-05 12:00:00,4.270161e-13


In [32]:
%%time
df_subset.describe() # just the time and rain columns

CPU times: user 3.35 s, sys: 1.48 s, total: 4.82 s
Wall time: 4.97 s


Unnamed: 0,rain (mm/day)
count,59248540.0
mean,1.90117
std,5.585735
min,-3.807373e-12
25%,3.838413e-06
50%,0.06154947
75%,1.020918
max,432.9395


### Comparison table for Python EDA timing

| Team Member  |Operating System|RAM|Processor|Is SSD| Baseline time for EDA | Time after changing `dtype`| Time for fewer columns|
| -------------| -------------- | - | --- ----| -----| --------- |  --------- |  --------- |
| Junrong Zhu  |macOS Monterey  |8GB|   CPU - Apple M1 chip 8-core   |  Yes | Total time 11.1s|  --------- | 
| Amelia Tang  |macOS Monterey  |8GB|   CPU - 2.2 GHz Dual-Core Intel Core i7   |  Yes | total: 22.4s Wall time: 26.1s| total: 13.2s Wall time: 15s | total: 4.82 s Wall time: 4.97 s 
| Chaoran Wang | macOS Big Sur   | 16  | Intel Core i7-7700k | Yes | Total time 5min 39s | --------- | 

### Summary for Python 
- Changing `dtype` 
> After changing the `dtype` from `float64` to `float32`, the memory usage decreased by around 50% and we observed decreases in total / wall time to perform the simple EDA across our team member's computers.  
- Loading on the columns needed 
> Since our main focus here was daily rainfall so we cared the most about the `time` and `rain (mm/day)` columns. After leaving out other columns, we saw decreases in total / wall time to perform the simple EDA across our team member's computers. 

### R Section

... Reasoning of the approach ...

## Challenges

1. One of the challenges we had with Q5 was the long running time. For example, we wanted to have a general overview on the dataframe by using '.info()' like what we did in other courses, however, it took a long time to output the dtype for each variable as well as other information which we were not particularly interested in. As an alternative approach, we used `.dtypes` to get the data type of columns and it returned the results immediately.

2. 