# DSCI 525: Web and Cloud Computing

## Milestone 1: Tackling Big Data on Computer

### Group 13
Authors: Ivy Zhang, Mike Lynch, Selma Duric, William Xu

## Table of contents

- [Download the data](#1)
- [Combining data CSVs](#2)
- [Load the combined CSV to memory and perform a simple EDA](#3)
- [Perform a simple EDA in R](#4)
- [Reflection](#5)

### Imports

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
import numpy as np
import pyarrow.feather as feather
from memory_profiler import memory_usage

In [2]:
# %load_ext rpy2.ipython
%load_ext memory_profiler

## Download the data <a name="1"></a>

1. Download the data from figshare to local computer using the figshare API.
2. Extract the zip file programmatically.

In [3]:
# Attribution: DSCI 525 lecture notebook
# Necessary metadata
article_id = 14096681  # unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figsharerainfall/"

In [4]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]    

In [5]:
%%time
files_to_dl = ["data.zip"]  
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

CPU times: user 5.09 s, sys: 5.02 s, total: 10.1 s
Wall time: 1min 29s


In [6]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

CPU times: user 18.2 s, sys: 4.01 s, total: 22.2 s
Wall time: 24.1 s


## Combining data CSVs <a name="2"></a>

1. Use one of the following options to combine data CSVs into a single CSV (Pandas, Dask). **We used the option of Pandas**.
2. When combining the csv files, we added extra column called "model" that identifies the model (we get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)
3. Compare run times and memory usages of these options on different machines within the team, and summarize observations.

In [7]:
%%time
%memit
# Shows time that regular python takes to merge file
# Join all data together
## here we are using a normal python way of merging the data 
# use_cols = ["time", "lat_min", "lat_max", "lon_min","lon_max","rain (mm/day)"]
files = glob.glob('figsharerainfall/*.csv')
df = pd.concat((pd.read_csv(file, index_col=0)
                .assign(model=re.findall(r'^[^_]+(?=_)', file)[0])
                for file in files)
              )
df.to_csv("figsharerainfall/combined_data.csv")

peak memory: 91.95 MiB, increment: 0.27 MiB
CPU times: user 7min 28s, sys: 34.2 s, total: 8min 3s
Wall time: 8min 38s


In [8]:
feather.write_feather(df, "figsharerainfall/combined_data.feather")

In [9]:
%%sh
du -sh figsharerainfall/combined_data.csv

6.6G	figsharerainfall/combined_data.csv


In [10]:
%%time
df = pd.read_csv("figsharerainfall/combined_data.csv")

CPU times: user 1min 11s, sys: 30.2 s, total: 1min 41s
Wall time: 2min 1s


In [11]:
print(df.shape)

(62513863, 7)


In [12]:
df.head()

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
0,1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.244226e-13,figsharerainfall/MPI-ESM-1-2-HAM
1,1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.217326e-13,figsharerainfall/MPI-ESM-1-2-HAM
2,1889-01-03 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.498125e-13,figsharerainfall/MPI-ESM-1-2-HAM
3,1889-01-04 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.251282e-13,figsharerainfall/MPI-ESM-1-2-HAM
4,1889-01-05 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.270161e-13,figsharerainfall/MPI-ESM-1-2-HAM


**Summary of run times and memory usages:**

***William***
- Combining files: 
    - peak memory: 95.41 MiB, increment: 0.26 MiB
    - CPU times: user 7min 28s, sys: 31 s, total: 7min 59s
    - Wall time: 9min 17s
- Reading the combined file:
    - Wall time: 1min 51s

***Mike***
- Combining files: 
    - peak memory: 168.59 MiB, increment: 0.12 MiB
    - CPU times: user 3min 29s, sys: 5.09 s, total: 3min 34s
    - Wall time: 3min 34s
- Reading the combined file:
    - Wall time: 37.1 s


Feel free to add your run times and memory usages and list here (we meant to compare these metrics)

## Load the combined CSV to memory and perform a simple EDA <a name="3"></a>

### Establish a baseline for memory usage

In [66]:
df = pd.read_csv("figsharerainfall/combined_data.csv", parse_dates=True, index_col='time')
df.head()

Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.244226e-13,figsharerainfall/MPI-ESM-1-2-HAM
1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.217326e-13,figsharerainfall/MPI-ESM-1-2-HAM
1889-01-03 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.498125e-13,figsharerainfall/MPI-ESM-1-2-HAM
1889-01-04 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.251282e-13,figsharerainfall/MPI-ESM-1-2-HAM
1889-01-05 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.270161e-13,figsharerainfall/MPI-ESM-1-2-HAM


In [67]:
df.columns

Index(['lat_min', 'lat_max', 'lon_min', 'lon_max', 'rain (mm/day)', 'model'], dtype='object')

In [68]:
df.dtypes

lat_min          float64
lat_max          float64
lon_min          float64
lon_max          float64
rain (mm/day)    float64
model             object
dtype: object

In [69]:
%%time
%memit
df.describe()

peak memory: 1152.00 MiB, increment: 0.16 MiB
CPU times: user 17.4 s, sys: 15.7 s, total: 33.1 s
Wall time: 48.1 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59294560.0
mean,-33.10482,-31.97757,146.9059,148.215,1.901827
std,1.963549,1.992067,3.793784,3.809994,5.588275
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.876672e-06
50%,-33.0,-32.04188,146.875,148.125,0.06161705
75%,-31.4017,-30.15707,150.1875,151.3125,1.021314
max,-29.9,-27.90606,153.75,155.625,432.9395


Baseline memory and time data:
- peak memory: 4578.73 MiB, increment: 0.05 MiB
- CPU times: user 5.98 s, sys: 1.19 s, total: 7.17 s
- Wall time: 7.34 s

### Effects of changing dtypes on memory usage

In [70]:
print(f"Memory usage with float64: {df[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with float64: 3000.67 MB
Memory usage with float32: 1750.39 MB


In [71]:
colum_dtypes = {'lat_min': np.float32, 'lat_max': np.float32, 'lon_min': np.float32, 'lon_max': np.float32, 'rain (mm/day)': np.float32, 'model': str}
df = pd.read_csv("figsharerainfall/combined_data.csv", parse_dates=True, index_col='time', dtype=colum_dtypes)
df.head()

Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1889-01-01 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.244226e-13,figsharerainfall/MPI-ESM-1-2-HAM
1889-01-02 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.217326e-13,figsharerainfall/MPI-ESM-1-2-HAM
1889-01-03 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.498125e-13,figsharerainfall/MPI-ESM-1-2-HAM
1889-01-04 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.251282e-13,figsharerainfall/MPI-ESM-1-2-HAM
1889-01-05 12:00:00,-35.439865,-33.574619,141.5625,143.4375,4.270161e-13,figsharerainfall/MPI-ESM-1-2-HAM


In [72]:
df.dtypes

lat_min          float32
lat_max          float32
lon_min          float32
lon_max          float32
rain (mm/day)    float32
model             object
dtype: object

In [73]:
%%time
%memit
df.describe()

peak memory: 1470.41 MiB, increment: 0.23 MiB
CPU times: user 11.3 s, sys: 6.5 s, total: 17.8 s
Wall time: 25.9 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59294560.0
mean,-33.10497,-31.97765,146.9058,148.215,1.901828
std,1.963549,1.992067,3.793784,3.809994,5.588274
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.876672e-06
50%,-33.0,-32.04189,146.875,148.125,0.06161705
75%,-31.4017,-30.15707,150.1875,151.3125,1.021314
max,-29.9,-27.90606,153.75,155.625,432.9395


Time and memory data when using different dtypes:
- peak memory: 6316.88 MiB, increment: 0.18 MiB
- CPU times: user 4.81 s, sys: 761 ms, total: 5.58 s
- Wall time: 5.73 s

### Effects of loading a smaller subset of columns on memory usage

In [74]:
print(f"Memory usage with regular columns: {df[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float64', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with smaller subset of columns: {df[['lat_min','rain (mm/day)']].astype('float64', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with regular columns: 3000.67 MB
Memory usage with smaller subset of columns: 1500.33 MB


In [75]:
df = pd.read_csv("figsharerainfall/combined_data.csv",parse_dates=True, index_col='time', usecols=['time', 'lat_min', 'rain (mm/day)'])
df.head()

Unnamed: 0_level_0,lat_min,rain (mm/day)
time,Unnamed: 1_level_1,Unnamed: 2_level_1
1889-01-01 12:00:00,-35.439867,4.244226e-13
1889-01-02 12:00:00,-35.439867,4.217326e-13
1889-01-03 12:00:00,-35.439867,4.498125e-13
1889-01-04 12:00:00,-35.439867,4.251282e-13
1889-01-05 12:00:00,-35.439867,4.270161e-13


In [76]:
df.dtypes

lat_min          float64
rain (mm/day)    float64
dtype: object

In [77]:
%%time
%memit
df.describe()

peak memory: 1119.74 MiB, increment: 0.23 MiB
CPU times: user 7.14 s, sys: 5.14 s, total: 12.3 s
Wall time: 19.7 s


Unnamed: 0,lat_min,rain (mm/day)
count,59248540.0,59294560.0
mean,-33.10482,1.901827
std,1.963549,5.588275
min,-36.46739,-3.807373e-12
25%,-34.86911,3.876672e-06
50%,-33.0,0.06161705
75%,-31.4017,1.021314
max,-29.9,432.9395


Time and memory data when using column subset:
- peak memory: 7748.36 MiB, increment: 0.02 MiB
- CPU times: user 2.74 s, sys: 1.08 s, total: 3.82 s
- Wall time: 4 s

### Summary

#### Using float32 vs. baseline float64 dtype to perform a simple EDA:
- `float32` takes substantially less memmory usage than `float64` type. Memory usage with float64 is 3000.67 MB, whereas the memory usage with float32 is 1750.39 MB.
- The time taken to perform the EDA decreased compared to the baseline. 

#### Using a reduced number of columns compared to the baseline to perform a simple EDA:
- The memory usage decreased substantially compared to the baseline when performing a simple EDA on the reduced dataset.
- The time taken to perform the EDA decreased substantially compared to the baseline.

## Perform a simple EDA in R <a name="4"></a>

## Reflection <a name="5"></a>