# DSCI 525 - Web and Cloud Computing

### Team members: 
- Chen Lin, Edward Yukun Zhang, Jakob Thoms, Vikram Grewal

## Milestone 1: Tackling big data on your laptop

## Overall project goal and data

During this course, you will be working on a team project involving big
data. The purpose is to get exposure to working with much larger
datasets than you have previously in MDS. You have been assigned to
teams of three or four. (See group assignment in
[Canvas](https://canvas.ubc.ca/courses/106517). Unlike previous project
courses, in this course, all of you will be working on **the same
problem**. In particular, you will be building and deploying ensemble
machine learning models in the cloud to predict daily rainfall in
Australia on a large dataset (\~6 GB), where features are outputs of
different climate models, and the target is the actual rainfall
observation.

You will be using [this dataset on
figshare](https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681).
This folder has the output of different climate models as features, and
our ultimate goal is to build an ensemble model on these outputs and
compare the results with the actual rainfall. At the end of the project,
you should have your ML model deployed in the cloud for others to use.

During this course, you will work towards this goal step by step in four
milestones.

<br><br>

## Milestone 1 checklist

Part of the purpose of this milestone is to annoy you by making you work
with large data in `Pandas` and vanilla CSV files. Typically these are
not the best for dealing with large data. Along the way, you will also
explore some useful tools for working with big data.

### 1. Team-work contract

rubric={correctness:10}

Similar to what you did in DSCI 522 and DSCI 524, create a teamwork
contract. The contract should outline how you are committed to working
together so that you are accountable to one another. Again, you may
start with your team contract document from previous project courses and
adapt it to your new team. It is a fairly personal document, and please
do not push it into your public repositories. Instead, save it somewhere
your team can easily share it, and you can share a link to it or a copy

### 2. Creating a repository and project structure

rubric={mechanics:10}

1.  Similar to previous project courses, create a [public repository](https://github.com/UBC-MDS/DSCI525_Group12)
    under [UBC-MDS org](https://github.com/UBC-MDS) for your project.
2.  Write a brief introduction of the project in the `README`.
3.  Create a folder called `notebooks` in the repository and create a
    notebook for this milestone in that folder.

### 3. Downloading the data

rubric={correctness:10}

1.  Download the data from
    [figshare](https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681)
    to your local computer using the [figshare
    API](https://docs.figshare.com) (you need to make use of `requests`
    library).

2.  Extract the zip file, again programmatically, similar to how we did
    it in class.

> You can download the data and unzip it manually. But we learned about
> APIs, so we can do it in a reproducible way with the `requests`
> library, similar to how we [did it in
> class](https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture1.html#using-rest-api-lab-lecture).

> There are 5 files in the figshare repo. The one we want is: `data.zip`

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

In [2]:
%mkdir /Users/clin404/Documents/UBC_MDS/Block6/DSCI525/figshareexp
%cd /Users/clin404/Documents/UBC_MDS/Block6/DSCI525/figshareexp

mkdir: /Users/clin404/Documents/UBC_MDS/Block6/DSCI525/figshareexp: File exists
/Users/clin404/Documents/UBC_MDS/Block6/DSCI525/figshareexp


In [3]:
# Necessary metadata
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figsharerainfall/"

In [4]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'id': 26579150,
  'name': 'daily_rainfall_2014.png',
  'size': 58863,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e'},
 {'id': 26579171,
  'name': 'environment.yml',
  'size': 192,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34'},
 {'id': 26586554,
  'name': 'README.md',
  'size': 5422,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c'},
 {'id': 26766812,
  'name': 'data.zip',
  'size': 814041183,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26766812',
  'supplied_md5': 'b517383f76e77bd03755a63a8f

In [5]:
%%time
files_to_dl = ["data.zip"]  # feel free to add other files here
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

CPU times: user 2.34 s, sys: 2.91 s, total: 5.26 s
Wall time: 58.2 s


In [6]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

CPU times: user 7.41 s, sys: 952 ms, total: 8.36 s
Wall time: 8.42 s


In [7]:
%ls -ltr figsharerainfall/

total 12049264
-rw-r--r--   1 clin404  staff  814041183 28 Mar 21:47 data.zip
-rw-r--r--   1 clin404  staff   95376895 28 Mar 21:47 MPI-ESM-1-2-HAM_daily_rainfall_NSW.csv
-rw-r--r--   1 clin404  staff   94960113 28 Mar 21:47 AWI-ESM-1-1-LR_daily_rainfall_NSW.csv
-rw-r--r--   1 clin404  staff   82474546 28 Mar 21:47 NorESM2-LM_daily_rainfall_NSW.csv
-rw-r--r--   1 clin404  staff  127613760 28 Mar 21:47 ACCESS-CM2_daily_rainfall_NSW.csv
-rw-r--r--   1 clin404  staff  232118894 28 Mar 21:47 FGOALS-f3-L_daily_rainfall_NSW.csv
-rw-r--r--   1 clin404  staff  330360682 28 Mar 21:47 CMCC-CM2-HR4_daily_rainfall_NSW.csv
-rw-r--r--   1 clin404  staff  254009247 28 Mar 21:47 MRI-ESM2-0_daily_rainfall_NSW.csv
-rw-r--r--   1 clin404  staff  235661418 28 Mar 21:47 GFDL-CM4_daily_rainfall_NSW.csv
-rw-r--r--   1 clin404  staff  294260911 28 Mar 21:47 BCC-CSM2-MR_daily_rainfall_NSW.csv
-rw-r--r--   1 clin404  staff  295768615 28 Mar 21:47 EC-Earth3-Veg-LR_daily_rainfall_NSW.csv
-rw-r--r--   1 clin404  s

In [8]:
%%time
df_mpi = pd.read_csv("figsharerainfall/MPI-ESM-1-2-HAM_daily_rainfall_NSW.csv")
df_mpi

CPU times: user 386 ms, sys: 28 ms, total: 414 ms
Wall time: 413 ms


Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
0,1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.244226e-13
1,1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.217326e-13
2,1889-01-03 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.498125e-13
3,1889-01-04 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.251282e-13
4,1889-01-05 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.270161e-13
...,...,...,...,...,...,...
966415,2014-12-27 12:00:00,-31.709369,-29.844118,152.8125,154.6875,3.218651e-04
966416,2014-12-28 12:00:00,-31.709369,-29.844118,152.8125,154.6875,4.609420e-13
966417,2014-12-29 12:00:00,-31.709369,-29.844118,152.8125,154.6875,5.685789e+00
966418,2014-12-30 12:00:00,-31.709369,-29.844118,152.8125,154.6875,1.231543e+01


In [9]:
%%time
df_awi = pd.read_csv("figsharerainfall/AWI-ESM-1-1-LR_daily_rainfall_NSW.csv")
df_awi

CPU times: user 388 ms, sys: 30.4 ms, total: 418 ms
Wall time: 418 ms


Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
0,1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,3.129635e-02
1,1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,1.083881e-13
2,1889-01-03 12:00:00,-35.439867,-33.574619,141.5625,143.4375,1.056313e-13
3,1889-01-04 12:00:00,-35.439867,-33.574619,141.5625,143.4375,1.080510e-13
4,1889-01-05 12:00:00,-35.439867,-33.574619,141.5625,143.4375,9.914916e-14
...,...,...,...,...,...,...
966415,2014-12-27 12:00:00,-31.709369,-29.844118,152.8125,154.6875,1.088772e-13
966416,2014-12-28 12:00:00,-31.709369,-29.844118,152.8125,154.6875,7.857531e-02
966417,2014-12-29 12:00:00,-31.709369,-29.844118,152.8125,154.6875,3.825708e+00
966418,2014-12-30 12:00:00,-31.709369,-29.844118,152.8125,154.6875,6.477188e+00


### 4. Combining data CSVs

rubric={correctness:10,reasoning:10}

1.  Combine data CSVs into a single CSV using pandas.

2.  When combining the CSV files, add an extra column called "model"
    that identifies the model. 
    - Tip 1: you can get this column populated
    from the file name, eg: for file name
    "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON
    
    - Tip 2: Remember how we added "year" column when we combined airline
    CSVs. Here the regex will be to get word before an underscore ie,
    "/([\^\_]\*)"

> Note: There is a file called `observed_daily_rainfall_SYD.csv` in the
> data folder that you downloaded. Make sure you exclude this file
> (programmatically or just take out that file from the folder) before
> you combine CSVs. We will use this file in our next milestone.

3.  ***Compare*** run times on different machines within your team and
    summarize your observations.

> Warning: Some of you might not be able to do it on your laptop. It's
> fine if you're unable to do it. Just make sure you discuss the reasons
> why you might not have been able to run this on your laptop.

In [10]:
%cd figsharerainfall

/Users/clin404/Documents/UBC_MDS/Block6/DSCI525/figshareexp/figsharerainfall


In [11]:
files = glob.glob('*.csv')
files.remove('observed_daily_rainfall_SYD.csv')
# Note to delete the created daily_rainfall.csv

In [12]:
files

['MPI-ESM-1-2-HAM_daily_rainfall_NSW.csv',
 'AWI-ESM-1-1-LR_daily_rainfall_NSW.csv',
 'NorESM2-LM_daily_rainfall_NSW.csv',
 'ACCESS-CM2_daily_rainfall_NSW.csv',
 'FGOALS-f3-L_daily_rainfall_NSW.csv',
 'CMCC-CM2-HR4_daily_rainfall_NSW.csv',
 'MRI-ESM2-0_daily_rainfall_NSW.csv',
 'GFDL-CM4_daily_rainfall_NSW.csv',
 'BCC-CSM2-MR_daily_rainfall_NSW.csv',
 'EC-Earth3-Veg-LR_daily_rainfall_NSW.csv',
 'CMCC-ESM2_daily_rainfall_NSW.csv',
 'NESM3_daily_rainfall_NSW.csv',
 'MPI-ESM1-2-LR_daily_rainfall_NSW.csv',
 'ACCESS-ESM1-5_daily_rainfall_NSW.csv',
 'FGOALS-g3_daily_rainfall_NSW.csv',
 'INM-CM4-8_daily_rainfall_NSW.csv',
 'MPI-ESM1-2-HR_daily_rainfall_NSW.csv',
 'TaiESM1_daily_rainfall_NSW.csv',
 'NorESM2-MM_daily_rainfall_NSW.csv',
 'CMCC-CM2-SR5_daily_rainfall_NSW.csv',
 'KIOST-ESM_daily_rainfall_NSW.csv',
 'INM-CM5-0_daily_rainfall_NSW.csv',
 'MIROC6_daily_rainfall_NSW.csv',
 'BCC-ESM1_daily_rainfall_NSW.csv',
 'GFDL-ESM4_daily_rainfall_NSW.csv',
 'CanESM5_daily_rainfall_NSW.csv',
 'SAM0-

In [13]:
%%time
daily_rainfall_df = pd.DataFrame()

for file in files:
    temp_df = pd.read_csv(file)
    temp_df['model'] = file.split("_")[0]
    daily_rainfall_df = pd.concat([daily_rainfall_df, temp_df])

CPU times: user 30.4 s, sys: 7.77 s, total: 38.2 s
Wall time: 38.3 s


In [14]:
daily_rainfall_df.shape

(62467843, 7)

In [15]:
%%time
daily_rainfall_df.to_csv("daily_rainfall.csv", index=False)

CPU times: user 2min 51s, sys: 6.44 s, total: 2min 58s
Wall time: 2min 58s


In [24]:
use_cols = ['lat_min', 'lat_max', 'lon_min', 'lon_max', 'rain (mm/day)', 'model']

### 5. Load the combined CSV to memory and perform a simple EDA

rubric={correctness:10,reasoning:10}

1.  Investigate at least two of the following approaches to reduce
    memory usage while performing the EDA (e.g., value_counts). Refer to
    lecture notes
    [here](https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture1.html#some-tactics-to-deal-with-memory-issue).
    - Changing `dtype` of your data
    - Load just columns that we want
    - Loading in chunks
2.  ***Compare*** run times on different machines within your team and
    summarize your observations.

In [19]:
%%time
df = pd.read_csv("daily_rainfall.csv")

CPU times: user 25.1 s, sys: 2.91 s, total: 28 s
Wall time: 28.4 s


#### 5.1-a Loading raw data without changing anything

In [20]:
%%time
df.describe()

CPU times: user 6.4 s, sys: 1.95 s, total: 8.35 s
Wall time: 8.59 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10482,-31.97757,146.9059,148.215,1.90117
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.838413e-06
50%,-33.0,-32.04188,146.875,148.125,0.06154947
75%,-31.4017,-30.15707,150.1875,151.3125,1.020918
max,-29.9,-27.90606,153.75,155.625,432.9395


#### 5.1-b Changing `dtype`

In [21]:
 %%time
df.astype('float32', errors='ignore').describe()

CPU times: user 6 s, sys: 878 ms, total: 6.88 s
Wall time: 6.89 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10497,-31.97765,146.9057,148.215,1.901173
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.838413e-06
50%,-33.0,-32.04189,146.875,148.125,0.06154947
75%,-31.4017,-30.15707,150.1875,151.3125,1.020918
max,-29.9,-27.90606,153.75,155.625,432.9395


In [25]:
print(f"Memory usage with float64: {df[use_cols].memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df[use_cols].astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with float64: 2998.46 MB
Memory usage with float32: 1749.10 MB


#### 5.1-c Selecting columns of interest

In [26]:
%%time
# Reading specific column only
df[use_cols].describe()

CPU times: user 6.8 s, sys: 1.34 s, total: 8.15 s
Wall time: 8.15 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10482,-31.97757,146.9059,148.215,1.90117
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.838413e-06
50%,-33.0,-32.04188,146.875,148.125,0.06154947
75%,-31.4017,-30.15707,150.1875,151.3125,1.020918
max,-29.9,-27.90606,153.75,155.625,432.9395


In [29]:
print(f"Memory usage with orginal dataframe: {df.memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with selected column dataframe: {df[use_cols].memory_usage().sum() / 1e6:.2f} MB")

Memory usage with orginal dataframe: 3498.20 MB
Memory usage with selected column dataframe: 2998.46 MB


#### 5.1-d Loading data in chunks

### 6. Perform a simple EDA in R

rubric={correctness:15,reasoning:10}

1.  Choose one of the methods listed below for transferring the
    dataframe (i.e., the entire dataset) from Python to R, and explain
    why you opted for this approach instead of the others.
    -   [Parquet
        file](https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture2.html#converting-csv-parquet)
    -   [Pandas
        exchange](https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture1.html#use-r-and-python-interchangeably)
    -   [Arrow
        exchange](https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture2.html#use-r-and-python-interchangeably-with-arrow)
2.  Once you have the dataframe in R, perform a simple EDA.

In [1]:
import os
os.environ['R_HOME'] = '/Users/clin404/miniconda3/envs/525/Lib/R' # Set this to your R path 

In [None]:
%load_ext rpy2.ipython

Unable to determine R library path: [Errno 2] No such file or directory: '/Users/clin404/miniconda3/envs/525/Lib/R/bin/Rscript'


In [27]:
%%R -i r_table
start_time <- Sys.time()

suppressMessages(library(dplyr))
result <- r_table %>% 
    group_by(model) %>% 
    summarize(
        max_rain = max(`rain (mm/day)`),
        median_rain = median(`rain (mm/day)`),
        mean_rain = mean(`rain (mm/day)`),
        min_rain = min(`rain (mm/day)`)
    )

end_time <- Sys.time()
print(result %>% collect())
print(end_time - start_time)

UsageError: Cell magic `%%R` not found.


## Specific expectations for this milestone

-   In this milestone, we are looking for a well-documented and
    self-explanatory notebook that explores different options to tackle
    big data on your laptop.
-   Please discuss any challenges or difficulties you faced when dealing
    with this large amount of data on your laptop. You can stop
    combining the data if it takes more than 30 minutes. Briefly explain
    your approach to overcoming the challenges or reasons why you could
    not overcome them.
-   For questions 5 and 6, you are free to choose any exploratory data
    analysis (EDA) task you want. Visualization is not necessary;
    summarizing the data is enough. However, if you want to install
    additional packages for visualization that are not included in the
    .yml file, feel free to install them on top of your notebook. If you
    want to install packages in R, you can do so using
    `install.packages("dplyr")` under `%%R` magic cell.
-   If someone in your team is facing issues with using R in a Python
    notebook, you can ignore it, as you will not need it for any other
    milestones. The main purpose of showing it in the lecture was to
    introduce and get a feel for the serialization and deserialization
    concept.
-   You only need to ***compare*** the time with other team members for
    questions 4 and 5. You do not need to do this for question 6. You
    can use the following table to record your results. Feel free to add
    any other relevant columns.

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken (merge/save to csv)|
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
|  Chen Lin   |     OSX 13.2     | 32GB|Apple M2 Max|   Yes   |   38.3s/2min 58s |
|  Member 2   |                  |     |           |        |            |
|  Member 3   |                  |     |           |        |            |
|  Member 4   |                  |     |           |        |            |

<br><br> \## Submission instructions rubric={mechanics:5}

In the textbox provided on Canvas for the Milestone 1 assignment
include:

-   The GitHub URL to your notebook.

As comment include - Repo link - Teamwork contract