# DSCI 525 - Web and Cloud Computing

## Milestone 1: Tackling big data on your laptop

## Overall project goal and data 

During this course, you will be working on a team project involving big data. The purpose is to get exposure to working with much larger datasets than you have previously in MDS. You have been assigned to teams of three or four. (See group assignment in [Canvas](https://canvas.ubc.ca/courses/106517). Unlike previous project courses, in this course, all of you will be working on **the same problem**. In particular, you will be building and deploying ensemble machine learning models in the cloud to predict daily rainfall in Australia on a large dataset (~6 GB), where features are outputs of different climate models, and the target is the actual rainfall observation.  

You will be using [this dataset on figshare](https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681). This folder has the output of different climate models as features, and our ultimate goal is to build an ensemble model on these outputs and compare the results with the actual rainfall. At the end of the project, you should have your ML model deployed in the cloud for others to use. 

During this course, you will work towards this goal step by step in four milestones.  

<br><br>

## Milestone 1 checklist  



### 1. Team-work contract
rubric={correctness:10}

Similar to what you did in DSCI 522 and DSCI 524, create a teamwork contract. The contract should outline how you are committed to working together so that you are accountable to one another. Again, you may start with your team contract document from previous project courses and adapt it to your new team. It is a fairly personal document, and please do not push it into your public repositories. Instead, save it somewhere your team can easily share it, and you can share a link to it or a copy with us in your submission to Canvas to prove you did this.

### 2. Creating a repository and project structure 
rubric={mechanics:10}

1. Similar to previous project courses, create a public repository under [UBC-MDS org](https://github.com/UBC-MDS) for your project. 
2. Write a brief introduction of the project in the `README`. 
3. Create a folder called `notebooks` in the repository and create a notebook for this milestone in that folder.

In [None]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
from IPython.display import display
import pyarrow.dataset as ds
import pyarrow as pa
from pyarrow import csv
import rpy2_arrow.pyarrow_rarrow as pyra

### 3. Downloading the data 
rubric={correctness:10}

1. Download the data from [figshare](https://figshare.com/articles/dataset/Daily_rainfall_over_NSW_Australia/14096681) to your local computer using the [figshare API](https://docs.figshare.com) (you need to make use of `requests` library).

2. Extract the zip file, again programmatically, similar to how we did it in class. 

>  You can download the data and unzip it manually. But we learned about APIs, so we can do it in a reproducible way with the `requests` library, similar to how we [did it in class](https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture1.html#using-rest-api-lab-lecture).

> There are 5 files in the figshare repo. The one we want is: `data.zip`

In [None]:
# References code from Lecture notes: https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture1.html#using-rest-api-lab-lecture
# article with daily rainfall data
article_id = 14096681  
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figshare_data/"

In [None]:
# Get request to get the files available
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  
files = data["files"]          
files

In [None]:
%%time

# Download data.zip
files_to_dl = ["data.zip"] 
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [None]:
%%time

# Extract Data.zip
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)



### 4. Combining data CSVs
rubric={correctness:10,reasoning:10}

1. Combine data CSVs into a single CSV using pandas.
    
2. When combining the CSV files, add an extra column called "model" that identifies the model.
    Tip 1: you can get this column populated from the file name, eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON
    Tip 2: Remember how we added "year" column when we combined airline CSVs. Here the regex will be to get word before an underscore ie, "/([^_]*)"

> Note: There is a file called `observed_daily_rainfall_SYD.csv` in the data folder that you downloaded. Make sure you exclude this file (programmatically or just take out that file from the folder) before you combine CSVs. We will use this file in our next milestone.

3. ***Compare*** run times on different machines within your team and summarize your observations. 

> Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you discuss the reasons why you might not have been able to run this on your laptop. 

In [None]:
# adjust based on your os
%cd "figshare_data/"

In [None]:
%%time
# Exclude the "observed_daily_rainfall_SYD.csv" file
file_pattern = re.compile(r"^(?!observed_daily_rainfall_SYD).*\.csv$")

# Extract the model name
def extract_model_name(file_name):
    return re.match(r"([^_]*)", file_name).group(1)

combined_data = []
for file_name in os.listdir("."):
    if file_pattern.match(file_name):
        model_name = extract_model_name(file_name)
        print(model_name)
        data = pd.read_csv(file_name)
        data["model"] = model_name
        combined_data.append(data)

# Combine all the dataframes
combined_data = pd.concat(combined_data, ignore_index=True)
combined_data.to_csv("combined_data.csv", index=False)

In [None]:
combined_data

In [None]:
%%sh
du -sh combined_data.csv

In [None]:
# table for Q4
data = {
    "Team Member": ["Andy Wang", "Samson Bakos", "Raul Aguilar", "Arjun Radhakrishnan"],
    "Operating System": ["Windows 11", "MacOS Ventura 13.2", "MacOS Monterey 12.5.1", "Windows 11"],
    "RAM": ["32GB", "16GB", "8GB", "16GB"],
    "Processor": ["Intel(R) Core(TM) i7-10870H", "Apple M1", "Apple M2", "Intel(R) Core(TM) i7-12700H"],
    "Is SSD": ["Y", "Y", "Y", "Y"],
    "Time taken": ["6min 37s", "3min 55s", "3min 32s", "3 min 38s"]
}
table = pd.DataFrame(data)

display(table)


### 5. Load the combined CSV to memory and perform a simple EDA
rubric={correctness:10,reasoning:10}

1. Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts). Refer to lecture notes [here](https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture1.html#some-tactics-to-deal-with-memory-issue).
    - Changing `dtype` of your data
    - Load just columns that we want
    - Loading in chunks
    
2. ***Compare*** run times on different machines within your team and summarize your observations.

In [None]:
%%time

columns = ["lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/day)", "model"] 
# dropping Time because timestamp not useful for EDA
# keeping the numeric columns and the model column for EDA

df = pd.read_csv("combined_data.csv", usecols = columns)

In [None]:
%%time

# Change float 64 to float 32
float64_cols = list(df.select_dtypes(include='float64'))
df[float64_cols] = df[float64_cols].astype('float32')

# create column subsets to simplify computations
numeric_cols= ["lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/day)"]
cat_cols = ["model"]


In [None]:
print(f"Loaded size is {round(df.memory_usage().sum()*1e-9,2)} GB")

In [None]:
%%time

df.info()

In [None]:
%%time


df[cat_cols].nunique()

In [None]:
%%time

df[numeric_cols].describe()

In [None]:
%%time

df[cat_cols].value_counts()

In [None]:
%%time

df.isnull().sum()

In [None]:
%%time

df[numeric_cols].corr()

In [None]:
# table for q5
data = {
    "Team Member": ["Samson Bakos", "Raul Aguilar", "Andy Wang", "Arjun Radhakrishnan"],
    "Operating System": ["MacOS Ventura 13.2", "MacOS Monterey 12.5", "Windows 11", "Windows 11"],
    "RAM": ["16GB", "8GB", "32GB", "16GB"],
    "Processor": ["M1", "M2", "Intel(R) Core(TM) i7-10870H", "Intel(R) Core(TM) i7-12700H"],
    "Is SSD": ["Y", "Y", "Y", "Y"],
    "Time taken": ["51.9s", "49.1s", "66.95", "52.1s"]
}
table = pd.DataFrame(data)

display(table)


### 6. Perform a simple EDA in R
rubric={correctness:15,reasoning:10}

1. Choose one of the methods listed below for transferring the dataframe (i.e., the entire dataset) from Python to R, and explain why you opted for this approach instead of the others.
    - [Parquet file](https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture2.html#converting-csv-parquet)
    - [Pandas exchange](https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture1.html#use-r-and-python-interchangeably)
    - [Arrow exchange](https://pages.github.ubc.ca/MDS-2022-23/DSCI_525_web-cloud-comp_students/lectures/lecture2.html#use-r-and-python-interchangeably-with-arrow)
2. Once you have the dataframe in R, perform a simple EDA.


__ANSWER__:

One of the primary factors in our decision to choose `Arrow exchange` over `Parquet file` and `Pandas exchange` is its efficiency when it comes to serialization and de-serialization processes, allowing an optimal use of memory space. Additionally, since `Arrow exchange` uses an unified memory representation, it facilitates seamless interchangeability between multiple programming languages. Moreover, `Arrow R` has good integration with essential R libraries such as `Dplyr`, `Lubridate`, and `String R`. This integration empowers users to perform comprehensive data wrangling operations on large-scale data frames.

In [None]:
# Change to your local path if necessary
import os
os.environ['R_HOME'] = '/Users/raulaguilar/opt/miniconda3/envs/525_dev/lib/R'

In [None]:
%load_ext rpy2.ipython 
# Load R magic in notebook

In [None]:
# Combined data path
combinedcsv = "combined_data.csv"

In [None]:
%%time

# Build `pyarrow dataset`
dataset = ds.dataset(combinedcsv, format="csv")

# Converting the `pyarrow dataset` to a `pyarrow table`
table = dataset.to_table()

# Converting a `pyarrow table` to a `rarrow table`
r_table = pyra.converter.py2rpy(table)

In [None]:
%%time
%%R -i r_table

# Check basic data structure
suppressMessages({
  library(dplyr)
})

glimpse <- r_table |>
    glimpse() 
    
print(glimpse)

In [None]:
%%time
%%R

# Records by model, top 10
model_rows <- r_table |> 
    count(model) |> 
    arrange(desc(n)) |> 
    collect()
    
print(model_rows)

In [None]:
%%R

# Null count for target variable
target_null <- r_table |>
    filter(is.na(`rain (mm/day)`)) |> 
    count() |> 
    pull()
    
cat("There are", target_null, "null registers in 'rain' column.")

In [None]:
%%time
%%R

# summary statistics for numeric columns
not_null <- r_table |>
    select(lat_min, lat_max, lon_min, lon_max, `rain (mm/day)`) |> 
    filter(!is.na(`rain (mm/day)`)) |> 
    collect()
        
max_values <- sapply(not_null, max)
min_values <- sapply(not_null, min)
mean_values <- sapply(not_null, mean)
sd_values <- sapply(not_null, sd)

print("Mean values:")
print(mean_values)
print("Min values:")
print(min_values)
print("Max values:")
print(max_values)
print("Sd values:")
print(sd_values)

## Specific expectations for this milestone 

- In this milestone, we are looking for a well-documented and self-explanatory notebook that explores different options to tackle big data on your laptop.
- Please discuss any challenges or difficulties you faced when dealing with this large amount of data on your laptop. You can stop combining the data if it takes more than 30 minutes. Briefly explain your approach to overcoming the challenges or reasons why you could not overcome them.
- For questions 5 and 6, you are free to choose any exploratory data analysis (EDA) task you want. Visualization is not necessary; summarizing the data is enough. However, if you want to install additional packages for visualization that are not included in the .yml file, feel free to install them on top of your notebook. If you want to install packages in R, you can do so using `install.packages("dplyr")` under `%%R` magic cell.
- If someone in your team is facing issues with using R in a Python notebook, you can ignore it, as you will not need it for any other milestones. The main purpose of showing it in the lecture was to introduce and get a feel for the serialization and deserialization concept.
- You only need to ***compare*** the time with other team members for questions 4 and 5. You do not need to do this for question 6. You can use the following table to record your results. Feel free to add any other relevant columns.


| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Member 1    |                  |     |           |        |            |
| Member 2    |                  |     |           |        |            |
| Member 3    |                  |     |           |        |            |
| Member 4    |                  |     |           |        |            |

## Submission instructions
rubric={mechanics:5}

In the textbox provided on Canvas for the Milestone 1 assignment include:

- The GitHub URL to your notebook.

As comment include
- Repo link
- Teamwork contract