# Milestone 3 - Task 3

Link to github repo for this task: https://github.com/UBC-MDS/DSCI525-Group7/blob/main/notebooks/Milestone3/Milestone3-Task3.ipynb

# Imports

In [1]:
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16, 'axes.labelweight': 'bold', 'figure.figsize': (8,6)})
## add any other additional packages that you need. You are free to use any packages for vizualization.

## Part 1:

Recall as a final goal of this project. We want to build and deploy ensemble machine learning models in the cloud, where features are outputs of different climate models and the target is the actual rainfall observation. In this milestone, you'll actually build these ensemble machine learning models in the cloud.  

**Your tasks:**

1. Read the data CSV from your s3 bucket. 
2. Drop rows with nans. 
3. Split the data into train (80%) and test (20%) portions with `random_state=123`. 
4. Carry out EDA of your choice on the train split. 
5. Train ensemble machine learning model using `RandomForestRegressor` and evaluate with metric of your choice (e.g., `RMSE`) by considering `Observed` as the target column. 
6. Discuss your results. Are you getting better results with ensemble models compared to the individual climate models? 

> Recall that individual columns in the data are predictions of different climate models. 

### 1. Read data from s3 bucket

In [2]:
aws_credentials ={"key": "ASIA5AG74I22KUK43OXU",
                  "secret": "osuphELZOnOx56XZYh++kVtfsO90SKEmV0wT3N/o",
                  "token":"FwoGZXIvYXdzEJ///////////wEaDKb08YMhboB/5tmJkCLJAeR0n55a0/DJu7sHpAvhkiseKKxtDMVg34v2Tzwo9kQNibx6JQQ5pKY8YRUiaSsynxwWUfbAgj700iLI251GjT+2qp4Fz9iYbR10vXd9fOtfZizNUDtscSV8LhlCIMpcDUXcF//47gABPUvZ0ik+VqJ/6WtCcN32GY20oUTYCSr62XNVTFQpjQfbnNdlnXiBs9DpWf5BJHrNdzu3BJ8vtSxhqfG9aUH2y44tXZkaFEhR+CsCRcVo1m1ipm9/Qp7o3KmjhulEyDqSLyictNKSBjItiM/2aQfEFbXFsl88/0ICi0sGXTDxYJp16ztU4qS9LhBDZRv18fnXd9JZ+p6x"}

df = pd.read_csv("s3://mds-s3-group7/output/ml_data_SYD.csv", 
                 storage_options = aws_credentials,
                 index_col=0, parse_dates=True)

In [3]:
df.head()

Unnamed: 0_level_0,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,EC-Earth3-Veg-LR,...,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,SAM0-UNICON,TaiESM1,observed
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1889-01-01,0.040427,1.814552,35.579336,4.268112,0.001107466,11.410537,3.322009e-08,2.6688,1.321215,1.515293,...,4.244226e-13,1.390174e-13,6.537884e-05,3.445495e-06,15.76096,4.759651e-05,2.451075,0.221324,2.257933,0.006612
1889-01-02,0.073777,0.303965,4.59652,1.190141,0.0001015323,4.014984,1.3127,0.946211,2.788724,4.771375,...,4.409552,0.1222283,1.049131e-13,4.791993e-09,0.367551,0.4350863,0.477231,3.757179,2.287381,0.090422
1889-01-03,0.232656,0.019976,5.927467,1.003845e-09,1.760345e-05,9.660565,9.10372,0.431999,0.003672,4.23398,...,0.22693,0.3762301,9.758706e-14,0.6912302,0.1562869,9.561101,0.023083,0.253357,1.199909,1.401452
1889-01-04,0.911319,13.623777,8.029624,0.08225225,0.1808932,3.951528,13.1716,0.368693,0.013578,15.252495,...,0.02344586,0.4214019,0.007060915,0.03835721,2.472226e-07,0.5301038,0.002699,2.185454,2.106737,14.869798
1889-01-05,0.698013,0.021048,2.132686,2.496841,4.708019e-09,2.766362,18.2294,0.339267,0.002468,11.920356,...,4.270161e-13,0.1879692,4.504985,3.506923e-07,1.949792e-13,1.460928e-10,0.001026,2.766507,1.763335,0.467628


In [4]:
df.shape

(46020, 26)

### 2. Drop rows with nans

In [5]:
# Check if there are null values and drop the rows
df.isnull().any().sum()

15

In [6]:
df = df.dropna()

### 3. Data splitting (80% train set, 20% test set, random state=123)

In [7]:
df_train, df_test = train_test_split(df.dropna(), test_size=0.2, random_state=123)

In [8]:
print(f"Train set size: {df_train.shape}")
print(f"Test set size: {df_test.shape}")

Train set size: (36791, 26)
Test set size: (9198, 26)


### 4. EDA (TODO)

### 5. ML model building and result evaluation

In [9]:
# Drop `observed` rainfall from the data

X_train = df_train.drop(columns=["observed"])
y_train = df_train["observed"]

X_test = df_test.drop(columns=["observed"])
y_test = df_test["observed"]

In [10]:
# Train random forest (RF) ensemble model
model = RandomForestRegressor().fit(X_train, y_train)

# Predict result
y_pred = model.predict(X_train)

# RF model result of RMSE
rmse = mean_squared_error(y_train, y_pred, squared=False)
print(f'RMSE of random forest (RF) ensemble model: {rmse}')
rmse

RMSE of random forest (RF) ensemble model: 3.1025267029473587


3.1025267029473587

In [11]:
# RMSE of individual climate models
models = X_train.columns.to_list()
results = {}

for model in models:
    X = pd.DataFrame(X_train[model])
    rf_model = RandomForestRegressor().fit(X, y_train)
    y_pred_ind = rf_model.predict(X)
    results[model] = mean_squared_error(y_train, y_pred_ind, squared=False)

In [12]:
# Compare RMSE of individual climate model with the RF ensemble model
results_df = pd.DataFrame(data = results.values(),
                          index = results.keys(),
                          columns = ['RMSE'])
results_df[f'Better than RF model RMSE {rmse:0.2f}?'] = results_df['RMSE'] < rmse
print('RMSE comparison of individual climate model with random forest ensemble model:')
results_df

RMSE comparison of individual climate model with random forest ensemble model:


Unnamed: 0,RMSE,Better than RF model RMSE 3.10?
ACCESS-CM2,3.713084,False
ACCESS-ESM1-5,4.213257,False
AWI-ESM-1-1-LR,4.709785,False
BCC-CSM2-MR,4.721479,False
BCC-ESM1,4.781943,False
CMCC-CM2-HR4,3.812931,False
CMCC-CM2-SR5,4.157355,False
CMCC-ESM2,3.992839,False
CanESM5,3.88924,False
EC-Earth3-Veg-LR,5.806866,False


### 6. Discussion of results

Are you getting better results with the random forest ensemble model compared to the individual climate models?

- The random forest ensemble model performed better than all the individual climate models with a lower RMSE by at least 0.6.
- We can try tuning the hyperparameters to achieve an even better result. 

## Part 2:

### Preparation for deploying model next week

***NOTE: Complete task 4 from the milestone3 before coming here***

We’ve found the best hyperparameter settings with MLlib (from the task 4 from milestone3), here we then use the same hyperparameters to train a scikit-learn model. 

In [13]:
model = RandomForestRegressor(n_estimators=___, max_depth=___)
model.fit(X_train, y_train)

RandomForestRegressor(max_depth=15, n_estimators=15)

In [14]:
print(f"Train RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")
print(f" Test RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.2f}")

Train RMSE: 6.91
 Test RMSE: 8.81


In [15]:
# ready to deploy
dump(model, "model.joblib")

['model.joblib']

***Upload model.joblib to s3 under output folder. You choose how you want to upload it (using CLI, SDK, or web console).***