# Task 3

This notebook is part of Milestone 3, Question 3. You can work on this notebook on your laptop to develop your machine learning model using all the learnings from the previous courses. At the end of this notebook, when you are ready to train the model, you will need to obtain the hyperparameters from the hyperparameter tuning job that you will run in Milestone 3 Question 4 (i.e., the notebook named `Milestone3-Task4.ipynb`).

PS: To speed up the process, you can test the model without the hyperparameters first. Once other team members obtain the hyperparameters, you can retrain the model using those hyperparameters and test it again.

In [1]:
# I asked them to use their laptop so they already got all these packages from previous courses.
# %pip install joblib scikit-learn matplotlib s3fs

# Imports

In [2]:
import numpy as np
import pandas as pd
import os
from joblib import dump, load
import altair as alt
import altair_ally as aly
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16, 'axes.labelweight': 'bold', 'figure.figsize': (8,6)})
## add any other additional packages that you need. You are free to use any packages for vizualization.
# Save a vega-lite spec and a PNG blob for each plot in the notebook
alt.renderers.enable('mimetype')
# Handle large data sets without embedding them in the notebook
alt.data_transformers.enable('data_server')

DataTransformerRegistry.enable('data_server')

## Part 1:

Recall as a final goal of this project. We want to build and deploy ensemble machine learning models in the cloud, where features are outputs of different climate models and the target is the actual rainfall observation. In this milestone, you'll actually build these ensemble machine learning models in the cloud.  

**Your tasks:**

1. Read the data CSV from your s3 bucket. 
2. Drop rows with nans. 
3. Split the data into train (80%) and test (20%) portions with `random_state=123`. 
4. Carry out EDA of your choice on the train split. 
5. Train ensemble machine learning model using `RandomForestRegressor` and evaluate with metric of your choice (e.g., `RMSE`) by considering `Observed` as the target column. 
6. Discuss your results. Are you getting better results with ensemble models compared to the individual climate models? 

> Recall that individual columns in the data are predictions of different climate models. 

In [3]:
## Remember by default it looks for credentials in home directory. 
## Make sure your updated credentials are in home directory
## or pass credentials explicitly and pass as storage_options=aws_credentials (not a good idea)
# aws_credentials = {"key": "","secret": "","token":""}
# replace with s3 path to your data

### Step 1: Read the data
os.environ["AWS_SHARED_CREDENTIALS_FILE"] = "~/.aws/credentials"
df = pd.read_csv("s3://mds-s3-4-mrnabiz/output/ml_data_SYD.csv", index_col=0, parse_dates=True)

In [4]:
### Step 2: Drop NAs
df = df.dropna()

In [5]:
## Step 3: Train Test Split
train_df, test_df = train_test_split(df, 
                                     test_size=0.2, 
                                     random_state=123)

X_train, y_train = train_df.drop(columns=["Observed"]), train_df["Observed"]
X_test, y_test = test_df.drop(columns=["Observed"]), test_df["Observed"]

In [6]:
## Step 4: EDA
X_train.head()

Unnamed: 0_level_0,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,EC-Earth3-Veg-LR,...,MIROC6,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,SAM0-UNICON,TaiESM1
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1953-10-26,17.906051,0.837579,9.753198e-14,0.018863,0.2878923,0.007043,0.122719,10.855838,0.022752,0.472927,...,0.456327,6.688447,2.860546,9.77933e-14,0.2980863,1.659176e-13,3.841924,2.713473,0.65944,0.129196
1921-10-22,0.515505,1.911354,1.135404,2e-06,0.4091981,0.009669,0.074208,1.239226,3.566098,0.66719,...,10.748933,0.2368273,0.652848,1.132699e-13,7.653117e-08,0.004560164,0.04178978,7.909935,0.206765,2.018346
1925-01-22,0.161412,2.666091,0.07012887,2.040689,13.38349,0.073243,0.000255,1.349633,0.075959,0.059223,...,0.001945,0.1082573,2.977031,1.320287e-13,0.0001937005,1.692996e-13,0.001290949,0.183711,1.733777,0.932259
1902-11-21,3.651607,3.117433,1.142701e-13,1.6e-05,4.658142e-09,3.913076,9.442968,0.720382,5.31468,0.122738,...,2.310382,0.1635075,0.021314,0.9901551,1.142382,0.001840662,0.04955181,6.8e-05,12.98833,0.005468
1925-02-17,0.635625,39.042773,1.084678,31.690315,6.208601e-09,0.416932,0.733783,0.004239,0.439862,0.40493,...,0.011038,4.388535e-13,0.025447,2.91817,0.1314147,0.369033,2.357034e-08,0.036247,0.298767,2.923645


In [7]:
numeric_cols = list(X_train.select_dtypes(include='number'))

numeric_cols_hist = alt.Chart(X_train).mark_bar().encode(
    alt.X(alt.repeat(), bin=alt.Bin(maxbins=50)),
    alt.Y('count()', axis=alt.Axis(title='Count'), stack=False)
).properties(
    width=250,
    height=150
).repeat(repeat=numeric_cols, columns = 3).properties(
    title='Histogram chart of the Sydney Rain dataset per each numeric feature'
)

numeric_cols_hist

<VegaLite 4 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html


<img src="img/corr_plot_1.png" >
<img src="img/corr_plot_2.png" >
<img src="img/corr_plot_3.png" >

In [7]:
X_train.corr('spearman').style.background_gradient()

Unnamed: 0,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,EC-Earth3-Veg-LR,FGOALS-g3,GFDL-CM4,INM-CM4-8,INM-CM5-0,KIOST-ESM,MIROC6,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,SAM0-UNICON,TaiESM1
ACCESS-CM2,1.0,0.004676,0.006968,-0.004138,0.010183,0.012352,0.006954,0.002204,0.026286,0.019125,0.001868,0.011236,-0.011431,0.00088,0.022973,0.01742,0.035208,0.017736,0.020222,-0.003758,0.020519,0.004781,0.010897,0.015579,0.012426
ACCESS-ESM1-5,0.004676,1.0,0.007135,0.005796,0.016647,0.044544,0.028338,0.024672,0.030473,0.034057,0.027829,0.036675,0.019573,0.003491,0.01522,-0.009451,0.021522,0.049268,0.025143,0.008796,0.033233,0.002286,0.022748,0.041066,0.02868
AWI-ESM-1-1-LR,0.006968,0.007135,1.0,0.004103,0.031176,0.001282,0.006411,0.006237,0.038756,0.023514,0.003691,0.013895,0.004488,0.007916,0.033882,0.01387,0.039266,0.003629,0.037911,-0.009852,0.033957,-0.005617,0.012906,0.019402,0.002768
BCC-CSM2-MR,-0.004138,0.005796,0.004103,1.0,0.017557,0.021063,0.009091,0.009198,0.013417,0.019688,0.009161,0.012879,0.000174,-0.002808,0.00515,0.009024,0.003596,0.012845,0.016782,0.027284,0.022216,-0.000839,0.006679,0.005745,0.007382
BCC-ESM1,0.010183,0.016647,0.031176,0.017557,1.0,0.01131,-0.007921,0.000311,0.050346,0.02845,-0.006267,0.003052,-0.010176,-0.008535,0.055679,0.033366,0.047502,0.007088,0.045593,0.002729,0.05203,-0.000692,0.007475,-0.004538,-0.007058
CMCC-CM2-HR4,0.012352,0.044544,0.001282,0.021063,0.01131,1.0,0.056384,0.063701,0.004556,0.026427,0.039001,0.050794,0.015124,0.0085,0.006483,-0.019811,0.001558,0.06111,0.023232,0.027032,0.047412,0.010197,0.009414,0.0516,0.051981
CMCC-CM2-SR5,0.006954,0.028338,0.006411,0.009091,-0.007921,0.056384,1.0,0.048498,-0.000451,0.014198,0.019122,0.060414,0.02664,0.015918,-0.004685,-0.035737,-0.021724,0.049834,-0.006245,0.003569,0.041775,-0.000828,0.019172,0.05472,0.051049
CMCC-ESM2,0.002204,0.024672,0.006237,0.009198,0.000311,0.063701,0.048498,1.0,-0.010525,0.023211,0.021767,0.051414,0.029598,0.024765,-0.012334,-0.01936,-0.017243,0.062817,0.016923,0.010749,0.021395,0.00976,0.017046,0.050043,0.049306
CanESM5,0.026286,0.030473,0.038756,0.013417,0.050346,0.004556,-0.000451,-0.010525,1.0,0.044511,-0.01908,0.005527,-0.007654,-0.015887,0.08248,0.033272,0.060261,0.011092,0.059197,-0.001426,0.06443,0.001366,0.021166,0.010907,-0.012242
EC-Earth3-Veg-LR,0.019125,0.034057,0.023514,0.019688,0.02845,0.026427,0.014198,0.023211,0.044511,1.0,-0.001429,0.024645,0.008256,-0.00332,0.046502,0.016332,0.029976,0.037903,0.056527,0.016395,0.064652,-0.010343,0.008382,0.032123,0.030139


<img src="img/corr_plot_4.png" >

In [10]:
numeric_cols = list(X_train.select_dtypes(include='number'))
aly.corr(X_train[numeric_cols])

<VegaLite 4 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/troubleshooting.html


In [11]:
### Step 5: Model Training
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=123)
model.fit(X_train, y_train)

In [12]:
model_predictions = model.predict(X_train)

rmse_scores = {'Model':[],
               'RMSE Scores':[]}
for column in X_train.columns.to_list():
    rmse_scores['Model'].append(column)
    rmse_scores['RMSE Scores'].append(mean_squared_error(X_train[column],
                                                         y_train))
rmse_scores['Model'].append('RFR_Preds')
rmse_scores['RMSE Scores'].append(mean_squared_error(model_predictions,
                                                     y_train))

rmse_scores_df = pd.DataFrame(rmse_scores)
rmse_scores_df

Unnamed: 0,Model,RMSE Scores
0,ACCESS-CM2,111.666622
1,ACCESS-ESM1-5,113.284866
2,AWI-ESM-1-1-LR,118.308166
3,BCC-CSM2-MR,105.982389
4,BCC-ESM1,101.429067
5,CMCC-CM2-HR4,107.193705
6,CMCC-CM2-SR5,119.585524
7,CMCC-ESM2,114.773506
8,CanESM5,111.752399
9,EC-Earth3-Veg-LR,98.052564


#### Step 6: Discussion
The table above presents the Root Mean Square Error (RMSE) scores for 25 different models and our Random Forest Regression Model. From the table, we can make this observation that the Random Forest Regression model, with a score of 9.67. This indicates that it has the best performance in terms of prediction accuracy among all listed models.
The model with the highest RMSE score is INM-CM5-0, with a score of 135.134065. This suggests that this model has the poorest performance in terms of prediction accuracy of the rain in Sydney among all the models presented.
The majority of the models have RMSE scores ranging from 84.576195 (KIOST-ESM) to 135.134065 (INM-CM5-0). This range represents the varying degrees of prediction accuracy for the models.

## Part 2:

### Preparation for deploying model next week

***NOTE: Complete Question 4 (`Milestone3-task4.ipynb`) from the milestone 3 before coming here***

We’ve found the best hyperparameter settings with MLlib (from the Question 4 from milestone3), here we then use the same hyperparameters to train a scikit-learn model. 

In [13]:
# Just replace ___ with the numbers you found from Milestone3-task4.ipynb
model = RandomForestRegressor(n_estimators=100, 
                              max_depth=5)
model.fit(X_train, y_train)

In [14]:
print(f"Train RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")
print(f" Test RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.2f}")

Train RMSE: 7.90
 Test RMSE: 8.64


In [15]:
# ready to deploy
# where this model is saved? Understand the concept of relative path.
dump(model, "model.joblib")

['model.joblib']

In [None]:
# !aws s3 cp model.joblib s3://mds-s3-4-mrnabiz/output/model.joblib

***Upload model.joblib to s3 under output folder. You choose how you want to upload it (using CLI, SDK, or web console).*** Web console is also completely fine as it is a small file.