# Task 3

This notebook is part of Milestone 3, Question 3. You can work on this notebook on your laptop to develop your machine learning model using all the learnings from the previous courses. At the end of this notebook, when you are ready to train the model, you will need to obtain the hyperparameters from the hyperparameter tuning job that you will run in Milestone 3 Question 4 (i.e., the notebook named `Milestone3-Task4.ipynb`).

PS: To speed up the process, you can test the model without the hyperparameters first. Once other team members obtain the hyperparameters, you can retrain the model using those hyperparameters and test it again.

In [7]:
# I asked them to use their laptop so they already got all these packages from previous courses.
# %pip install joblib scikit-learn matplotlib s3fs

# Imports

In [8]:
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16, 'axes.labelweight': 'bold', 'figure.figsize': (8,6)})
## add any other additional packages that you need. You are free to use any packages for vizualization.

## Part 1:

Recall as a final goal of this project. We want to build and deploy ensemble machine learning models in the cloud, where features are outputs of different climate models and the target is the actual rainfall observation. In this milestone, you'll actually build these ensemble machine learning models in the cloud.  

**Your tasks:**

1. Read the data CSV from your s3 bucket. 
2. Drop rows with nans. 
3. Split the data into train (80%) and test (20%) portions with `random_state=123`. 
4. Carry out EDA of your choice on the train split. 
5. Train ensemble machine learning model using `RandomForestRegressor` and evaluate with metric of your choice (e.g., `RMSE`) by considering `Observed` as the target column. 
6. Discuss your results. Are you getting better results with ensemble models compared to the individual climate models? 

> Recall that individual columns in the data are predictions of different climate models. 

In [9]:
## Remember by default it looks for credentials in home directory. 
## Make sure your updated credentials are in home directory
## or pass credentials explicitly and pass as storage_options=aws_credentials (not a good idea)
# aws_credentials = {"key": "","secret": "","token":""}
# replace with s3 path to your data

In [10]:
## Use your ML skills to get from step 1 to step 6

In [11]:
# step 1 
df = pd.read_csv("ml_data_SYD.csv", index_col=0, parse_dates=True) 

In [12]:
# step 2 
df = df.dropna()

In [13]:
# step 3 
train_df, test_df = train_test_split(df, test_size=0.20, random_state=123)
X_train, y_train = (
    train_df.drop(columns=["observed"]),
    train_df["observed"],
)
X_test, y_test = (
    test_df.drop(columns=["observed"]),
    test_df["observed"],
)

In [14]:
# step 4 
train_df.describe()

Unnamed: 0,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,EC-Earth3-Veg-LR,...,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,SAM0-UNICON,TaiESM1,observed
count,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,...,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0
mean,2.429419,2.938955,3.716329,2.203086,2.748441,3.092784,3.575203,3.489756,2.879339,2.56543,...,3.213535,1.299377,2.041242,1.533212,1.726792,2.458268,2.890478,3.383557,3.417809,2.72632
std,6.791374,7.048794,7.280859,6.518224,5.997439,6.459254,7.353451,7.039201,6.89889,5.732742,...,6.979341,4.890737,5.347782,5.000287,4.872754,5.815333,7.129072,7.927354,7.558577,8.07831
min,0.0,0.0,9.161142e-14,4.21143e-24,1.091904e-24,0.0,-4.503054e-17,-3.1861769999999997e-19,0.0,-9.934637e-19,...,3.315622e-13,1.088608e-13,9.155419e-14,9.479186000000001e-33,1.435053e-13,0.0,0.0,-3.6046730000000005e-17,-2.148475e-14,0.0
25%,0.054108,0.021248,0.02961787,0.0005089918,0.002381995,0.138315,0.08899328,0.09271159,0.022493,0.0120163,...,0.0001169275,1.270013e-13,1.358104e-13,5.380599e-05,1.866808e-13,0.005478,0.010013,0.03651962,0.04934874,0.008084
50%,0.19298,0.492758,0.5923147,0.09644146,0.2986511,0.633548,0.8278889,0.8486242,0.337613,0.4296779,...,0.2081838,0.001579151,0.1140358,0.03185565,0.04989652,0.169617,0.255937,0.6539921,0.6675421,0.163215
75%,1.445456,2.398539,3.601697,1.31894,2.477893,3.18263,3.727703,3.629963,2.558854,2.295852,...,2.699071,0.3465456,1.192421,0.6732646,0.787474,1.822582,2.45069,3.275132,3.23443,1.612815
max,149.967634,157.605713,89.46575,134.4652,87.13472,124.95239,140.1478,137.5916,135.569753,96.42382,...,93.06766,109.5008,74.84368,101.69,80.45783,114.898109,163.164524,154.9718,167.3562,192.93303


In [15]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 36791 entries, 1953-10-26 to 1932-01-31
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ACCESS-CM2        36791 non-null  float64
 1   ACCESS-ESM1-5     36791 non-null  float64
 2   AWI-ESM-1-1-LR    36791 non-null  float64
 3   BCC-CSM2-MR       36791 non-null  float64
 4   BCC-ESM1          36791 non-null  float64
 5   CMCC-CM2-HR4      36791 non-null  float64
 6   CMCC-CM2-SR5      36791 non-null  float64
 7   CMCC-ESM2         36791 non-null  float64
 8   CanESM5           36791 non-null  float64
 9   EC-Earth3-Veg-LR  36791 non-null  float64
 10  FGOALS-g3         36791 non-null  float64
 11  GFDL-CM4          36791 non-null  float64
 12  INM-CM4-8         36791 non-null  float64
 13  INM-CM5-0         36791 non-null  float64
 14  KIOST-ESM         36791 non-null  float64
 15  MIROC6            36791 non-null  float64
 16  MPI-ESM-1-2-HAM   36791

In [16]:
# step 5
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
numeric_transformer = StandardScaler()
numeric_features = X_train.columns.tolist()
preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features)
)

pipe_rf = make_pipeline(
    preprocessor,
    RandomForestRegressor(
        n_jobs=-1,
        random_state=123,
    ),
)
pipe_rf.fit(X_train, y_train)

In [17]:
y_predict = pipe_rf.predict(X_test)
print(f'Ensemble model RMSE: {round(mean_squared_error(y_predict,y_test),3)}')

Ensemble model RMSE: 78.237


In [18]:
# step 6
for column in numeric_features:
    print(f'Individual model {column} RMSE {round(mean_squared_error(X_test[column],y_test),3)}')

Individual model ACCESS-CM2 RMSE 121.86
Individual model ACCESS-ESM1-5 RMSE 114.39
Individual model AWI-ESM-1-1-LR RMSE 120.926
Individual model BCC-CSM2-MR RMSE 115.807
Individual model BCC-ESM1 RMSE 112.69
Individual model CMCC-CM2-HR4 RMSE 113.278
Individual model CMCC-CM2-SR5 RMSE 131.805
Individual model CMCC-ESM2 RMSE 126.484
Individual model CanESM5 RMSE 124.352
Individual model EC-Earth3-Veg-LR RMSE 109.278
Individual model FGOALS-g3 RMSE 93.853
Individual model GFDL-CM4 RMSE 110.495
Individual model INM-CM4-8 RMSE 131.14
Individual model INM-CM5-0 RMSE 150.068
Individual model KIOST-ESM RMSE 92.169
Individual model MIROC6 RMSE 128.89
Individual model MPI-ESM-1-2-HAM RMSE 119.509
Individual model MPI-ESM1-2-HR RMSE 99.397
Individual model MPI-ESM1-2-LR RMSE 105.286
Individual model MRI-ESM2-0 RMSE 98.462
Individual model NESM3 RMSE 99.563
Individual model NorESM2-LM RMSE 108.371
Individual model NorESM2-MM RMSE 119.678
Individual model SAM0-UNICON RMSE 136.393
Individual model 

The ensemble model performs better than all individual climate models with lower RMSE on observed target.

## Part 2:

### Preparation for deploying model next week

***NOTE: Complete Question 4 (`Milestone3-task4.ipynb`) from the milestone 3 before coming here***

We’ve found the best hyperparameter settings with MLlib (from the Question 4 from milestone3), here we then use the same hyperparameters to train a scikit-learn model. 

In [19]:
# Just replace ___ with the numbers you found from Milestone3-task4.ipynb
model = RandomForestRegressor(n_estimators=100, max_depth=5, bootstrap=True)
model.fit(X_train, y_train)

In [20]:
print(f"Train RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")
print(f" Test RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.2f}")

Train RMSE: 7.89
 Test RMSE: 8.65


In [21]:
# ready to deploy
# where this model is saved? Understand the concept of relative path.
dump(model, "model.joblib")

['model.joblib']

***Upload model.joblib to s3 under output folder. You choose how you want to upload it (using CLI, SDK, or web console).*** Web console is also completely fine as it is a small file.