# Task 3

This notebook is part of Milestone 3, Question 3. You can work on this notebook on your laptop to develop your machine learning model using all the learnings from the previous courses. At the end of this notebook, when you are ready to train the model, you will need to obtain the hyperparameters from the hyperparameter tuning job that you will run in Milestone 3 Question 4 (i.e., the notebook named `Milestone3-Task4.ipynb`).

PS: To speed up the process, you can test the model without the hyperparameters first. Once other team members obtain the hyperparameters, you can retrain the model using those hyperparameters and test it again.

In [1]:
# I asked them to use their laptop so they already got all these packages from previous courses.
# %pip install joblib scikit-learn matplotlib s3fs

# Imports

In [71]:
import os
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import plotly.express as px
import awswrangler as wr

plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16, 'axes.labelweight': 'bold', 'figure.figsize': (8,6)})
## add any other additional packages that you need. You are free to use any packages for vizualization.

## Part 1:

Recall as a final goal of this project. We want to build and deploy ensemble machine learning models in the cloud, where features are outputs of different climate models and the target is the actual rainfall observation. In this milestone, you'll actually build these ensemble machine learning models in the cloud.  

**Your tasks:**

1. Read the data CSV from your s3 bucket. 
2. Drop rows with nans. 
3. Split the data into train (80%) and test (20%) portions with `random_state=123`. 
4. Carry out EDA of your choice on the train split. 
5. Train ensemble machine learning model using `RandomForestRegressor` and evaluate with metric of your choice (e.g., `RMSE`) by considering `Observed` as the target column. 
6. Discuss your results. Are you getting better results with ensemble models compared to the individual climate models? 

> Recall that individual columns in the data are predictions of different climate models. 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

In [3]:
# os.environ["AWS_SHARED_CREDENTIALS_FILE"] = "/srv/keys/credentials"
os.environ["AWS_SHARED_CREDENTIALS_FILE"] = "%USERPROFILE%\\.aws\\credentials"

train_data_url = 's3://mds-525-group-15/output/ml_data_SYD.csv'

In [4]:
## Remember by default it looks for credentials in home directory. 
## Make sure your updated credentials are in home directory
## or pass credentials explicitly and pass as storage_options=aws_credentials (not a good idea)
# aws_credentials = {"key": "","secret": "","token":""}
# replace with s3 path to your data
df = pd.read_csv(train_data_url, index_col=0, parse_dates=True)

In [5]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)

## EDA

In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 36816 entries, 1904-07-25 to 1932-01-22
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ACCESS-CM2        36816 non-null  float64
 1   ACCESS-ESM1-5     36816 non-null  float64
 2   AWI-ESM-1-1-LR    36816 non-null  float64
 3   BCC-CSM2-MR       36816 non-null  float64
 4   BCC-ESM1          36816 non-null  float64
 5   CMCC-CM2-HR4      36816 non-null  float64
 6   CMCC-CM2-SR5      36816 non-null  float64
 7   CMCC-ESM2         36816 non-null  float64
 8   CanESM5           36816 non-null  float64
 9   EC-Earth3-Veg-LR  36816 non-null  float64
 10  FGOALS-g3         36816 non-null  float64
 11  GFDL-CM4          36816 non-null  float64
 12  INM-CM4-8         36816 non-null  float64
 13  INM-CM5-0         36816 non-null  float64
 14  KIOST-ESM         36816 non-null  float64
 15  MIROC6            36816 non-null  float64
 16  MPI-ESM-1-2-HAM   36816

In [7]:
train_df.describe()

Unnamed: 0,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,EC-Earth3-Veg-LR,...,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,SAM0-UNICON,TaiESM1,observed
count,36816.0,36816.0,36816.0,36816.0,36816.0,36816.0,36816.0,36816.0,36816.0,36816.0,...,36816.0,36816.0,36816.0,36816.0,36816.0,36816.0,36816.0,36816.0,36816.0,36816.0
mean,2.435598,2.911161,3.6851,2.19516,2.771609,3.116934,3.591418,3.490604,2.950611,2.559451,...,3.175946,1.328797,2.048747,1.536491,1.752144,2.451512,2.909697,3.391212,3.403765,2.736204
std,6.876014,6.951689,7.227256,6.502536,6.051221,6.466975,7.392305,7.076361,7.074549,5.739063,...,6.883672,4.955151,5.375858,4.993425,4.937174,5.796878,7.173033,7.960724,7.525256,8.108492
min,0.0,0.0,9.161142e-14,0.0,0.0,0.0,-3.479596e-18,-3.1861769999999997e-19,0.0,-9.934637e-19,...,3.315622e-13,1.089808e-13,9.155419e-14,9.479186000000001e-33,1.426891e-13,0.0,0.0,-3.6046730000000005e-17,-2.148475e-14,0.0
25%,0.053584,0.021379,0.0281984,0.000518,0.00237,0.138181,0.08941694,0.09016145,0.022656,0.01192093,...,0.0001005828,1.270362e-13,1.352331e-13,5.353678e-05,1.862711e-13,0.005547,0.010028,0.03754041,0.04883792,0.008082
50%,0.191574,0.494985,0.585113,0.096505,0.295341,0.643671,0.8435672,0.8216741,0.348699,0.4261732,...,0.2054757,0.001752656,0.114682,0.03193842,0.05167065,0.16797,0.256126,0.6540263,0.6658721,0.164671
75%,1.435693,2.398416,3.571731,1.323656,2.508854,3.219543,3.724556,3.630505,2.615149,2.294516,...,2.685723,0.3616506,1.18362,0.6686751,0.7920023,1.819091,2.502725,3.271716,3.217312,1.652147
max,149.967634,157.605713,89.46575,134.465223,87.134722,124.95239,140.1478,137.5916,135.569753,134.2262,...,93.06766,109.5008,80.05998,101.69,80.45783,103.367212,163.164524,154.9718,167.3562,192.93303


In [None]:
# make plotly histogram with overlapped distributions of of each model
fig = px.histogram(
    # make dataframe tidy with melt with all columns as value_vars and column names as column named model
    train_df.melt(var_name="model", value_name="rainfall").query("rainfall < 15"),
    barmode="overlay",
    x="rainfall",
    color="model",
    hover_data="model",
    title="Rainfall Predictions are Heavily Right Tail Skewed with Majority < 2mm",
)
fig.show()

In [None]:
fig = px.violin(
    train_df.melt(var_name="model", value_name="rainfall"),
    box=True,
    points=False,
    color="model",
    hover_data="model",
    title="Rainfall Predictions are Heavily Right Tail Skewed with Majority < 2mm",
    labels={"rainfall": "Rainfall (mm)", "model": "Model"},
)

fig.show()

## Baseline Model

Using mean predictions we will get a baseline RMSE.

In [68]:
# import cross validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

from sklearn.dummy import DummyRegressor

X_train = train_df.drop(columns=["observed"])
y_train = train_df["observed"]

X_test = test_df.drop(columns=["observed"])
y_test = test_df["observed"]

model_results_df = pd.DataFrame()

In [53]:
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, return_train_score = True, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

In [54]:
dummy = DummyRegressor(strategy="mean")

model_results_df["dummy"] = mean_std_cross_val_scores(
    dummy, X_train, y_train, cv=5, scoring="neg_root_mean_squared_error"
)

model_results_df

Unnamed: 0,dummy
fit_time,0.006 (+/- 0.001)
score_time,0.000 (+/- 0.000)
test_score,-8.107 (+/- 0.168)
train_score,-8.108 (+/- 0.042)


## Random Forest Regressor

In [55]:
from sklearn.ensemble import RandomForestRegressor

In [77]:
rf_model = RandomForestRegressor(random_state=123)

model_results_df["random_forest"] = mean_std_cross_val_scores(
    rf_model, X_train, y_train, cv=3, scoring="neg_root_mean_squared_error"
)

model_results_df

Unnamed: 0,random_forest
fit_time,414.201 (+/- 102.582)
score_time,0.676 (+/- 0.087)
test_score,-8.368 (+/- 0.274)
train_score,-3.135 (+/- 0.051)


## Evaluation on Test Set

In [64]:
rf_model.fit(X_train, y_train)

# get RMSE on test set
y_pred = rf_model.predict(test_df.drop(columns=["observed"]))
rmse = np.sqrt(mean_squared_error(test_df["observed"], y_pred))
print(f"RMSE on test set: {rmse:.3f}")

RMSE on test set: 8.508


In [65]:
# comparing rf model to individual model columns excluding observed
for m in train_df.drop(columns=["observed"]).columns:
    print(f"RMSE for {m}: {np.sqrt(mean_squared_error(test_df['observed'], test_df[m])):.3f}")

RMSE for ACCESS-CM2: 10.764
RMSE for ACCESS-ESM1-5: 10.847
RMSE for AWI-ESM-1-1-LR: 11.187
RMSE for BCC-CSM2-MR: 10.796
RMSE for BCC-ESM1: 10.432
RMSE for CMCC-CM2-HR4: 10.565
RMSE for CMCC-CM2-SR5: 11.285
RMSE for CMCC-ESM2: 11.129
RMSE for CanESM5: 10.638
RMSE for EC-Earth3-Veg-LR: 10.299
RMSE for FGOALS-g3: 9.565
RMSE for GFDL-CM4: 10.401
RMSE for INM-CM4-8: 11.691
RMSE for INM-CM5-0: 12.060
RMSE for KIOST-ESM: 9.410
RMSE for MIROC6: 11.498
RMSE for MPI-ESM-1-2-HAM: 11.043
RMSE for MPI-ESM1-2-HR: 9.770
RMSE for MPI-ESM1-2-LR: 10.053
RMSE for MRI-ESM2-0: 9.844
RMSE for NESM3: 9.694
RMSE for NorESM2-LM: 10.331
RMSE for NorESM2-MM: 10.660
RMSE for SAM0-UNICON: 11.527
RMSE for TaiESM1: 11.473


## Part 2:

### Preparation for deploying model next week

***NOTE: Complete Question 4 (`Milestone3-task4.ipynb`) from the milestone 3 before coming here***

We’ve found the best hyperparameter settings with MLlib (from the Question 4 from milestone3), here we then use the same hyperparameters to train a scikit-learn model. 

In [66]:
# Just replace ___ with the numbers you found from Milestone3-task4.ipynb
model = RandomForestRegressor(n_estimators=100, max_depth=5, bootstrap=True, random_state=123)
model.fit(X_train, y_train)

In [69]:
print(f"Train RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")
print(f" Test RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.2f}")

Train RMSE: 7.93
 Test RMSE: 8.51


In [70]:
# ready to deploy
# where this model is saved? Understand the concept of relative path.
dump(model, "model.joblib")

['model.joblib']

In [75]:
# import boto3
# s3 = boto3.resource('s3')
# ### your bucket details s3://mdsfinal/output/
# s3.meta.client.upload_file('model.joblib',## your local file path
#                            'mdsfinal', ## your s3 bucket
#                            'output/model.joblib') ## the key within the S3 bucket, you can also think this as your file path followed by bucket name
wr.s3.upload(local_file='model.joblib', path='s3://mds-525-group-15/output/model.joblib')

***Upload model.joblib to s3 under output folder. You choose how you want to upload it (using CLI, SDK, or web console).*** Web console is also completely fine as it is a small file.