# Task 3

This notebook is part of Milestone 3, Question 3. You can work on this notebook on your laptop to develop your machine learning model using all the learnings from the previous courses. At the end of this notebook, when you are ready to train the model, you will need to obtain the hyperparameters from the hyperparameter tuning job that you will run in Milestone 3 Question 4 (i.e., the notebook named `Milestone3-Task4.ipynb`).

PS: To speed up the process, you can test the model without the hyperparameters first. Once other team members obtain the hyperparameters, you can retrain the model using those hyperparameters and test it again.

In [1]:
# I asked them to use their laptop so they already got all these packages from previous courses.
# %pip install joblib scikit-learn matplotlib s3fs

# Imports

In [2]:
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16, 'axes.labelweight': 'bold', 'figure.figsize': (8,6)})
## add any other additional packages that you need. You are free to use any packages for vizualization.

## Part 1:

Recall as a final goal of this project. We want to build and deploy ensemble machine learning models in the cloud, where features are outputs of different climate models and the target is the actual rainfall observation. In this milestone, you'll actually build these ensemble machine learning models in the cloud.  

**Your tasks:**

1. Read the data CSV from your s3 bucket. 
2. Drop rows with nans. 
3. Split the data into train (80%) and test (20%) portions with `random_state=123`. 
4. Carry out EDA of your choice on the train split. 
5. Train ensemble machine learning model using `RandomForestRegressor` and evaluate with metric of your choice (e.g., `RMSE`) by considering `Observed` as the target column. 
6. Discuss your results. Are you getting better results with ensemble models compared to the individual climate models? 

> Recall that individual columns in the data are predictions of different climate models. 

In [3]:
## Remember by default it looks for credentials in home directory. 
## Make sure your updated credentials are in home directory
## or pass credentials explicitly and pass as storage_options=aws_credentials (not a good idea)
# aws_credentials = {"key": "","secret": "","token":""}
# replace with s3 path to your data
aws_credentials ={"key": "ASIA5PFUIBTLPCMCUVEC",
                  "secret": "yM0308aUwTiEzPpboQKJAXPakBA0CbG0IHcS8FIT",
                  "token":"FwoGZXIvYXdzEAcaDCUORkdAHdP4UVNN/SLAAQMW9bACYJQ1F43LwIbzp2WWo1M3jmuc/BfdxyspcMvcxH4eiUv7CR/Wjq5JS9Ce/PDYShAtpMoObdqDUbEsDlEhxkRvgnLjDkbDL+jqsXrfzKh1YywXu+0zR/uh6PyPS7a4l/uJT4Bu8Fz0LVFF+PuQ4OtJirtAc8h/OCLsWxg7HBZ1DpqilmKgjGhUNUnWKlpTCZioy1L6LAk18OyImf0q/J6rYeRkmIA/Ith/S0wZ6x9DY0lvEEKqOlApTMGymyj/6+GhBjIt6KHWT8I90oAlAR9eJVW9Tlv/wSmOvNSMRLgBFWOIOedTzMwrPJt6GZY94KIj"} 

df = pd.read_csv("s3://mds-s3-10-yurui/output/ml_data_SYD.csv", 
                  storage_options=aws_credentials, 
                  index_col=0, parse_dates=True)

In [4]:
## Use your ML skills to get from step 1 to step 6

In [5]:
# Drop rows with nans. 
df.dropna(inplace=True)

In [6]:
# Split the data
X = df.iloc[:, :-1]
y = df[['Observed']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=False)

In [7]:
# EDA
print(X_train.describe())

         ACCESS-CM2  ACCESS-ESM1-5  AWI-ESM-1-1-LR   BCC-CSM2-MR  \
count  36816.000000   36816.000000    3.681600e+04  36816.000000   
mean       2.418005       2.896691    3.663874e+00      2.210497   
std        6.779209       6.879033    7.212582e+00      6.542722   
min        0.000000       0.000000    9.161142e-14      0.000000   
25%        0.054202       0.021841    2.585922e-02      0.000387   
50%        0.190842       0.492040    5.759650e-01      0.090620   
75%        1.418798       2.389449    3.547031e+00      1.330929   
max      149.967634     157.605713    8.946575e+01    134.465223   

           BCC-ESM1  CMCC-CM2-HR4  CMCC-CM2-SR5     CMCC-ESM2       CanESM5  \
count  36816.000000  36816.000000  3.681600e+04  3.681600e+04  36816.000000   
mean       2.765911      3.099252  3.592775e+00  3.495889e+00      2.964081   
std        6.009045      6.499206  7.385247e+00  7.076512e+00      7.022035   
min        0.000000      0.000000 -4.503054e-17 -3.186177e-19      0.00

In [8]:
# Train ensemble machine learning model
model = RandomForestRegressor()
model.fit(X_train, y_train)

  model.fit(X_train, y_train)


In [9]:
# Discuss your results
print(f"Train RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")
print(f" Test RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.2f}")

Train RMSE: 3.21
 Test RMSE: 8.07


## Part 2:

### Preparation for deploying model next week

***NOTE: Complete Question 4 (`Milestone3-task4.ipynb`) from the milestone 3 before coming here***

We’ve found the best hyperparameter settings with MLlib (from the Question 4 from milestone3), here we then use the same hyperparameters to train a scikit-learn model. 

In [13]:
# Just replace ___ with the numbers you found from Milestone3-task4.ipynb
model = RandomForestRegressor(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)

  model.fit(X_train, y_train)


In [14]:
print(f"Train RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")
print(f" Test RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.2f}")

Train RMSE: 8.10
 Test RMSE: 7.81


In [15]:
# ready to deploy
# where this model is saved? Understand the concept of relative path.
dump(model, "model.joblib")

['model.joblib']

***Upload model.joblib to s3 under output folder. You choose how you want to upload it (using CLI, SDK, or web console).*** Web console is also completely fine as it is a small file.