# Task 3

# Imports

In [1]:
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16, 'axes.labelweight': 'bold', 'figure.figsize': (8,6)})

In [2]:
import boto3
import awswrangler as wr
import altair as alt

In [3]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Part 1:

Recall as a final goal of this project. We want to build and deploy ensemble machine learning models in the cloud, where features are outputs of different climate models and the target is the actual rainfall observation. In this milestone, you'll actually build these ensemble machine learning models in the cloud.  

**Your tasks:**

1. Read the data CSV from your s3 bucket. 
2. Drop rows with nans. 
3. Split the data into train (80%) and test (20%) portions with `random_state=123`. 
4. Carry out EDA of your choice on the train split. 
5. Train ensemble machine learning model using `RandomForestRegressor` and evaluate with metric of your choice (e.g., `RMSE`) by considering `Observed` as the target column. 
6. Discuss your results. Are you getting better results with ensemble models compared to the individual climate models? 

> Recall that individual columns in the data are predictions of different climate models. 

### Step 1: Read in the CSV

In [4]:
## You could download it from your bucket, or you can use the file that I have in my bucket. 
## You should be able to access it from my bucket using your key and secret
session = boto3.Session(
    aws_access_key_id="AKIATB63UHM3PJPEN2QP",
    aws_secret_access_key="hI06CJiIk7f26Po4c17X39rsSagpi54SqIVUW92D",
)
df = wr.s3.read_csv(path="s3://mds-s3-student9/output/", dataset=True, 
                        boto3_session= session)

In [5]:
## Use your ML skills to get from step 1 to step 6

In [6]:
df.shape

(46020, 27)

### Step 2: Drop NA's

In [7]:
df = df.dropna(axis = 0)

In [8]:
df

Unnamed: 0,time,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,...,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,Observed,SAM0-UNICON,TaiESM1
0,1889-01-01,0.040427,1.814552,3.557934e+01,4.268112e+00,1.107466e-03,1.141054e+01,3.322009e-08,2.668800,1.321215,...,4.244226e-13,1.390174e-13,6.537884e-05,3.445495e-06,1.576096e+01,4.759651e-05,2.451075,0.006612,0.221324,2.257933
1,1889-01-02,0.073777,0.303965,4.596520e+00,1.190141e+00,1.015323e-04,4.014984e+00,1.312700e+00,0.946211,2.788724,...,4.409552e+00,1.222283e-01,1.049131e-13,4.791993e-09,3.675510e-01,4.350863e-01,0.477231,0.090422,3.757179,2.287381
2,1889-01-03,0.232656,0.019976,5.927467e+00,1.003845e-09,1.760345e-05,9.660565e+00,9.103720e+00,0.431999,0.003672,...,2.269300e-01,3.762301e-01,9.758706e-14,6.912302e-01,1.562869e-01,9.561101e+00,0.023083,1.401452,0.253357,1.199909
3,1889-01-04,0.911319,13.623777,8.029624e+00,8.225225e-02,1.808932e-01,3.951528e+00,1.317160e+01,0.368693,0.013578,...,2.344586e-02,4.214019e-01,7.060915e-03,3.835721e-02,2.472226e-07,5.301038e-01,0.002699,14.869798,2.185454,2.106737
4,1889-01-05,0.698013,0.021048,2.132686e+00,2.496841e+00,4.708019e-09,2.766362e+00,1.822940e+01,0.339267,0.002468,...,4.270161e-13,1.879692e-01,4.504985e+00,3.506923e-07,1.949792e-13,1.460928e-10,0.001026,0.467628,2.766507,1.763335
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46015,2014-12-27,0.033748,0.123476,1.451179e+00,3.852845e+01,2.061717e-03,8.179260e-09,1.171263e-02,0.090786,59.895053,...,4.726998e-13,1.326889e-01,1.827857e+00,6.912632e-03,2.171327e-03,1.620489e+00,2.084252,0.037472,0.868046,17.444923
46016,2014-12-28,0.094198,2.645496,4.249335e+01,5.833801e-01,5.939502e-09,8.146937e-01,4.938899e-01,0.000000,0.512632,...,4.609420e-13,1.644482e+00,7.242920e-01,2.836752e-03,1.344768e+01,2.391159e+00,1.644527,0.158061,0.782258,1.569647
46017,2014-12-29,0.005964,3.041667,2.898325e+00,9.359547e-02,2.000051e-08,2.532205e-01,1.306046e+00,0.000002,37.169669,...,2.016156e+01,1.506439e+00,1.049481e-01,8.137182e+00,2.547820e+01,1.987695e-12,0.205036,0.025719,2.140723,1.444630
46018,2014-12-30,0.000028,1.131412,2.516381e-01,1.715028e-01,7.191735e-05,8.169252e-02,1.722262e-01,0.788577,7.361246,...,9.420543e+00,6.242895e+00,1.245115e-01,9.305263e-03,4.192948e+00,2.150346e+00,0.000017,0.729390,29.714692,0.716019


### Step 3: Split the data

In [9]:
train_df, test_df = train_test_split(df, test_size = 0.2, random_state = 123)

### Step 4: Conduct EDA 

In [10]:
# convert time into datetime
train_df["time"] = pd.to_datetime(train_df["time"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [11]:
train_df = train_df.set_index("time")

In [12]:
# resample the data into monthly 
resampled_df = train_df.resample("M").mean()

In [13]:
# melt the columns to make it easier to plot
melted_df = pd.melt(resampled_df.reset_index(), id_vars=['time'], value_vars=train_df.columns.tolist())
melted_df

Unnamed: 0,time,variable,value
0,1889-01-31,ACCESS-CM2,1.865877
1,1889-02-28,ACCESS-CM2,0.814492
2,1889-03-31,ACCESS-CM2,1.120327
3,1889-04-30,ACCESS-CM2,1.506457
4,1889-05-31,ACCESS-CM2,4.101108
...,...,...,...
39307,2014-08-31,TaiESM1,1.642193
39308,2014-09-30,TaiESM1,1.121861
39309,2014-10-31,TaiESM1,1.719495
39310,2014-11-30,TaiESM1,3.750043


In [14]:
# index to only have data from 2000 to 2014
sample = resampled_df.loc["2000":"2014"]

In [15]:
models_list = sample.columns.tolist()
models_list.remove("Observed")

In [23]:
plot2 = alt.Chart().mark_line(color = "gray", opacity = 0.5).encode(
    x='time:T',
    y='Observed:Q')

l = []
for model in models_list:
    plot1 = alt.Chart().mark_line(opacity = 0.7).encode(
    x= alt.X('time:T', title = "Year"),
    y= alt.Y(model, type = 'quantitative', title = f'{model} Compared to Observed'))
    l.append(alt.layer(plot1, plot2, data=sample.reset_index()))

In [24]:
alt.vconcat(*l)

In [34]:
X_train, y_train = (train_df.drop(columns=['Observed']), train_df["Observed"])
X_test, y_test = (test_df.drop(columns=["Observed", "time"]), test_df["Observed"])

### Compare individual models to observed data 

In [49]:
for col in X_train.columns:
    print(f"{col} RMSE: {mean_squared_error(y_train, X_train[col], squared=False):.2f}")

ACCESS-CM2 RMSE: 10.57
ACCESS-ESM1-5 RMSE: 10.64
AWI-ESM-1-1-LR RMSE: 10.88
BCC-CSM2-MR RMSE: 10.29
BCC-ESM1 RMSE: 10.07
CMCC-CM2-HR4 RMSE: 10.35
CMCC-CM2-SR5 RMSE: 10.94
CMCC-ESM2 RMSE: 10.71
CanESM5 RMSE: 10.57
EC-Earth3-Veg-LR RMSE: 9.90
FGOALS-g3 RMSE: 9.28
GFDL-CM4 RMSE: 9.89
INM-CM4-8 RMSE: 11.09
INM-CM5-0 RMSE: 11.62
KIOST-ESM RMSE: 9.20
MIROC6 RMSE: 11.24
MPI-ESM-1-2-HAM RMSE: 10.62
MPI-ESM1-2-HR RMSE: 9.49
MPI-ESM1-2-LR RMSE: 9.68
MRI-ESM2-0 RMSE: 9.61
NESM3 RMSE: 9.37
NorESM2-LM RMSE: 9.92
NorESM2-MM RMSE: 10.68
SAM0-UNICON RMSE: 11.32
TaiESM1 RMSE: 11.01


## Part 2:

### Preparation for deploying model next week

#### Complete task 4 from the milestone3 before coming here

We’ve found ```n_estimators=100, max_depth=5, bootstrap=True``` to be the best hyperparameter settings with MLlib (from the task 4 from milestone3), here we then use the same hyperparameters to train a scikit-learn model. 

In [36]:
model = RandomForestRegressor(n_estimators=100, max_depth=5, bootstrap=True)
model.fit(X_train, y_train)

RandomForestRegressor(max_depth=5)

In [37]:
print(f"Train RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")
print(f" Test RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.2f}")

Train RMSE: 7.88
 Test RMSE: 8.65


### Reflection:


The individual models RMSE scores are higher than the ensemble model. We have found that the ensemble model performs much better than the individual models.

In [38]:
# ready to deploy
dump(model, "model.joblib")

['model.joblib']

***Upload model.joblib to s3. You choose how you want to upload it.***

The model was relatively small so we manually uploaded it to the S3 bucket. 