# Task 3 - GROUP 15

# Imports

In [2]:
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16, 'axes.labelweight': 'bold', 'figure.figsize': (8,6)})

## Part 1:

Recall as a final goal of this project. We want to build and deploy ensemble machine learning models in the cloud, where features are outputs of different climate models and the target is the actual rainfall observation. In this milestone, you'll actually build these ensemble machine learning models in the cloud.  

**Your tasks:**

1. Read the data CSV from your s3 bucket. 
2. Drop rows with nans. 
3. Split the data into train (80%) and test (20%) portions with `random_state=123`. 
4. Carry out EDA of your choice on the train split. 
5. Train ensemble machine learning model using `RandomForestRegressor` and evaluate with metric of your choice (e.g., `RMSE`) by considering `Observed` as the target column. 
6. Discuss your results. Are you getting better results with ensemble models compared to the individual climate models? 

> Recall that individual columns in the data are predictions of different climate models. 

### Step 1. Read the data CSV from your s3 bucket.

In [3]:
## You could download it from your bucket, or you can use the file that I have in my bucket. 
## You should be able to access it from my bucket using your key and secret
aws_credentials ={"key": "...","secret": "..."} 
df = pd.read_csv("s3://mds-s3-student77/output/ml_data_SYD.csv", index_col=0, parse_dates=True, storage_options=aws_credentials)


### Step 2. Drop rows with nans.

In [4]:
df = df.dropna()

### Step 3. Split the data into train (80%) and test (20%) portions with `random_state=123`. 

In [5]:
## Use your ML skills to get from step 1 to step 6
train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)

X_train, y_train = train_df.drop(columns=["observed_rainfall"]), train_df["observed_rainfall"]
X_test, y_test = test_df.drop(columns=["observed_rainfall"]), test_df["observed_rainfall"]

### Step 4. Carry out EDA of your choice on the train split.

In [6]:
train_df

Unnamed: 0_level_0,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,EC-Earth3-Veg-LR,...,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,SAM0-UNICON,TaiESM1,observed_rainfall
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1953-10-26,17.906051,0.837579,9.753198e-14,0.018863,2.878923e-01,0.007043,1.227189e-01,1.085584e+01,0.022752,0.472927,...,6.688447e+00,2.860546e+00,9.779330e-14,2.980863e-01,1.659176e-13,3.841924e+00,2.713473,6.594400e-01,0.129196,1.833044
1921-10-22,0.515505,1.911354,1.135404e+00,0.000002,4.091981e-01,0.009669,7.420817e-02,1.239226e+00,3.566098,0.667190,...,2.368273e-01,6.528480e-01,1.132699e-13,7.653117e-08,4.560164e-03,4.178978e-02,7.909935,2.067648e-01,2.018346,4.038183
1925-01-22,0.161412,2.666091,7.012887e-02,2.040689,1.338349e+01,0.073243,2.552343e-04,1.349633e+00,0.075959,0.059223,...,1.082573e-01,2.977031e+00,1.320287e-13,1.937005e-04,1.692996e-13,1.290949e-03,0.183711,1.733777e+00,0.932259,0.419818
1902-11-21,3.651607,3.117433,1.142701e-13,0.000016,4.658142e-09,3.913076,9.442968e+00,7.203823e-01,5.314680,0.122738,...,1.635075e-01,2.131350e-02,9.901551e-01,1.142382e+00,1.840662e-03,4.955181e-02,0.000068,1.298833e+01,0.005468,0.698486
1925-02-17,0.635625,39.042773,1.084678e+00,31.690315,6.208601e-09,0.416932,7.337828e-01,4.238512e-03,0.439862,0.404930,...,4.388535e-13,2.544746e-02,2.918170e+00,1.314147e-01,3.690330e-01,2.357034e-08,0.036247,2.987665e-01,2.923645,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1910-04-09,0.148872,0.002789,1.484854e-02,21.972655,4.106228e+00,4.366387,3.023449e+00,8.233286e-03,14.374214,0.004196,...,1.005828e-05,5.328879e-02,1.144262e-13,3.944834e-04,6.158946e+00,5.484728e-01,2.780504,2.572049e+00,0.108625,0.172773
1931-02-17,2.564109,0.000679,2.304275e+00,1.376829,2.344160e+00,0.280475,7.628959e-01,1.130374e-02,0.176314,0.168514,...,1.058082e+01,1.393032e+00,6.145611e-02,1.041555e+01,1.943845e-13,3.161930e+00,0.000017,2.685817e+00,2.015153,1.146269
1937-07-30,0.112727,0.247349,4.266945e+00,0.061475,4.903452e-04,0.518605,5.171097e-01,6.649084e+00,2.194583,3.394174,...,2.304702e+00,1.143579e-13,8.450165e-01,1.151981e-04,7.236151e-08,1.165880e+01,0.015212,3.227688e-08,0.972568,0.001468
1965-10-19,0.264141,0.000000,1.226105e-02,29.641189,1.512778e+00,0.000555,4.047204e-09,6.883775e-13,0.000000,0.183916,...,4.178639e-13,1.159168e-13,1.425951e-13,1.592603e+00,1.548193e-13,5.736621e+00,0.000016,2.926012e-02,1.126471,26.762451


In [7]:
profile = ProfileReport(train_df, title="Pandas Profiling Report", minimal=True);
profile

Summarize dataset:   0%|          | 0/35 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



#### If the above EDA report does not render properly on notebook, rendered html file for EDA report can be found here: https://htmlpreview.github.io/?https://raw.githubusercontent.com/UBC-MDS/grp15_rainfall_analysis/main/notebooks/milestone3/doc/pandas_profiling_report_minimal.html

### Step 4 Observations:

Our EDA shows us that we have 26 numeric columns in our `train_df`, of which 25 represent an individual model's prediction and `observed_rainfall` which is our target column. There are a total number of 36791 rows, which represent the number of observations we have. While we do not have any missing cells, there are observations where we have 0 as the observed, which is understandable as we are looking at observed rainfall. 

### Step 5. Train ensemble machine learning model using `RandomForestRegressor` and evaluate with metric of your choice (e.g., `RMSE`) by considering `Observed` as the target column. 

In [8]:
model = RandomForestRegressor()
model.fit(X_train, y_train)
print(f"Train RMSE using RandomForestRegressor with default hyperparameters: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")

Train RMSE using RandomForestRegressor with default hyperparameters: 3.12


### Step 6. Discuss your results. Are you getting better results with ensemble models compared to the individual climate models? 

In [9]:
for model in X_train.columns:
    print(f"Train RMSE using individual model {model : <17}: {mean_squared_error(y_train, X_train[model], squared=False):.2f}")
    

Train RMSE using individual model ACCESS-CM2       : 10.57
Train RMSE using individual model ACCESS-ESM1-5    : 10.64
Train RMSE using individual model AWI-ESM-1-1-LR   : 10.88
Train RMSE using individual model BCC-CSM2-MR      : 10.29
Train RMSE using individual model BCC-ESM1         : 10.07
Train RMSE using individual model CMCC-CM2-HR4     : 10.35
Train RMSE using individual model CMCC-CM2-SR5     : 10.94
Train RMSE using individual model CMCC-ESM2        : 10.71
Train RMSE using individual model CanESM5          : 10.57
Train RMSE using individual model EC-Earth3-Veg-LR : 9.90
Train RMSE using individual model FGOALS-g3        : 9.28
Train RMSE using individual model GFDL-CM4         : 9.89
Train RMSE using individual model INM-CM4-8        : 11.09
Train RMSE using individual model INM-CM5-0        : 11.62
Train RMSE using individual model KIOST-ESM        : 9.20
Train RMSE using individual model MIROC6           : 11.24
Train RMSE using individual model MPI-ESM-1-2-HAM  : 10.62
T

### Step 6 Observations:

The ensemble method we are comparing to the individual methods is a `RandomForestRegressor()` with default hyperparameters. We can see that the the train RMSE for our ensemble method to be much lower at 3.11, while the individual methods have RMSE ranging from 9.20 to 11.62. It seems that the `RandomForestRegressor()` is performing much better than the individual models. However, it should be noted we are simply comparing the train scores and when we compare test scores, the result may differ. 

## Part 2:

### Preparation for deploying model next week

#### Complete task 4 from the milestone3 before coming here

Weâ€™ve found ```n_estimators=100, max_depth=5``` to be the best hyperparameter settings with MLlib (from the task 4 from milestone3), here we then use the same hyperparameters to train a scikit-learn model. 

In [10]:
model = RandomForestRegressor(n_estimators=100, max_depth=5, bootstrap=True)
model.fit(X_train, y_train)

RandomForestRegressor(max_depth=5)

In [11]:
print(f"Train RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")
print(f" Test RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.2f}")

Train RMSE: 7.89
 Test RMSE: 8.65


In [12]:
# ready to deploy
dump(model, "model.joblib")

['model.joblib']

***Upload model.joblib to s3. You choose how you want to upload it.***