## DSCI 525 - Web and Cloud Computing

## Project: Daily Rainfall Over NSW, Australia

## Milestone 3: Setup Spark Cluster and Develop Machine Learning 

### Authors: Group 24 Huanhuan Li, Nash Makhija and Nicholas Wu

# Task 3

# Imports

In [1]:
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16, 'axes.labelweight': 'bold', 'figure.figsize': (8,6)})

## Part 1:

Recall as a final goal of this project. We want to build and deploy ensemble machine learning models in the cloud, where features are outputs of different climate models and the target is the actual rainfall observation. In this milestone, you'll actually build these ensemble machine learning models in the cloud.  

**Your tasks:**

1. Read the data CSV from your s3 bucket. 
2. Drop rows with nans. 
3. Split the data into train (80%) and test (20%) portions with `random_state=123`. 
4. Carry out EDA of your choice on the train split. 
5. Train ensemble machine learning model using `RandomForestRegressor` and evaluate with metric of your choice (e.g., `RMSE`) by considering `Observed` as the target column. 
6. Discuss your results. Are you getting better results with ensemble models compared to the individual climate models? 

> Recall that individual columns in the data are predictions of different climate models. 

### Step 1. Read the data CSV from your S3 bucket.

In [2]:
## You could download it from your bucket, or you can use the file that I have in my bucket. 
## You should be able to access it from my bucket using your key and secret


aws_credentials = {"key": "AKIATB63UHM3GQALZUHM","secret": "FR/t8W39zX2tUD9jrmia6IicVx0GmBx7kXIBIkv3"}
df = pd.read_csv("s3://mds-s3-student82/output/ml_data_SYD.csv", index_col=0, parse_dates=True)
df

Unnamed: 0_level_0,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,EC-Earth3-Veg-LR,...,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,SAM0-UNICON,TaiESM1,Observed
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1889-01-01,0.040427,1.814552,3.557934e+01,4.268112e+00,1.107466e-03,1.141054e+01,3.322009e-08,2.668800,1.321215,1.515293,...,4.244226e-13,1.390174e-13,6.537884e-05,3.445495e-06,1.576096e+01,4.759651e-05,2.451075,0.221324,2.257933,0.006612
1889-01-02,0.073777,0.303965,4.596520e+00,1.190141e+00,1.015323e-04,4.014984e+00,1.312700e+00,0.946211,2.788724,4.771375,...,4.409552e+00,1.222283e-01,1.049131e-13,4.791993e-09,3.675510e-01,4.350863e-01,0.477231,3.757179,2.287381,0.090422
1889-01-03,0.232656,0.019976,5.927467e+00,1.003845e-09,1.760345e-05,9.660565e+00,9.103720e+00,0.431999,0.003672,4.233980,...,2.269300e-01,3.762301e-01,9.758706e-14,6.912302e-01,1.562869e-01,9.561101e+00,0.023083,0.253357,1.199909,1.401452
1889-01-04,0.911319,13.623777,8.029624e+00,8.225225e-02,1.808932e-01,3.951528e+00,1.317160e+01,0.368693,0.013578,15.252495,...,2.344586e-02,4.214019e-01,7.060915e-03,3.835721e-02,2.472226e-07,5.301038e-01,0.002699,2.185454,2.106737,14.869798
1889-01-05,0.698013,0.021048,2.132686e+00,2.496841e+00,4.708019e-09,2.766362e+00,1.822940e+01,0.339267,0.002468,11.920356,...,4.270161e-13,1.879692e-01,4.504985e+00,3.506923e-07,1.949792e-13,1.460928e-10,0.001026,2.766507,1.763335,0.467628
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2014-12-27,0.033748,0.123476,1.451179e+00,3.852845e+01,2.061717e-03,8.179260e-09,1.171263e-02,0.090786,59.895053,5.071783,...,4.726998e-13,1.326889e-01,1.827857e+00,6.912632e-03,2.171327e-03,1.620489e+00,2.084252,0.868046,17.444923,0.037472
2014-12-28,0.094198,2.645496,4.249335e+01,5.833801e-01,5.939502e-09,8.146937e-01,4.938899e-01,0.000000,0.512632,1.578188,...,4.609420e-13,1.644482e+00,7.242920e-01,2.836752e-03,1.344768e+01,2.391159e+00,1.644527,0.782258,1.569647,0.158061
2014-12-29,0.005964,3.041667,2.898325e+00,9.359547e-02,2.000051e-08,2.532205e-01,1.306046e+00,0.000002,37.169669,1.565885,...,2.016156e+01,1.506439e+00,1.049481e-01,8.137182e+00,2.547820e+01,1.987695e-12,0.205036,2.140723,1.444630,0.025719
2014-12-30,0.000028,1.131412,2.516381e-01,1.715028e-01,7.191735e-05,8.169252e-02,1.722262e-01,0.788577,7.361246,0.025749,...,9.420543e+00,6.242895e+00,1.245115e-01,9.305263e-03,4.192948e+00,2.150346e+00,0.000017,29.714692,0.716019,0.729390


### Step 2. Drop rows with nans. 

In [3]:
df = df.dropna()
df

Unnamed: 0_level_0,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,EC-Earth3-Veg-LR,...,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,SAM0-UNICON,TaiESM1,Observed
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1889-01-01,0.040427,1.814552,3.557934e+01,4.268112e+00,1.107466e-03,1.141054e+01,3.322009e-08,2.668800,1.321215,1.515293,...,4.244226e-13,1.390174e-13,6.537884e-05,3.445495e-06,1.576096e+01,4.759651e-05,2.451075,0.221324,2.257933,0.006612
1889-01-02,0.073777,0.303965,4.596520e+00,1.190141e+00,1.015323e-04,4.014984e+00,1.312700e+00,0.946211,2.788724,4.771375,...,4.409552e+00,1.222283e-01,1.049131e-13,4.791993e-09,3.675510e-01,4.350863e-01,0.477231,3.757179,2.287381,0.090422
1889-01-03,0.232656,0.019976,5.927467e+00,1.003845e-09,1.760345e-05,9.660565e+00,9.103720e+00,0.431999,0.003672,4.233980,...,2.269300e-01,3.762301e-01,9.758706e-14,6.912302e-01,1.562869e-01,9.561101e+00,0.023083,0.253357,1.199909,1.401452
1889-01-04,0.911319,13.623777,8.029624e+00,8.225225e-02,1.808932e-01,3.951528e+00,1.317160e+01,0.368693,0.013578,15.252495,...,2.344586e-02,4.214019e-01,7.060915e-03,3.835721e-02,2.472226e-07,5.301038e-01,0.002699,2.185454,2.106737,14.869798
1889-01-05,0.698013,0.021048,2.132686e+00,2.496841e+00,4.708019e-09,2.766362e+00,1.822940e+01,0.339267,0.002468,11.920356,...,4.270161e-13,1.879692e-01,4.504985e+00,3.506923e-07,1.949792e-13,1.460928e-10,0.001026,2.766507,1.763335,0.467628
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2014-12-27,0.033748,0.123476,1.451179e+00,3.852845e+01,2.061717e-03,8.179260e-09,1.171263e-02,0.090786,59.895053,5.071783,...,4.726998e-13,1.326889e-01,1.827857e+00,6.912632e-03,2.171327e-03,1.620489e+00,2.084252,0.868046,17.444923,0.037472
2014-12-28,0.094198,2.645496,4.249335e+01,5.833801e-01,5.939502e-09,8.146937e-01,4.938899e-01,0.000000,0.512632,1.578188,...,4.609420e-13,1.644482e+00,7.242920e-01,2.836752e-03,1.344768e+01,2.391159e+00,1.644527,0.782258,1.569647,0.158061
2014-12-29,0.005964,3.041667,2.898325e+00,9.359547e-02,2.000051e-08,2.532205e-01,1.306046e+00,0.000002,37.169669,1.565885,...,2.016156e+01,1.506439e+00,1.049481e-01,8.137182e+00,2.547820e+01,1.987695e-12,0.205036,2.140723,1.444630,0.025719
2014-12-30,0.000028,1.131412,2.516381e-01,1.715028e-01,7.191735e-05,8.169252e-02,1.722262e-01,0.788577,7.361246,0.025749,...,9.420543e+00,6.242895e+00,1.245115e-01,9.305263e-03,4.192948e+00,2.150346e+00,0.000017,29.714692,0.716019,0.729390


### Step 3. Split the data into train (80%) and test (20%) portions with random_state=123.

In [4]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)

### Step 4. Carry out EDA of your choice on the train split. 

In [5]:
train_df.describe(include="all")

Unnamed: 0,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,EC-Earth3-Veg-LR,...,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,SAM0-UNICON,TaiESM1,Observed
count,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,...,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0,36791.0
mean,2.429419,2.938955,3.716329,2.203086,2.748441,3.092784,3.575203,3.489756,2.879339,2.56543,...,3.213535,1.299377,2.041242,1.533212,1.726792,2.458268,2.890478,3.383557,3.417809,2.72632
std,6.791374,7.048794,7.280859,6.518224,5.997439,6.459254,7.353451,7.039201,6.89889,5.732742,...,6.979341,4.890737,5.347782,5.000287,4.872754,5.815333,7.129072,7.927354,7.558577,8.07831
min,0.0,0.0,9.161142e-14,4.21143e-24,1.091904e-24,0.0,-4.503054e-17,-3.1861769999999997e-19,0.0,-9.934637e-19,...,3.315622e-13,1.088608e-13,9.155419e-14,9.479186000000001e-33,1.435053e-13,0.0,0.0,-3.6046730000000005e-17,-2.148475e-14,0.0
25%,0.054108,0.021248,0.02961787,0.0005089918,0.002381995,0.138315,0.08899328,0.09271159,0.022493,0.0120163,...,0.0001169275,1.270013e-13,1.358104e-13,5.380599e-05,1.866808e-13,0.005478,0.010013,0.03651962,0.04934874,0.008084
50%,0.19298,0.492758,0.5923147,0.09644146,0.2986511,0.633548,0.8278889,0.8486242,0.337613,0.4296779,...,0.2081838,0.001579151,0.1140358,0.03185565,0.04989652,0.169617,0.255937,0.6539921,0.6675421,0.163215
75%,1.445456,2.398539,3.601697,1.31894,2.477893,3.18263,3.727703,3.629963,2.558854,2.295852,...,2.699071,0.3465456,1.192421,0.6732646,0.787474,1.822582,2.45069,3.275132,3.23443,1.612815
max,149.967634,157.605713,89.46575,134.4652,87.13472,124.95239,140.1478,137.5916,135.569753,96.42382,...,93.06766,109.5008,74.84368,101.69,80.45783,114.898109,163.164524,154.9718,167.3562,192.93303


In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 36791 entries, 1953-10-26 to 1932-01-31
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ACCESS-CM2        36791 non-null  float64
 1   ACCESS-ESM1-5     36791 non-null  float64
 2   AWI-ESM-1-1-LR    36791 non-null  float64
 3   BCC-CSM2-MR       36791 non-null  float64
 4   BCC-ESM1          36791 non-null  float64
 5   CMCC-CM2-HR4      36791 non-null  float64
 6   CMCC-CM2-SR5      36791 non-null  float64
 7   CMCC-ESM2         36791 non-null  float64
 8   CanESM5           36791 non-null  float64
 9   EC-Earth3-Veg-LR  36791 non-null  float64
 10  FGOALS-g3         36791 non-null  float64
 11  GFDL-CM4          36791 non-null  float64
 12  INM-CM4-8         36791 non-null  float64
 13  INM-CM5-0         36791 non-null  float64
 14  KIOST-ESM         36791 non-null  float64
 15  MIROC6            36791 non-null  float64
 16  MPI-ESM-1-2-HAM   36791

### Step 5. Train ensemble machine learning model using RandomForestRegressor and evaluate with RMSE by considering Observed as the target column. 

In [7]:


X_train = train_df.drop(columns=["Observed"])
y_train = train_df["Observed"]

X_test = test_df.drop(columns=["Observed"])
y_test = test_df["Observed"]


model = RandomForestRegressor(random_state=123)
model.fit(X_train, y_train)


RandomForestRegressor(random_state=123)

In [8]:
result = {}
result["RandomForestRegressor"] = [round(mean_squared_error(y_train, model.predict(X_train), squared=False), 6),
                                   round(mean_squared_error(y_test, model.predict(X_test), squared=False), 6)]
pd.DataFrame(result, index=["train_error", "test_error"])

Unnamed: 0,RandomForestRegressor
train_error,3.114311
test_error,8.844402


### Step 6. Discuss results. 
### Are we getting better results with ensemble models compared to the individual climate models? 

In [9]:
#Calculate the train and test score on individual climate models:
individual_model_names = list(train_df.columns)
individual_model_names.remove("Observed")

for name in individual_model_names:
    result[name] = [round(mean_squared_error(y_train, train_df[name], squared=False), 6),
                          round(mean_squared_error(y_test, test_df[name], squared=False), 6)]
    
scores = pd.DataFrame(result, index=["train_error", "test_error"]).T.sort_values("test_error")
scores

Unnamed: 0,train_error,test_error
RandomForestRegressor,3.114311,8.844402
KIOST-ESM,9.196532,9.60048
FGOALS-g3,9.284867,9.687788
MRI-ESM2-0,9.609047,9.922795
MPI-ESM1-2-HR,9.489925,9.969823
NESM3,9.371897,9.978137
MPI-ESM1-2-LR,9.681899,10.260886
NorESM2-LM,9.918216,10.410145
EC-Earth3-Veg-LR,9.902149,10.453606
GFDL-CM4,9.889638,10.511682


### Discussion
1. The emsemble Random Forest Regressor model has a 3.11 RMSE train error  and 8.84 test error. 
2. The ensemble Random Forest Regressor seem to perform the best with the lowest train and test error, comparing with all the individual models. This is expected because each individual models have differennt areas of inaccuracy in prediction. By combining them together, the ensemble model is able to outperform individual models. 
3. However, the gap between train and test error is greater than individual models. There seem to be a bigger overfit problem in our ensemble model than individual models. 
4. Hence, we will perform hyperparameter optimization to improve our emsemble model. 


## Part 2:

### Preparation for deploying model next week

We’ve found ```n_estimators=100, max_depth=5``` to be the best hyperparameter settings with MLlib (from the task 4 from milestone3), here we then use the same hyperparameters to train a scikit-learn model. 

In [10]:
model = RandomForestRegressor(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)

RandomForestRegressor(max_depth=5)

In [11]:
print(f"Train RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")
print(f" Test RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.2f}")

Train RMSE: 7.89
 Test RMSE: 8.64


In [12]:
# ready to deploy
dump(model, "model.joblib")

['model.joblib']

***Upload model.joblib to s3. You choose how you want to upload it.***