# Milestone 3 (Task 3)

This notebook is part of Milestone 3, where we develop a machine learning model locally.

In [1]:
import numpy as np
import pandas as pd
from joblib import dump, load
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [2]:
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16, 'axes.labelweight': 'bold', 'figure.figsize': (8,6)})

In [3]:
s3_bucket_name = 'mds-s3-5-xxx'

In [4]:
target_column = "Observed"

In [5]:
random_state = 123

## Part 1

Here, we train ensemble machine learning model from the `ml_data_SYD.csv` prepared from the last Milestone.

### Read CSV from S3

In [6]:
df = pd.read_csv(f"s3://{s3_bucket_name}/output/ml_data_SYD.csv", index_col=0, parse_dates=True)

In [7]:
df.shape

(46020, 26)

### Drop NA Rows

In [8]:
df.isnull().sum()

ACCESS-CM2           0
ACCESS-ESM1-5        0
AWI-ESM-1-1-LR       0
BCC-CSM2-MR         30
BCC-ESM1            30
CMCC-CM2-HR4        30
CMCC-CM2-SR5        30
CMCC-ESM2           30
CanESM5             30
EC-Earth3-Veg-LR     0
FGOALS-g3           30
GFDL-CM4            30
INM-CM4-8           30
INM-CM5-0           30
KIOST-ESM           30
MIROC6               0
MPI-ESM-1-2-HAM      0
MPI-ESM1-2-HR        0
MPI-ESM1-2-LR        0
MRI-ESM2-0           0
NESM3                0
NorESM2-LM          30
NorESM2-MM          30
SAM0-UNICON         31
TaiESM1             30
Observed             0
dtype: int64

In [9]:
df = df.dropna(axis = 0)

In [10]:
df.isnull().sum()

ACCESS-CM2          0
ACCESS-ESM1-5       0
AWI-ESM-1-1-LR      0
BCC-CSM2-MR         0
BCC-ESM1            0
CMCC-CM2-HR4        0
CMCC-CM2-SR5        0
CMCC-ESM2           0
CanESM5             0
EC-Earth3-Veg-LR    0
FGOALS-g3           0
GFDL-CM4            0
INM-CM4-8           0
INM-CM5-0           0
KIOST-ESM           0
MIROC6              0
MPI-ESM-1-2-HAM     0
MPI-ESM1-2-HR       0
MPI-ESM1-2-LR       0
MRI-ESM2-0          0
NESM3               0
NorESM2-LM          0
NorESM2-MM          0
SAM0-UNICON         0
TaiESM1             0
Observed            0
dtype: int64

### Split the Data into Training/Test Data

In [11]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=random_state)

In [12]:
train_df.shape

(36791, 26)

In [13]:
test_df.shape

(9198, 26)

### Perform EDA

Here, we try to look at the statistical summaries across the different prediction models and the observed values.

In [14]:
train_df.describe().round(3).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ACCESS-CM2,36791.0,2.429,6.791,0.0,0.054,0.193,1.445,149.968
ACCESS-ESM1-5,36791.0,2.939,7.049,0.0,0.021,0.493,2.399,157.606
AWI-ESM-1-1-LR,36791.0,3.716,7.281,0.0,0.03,0.592,3.602,89.466
BCC-CSM2-MR,36791.0,2.203,6.518,0.0,0.001,0.096,1.319,134.465
BCC-ESM1,36791.0,2.748,5.997,0.0,0.002,0.299,2.478,87.135
CMCC-CM2-HR4,36791.0,3.093,6.459,0.0,0.138,0.634,3.183,124.952
CMCC-CM2-SR5,36791.0,3.575,7.353,-0.0,0.089,0.828,3.728,140.148
CMCC-ESM2,36791.0,3.49,7.039,-0.0,0.093,0.849,3.63,137.592
CanESM5,36791.0,2.879,6.899,0.0,0.022,0.338,2.559,135.57
EC-Earth3-Veg-LR,36791.0,2.565,5.733,-0.0,0.012,0.43,2.296,96.424


### Train Ensemble Model

An ensemble model based on `RandomForestRegressor` has been created and the performance is measured using RMSE (root-mean-square error). The best RMSE score a model can get is 0, which means there is no difference between the actual and predicted values.

In [15]:
X_train = train_df.drop(columns=[target_column])
X_test = test_df.drop(columns=[target_column])

y_train = train_df[target_column]
y_test = test_df[target_column]

In [16]:
rf_regressor = RandomForestRegressor(random_state=random_state)
rf_regressor.fit(X_train, y_train)

In [17]:
predictions = rf_regressor.predict(X_test)

In [18]:
rmse = mean_squared_error(y_test, predictions, squared=False)
rmse

8.860047622369347

### Evaluate the Results

From the Random Forest regressor, we get a RMSE of 8.860. Here, we are going to list the RMSEs of the individual models in the constituent. (Notice we are using only the test split to calculate for all models.)

We are seeing better results, in terms of RMSEs in the test split, compared with all of the constituent models within the regressor.

In [19]:
constituent_models = X_test.columns.to_list()

rmse_test = [mean_squared_error(y_test, X_test[col], squared=False) for col in constituent_models]
rmse_train = [mean_squared_error(y_train, X_train[col], squared=False) for col in constituent_models]

pd.DataFrame({
    'models': constituent_models,
    'rmse_test': rmse_test,
    'rmse_train': rmse_train
}).set_index('models').sort_values('rmse_test')

Unnamed: 0_level_0,rmse_test,rmse_train
models,Unnamed: 1_level_1,Unnamed: 2_level_1
KIOST-ESM,9.60048,9.196532
FGOALS-g3,9.687788,9.284867
MRI-ESM2-0,9.922795,9.609047
MPI-ESM1-2-HR,9.969823,9.489925
NESM3,9.978137,9.371897
MPI-ESM1-2-LR,10.260886,9.681899
NorESM2-LM,10.410145,9.918216
EC-Earth3-Veg-LR,10.453606,9.902149
GFDL-CM4,10.511682,9.889638
BCC-ESM1,10.615578,10.0712


## Part 2

Here, we use the tuned hyperparameter (from [Task 4](Milestone3-task4.ipynb)) and use the same hyperparameter to train a `scikit-learn` model.

From there, we obtained `n_estimators=100`, `max_depth=5` and `bootstrap=True`.

In [20]:
model = RandomForestRegressor(n_estimators=100, max_depth=5, bootstrap=True, random_state=random_state)
model.fit(X_train, y_train)

In [21]:
print(f"Train RMSE: {mean_squared_error(y_train, model.predict(X_train), squared=False):.2f}")
print(f" Test RMSE: {mean_squared_error(y_test, model.predict(X_test), squared=False):.2f}")

Train RMSE: 7.89
 Test RMSE: 8.66


In [22]:
model_dump_file = "model.joblib"

In [23]:
dump(model, model_dump_file)

['model.joblib']

In [24]:
upload_to_s3 = False

if upload_to_s3:
    import logging
    import boto3
    from botocore.exceptions import ClientError

    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(model_dump_file, s3_bucket_name, f"output/{model_dump_file}")
    except ClientError as e:
        logging.error(e)