# MLA

## Preparing data

In [1]:
import pandas as pd

In [2]:
waves_filepath = 'C:/Users/Andre/Documents/GitHub/BEDU-Data-Analysis/data/Coastal Data System - Waves (Mooloolaba) 01-2017 to 06 - 2019.csv'
waves_data = pd.read_csv(waves_filepath, index_col=False)
waves_data = waves_data.rename(columns={"Date/Time":"time",
                                        'Hs':'wave_height',
                                        'Hmax':'max_wave_height',
                                        'Tz':'zero_upcrossing_wave_period',
                                        'Tp':'peak_energy_wave_period',
                                        'Peak Direction':'peak_direction',
                                        'SST':'temperature'})

In [3]:
waves_data_clean = waves_data[(waves_data.wave_height > 0) &
                              (waves_data.max_wave_height > 0) &
                              (waves_data.zero_upcrossing_wave_period > 0) &
                              (waves_data.peak_energy_wave_period > 0) &
                              (waves_data.peak_direction > 0) &
                              (waves_data.temperature > 0)]

In [4]:
waves_data_sample = waves_data_clean.iloc[41727:] # 1727 rows

## Random Forests *without* max_wave_height

In [5]:
y = waves_data_sample.wave_height
y_tr = y.iloc[:1295]
y_te = y.iloc[1295:]

features = ['zero_upcrossing_wave_period','peak_energy_wave_period','peak_direction','temperature']
X = waves_data_sample[features]
X_tr = X.iloc[:1295]
X_te = X.iloc[1295:]

In [6]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# 1/4 select
rf_model = RandomForestRegressor(random_state=1)
# 2/4 fit
rf_model.fit(X_tr, y_tr)
# 3/4 predict
wave_height_preds = rf_model.predict(X_te)
# 4/4 validate
MAE = mean_absolute_error(y_te, wave_height_preds)

print("The Mean Absolute Error is {0:.2f} meters.".format(MAE))

The Mean Absolute Error is 0.97 meters.


## Random Forests *using* max_wave_height

In [10]:
features = ['max_wave_height','zero_upcrossing_wave_period','peak_energy_wave_period','peak_direction','temperature']
Z = waves_data_sample[features]
Z_tr = Z.iloc[:1295]
Z_te = Z.iloc[1295:]

In [11]:
# 1/4 select
rf_model_b = RandomForestRegressor(random_state=1)
# 2/4 fit
rf_model_b.fit(Z_tr, y_tr)
# 3/4 predict
wave_height_preds_b = rf_model_b.predict(Z_te)
# 4/4 validate
MAE_b = mean_absolute_error(y_te, wave_height_preds_b)

print("The Mean Absolute Error of the second model is {0:.2f} meters.".format(MAE_b))

The Mean Absolute Error of the second model is 0.25 meters.


## Conlusion

The second model (using max_wave_height) decreases the MAE from 0.97 meters to 0.25 meters, because it considers the most correlated factor, maximum wave height, to our variable of interest, average wave height.

If we were to choose a model, the second model makes a better job at predicting the waves; the problem with the second model is that it is not practical nor realisitc. This is because it is not possible to predict the average weave height of the set while the biggest wave of the set hasn't happened.

On the other hand, let's not forget that the Random Forests Model is a non-parametric model and we can treat it as a black-box. This means that we don't know how the model is actually predicing the variable of interest, average wave height. Perhaps the model is indeed taking into account past maximum wave heights to predcit the next average wave height.