# Example Train on All (ToA) Training Code

Sample code that, with *all* of the data provided, uses the appropriate data processing steps, trains a model, and then evaluates it using various metrics.

Data is split into a holdout set for evaluation after training, a test set for evaluation while training/fitting (may or may not be used that way, depending on the model), and finally a training set.

In [None]:
# Automatically reload external Python files
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import RobustScaler

from training import toa_data, calc_stats

## Prep the data

In [None]:
dep_var = "Log(Rmax)"
# dep_var = "Log(Efficiency)"
(train_X, train_y), (test_X, test_y), (hold_X,
                                       hold_y) = toa_data(dep_var, RobustScaler())


## Fit the model

In [None]:
model = RandomForestRegressor(n_estimators=1000)
model.fit(train_X, train_y)


## Measure performance

In [None]:
# Testing score
pred_y = model.predict(test_X)
calc_stats(test_y, pred_y, prefix="Testing")

print()

# Holdout score
pred_y = model.predict(hold_X)
calc_stats(hold_y, pred_y, prefix="Holdout")


## Try cross validation for additional data

It only makes sense when using ToA, because random train/test splits that do not respect time would violate ToP.

In [None]:
non_holdout_X = pd.concat([train_X, test_X], ignore_index=True)
non_holdout_y = pd.concat([train_y, test_y], ignore_index=True)
# Use the default 5 number of folds
score: np.ndarray = cross_val_score(
    model, non_holdout_X, non_holdout_y, scoring="r2", n_jobs=5)
print(score, score.mean(), score.std())
