# Example Train on All (ToA) Training Code

Sample code that, with *all* of the data provided, uses the appropriate data processing steps, trains a model, and then evaluates it using various metrics.

Data is split into a holdout set for evaluation after training, a test set for evaluation while training/fitting (may or may not be used that way, depending on the model), and finally a training set.

In [55]:
# Automatically reload external Python files
%load_ext autoreload
%autoreload 2

import pickle

import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler

from training import toa_data, calc_stats
from data_cleaning import DataCleaner
from models import models

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Prep the data

In [56]:
# dep_var = "Log(Rmax)"
dep_var = "Log(Efficiency)"
cleaner = DataCleaner(RobustScaler(), dep_var)
(train_X, train_y), (test_X, test_y), (hold_X, hold_y) = toa_data(dep_var, cleaner)


Unknown processor: 'NEC', full name: 'NEC  3.200GHz' @ Earth Simulator, 2009
Unknown processor: 'NEC', full name: 'NEC  3.200GHz' @ Earth Simulator, 2009
Unknown processor: 'NEC', full name: 'NEC  3.200GHz' @ Earth Simulator, 2009
Unknown processor: 'NEC', full name: 'NEC  3.200GHz' @ Earth Simulator, 2009
Unknown processor: 'NEC', full name: 'NEC  3.20GHz' @ Earth Simulator, 2009
Unknown processor: 'Xeon EM64T', full name: 'Xeon EM64T  3.60GHz' @ Thunderbird, 2006
Filtered duplicates to go from 5332 rows to 1530


In [21]:
# Save preprocessor if desired
pickle.dump(cleaner, open("preprocessor.pkl", "wb"))


## Fit the model

In [57]:
model = models["svr_2"]
model.fit(train_X, train_y)


SVR(cache_size=700, kernel='poly')

In [23]:
# Save the model if desired
pickle.dump(model, open("model.pkl", "wb"))


## Measure performance

In [None]:
# Testing score
pred_y = model.predict(test_X)
calc_stats(test_y, pred_y, prefix="Testing")

print()

# Holdout score
pred_y = model.predict(hold_X)
calc_stats(hold_y, pred_y, prefix="Holdout")


## Try cross validation for additional data

It only makes sense when using ToA, because random train/test splits that do not respect time would violate ToP.

In [53]:
from sklearn.metrics import r2_score

from kfold import cross_validate

non_holdout_X = pd.concat([train_X, test_X], ignore_index=True)
non_holdout_y = pd.concat([train_y, test_y], ignore_index=True)

r2, best = cross_validate(model, non_holdout_X, non_holdout_y, scoring=r2_score, cv=10)


In [54]:
r2, best

(array([0.77709742, 0.70812389, 0.87638093, 0.74226829, 0.82012622,
        0.88306558, 0.80476139, 0.77750221, 0.7868041 , 0.79835992]),
 DNN1())