# Example Train on Past (ToP) Training Code

Sample code that, with only data that occurs *before* a given date, uses the appropriate data processing steps, trains a model, and then evaluates it using various metrics.

Data is split into a testing set, comprised of data reported at a given date, and a training set, comprised of data from a given number of datasets before the date.

In [None]:
# Automatically reload external Python files
%load_ext autoreload
%autoreload 2

import pickle

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import RobustScaler

from training import top_data, calc_stats
from data_cleaning import DataCleaner

## Prep the data

In [None]:
dep_var = "Log(Rmax)"
# dep_var = "Log(Efficiency)"
cleaner = DataCleaner(RobustScaler(), dep_var)
(train_X, train_y), (test_X, test_y) = top_data(dep_var, cleaner, 3)


In [None]:
# Save preprocessor if desired
pickle.dump(cleaner, open("preprocessor.pkl", "wb"))


## Fit the model

In [None]:
model = RandomForestRegressor(n_estimators=1000)
model.fit(train_X, train_y)


In [None]:
# Save the model if desired
pickle.dump(model, open("model.pkl", "wb"))


## Measure performance

In [None]:
# Testing score
pred_y = model.predict(test_X)
calc_stats(test_y, pred_y, prefix="Testing")
