# Example Train on Past (ToP) Training Code

Sample code that, with only data that occurs *before* a given date, uses the appropriate data processing steps, trains a model, and then evaluates it using various metrics.

Data is split into a testing set, comprised of data reported at a given date, and a training set, comprised of data from a given number of datasets before the date.

In [11]:
# Automatically reload external Python files
%load_ext autoreload
%autoreload 2

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import RobustScaler

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:
from data_cleaning import prep_dataframe, preprocess_data, select_past
from read_data import read_datasets
from training import train_test_year, split_x_y, print_stats

## Prep the data

In [13]:
dep_var = "Log(Rmax)"
# dep_var = "Log(Efficiency)"
all_data = read_datasets()

data = prep_dataframe(all_data, dep_var)
data = select_past(data, 3)
data, _ = preprocess_data(data, dep_var, RobustScaler(), True)
train, test = train_test_year(data)

# Do splits for all data
(train_X, train_y), (test_X, test_y) = split_x_y([train, test], dep_var)


TOP500_201906.xls
TOP500_202011.xlsx


TOP500_201911.xls
TOP500_202006.xlsx


TOP500_201811.xls
TOP500_201806.xls
TOP500_201206.xls
TOP500_201211.xls
TOP500_201406.xls
TOP500_202106.xlsx


TOP500_201611.xls
TOP500_201411.xls
TOP500_201606.xls
TOP500_201311.xls
TOP500_201306.xls


TOP500_201111.xls
TOP500_201511.xls
TOP500_201706.xls
TOP500_201506.xls
TOP500_201711.xls


Unknown processor: 'Vector Engine Type10AE', full name: 'Vector Engine Type10AE 8C 1.58GHz' @ Plasma Simulator, 2020
Unknown processor: 'Vector Engine Type10AE', full name: 'Vector Engine Type10AE 8C 1.58GHz' @ nan, 2020
Unknown processor: 'NEC', full name: 'NEC  3.200GHz' @ Earth Simulator, 2009
Unknown processor: 'NEC', full name: 'NEC  3.200GHz' @ Earth Simulator, 2009
Unknown processor: 'AMD EPYC 7763', full name: 'AMD EPYC 7763 64C 2.45GHz' @ Perlmutter, 2021
Unknown processor: 'Xeon Platinum 8368Q', full name: 'Xeon Platinum 8368Q 38C 2.6GHz' @ Maru, 2021
Unknown processor: 'Xeon Platinum 8368Q', full name: 'Xeon Platinum 8368Q 38C 2.6GHz' @ Guru, 2021
Unknown processor: 'Vector Engine Type20B', full name: 'Vector Engine Type20B 8C 1.6GHz' @ Earth Simulator -SX-Aurora TSUBASA, 2021
Unknown processor: 'Xeon Platinum 8360Y', full name: 'Xeon Platinum 8360Y 36C 2.4GHz' @ Raven-GPU, 2021
Unknown processor: 'Xeon Platinum 8368', full name: 'Xeon Platinum 8368 38C 2.4GHz' @ HoreKa-Gree

## Fit the model

In [14]:
model = RandomForestRegressor(n_estimators=1000)
model.fit(train_X, train_y)


RandomForestRegressor(n_estimators=1000)

## Measure performance

In [15]:
# Testing score
pred_y = model.predict(test_X)
print_stats(test_y, pred_y, "Testing")


Testing R^2: 0.5814748366404364
Testing MAE: 0.35693911245747917
Testing MAPE: 0.0399896659057118
Testing MSE: 0.26027491144787035
