## Train, Validate, Test


### Recap

- We used R-squared and MAE as **metrics** that quantify the accuracy between predicted and true values
- We **separated validation data** from training data because we care about a model's performance on future observations, not how well it does on its own training data
- We've been using the RF model's handy **OOB (out-of-bag) samples** as a substitute for a validation set
- Accuracy metrics derived from the OOB samples are excellent estimates of the true validation scores **but only for time-insensitive data**

### Time-sensitive Datasets

- Inflation alone means that future prices far beyond the training period will be much higher. An RF bulldozer price predictor trained on data from years 2000-2005 won't make accurate predictions for bulldozers sold in 2020
- Metrics derived from OOB samples are, therefore, overly optimistic about the generality of a model and how it will perform on future predictions
- We must obtain a validation set beyond the date range of the training set in order to properly measure an RF's accuracy on time-sensitive data

### The testing trilogy

- Training, validation, and test sets
- The model trains just on the training set and model accuracy is evaluated using the validation set during development
- We run the test set through the model to get our final measure of model accuracy and generality
- **The only true measure of model generality comes from computing metrics on a test set that has never previously been run through the model**

#### Holdout Method for splitting time-insensitive datasets

Sample Code:

**An example:**

[Diabetes Dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
df = datasets.load_diabetes(as_frame=True)['frame']

df

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068330,-0.092204,75.0
2,0.085299,0.050680,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.025930,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0
...,...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207,178.0
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018118,0.044485,104.0
439,0.041708,0.050680,-0.015906,0.017282,-0.037344,-0.013840,-0.024993,-0.011080,-0.046879,0.015491,132.0
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044528,-0.025930,220.0


In [2]:
df = df.sample(frac=1) # shuffle data
df_dev, df_test = train_test_split(df, test_size=0.15)
df_train, df_valid = train_test_split(df_dev, test_size=0.15)

In [3]:
df_train.head(1)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
22,-0.08543,-0.044642,-0.00405,-0.009113,-0.002945,0.007767,0.022869,-0.039493,-0.061177,-0.013504,68.0


In [4]:
df_valid.head(1)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
246,0.041708,-0.044642,-0.032073,-0.061904,0.079612,0.050982,0.056003,-0.009972,0.045066,-0.059067,78.0


In [5]:
X_train = df_train.drop('target', axis=1)
y_train = df_train['target']


X_valid = df_valid.drop('target', axis=1)
y_valid = df_valid['target']

In [6]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_jobs=-1, oob_score=True) 

rf.fit(X_train, y_train) 

print("OOB R-squared: ", rf.oob_score_)
print("Training R-squared: ", rf.score(X_train, y_train))
print("Validation R-squared: ", rf.score(X_valid, y_valid))

OOB R-squared:  0.42006983361357786
Training R-squared:  0.9226899552920524
Validation R-squared:  0.42586230214778553


Because we're selecting validation and test sets randomly, it's possible that the sets will contain a **disproportionate number of outlier records**, such as really expensive bulldozers. Such tests are not representative and yield pessimistic accuracy metrics

#### k-fold cross validation Method for splitting time-insensitive datasets

- Splits the dataset into k chunks of equal size. We train the model on k-1 chunks and test it on the other, repeating the procedure k times so that every chunk gets used as a **validation set**
- The overall validation error is the average of the k validation errors

![](kfold.png)

Sample Code:

Creating arrays from `dev` set:

In [7]:
X_dev = df_dev.drop('target', axis=1)
y_dev = df_dev['target']

In [8]:
from sklearn.model_selection import cross_val_score

rf = RandomForestRegressor(n_jobs=-1, oob_score=True) 

scores = cross_val_score(rf, X_dev, y_dev, cv=5) # k=5

print(scores)
print(scores.mean())

[0.42637875 0.37750823 0.39387314 0.43333058 0.35154863]
0.3965278667039188


### Hyperparameter Tuning

In [9]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

# specify parameters and distributions to sample from
param_dist = {"max_features": sp_randint(1, X_dev.shape[1]),
              "n_estimators": sp_randint(100, 1000)}

rf = RandomForestRegressor(n_jobs=-1, oob_score=True) 

random_search = RandomizedSearchCV(rf, param_distributions=param_dist,
                                   n_iter=10, cv=5, random_state=42)

random_search.fit(X_dev, y_dev)

RandomizedSearchCV(cv=5,
                   estimator=RandomForestRegressor(n_jobs=-1, oob_score=True),
                   param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x000002655BB23880>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x000002655BAECB80>},
                   random_state=42)

**Best Parameters:**

In [10]:
print(random_search.best_params_)

{'max_features': 2, 'n_estimators': 443}


Using the best params to make predictions:

In [11]:
from sklearn.metrics import r2_score

X_test = df_test.drop('target', axis=1)
y_test = df_test['target']

y_preds_test = random_search.predict(X_test)

r2_score(y_test, y_preds_test)

0.45953874183530996

### Splitting time-sensitive datasets

The process for extracting training, validation, and test sets for time-sensitive data is:

- Sort the records by date, earliest to latest
- Extract the last, say, 15% of the records as `df_test`
- Extract the second to last 15% of the records as `df_valid`
- The remaining 70% of the original data is `df_train`

See an example in [Chapter 9](https://mlbook.explained.ai/bulldozer-testing.html)

Sample Code for Bulldozer data:

### Rectifying training and validation sets

Important rules for preparing separated training and test sets:

- Transformations must be applied to features consistently across data subsets.
- Transformations of validation and test sets can only use data derived from the training set.