## Cross-Validation
- In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality.

- For example, we could begin by dividing the data into 5 pieces, each 20% of the full dataset. In this case, we say that we have broken the data into 5 "folds".

- Then, we run one experiment for each fold:
  - In Experiment 1, we use the first fold as a validation (or holdout) set and everything else as training data. This gives us a measure of model quality based on a 20% holdout set.
  - In Experiment 2, we hold out data from the second fold (and use everything except the second fold for training the model). The holdout set is then used to get a second estimate of model quality.
  - We repeat this process, using every fold once as the holdout set. Putting this together, 100% of the data is used as holdout at some point, and we end up with a measure of model quality that is based on all of the rows in the dataset (even if we don't use all rows simultaneously).

In [2]:
import pandas as pd

path = "./archive/melb_data.csv"
data = pd.read_csv(path)

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50, random_state=42))
                             ])

In [5]:
from sklearn.model_selection import cross_val_score

# multiple by -1 (since sklearn calculates negative mae)
# 5 folds
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv = 5,
                              scoring = "neg_mean_absolute_error")

print("MAE scores:\n", scores)

MAE scores:
 [297064.72238276 300239.56062646 287168.95506934 237234.84661343
 260589.85501224]


In [6]:
print("Average MAE score (across experiments):")
print(scores.mean())

Average MAE score (across experiments):
276459.5879408453
