Cross Validation is a technique to accurately determine how a model performs using different pairs of training and validation data sets out of the same full dataset.

- For small datasets, where extra computational burden isn't a big deal, you should run cross-validation.
- For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have enough data that there's little need to re-use some of it for holdout.

In [2]:
import pandas as pd

data = pd.read_csv("melb_data.csv")

cols_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_use]
y = data.Price

Next, we can use a pipeline that bundles an imputer to fill in missing values and a random forest model to make predictions.

Using pipelines is highly recommended while performing cross-validation as it makes the code easy to understand and use.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer(strategy='median')),
    ('model', RandomForestRegressor(n_estimators=50, random_state=0)),
])

In [8]:
from sklearn.model_selection import cross_val_score

scores = -1 * cross_val_score(pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')

print(scores)
print(scores.mean())

[301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]
277707.3795913405
