# Cross-Validation

### What is Cross Validation?

Cross-validation extends the approach of train test split to measure model quality on the test data. cross-validation gives you a more reliable measure of your model's quality, although it takes longer to run.

### The Cross-Validation Procedure

In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality. For example, we could have 5 folds or experiments. We divide the data into 5 pieces, each being 20% of the full dataset.

<img src="img/ex1.png">

First we run experiment 1 which uses the first fold as a holdout set, and everything else as training data. This gives us a measure of model quality based on a 20% holdout set, much as we got from using the simple train-test split. Then we run a second experiment, where we hold out data from the second fold. This gives us a second estimate of model quality. We repeat this process, using every fold once as the holdout. Putting this together, 100% of the data is used as a holdout at some point.

### Advantage and Disadvantages
Cross-validation gives a more accurate measure of model quality (Reduces Overfitting), which is especially important if you are making a lot of modeling decisions. However, it can take more time to run, because it estimates models once for each fold. So it is doing more total work.

Alternatively, you can run cross-validation and see if the scores for each experiment seem close. If each experiment gives the same results, train-test split is probably sufficient.

### Trade-offs Between Cross-Validation and Train-Test Split

On small datasets, the extra computational burden of running cross-validation isn't a big deal. A simple train-test split is sufficient for larger datasets. It will run faster, and you may have enough data that there's little need to re-use some of it for holdout.

## Example

In [7]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score



In [2]:
data = pd.read_csv('melb_data.csv')
data.columns

Index(['Unnamed: 0', 'Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',
       'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom',
       'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea',
       'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [3]:
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price

Then specify a pipeline of our modeling steps (It can be very difficult to do cross-validation properly if you arent't using pipelines)

In [6]:
my_pipeline = make_pipeline(SimpleImputer(), RandomForestRegressor())

Finally get the cross-validation scores:

In [11]:
scores = cross_val_score(my_pipeline, X, y, scoring='neg_mean_absolute_error')
print(scores)

[-330138.72131734 -313920.30270039 -298760.99547024 -242550.80322356
 -248624.2845189 ]


In [9]:
print('Mean Absolute Error %2f' %(-1 * scores.mean()))

Mean Absolute Error 286304.017764


## Conclusion

Using cross-validation gave us much better measures of model quality, with the added benefit of cleaning up our code (no longer needing to keep track of separate train and test sets. So, it's a good win.