In this particular notebook, we're gonna learn about *cross-validation* and it's working.<br>
For this reason, we choose the dataset from [Melbourne Housing Snapshot](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot)<br>
Beofore that, what do you mean by the term **CROSS-VALIDATION** in machine learning models?<br>
Cross means shuffling here, and corss validation means the shuffling in main dataset due to the validation dataset which belongs into it.<br>
There are some types of problems who generally face while choosing validation dataset. Some approaches could be-
* k-Fold
* Stratified k-Fold
* Leave one out
and some others. Let's know some ideas how these are actually doing on dataset.

# K-Fold
In this process, we choose random part of the dataset to set as validation dataset e.g 20% most of the time.<br>
If we have been declared 20% for the validation dataset from 5000 examples, then random 1000 examples would be there to get selected.<br>
Then we find out the prediction for each of the fold *imagine this one {v1, v2, v3, v4, v5}* have been selected  as validation set such as 
But the problem is, having a perfect ratio of 90% positive and 10% negative examples in the main dataset doesn't surely maintained by the k-Fold technique.<br>
More specificly,  each fold doesn't necessarily balanced with 90% +ve and 10% -ve ratio as like as the main dataset. [Click here](https://scikit-learn.org/stable/modules/cross_validation.html) to learn more.

# Stratified k-Fold
In this particular process, we're going to use this process as it provides train/test indices to split data in train/test sets.<br>
So, it takes care the ratio of positive or negative data exaples in each folds. [Click here]( https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) to learn more.

# Leave one Out
This process is only applicable if there's only 1 validation data in each fold of our entire dataset while the remaining samples form the training set..[Click here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html)<br>
So, now we're gonna start our code here. As usual let's load the input data in X and the output data in y.

In [1]:
import pandas as pd

# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Select subset of predictors
features_cols = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[features_cols]

# Select target
y = data.Price

let's print X, and y so see how these they look like.

In [2]:
X.head()

Unnamed: 0,Rooms,Distance,Landsize,BuildingArea,YearBuilt
0,2,2.5,202.0,,
1,2,2.5,156.0,79.0,1900.0
2,3,2.5,134.0,150.0,1900.0
3,3,2.5,94.0,,
4,4,2.5,120.0,142.0,2014.0


In [3]:
y.head()

0    1480000.0
1    1035000.0
2    1465000.0
3     850000.0
4    1600000.0
Name: Price, dtype: float64

It is possible to do cross-validatio without doing pipelining! But it is not a good decision.<br>
However, pipeline will make the code remarkably straightforward. So, we're gonna a pipeline.<br>
For this reason, we'll use an imputer for missing values as the preprocessor, and RandomForestRegressor while defining the model.

In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# defining pipeline 
my_pipeline = Pipeline(steps = [
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators = 50, random_state = 0))
])

Now, we'll be using the *cross_val_score()* ([click here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) to see documentation) function to obtain cross validation scores. Here, we'll use *neg_mean_absolute_error*.<br>
For this, we need to multiply our function with **-1**. Again, inside the function, we'll be passing the whole X and y as it'll do all the fold for us from the given dataset.

In [5]:
from sklearn.model_selection import cross_val_score

# cv is the number of folds (k actually)
my_scores = -1 * cross_val_score(my_pipeline,
                                X,
                                y,
                                cv = 5,
                                scoring  = 'neg_mean_absolute_error')

print('MAE scores:\n', my_scores)


MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


See! We got our mean absolute error from each validation set (k = 5). <br>
It is a little surprising that we specify negative MAE. Scikit-learn has a convention where all metrics are defined so a high number is better. Using negatives here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere.<br>
Now, just averaging the scores to get mean value of them.

In [6]:
print("Average MEA of our scores:")
print(my_scores.mean())

Average MEA of our scores:
277707.3795913405


**Yes! We're done with corss-validation**<br>
The good thing is that using cross-validation we no longer need to keep track of separate training and validation sets. <br>
So, especially for small datasets...right!!

