# Task 2.4 – Cross Validation

### What is Overfitting/Underfitting a Model?
As mentioned, in statistics and machine learning we usually split our data into two subsets: training data and testing data (and sometimes to three: train, validate and test), and fit our model on the train data, in order to make predictions on the test data. When we do that, one of two things might happen: we overfit our model or we underfit our model. We don’t want any of these things to happen, because they affect the predictability of our model — we might be using a model that has lower accuracy and/or is ungeneralized (meaning you can’t generalize your predictions on other data). Let’s see what under and overfitting actually mean:

#### Overfitting
Overfitting means that model we trained has trained “too well” and is now fit too closely to the training dataset. This usually happens when the model is too complex (i.e., too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably be not accurate on untrained or new data. It is because this model is not generalized (or not AS generalized), meaning you can generalize the results and can’t make any inferences on other data, which is, ultimately, what you are trying to do. Basically, when this happens, the model learns or describes the “noise” in the training data instead of the actual relationships between variables in the data. This noise, obviously, isn’t part in of any new dataset, and cannot be applied to it.

#### Underfitting
In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. As you probably guessed (or figured out!), this is usually the result of a very simple model (not enough predictors/independent variables). It could also happen when, for example, we fit a linear model (like linear regression) to data that is not linear. It almost goes without saying that this model will have poor predictive ability (on training data and can’t be generalized to other data).

![](https://cdn-images-1.medium.com/max/800/1*tBErXYVvTw2jSUYK7thU2A.png)

It is worth noting the underfitting is not as prevalent as overfitting. Nevertheless, we want to avoid both of those problems in data analysis. You might say we are trying to find the middle ground between under and overfitting our model. As you will see, train/test split and cross validation help to avoid overfitting more than underfitting. Let’s dive into both of them!

### Train/Test Split
As stated before, the data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

![](https://cdn-images-1.medium.com/max/800/1*-8_kogvwmL1H6ooN1A1tsQ.png)

Let’s see how to do this in Python. We’ll do this using the Scikit-Learn library and specifically the train_test_split method. We’ll start with importing the necessary libraries:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Let’s load in the diabetes dataset:

### Task 2.4-1:  Recap Linear Regression

Using this dataframe, 
```python
df = pd.read_csv('./data/2_4_diabetes.csv', index_col=0)

```
Do the following:

1.  Apply a train test split with a test size of 0.2
*  Print shapes of resulting data sets
* Fit a linear regression model
* Calcualte the MAE

In [2]:
df = pd.read_csv('./data/2_4_diabetes.csv', index_col=0)
df.head()

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

Now we can use the train_test_split function in order to make the split. The test_size=0.2 inside the function indicates the percentage of the data that should be held over for testing. It’s usually around 80/20 or 70/30.

In [3]:
# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(353, 10) (353,)
(89, 10) (89,)


In [4]:
regressor = LinearRegression()

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

As you can see, we’re fitting the model on the training data and trying to predict the test data. Let’s see what (some of) the predictions are:

In [5]:
y_pred[:5]

array([218.31009366, 146.06742371, 151.11325926, 170.84488206,
       242.44287695])

Let’s score the model:

In [6]:
from sklearn import metrics 

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))

Mean Absolute Error: 49.6492913997945


We’ve loaded in the data, split it into a training and testing sets, fitted a regression model to the training data, made predictions based on this data and tested the predictions on the test data. Seems good, right? But train/test split does have its dangers — what if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? (imagine a file ordered by one of these). This will result in overfitting, even though we’re trying to avoid it! This is where cross validation comes in.

### Cross Validation

In the previous section, I mentioned the caveats in the train/test split method. In order to avoid this, we can perform something called cross validation. It’s very similar to train/test split, but it’s applied to more subsets. Meaning, we split our data into k subsets, and train on k-1 one of those subset. What we do is to hold the last subset for test. We’re able to do it for each of the subsets.

#### K-Folds Cross Validation
In K-Folds Cross Validation we split our data into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.

![](https://cdn-images-1.medium.com/max/800/1*J2B_bcbd1-s1kpWOu_FZrg.png)

Here is a very simple example from the Sklearn documentation for K-Folds:

In [7]:
from sklearn.model_selection import KFold 
import numpy as np

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
kf

KFold(n_splits=2, random_state=None, shuffle=False)

And let’s see the result — the folds:

In [8]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]


As you can see, the function split the original data into different subsets of the data. Again, very simple example but I think it explains the concept pretty well.

Let’s check out the example I used before, this time with using cross validation. I’ll use the cross_val_predict function to return the predicted values for each data point when it’s in the testing slice.

### Task 2.4-2:  Cross Validation

Using this dataframe, 
```python
df = pd.read_csv('./data/2_4_diabetes.csv', index_col=0)

```
Do the following:

1.  Apply a 6-fold cross validation
* Fit a linear regression model
* Calculate the MAE

In [9]:
from sklearn.model_selection import cross_val_predict
from sklearn import metrics

In [10]:
df = pd.read_csv('./data/2_4_diabetes.csv', index_col=0)
df.head()

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [11]:
regressor = LinearRegression()

y_pred = cross_val_predict(regressor, X, y, cv=6)

In [12]:
from sklearn import metrics 

print('Mean Absolute Error:', metrics.mean_absolute_error(y, y_pred))

Mean Absolute Error: 44.6447403952256
