## General Understanding for Over-fitting
<div style="display: flex;">
  <div style="flex: 30%; text-align: center;">
    <img src="./images/overfitting.png" alt="Over-fitting" style="width: 100%;">
    <p>Over-means-worse</p>
    <p><a href="https://github.com/abhishekkrthakur/approachingalmost/blob/master/AAAMLP.pdf" target="_blank">Ref: AAAML</a></p>
  </div>
  <div style="flex: 70%; text-align: center;">
    <img src="./images/overfitting_2.png" alt="Second Image" style="width: 100%;">
    <p>Under-Optimum-Over</p>
    <p><a href="https://www.researchgate.net/publication/339680577_An_Introduction_to_Machine_Learning" target="_blank">Ref:An Introduction to Machine Learning</a></p>
  </div>
</div>

## Procedures to Prevent Overfitting (Train, Validation, and Test)
* hold-out based validation
* [k-fold cross-validation](./images/train-val-test.png)
* stratified k-fold cross-validation
* leave-one-out cross-validation

***Occam’s razor*** in simple words states that one should not try to complicate things that can be solved in a much simpler manner. In general, whenever your model does not obey Occam’s razor, it is probably overfitting.

***Regarding k-fold cross-validation***, the important technique is that you will have k-different models which have different parameters, and these <u>***k-different model***</u> will be merged (ensembled) from <u>***k-different inference results***</u>. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn import datasets
from sklearn import manifold

In [18]:
from sklearn.model_selection import KFold
from tabulate import tabulate
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(url, sep=";")

kf = KFold(n_splits=5, shuffle=True, random_state=42)

df['kfold'] = -1

# Split df into 5 folds
for fold, (train_index, test_index) in enumerate(kf.split(df)):
    df.loc[test_index, 'kfold'] = fold

df.to_csv('../data/wine-quality-data.csv')
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,kfold
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,2
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,4
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,2
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,2
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5,2
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,4
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,3
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,1


In [19]:
import plotly.express as px
df = df.sort_values(by="kfold")
fig = px.histogram(df, x="quality", color="kfold", title="Histogram of KFold Column")
fig.show()

In [21]:
from sklearn.model_selection import StratifiedKFold
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(url, sep=";")
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

df['kfold'] = -1

# Split df into 5 folds
for fold, (train_index, test_index) in enumerate(skf.split(df, df['quality'])):
    df.loc[test_index, 'kfold'] = fold

import plotly.express as px

df = df.sort_values(by="kfold")
fig = px.histogram(df, x="quality", color="kfold", title="Histogram of Statified KFold Column")
fig.show()

### hold-out based validation
In some cases, stratified k-fold cross-validation is quite demanding for computing. For this kinds of case, just one fold is used for validation set. And it is recommended to split into higher number of k-fold if the number of samples is high such as 1 million. 

## Regression
- Mostly, simple k-fold cross-validation works for any regression problem
- If you see that the distribution of targets is not consistent, you can use stratified k-fold
- If you have a lot of samples( > 10k, > 100k), then you don’t need to care about the number of bins.
- If the number of the samples is low, then use Sturge's rule.

Sturge's rule: Number of Bins = 1 + log2(N)

***Consider and do Stratified K-fold Cross-Validation for Regression Problem.*** For this, the most important thing is to get the relevant data.