# Cross Validation

## Cross Validation


### 1. Overfitting Defined

**Overfitting** occurs when a machine learning model learns the training data, including the noise and random fluctuations, too well.

* **Result:** An overfit model achieves a **perfect or near-perfect score** (e.g., $100\%$ accuracy) on the data it was trained on.
* **Failure:** The model has memorized the answers rather than learned the underlying patterns. Consequently, it **fails to predict anything useful** on new, previously unseen data. This is the definition of **poor generalization**.

### 2. The Solution: Data Splitting

To prevent overfitting and obtain an unbiased measure of a model's true predictive power, the available data must be split into two or three distinct, non-overlapping sets:

| Data Set | Purpose | Role in Experiment |
| :--- | :--- | :--- |
| **Training Set** ($\mathbf{X}_{\text{train}}, \mathbf{y}_{\text{train}}$) | Used to **learn the parameters** of the prediction function (i.e., train the model). | **Model Development** |
| **Test Set** ($\mathbf{X}_{\text{test}}, \mathbf{y}_{\text{test}}$) | Held out and used **only once** at the very end to evaluate the model's performance on unseen data. | **Final Evaluation** |

By using a separate **test set**, the researcher ensures that the model is evaluated on data it has never encountered, providing an honest assessment of its ability to **generalize** to new situations.

https://scikit-learn.org/stable/modules/cross_validation.html

### Underfitting and Overfitting

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". An overfitted model is a statistical model that contains more parameters than can be justified by the data.

If you're overfitting, or you are getting great training scores and poor  test scores, try removing the lesser performing features. The model is just memorizing the training data.

Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An under-fitted model is a model where some parameters or terms that would appear in a correctly specified model are missing.[2] Under-fitting would occur, for example, when fitting a linear model to non-linear data. Such a model will tend to have poor predictive performance.

If you're underfitting, or you are getting poor training scores and test scores, try adding more data or more features.

https://en.wikipedia.org/wiki/Overfitting

Please review the following regarding under-fitting and over-fitting:<br />
https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html

### Underfitting / Overfitting Fixes

#### Underfitting
* Get more data
* Get more features

#### Overfitting:
* Cross Validation
* Regularization
* Remove weak features

https://medium.com/ml-research-lab/under-fitting-over-fitting-and-its-solution-dc6191e34250

### Supervised vs Unsupervised

* Supervised: Uses target with model.fit
* Unsupervised: Doesn't use target with model.fit

## Out of Sample Testing (Cross Validation)

Sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set). The goal of cross-validation, out of sample testing, is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

## K-fold Cross Validation

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter.

https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation

## Cross Validation Using a Decision Tree Classifier

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

df = pd.DataFrame(data=X, columns=iris.feature_names)
df['species'] = y
X_train, X_test, y_train, y_test = train_test_split(df.drop('species', axis=1), df['species'], test_size=0.20)


model = DecisionTreeClassifier()
model.fit(X_train,y_train)
predictions = model.predict(X_test)

print(cross_val_score(model, X_train, y_train, cv=10))
print(f'With Cross Validation: {cross_val_score(model, X_train, y_train, cv=10).mean()}')
print(f'Without Cross Validation: {accuracy_score(y_test, predictions)}')

[0.91666667 1.         1.         1.         0.91666667 0.83333333
 0.91666667 0.83333333 0.91666667 0.91666667]
With Cross Validation: 0.9333333333333332
Without Cross Validation: 1.0
