# What is Cross-Validation?
Cross-validation is a technique in machine learning used to assess how well a model generalizes to unseen data. It helps in preventing overfitting and ensures that the model performs well on new data.

# Why Use Cross-Validation?
It provides a more reliable estimate of model performance.

It helps in selecting the best model or hyperparameters.

It reduces the risk of overfitting.

# Types of Cross-Validation:

# K-Fold Cross-Validation

The dataset is split into K equal parts (folds).

The model is trained on K-1 folds and tested on the remaining fold.

This process repeats K times, with each fold used as a test set once.

The final performance is the average of all K iterations.

Similar to K-Fold, but maintains the proportion of class labels in each fold.

Useful for imbalanced datasets.

# Leave-One-Out Cross-Validation (LOO-CV)

Each data point is used as a test set once, while the remaining data is used for training.

Computationally expensive but useful for small datasets.

# Leave-P-Out Cross-Validation (LPO-CV)

Similar to LOO-CV but uses P data points for testing instead of 1.

# Time Series Cross-Validation (Rolling/Expanding Window)

Used for time-series data where past data is used to predict future data.

Ensures that training data comes before test data.

# Which One to Use?

For large datasets: K-Fold (K=5 or 10) is a good choice.

For imbalanced data: Stratified K-Fold.

For small datasets: Leave-One-Out.

For time-series data: Rolling/Expanding window.

In [4]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression, LinearRegression
import numpy as np

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Define K-Fold cross-validator (K=5)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Define the model
model = LogisticRegression(max_iter=200)
model2= LinearRegression()

# Perform cross-validation and get accuracy scores
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

scores2 = cross_val_score(model2, X, y, cv=kf, scoring='r2')
# Print individual fold scores and average accuracy
print("Cross-validation scores:", scores)
print("Mean accuracy:", np.mean(scores))

print("Cross-validation scores2:", scores2)
print("Mean accuracy2:", np.mean(scores2))


Cross-validation scores: [1.         1.         0.93333333 0.96666667 0.96666667]
Mean accuracy: 0.9733333333333334
Cross-validation scores2: [0.946896   0.93157873 0.91771298 0.90265783 0.92107314]
Mean accuracy2: 0.9239837362538136


In [6]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits

digits = load_digits()

In [26]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target, test_size = 0.3)

In [27]:
lr = LogisticRegression()
lr.fit(X_train,y_train)
lr.score(X_test,y_test)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9537037037037037

In [28]:
svm = SVC()
svm.fit(X_train,y_train)
svm.score(X_test,y_test)

0.9851851851851852

In [29]:
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train,y_train)
rf.score(X_test,y_test)

0.9740740740740741

In [30]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [32]:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


In [40]:
def get_score(model,X_train,X_test,y_train,y_test):
    model.fit(X_train,y_train)
    return model.score(X_test,y_test)

In [41]:
get_score(lr,X_train,X_test,y_train,y_test)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9537037037037037

In [42]:
get_score(rf,X_train,X_test,y_train,y_test)

0.975925925925926

In [43]:
get_score(SVC(),X_train,X_test,y_train,y_test)

0.9851851851851852

# StratifiedKFold

In [44]:
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)

In [50]:
scores_l = []
scores_svm = []
scores_rf = []

for train_index, test_index in kf.split(digits.data):
    X_train, X_test, y_train, y_test = digits.data[train_index],digits.data[test_index], \
    digits.target[train_index],digits.target[test_index]

    scores_l.append(get_score(LinearRegression(),X_train,X_test,y_train,y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40),X_train,X_test,y_train,y_test))
    scores_svm.append(get_score(SVC(),X_train,X_test,y_train,y_test))


In [51]:
print(scores_l)

[0.5210503111571498, 0.5681424219493625, 0.41943985059688305]


In [52]:
print(scores_rf)

[0.9148580968280468, 0.9549248747913188, 0.9181969949916527]


In [53]:
print(scores_svm)

[0.9666110183639399, 0.9816360601001669, 0.9549248747913188]


In [54]:
from sklearn.model_selection import cross_val_score

In [61]:
cross_val_score(LogisticRegression(),X_train,y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

array([0.88333333, 0.97083333, 0.925     , 0.94560669, 0.9748954 ])

In [62]:
cross_val_score(SVC(),X_train,y_train)

array([0.925     , 0.975     , 0.95      , 0.94560669, 0.9790795 ])

In [80]:
cross_val_score(RandomForestClassifier(40),X_train,y_train)

array([0.85416667, 0.9625    , 0.9125    , 0.93723849, 0.94142259])

In [81]:
cross_val_score(RandomForestClassifier(n_estimators=10),digits.data,digits.target)

array([0.90833333, 0.86666667, 0.89415042, 0.93871866, 0.89972145])

In [82]:
np.array([0.90833333, 0.86666667, 0.89415042, 0.93871866, 0.89972145]).mean()

np.float64(0.9015181059999999)

In [83]:
np.array([0.85416667, 0.9625    , 0.9125    , 0.93723849, 0.94142259]).mean()

np.float64(0.9215655500000001)

# 1. implement LOOCV normally
# 2. implement LOOCV using cross_val_score
# 3. implement Regression using LOOCV
# 4. implement Classification LOOCV

In [84]:
from sklearn.datasets import make_blobs
from sklearn.model_selection import LeaveOneOut
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score

In [94]:
# make Datasets
X,y = make_blobs(n_samples=10,random_state=1)

In [106]:
cv = LeaveOneOut()

In [107]:
X

array([[ -7.23731039,  -9.03108652],
       [ -8.16550136,  -7.00850439],
       [ -7.02266844,  -7.57041289],
       [ -8.86394306,  -5.05323981],
       [  0.08525186,   3.64528297],
       [ -0.79415228,   2.10495117],
       [ -1.34052081,   4.15711949],
       [-10.32012971,  -4.3374029 ],
       [ -2.18773166,   3.33352125],
       [ -8.53560457,  -6.01348926]])

In [108]:
y

array([2, 2, 2, 1, 0, 0, 0, 1, 0, 1])

In [109]:
for i,j in cv.split(X):
    print(i,j)

[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]


In [110]:
# implement LOOCV using cross_val_score

In [111]:
from sklearn.model_selection import cross_val_score

In [112]:
model1 = RandomForestClassifier(random_state=1)

In [113]:
scores = cross_val_score(model1,X,y,scoring='accuracy', cv=cv)

In [114]:
scores.mean()

np.float64(0.9)

# Time Series Cross Validation

In [115]:
import pandas as pd
import numpy as np

In [116]:
# Rollong forecast Origin

def rolling_forecast_origion(train, min_train_size, horizon):
    '''
    Rooling forecast generator.
    '''
    for i in range(len(train) - min_train_size - horizon +1):
        split_train = train[:min_train_size + 1]
        split_val = train[min_train_size+i:min_train_size+i+horizon]
        yield split_train, split_val

In [118]:
full_series = [2502, 2414, 2800, 2143, 2708, 1900, 2333, 2222, 1234, 3456]

test = full_series[-2:]
train = full_series[:-2]
print('full training set :{0}'.format(train))
print('hidden test set :{0}'.format(test))

full training set :[2502, 2414, 2800, 2143, 2708, 1900, 2333, 2222]
hidden test set :[1234, 3456]


In [124]:
cv_rolling = rolling_forecast_origion(train, min_train_size=4, horizon=2)
cv_rolling

<generator object rolling_forecast_origion at 0x000001E1C3F61040>

In [125]:
i=0
for cv_train, cv_val in cv_rolling:
    print(f'CV[{i+1}]')
    print(f'Train:\t{cv_train}')
    print(f'Val:\t{cv_val}')
    print('------')
    i+=1

CV[1]
Train:	[2502, 2414, 2800, 2143, 2708]
Val:	[2708, 1900]
------
CV[2]
Train:	[2502, 2414, 2800, 2143, 2708]
Val:	[1900, 2333]
------
CV[3]
Train:	[2502, 2414, 2800, 2143, 2708]
Val:	[2333, 2222]
------


# Sliding Window Cross Validation

In [127]:
def sliding_window(train,window_size, horizon):
    '''
    Sliding Window
    '''
    for i in range(len(train) - window_size -horizon +1):
        split_train = train[i:window_size+i]
        split_val = train[i+window_size:window_size+i+horizon]
        yield split_train, split_val

In [130]:
cv_sliding = sliding_window(train ,window_size=4, horizon=1)

print('full training set: {0}\n'.format(train))

i=0
for cv_train, cv_val in cv_sliding:
    print(f'CV[{i+1}]')
    print(f'Train:\t{cv_train}')
    print(f'Val:\t{cv_val}')
    print('------')
    i+=1

full training set: [2502, 2414, 2800, 2143, 2708, 1900, 2333, 2222]

CV[1]
Train:	[2502, 2414, 2800, 2143]
Val:	[2708]
------
CV[2]
Train:	[2414, 2800, 2143, 2708]
Val:	[1900]
------
CV[3]
Train:	[2800, 2143, 2708, 1900]
Val:	[2333]
------
CV[4]
Train:	[2143, 2708, 1900, 2333]
Val:	[2222]
------


# Cross validation example

In [131]:
def cross_validation_score(model, train, cv, metric):
    '''
    Claculate cross validation score
    
    '''
    cv_scores = []
    for cv_train,cv_test in cv :
        model.fit(cv_train)
        preds = model.predict(horizon=len(cv_test))
        score = metric(y_true=cv_test,y_pred= preds)
        cv_scores.append(score)

    return np.array(cv_scores)
    

In [2]:
from forecast.baseline import SNaive, Naivel
from sklearn.metrics import mean_absolute_error

ModuleNotFoundError: No module named 'forecast.baseline'