In [2]:
import numpy as np
import pandas as pd

# Cross Validation

It is a statistical technique used in machine learning and statistics to assess how well a predictive model will perform on unseen data. The basic idea is to split the available dataset into multiple subsets or folds.

[Cross Validation Explained!](https://youtu.be/a86WxNgMv7E?si=3RVFsQgAViQsCL-z)  
[Machine Learning Fundamentals: Cross Validation](https://youtu.be/fSytzGwwBVw?si=TSZXN41ZNLQPXm5U)

## Types of Cross Validation

1. K-Fold Cross Validation
2. Stratified K-Fold Cross Validation
3. Leave One Out Cross Validation
4. Repeated K-Fold Cross Validation
5. Shuffle Split Cross Validation
6. Time Series Cross Validation

[Complete Guide to Cross Validation](https://youtu.be/-8s9KuNo5SA?si=rk1Ltp9_5GpqYiX_)

In [15]:
from sklearn.datasets import load_digits

digits = load_digits()

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,
                                                    digits.target,
                                                    test_size=0.3,
                                                    random_state=2002)

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [21]:
lr = LogisticRegression(solver='liblinear',multi_class='ovr')
lr.fit(X_train, y_train)
print(lr.score(X_test, y_test))

svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
print(svm.score(X_test, y_test))

rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)
print(rf.score(X_test, y_test))

0.9648148148148148
0.4
0.9740740740740741


### K-Fold Cross Validation

The model is trained on k-1 folds and validated on the remaining fold. This process is repeated k times, with each fold serving as the validation set exactly once.

In [22]:
from sklearn.model_selection import KFold

In [30]:
kf = KFold(n_splits=7,shuffle=True)
kf

KFold(n_splits=7, random_state=None, shuffle=True)

In [47]:
print(' '*25+'Train Set'+' '*34+'Test Set'+' '*5)
for train_index, test_index in kf.split(np.random.randint(-50,150,25)):
    print(train_index, test_index)

                         Train Set                                  Test Set     
[ 0  1  2  4  5  6  7  8  9 11 12 13 14 15 17 18 19 20 21 22 23] [ 3 10 16 24]
[ 0  1  2  3  4  5  7  9 10 11 13 14 15 16 17 18 19 20 21 22 24] [ 6  8 12 23]
[ 1  3  5  6  7  8  9 10 11 12 13 14 16 17 18 19 20 21 22 23 24] [ 0  2  4 15]
[ 0  1  2  3  4  5  6  8  9 10 11 12 13 15 16 17 18 19 22 23 24] [ 7 14 20 21]
[ 0  1  2  3  4  5  6  7  8  9 10 12 13 14 15 16 17 19 20 21 23 24] [11 18 22]
[ 0  2  3  4  6  7  8  9 10 11 12 13 14 15 16 17 18 20 21 22 23 24] [ 1  5 19]
[ 0  1  2  3  4  5  6  7  8 10 11 12 14 15 16 18 19 20 21 22 23 24] [ 9 13 17]


In [68]:
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [69]:
scores_logistic = []
scores_svm = []
scores_rf = []

for train_index, test_index in kf.split(digits.data,digits.target):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], \
                                       digits.target[train_index], digits.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

In [76]:
scores_logistic

[0.953307392996109,
 0.9610894941634242,
 0.9649805447470817,
 0.9494163424124513,
 0.9649805447470817,
 0.96875,
 0.95703125]

### Stratified K-Fold Cross Validation

This technique is similar to k-fold cross-validation, but it ensures that each fold has approximately the same proportion of target classes as the original dataset. It's particularly useful for imbalanced datasets where one class is much more prevalent than the others.

In [48]:
from sklearn.model_selection import cross_val_score

In [53]:
cross_val_score(RandomForestClassifier(n_estimators=37),X_train,y_train,cv=5)

array([0.9484127 , 0.94444444, 0.97211155, 0.98804781, 0.98007968])

### Leave One Out Cross Validation

This process is repeated n times, where n is the total number of observations in the dataset. It is computationally expensive, especially for large datasets, but it provides a less biased estimate of the model's performance.

In [54]:
from sklearn.model_selection import LeaveOneOut

In [55]:
loo = LeaveOneOut()

In [67]:
for train_index, test_index in loo.split(np.random.randint(-50,150,10)):
    print(train_index, test_index)

[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]


### Shuffle Split Cross Validation

This process is repeated multiple times. It is useful when the dataset is too large to be easily divided into folds or when you want a specific number of iterations rather than a fixed number of folds.

In [43]:
from sklearn.model_selection import ShuffleSplit

In [44]:
ss = ShuffleSplit()

In [49]:
for train_index, test_index in ss.split(np.random.randint(-50,150,10)):
    print(train_index, test_index)

[7 8 1 9 6 4 2 0 3] [5]
[8 0 7 4 3 6 5 9 1] [2]
[8 9 1 5 6 3 4 0 2] [7]
[1 7 9 6 5 8 3 4 2] [0]
[0 5 8 9 1 3 2 7 6] [4]
[3 8 2 4 9 1 7 5 6] [0]
[1 2 0 9 6 3 8 5 4] [7]
[7 3 9 2 0 6 5 1 4] [8]
[0 1 5 4 9 6 3 2 8] [7]
[4 0 6 5 7 8 3 1 2] [9]


### Repeated Shuffle Split Cross Validation

This method involves repeating k-fold cross-validation multiple times with different random splits of the data. It helps to reduce the variability in the estimated performance of the model.

In [50]:
from sklearn.model_selection import RepeatedStratifiedKFold

In [51]:
rsk = RepeatedStratifiedKFold()

In [52]:
for train_index, test_index in ss.split(np.random.randint(-50,150,10)):
    print(train_index, test_index)

[8 4 9 3 1 0 2 6 7] [5]
[8 5 9 3 0 7 1 6 2] [4]
[4 7 8 5 9 3 6 0 1] [2]
[3 2 0 1 4 5 9 8 7] [6]
[1 6 8 7 2 9 5 0 4] [3]
[7 8 3 0 9 6 1 4 5] [2]
[6 7 4 5 0 9 8 3 1] [2]
[6 8 9 1 7 4 5 3 0] [2]
[4 1 7 5 9 0 2 6 3] [8]
[2 8 9 4 5 1 3 7 6] [0]


### Time Series Cross Validation

Time series data requires a different approach to cross-validation because the order of observations matters. Techniques like forward chaining or sliding window validation are commonly used for time series data, where the model is trained on past data and validated on future data.

[Time Series Cross Validation](https://www.youtube.com/watch?v=g9iO2AwTXyI)  
https://www.youtube.com/live/355u2bDqB7c?si=7XXIl24LAPfD71ox&t=4303

## Hyperparameter Tuning

It is the process of finding the best settings for a machine learning model, like choosing the right learning rate or number of trees. It's crucial for optimizing the model's performance on unseen data.

[GridSearchCV | Hyperparameter Tuning | Machine Learning with Scikit-Learn Python](https://youtu.be/TvB_3jVIHhg?si=SWqHZ2z_zYonOgNb)

In [3]:
link = 'https://raw.githubusercontent.com/daaanishhh002/MachineLearning/main/Datasets/heart.csv'
df = pd.read_csv(link)

df.sample(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
243,57,1,0,152,274,0,1,88,1,1.2,1,1,3,0
197,67,1,0,125,254,1,1,163,0,0.2,1,2,3,0
108,50,0,1,120,244,0,1,162,0,1.1,2,0,2,1
283,40,1,0,152,223,0,1,181,0,0.0,2,0,3,0
296,63,0,0,124,197,0,1,136,1,0.0,1,0,2,0


In [6]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,:-1],
                                                 df.iloc[:,-1],
                                                 test_size=0.3,
                                                 random_state=2002)

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [15]:
rf = RandomForestClassifier()
lr = LogisticRegression()
knn = KNeighborsClassifier()

In [9]:
knn.fit(X_train,y_train)
knn.score(X_test,y_test)

0.6593406593406593

In [8]:
lr.fit(X_train,y_train)
lr.score(X_test,y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.7802197802197802

In [16]:
rf.fit(X_train,y_train)
rf.score(X_test,y_test)

0.7802197802197802

In [20]:
rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [14]:
rf = RandomForestClassifier(max_samples=0.75,random_state=2002)

rf.fit(X_train,y_train)
rf.score(X_test,y_test)

0.7582417582417582

In [22]:
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(RandomForestClassifier(),df.iloc[:,:-1],
                                                 df.iloc[:,-1],cv=10,scoring='accuracy'))

0.8347311827956989

### Grid Search CV

In [27]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [23]:
n_estimators = [20,60,100,120]
max_features = [0.2,0.6,1.0]
max_depth = [2,8,None]
max_samples = [0.5,0.75,1.0]

In [24]:
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
              'max_samples':max_samples
}

In [28]:
rf = RandomForestClassifier()

rf_grid = GridSearchCV(estimator = rf, 
                       param_grid = param_grid, 
                       cv = 5, 
                       verbose=2, 
                       n_jobs = -1)

In [29]:
rf_grid.fit(X_train,y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [30]:
rf_grid.best_score_

0.8726467331118494

In [31]:
rf_grid.best_params_

{'max_depth': 8, 'max_features': 0.2, 'max_samples': 0.75, 'n_estimators': 120}

### Random Search CV

In [32]:
n_estimators = [20,60,100,120]
max_features = [0.2,0.6,1.0]
max_depth = [2,8,None]
max_samples = [0.5,0.75,1.0]
bootstrap = [True,False]
min_samples_split = [2, 5]
min_samples_leaf = [1, 2]

In [33]:
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
              'max_samples':max_samples,
              'bootstrap':bootstrap,
              'min_samples_split':min_samples_split,
              'min_samples_leaf':min_samples_leaf
}

In [38]:
from sklearn.model_selection import RandomizedSearchCV

rf_grid = RandomizedSearchCV(estimator = rf, 
                       param_distributions = param_grid,
                       n_iter=25,
                       cv = 5, 
                       verbose=2, 
                       n_jobs = -1)

In [39]:
rf_grid.fit(X_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits


40 fits failed out of a total of 125.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Danish\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Danish\anaconda3\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Danish\anaconda3\Lib\site-packages\sklearn\ensemble\_forest.py", line 402, in fit
    raise ValueError(
ValueError: `max_sample` cannot be set if `bootstrap=False`. Either switch to `bootstrap=True` or set

In [40]:
rf_grid.best_score_

0.8679955703211517

In [41]:
rf_grid.best_params_

{'n_estimators': 20,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_samples': 0.75,
 'max_features': 0.6,
 'max_depth': None,
 'bootstrap': True}