

## **What is Cross-Validation?**

Cross-Validation (CV) is a **model evaluation technique** used to assess how well a machine learning model generalizes to unseen data.

* Instead of testing the model on just one fixed train-test split, CV systematically splits the data into multiple parts (folds or subsets) and tests the model on each.
* This gives a more reliable estimate of the model’s performance.

---

## **Purpose of Cross-Validation**

1. **Model Evaluation** – Estimate how well the model will perform on unseen data.
2. **Model Selection** – Compare and choose the best among different algorithms.
3. **Hyperparameter Tuning** – Select the best parameters (e.g., in GridSearchCV).
4. **Reduce Overfitting Risk** – Ensures performance isn’t dependent on just one split.
5. **Efficient Use of Data** – Utilizes all data points for both training and validation.

---

## **Summary Table of Cross-Validation Methods**

| **Method**                      | **Description**                                              | **Pros**                            | **Cons / Limitations**                   | **When to Use**                        |
| ------------------------------- | ------------------------------------------------------------ | ----------------------------------- | ---------------------------------------- | -------------------------------------- |
| **Hold-Out**                    | Single train-test split (e.g., 70%-30%).                     | Simple, fast.                       | High variance, depends on one split.     | Quick baseline, large datasets.        |
| **k-Fold CV**                   | Split into *k* folds, train/test *k* times.                  | Stable, uses all data.              | More computation.                        | General-purpose evaluation.            |
| **Stratified k-Fold**           | Like k-Fold, but preserves class proportions.                | Works well with imbalanced classes. | Slightly complex.                        | Classification with imbalance.         |
| **LOOCV**                       | One sample is test, rest train (n folds).                    | Max use of data for training.       | Very slow for large n; high variance.    | Small datasets.                        |
| **Leave-p-Out**                 | Leave *p* samples out for testing.                           | Very thorough.                      | Computationally infeasible for big data. | Very small datasets only.              |
| **Repeated k-Fold**             | Run k-Fold multiple times with random splits.                | Reduces variance of estimate.       | Increases computation.                   | Medium datasets, need robust estimate. |
| **Shuffle Split (Monte Carlo)** | Random splits into train/test repeated.                      | Flexible splits, easy.              | Some samples may repeat across tests.    | Large datasets, need randomness.       |
| **Nested CV**                   | Inner CV for hyperparameter tuning, outer CV for evaluation. | Avoids bias from tuning.            | Very expensive.                          | When tuning hyperparameters.           |

---


## Python Code

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.model_selection import LeaveOneOut, RepeatedKFold, ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression


In [2]:
# Load Titanic dataset
titanic = sns.load_dataset("titanic")
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [7]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [11]:
# Drop redundant columns
titanic.drop(["class","who","embark_town","alive","alone",],\
                       axis =1, inplace= True)
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,adult_male,deck
0,0,3,male,22.0,1,0,7.2500,S,True,
1,1,1,female,38.0,1,0,71.2833,C,False,C
2,1,3,female,26.0,0,0,7.9250,S,False,
3,1,1,female,35.0,1,0,53.1000,S,False,C
4,0,3,male,35.0,0,0,8.0500,S,True,
...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,True,
887,1,1,female,19.0,0,0,30.0000,S,False,B
888,0,3,female,,1,2,23.4500,S,False,
889,1,1,male,26.0,0,0,30.0000,C,True,C


In [15]:
titanic.drop("deck",axis =1, inplace= True)
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,adult_male
0,0,3,male,22.0,1,0,7.2500,S,True
1,1,1,female,38.0,1,0,71.2833,C,False
2,1,3,female,26.0,0,0,7.9250,S,False
3,1,1,female,35.0,1,0,53.1000,S,False
4,0,3,male,35.0,0,0,8.0500,S,True
...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,True
887,1,1,female,19.0,0,0,30.0000,S,False
888,0,3,female,,1,2,23.4500,S,False
889,1,1,male,26.0,0,0,30.0000,C,True


In [17]:
# Select features and target
X = titanic.drop("survived", axis =1)
y = titanic["survived"]

# Drop missing values for simplicity
X = X.dropna()
y = y.loc[X.index]
X.shape, y.shape

((712, 8), (712,))

In [23]:
# Separate categorical and numeric columns
categorical_cols = ["sex", "embarked"]
numeric_cols = ["pclass", "age", "fare","parch","sibsp","adult_male"]

In [27]:
# OneHotEncode categorical columns
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop="first", sparse_output = False)
X_encoded = encoder.fit_transform(X[categorical_cols])

# Convert back to DataFrame
encoded_cols = encoder.get_feature_names_out(categorical_cols)
X_encoded_df = pd.DataFrame(X_encoded, columns=encoded_cols,\
                            index=X.index)

# Combine with numeric columns
X_final = pd.concat([X[numeric_cols], X_encoded_df], axis=1)
X_final

Unnamed: 0,pclass,age,fare,parch,sibsp,adult_male,sex_male,embarked_Q,embarked_S
0,3,22.0,7.2500,0,1,True,1.0,0.0,1.0
1,1,38.0,71.2833,0,1,False,0.0,0.0,0.0
2,3,26.0,7.9250,0,0,False,0.0,0.0,1.0
3,1,35.0,53.1000,0,1,False,0.0,0.0,1.0
4,3,35.0,8.0500,0,0,True,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
885,3,39.0,29.1250,5,0,False,0.0,1.0,0.0
886,2,27.0,13.0000,0,0,True,1.0,0.0,1.0
887,1,19.0,30.0000,0,0,False,0.0,0.0,1.0
889,1,26.0,30.0000,0,0,True,1.0,0.0,0.0


In [31]:
# Logistic Regression model
model = LogisticRegression(max_iter=3000)


In [33]:

# --------Hold-Out Cross Validation--------------

X_train, X_test, y_train, y_test = train_test_split(
    X_final, y, test_size=0.3, random_state=42, stratify=y
)
model.fit(X_train, y_train)
holdout_score = model.score(X_test, y_test)
print("Hold-Out Accuracy:", holdout_score)


Hold-Out Accuracy: 0.794392523364486


In [37]:
# ---------- k-Fold CV ----------

kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(model, X_final, y, cv=kf, \
                            scoring="accuracy")
print(kf_scores)
print("k-Fold CV Mean Accuracy:", kf_scores.mean())


[0.81818182 0.7972028  0.82394366 0.85915493 0.8028169 ]
k-Fold CV Mean Accuracy: 0.8202600216684723


In [39]:
titanic["survived"].value_counts()

survived
0    549
1    342
Name: count, dtype: int64

In [41]:

# ---------- Stratified k-Fold CV --------------------------------------
skf = StratifiedKFold(n_splits=5, shuffle=True, \
                      random_state=42)
skf_scores = cross_val_score(model, X_final, y, \
                             cv=skf, scoring="accuracy")
print(skf_scores)
print("Stratified k-Fold Mean Accuracy:", skf_scores.mean())


[0.85314685 0.81118881 0.78169014 0.81690141 0.82394366]
Stratified k-Fold Mean Accuracy: 0.817374175120654


In [43]:
# ---------- Leave-One-Out CV ----------------------------------
loo = LeaveOneOut()
loo_scores = cross_val_score(model, X_final, y, \
                             cv=loo, scoring="accuracy")
print(loo_scores)
print("LOOCV Mean Accuracy :", loo_scores.mean())


[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1.
 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1.
 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1.
 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1.
 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1.

In [48]:
# ---------- Repeated k-Fold CV ----------

rkf = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
rkf_scores = cross_val_score(model, X_final, y, cv=rkf, scoring="accuracy")
print("Repeated k-Fold Accuracy:", rkf_scores.mean())


Repeated k-Fold Accuracy: 0.7870251813913784


In [45]:
# ---------- Shuffle Split CV ----------
ss = ShuffleSplit(n_splits=5, test_size=0.3, \
                            random_state=42)
ss_scores = cross_val_score(model, X_final, y, \
                            cv=ss, scoring="accuracy")
print(ss_scores)
print("Shuffle Split Accuracy:", ss_scores.mean())


[0.81308411 0.8271028  0.79906542 0.81775701 0.79906542]
Shuffle Split Accuracy: 0.811214953271028
