**K-fold cross-validation** is a technique for evaluating machine learning models by training and testing them on different subsets of data, providing a better estimate of model performance than a single train-test split.

### Steps:

1. **Split the Data**: Divide the data into *K* equally-sized "folds" (subsets).
2. **Train & Validate in K Rounds**:
   - In each round, use one fold as the validation set and the remaining *K-1* folds as the training set.
   - Repeat this process *K* times so that each fold is used as the validation set once.
3. **Average the Results**: After *K* rounds, average the performance scores (like accuracy or F1-score) from each fold to get a reliable estimate of model performance.

### Diagram Explanation

Imagine *K = 5* for simplicity. In each round, a different fold (colored differently in each round) is used as the validation set while the remaining folds serve as training data.


```bash
Round 1: [ Val ] [ Train ] [ Train ] [ Train ] [ Train ]
Round 2: [ Train ] [ Val ] [ Train ] [ Train ] [ Train ]
Round 3: [ Train ] [ Train ] [ Val ] [ Train ] [ Train ]
Round 4: [ Train ] [ Train ] [ Train ] [ Val ] [ Train ]
Round 5: [ Train ] [ Train ] [ Train ] [ Train ] [ Val ]
```

This ensures every data point is used once as a validation set, and the model is evaluated on the entire dataset across different splits.

- 5 or 10 folds are typically ideal choices.
- For small datasets, try k=10 or LOOCV.
- For large datasets or complex models, stick with k=5.
- If class imbalance is an issue, use StratifiedKFold.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV

In [3]:
housing_df = pd.read_csv(r"..\Datasets\Boston.csv")

In [4]:
housing_df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [5]:
X = housing_df.drop(['medv'], axis=1)
y = housing_df['medv']

In [11]:
linear_model = LinearRegression()

score_result = cross_val_score(linear_model, X, y, cv=5) #k=5
score_result
#This array represents the evaluation metric (such as accuracy for classification or 
#R2 for regression) calculated for the model on each validation set across the folds.

array([ 0.63919994,  0.71386698,  0.58702344,  0.07923081, -0.25294154])

In [12]:
score_results.mean()

np.float64(0.4726525191941059)

In [13]:
elastic_model = ElasticNet()

score_results = cross_val_score(elastic_model, X, y, cv=5)
score_results

array([0.57022044, 0.6626767 , 0.40322405, 0.45880379, 0.26833761])

In [14]:
score_results.mean()

np.float64(0.4726525191941059)

Finding best alpha and l1 ratio using `cross_val_score` (manual and time consuming)

In [15]:
alphas = np.linspace(0.01, 10, 20)
l1_ratios = np.linspace(0.01, 1, 10)

scores = []

for alpha in alphas:
    for l1_ratio in l1_ratios:
        elastic_model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
        score_results = cross_val_score(elastic_model, X, y, cv=5)
        scores.append({
            'alpha': alpha,
            'l1_ratio': l1_ratio,
            'score': score_results.mean()
        })
        
df_scores = pd.DataFrame(scores)
df_scores.sort_values(by='score', ascending=False, inplace=True)

print(f'Best parameters: \nalpha: {df_scores.alpha.iloc[0]}\nl1_ratio: {df_scores.l1_ratio.iloc[0]}\nscore: {df_scores.score.iloc[0]}\n\n')
        

Best parameters: 
alpha: 0.5357894736842106
l1_ratio: 0.01
score: 0.4973992456206477




Finding best alpha and l1 ratio using `GridSearchCV`. it is automatically find the best parameters for alpha and l1 ratio.

In [16]:
elastic_model = ElasticNet() 
paramas = {
    'alpha': np.linspace(0.01, 10, 20), #20 Alphas
    'l1_ratio': np.linspace(0.01, 1, 10) #10 L1 ratios
}
gcv_el = GridSearchCV(elastic_model, paramas, cv=5, scoring='r2')
gcv_el.fit(X, y)


In [17]:
print("Best Parameter",gcv_el.best_params_)
print("Best Score",gcv_el.best_score_)

Best Parameter {'alpha': np.float64(0.5357894736842106), 'l1_ratio': np.float64(0.01)}
Best Score 0.4973992456206477
