# 📌 Cross-Validation Methods & Evaluation Metrics in Classification

## 🔹 Cross-Validation Methods
Cross-validation is used to evaluate model performance by splitting the dataset into multiple parts (folds). The model is trained on some folds and tested on others, ensuring more reliable performance estimation.

### 1. **K-Fold Cross-Validation**
- Splits the dataset into `k` equal folds.
- The model is trained on `k-1` folds and tested on the remaining fold.
- Repeats this process `k` times, each time with a different test fold.
- **Pros:** Simple, works for balanced datasets.  
- **Cons:** Class distribution may vary across folds.

### 2. **Stratified K-Fold Cross-Validation**
- Similar to K-Fold, but ensures each fold has approximately the same class distribution as the full dataset.
- Particularly useful for **imbalanced classification problems**.
- **Pros:** Preserves label proportions, reducing bias in performance metrics.  
- **Cons:** Slightly more complex, but preferred for classification tasks.

---

## 🔹 Performance Analysis Tool

### **Confusion Matrix**
A diagnostic tool that shows prediction results in a tabular form:
- **True Positives (TP):** Correctly predicted positives.  
- **True Negatives (TN):** Correctly predicted negatives.  
- **False Positives (FP):** Incorrectly predicted positives.  
- **False Negatives (FN):** Incorrectly predicted negatives.  

It is not a metric by itself, but forms the **basis for evaluation metrics** such as accuracy, precision, recall, and F1-score.

---

## 🔹 Evaluation Metrics in Classification

### 1. **Accuracy**
\[
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
\]  
- Fraction of correct predictions.  
- **Limitation:** Misleading in imbalanced datasets.

### 2. **Precision**
\[
\text{Precision} = \frac{TP}{TP + FP}
\]  
- Of all predicted positives, how many are actually positive?  
- Important when the **cost of false positives** is high (e.g., spam detection).

### 3. **Recall (Sensitivity or True Positive Rate)**
\[
\text{Recall} = \frac{TP}{TP + FN}
\]  
- Of all actual positives, how many did the model correctly identify?  
- Important when the **cost of false negatives** is high (e.g., medical diagnosis).

### 4. **F1-Score**
\[
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\]  
- Harmonic mean of Precision and Recall.  
- Useful for **imbalanced datasets** where both false positives and false negatives matter.

### 5. **AUC-ROC (Area Under the Receiver Operating Characteristic Curve)**
- **ROC Curve:** Plots the True Positive Rate (Recall) against the False Positive Rate at various thresholds.  
- **AUC (Area Under Curve):** Measures the ability of the model to distinguish between classes.  
  - `AUC = 1.0`: Perfect classifier.  
  - `AUC = 0.5`: No better than random guessing.  
- **Use Case:** Effective for comparing models, especially in **imbalanced datasets**, since it considers all classification thresholds.


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score, StratifiedKFold, LeaveOneOut, GroupKFold
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report


# Load your dataset
data = pd.read_csv('Telco_Cusomer_Churn.csv')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [29]:
data

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [30]:
#Check for number of Unique Customer IDs
data['customerID'].nunique()

7043

In [31]:
#Replace empty strings with nan
data['TotalCharges'] = data['TotalCharges'].replace(' ', np.nan)

In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [33]:
#Change the datatype of Total Charges to Float
data['TotalCharges'] = data['TotalCharges'].astype(float)

In [34]:
#Get the total number of missing values in the Total Charges column

data['TotalCharges'].isnull().sum()

np.int64(11)

In [35]:
#Drop the rows with missing values in the Total Charges column
data.dropna(subset=['TotalCharges'], inplace=True)

In [36]:
#Drop the Customer ID column as it is unique and not useful for any analyses
data.drop(columns=['customerID'], inplace=True)

In [37]:
#data['PaymentMethod'].unique()
data['Contract'].unique()

array(['Month-to-month', 'One year', 'Two year'], dtype=object)

In [38]:
#Encode Categorical Independendt features using One-Hot Encoding
data_encoded = data.drop(columns=['Churn'])
target= data['Churn']
df_encoded = pd.get_dummies(data_encoded, drop_first=False)

In [39]:
# Encode the Target Column using Label Encoding
le = LabelEncoder()
df_encoded['Churn'] = le.fit_transform(target)

In [40]:
df_encoded

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,...,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Churn
0,0,1,29.85,29.85,True,False,False,True,True,False,...,True,False,False,False,True,False,False,True,False,0
1,0,34,56.95,1889.50,False,True,True,False,True,False,...,False,True,False,True,False,False,False,False,True,0
2,0,2,53.85,108.15,False,True,True,False,True,False,...,True,False,False,False,True,False,False,False,True,1
3,0,45,42.30,1840.75,False,True,True,False,True,False,...,False,True,False,True,False,True,False,False,False,0
4,0,2,70.70,151.65,True,False,True,False,True,False,...,True,False,False,False,True,False,False,True,False,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0,24,84.80,1990.50,False,True,False,True,False,True,...,False,True,False,False,True,False,False,False,True,0
7039,0,72,103.20,7362.90,True,False,False,True,False,True,...,False,True,False,False,True,False,True,False,False,0
7040,0,11,29.60,346.45,True,False,False,True,False,True,...,True,False,False,False,True,False,False,True,False,0
7041,1,4,74.40,306.60,False,True,False,True,True,False,...,True,False,False,False,True,False,False,False,True,1


In [41]:
#Convert the boolean values to integers
df_encoded = df_encoded.astype(int)

In [42]:
data =df_encoded

In [43]:
#Reset the Dataframe Index
data.reset_index(drop=True, inplace=True)

In [44]:
#Seperate the dependent and independent features
X = data.drop('Churn', axis=1)
y = data['Churn']

In [45]:
#Instantiate the Random Forest classifier
rf = RandomForestClassifier(
    n_estimators=200,       # number of trees
    max_depth=None,         # let trees grow fully (can tune this)
    random_state=42
)

# 📌 Stratified K-Fold Cross-Validation with Accuracy Scoring

This code sets up a **StratifiedKFold** cross-validator with:
- `n_splits=5`: Divides the dataset into 5 folds.  
- `shuffle=True`: Shuffles the data before splitting to ensure randomness.  
- `random_state=42`: Ensures reproducibility of the splits.  

It then uses **cross_val_score** to:
- Train and evaluate the Random Forest model (`rf`) across the folds.  
- Use **accuracy** as the scoring metric.  
- Return an array (`scores_skf`) containing the accuracy for each fold.  

This gives a quick measure of the model’s performance stability across different data splits.


In [46]:
# Using Stratified KFold Cross Validation for our Model validation technique
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_skf = cross_val_score(rf, X, y, cv=skf, scoring='accuracy')

In [47]:
scores_skf

array([0.7960199 , 0.78678038, 0.78947368, 0.79089616, 0.79587482])

In [48]:
scores_skf.mean()

np.float64(0.7918089900022343)

# 📌 K-Fold Cross-Validation with Accuracy Scoring

This code sets up a **KFold** cross-validator with:
- `n_splits=5`: Splits the dataset into 5 equal folds.  
- `shuffle=True`: Randomly shuffles the data before splitting.  
- `random_state=42`: Ensures reproducibility of the splits.  

It then applies **cross_val_score** to:
- Train and evaluate the Random Forest model (`rf`) on each fold.  
- Use **accuracy** as the performance metric.  
- Store the accuracy of each fold in `scores_kf`.  

Unlike **StratifiedKFold**, this does not guarantee that class proportions are preserved across folds.


In [49]:
# Using KFold Cross Validation for our Model validation technique
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores_kf = cross_val_score(rf, X, y, cv=kf, scoring='accuracy')

In [50]:
scores_kf

array([0.78322672, 0.79317697, 0.76386913, 0.79516358, 0.79871977])

In [51]:
scores_kf.mean()

np.float64(0.7868312370276236)

# 📌 Cross-Validation with Performance Metrics

This code performs stratified k-fold cross-validation using a Random Forest classifier (`rf`).  
For each fold, it:  
1. Splits the dataset into training and testing sets.  
2. Trains the model on the training set.  
3. Makes predictions on the test set.  
4. Computes and prints the **confusion matrix** and **classification report**.  
5. Stores accuracy, precision, recall, and F1-score for later aggregation.  

After all folds are completed:  
- It calculates the **average confusion matrix** across folds.  
- It reports the **mean ± standard deviation** of Accuracy, Precision, Recall, and F1-score.  

This helps evaluate model performance consistency across different folds of the dataset.


In [52]:
fold = 1
all_conf_matrices = []
accuracies, precisions, recalls, f1s = [], [], [], []

for train_idx, test_idx in skf.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] 
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]  

    # Train model
    rf.fit(X_train, y_train)

    # Predictions
    y_pred = rf.predict(X_test)

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    all_conf_matrices.append(cm)

    print(f"\nFold {fold} Confusion Matrix:")
    print(cm)
    print("Classification Report:")
    print(classification_report(y_test, y_pred, digits=3))
    
    fold += 1
    accuracies.append(accuracy_score(y_test, y_pred))
    precisions.append(precision_score(y_test, y_pred, average="binary"))
    recalls.append(recall_score(y_test, y_pred, average="binary"))
    f1s.append(f1_score(y_test, y_pred, average="binary"))

# Average confusion matrix
mean_cm = np.mean(all_conf_matrices, axis=0)
print("\nAverage Confusion Matrix across folds:")
print(mean_cm.astype(int))

print("Average Accuracy: %.3f ± %.3f" % (np.mean(accuracies), np.std(accuracies)))
print("Average Precision: %.3f ± %.3f" % (np.mean(precisions), np.std(precisions)))
print("Average Recall: %.3f ± %.3f" % (np.mean(recalls), np.std(recalls)))
print("Average F1: %.3f ± %.3f" % (np.mean(f1s), np.std(f1s)))


Fold 1 Confusion Matrix:
[[932 101]
 [186 188]]
Classification Report:
              precision    recall  f1-score   support

           0      0.834     0.902     0.867      1033
           1      0.651     0.503     0.567       374

    accuracy                          0.796      1407
   macro avg      0.742     0.702     0.717      1407
weighted avg      0.785     0.796     0.787      1407


Fold 2 Confusion Matrix:
[[926 107]
 [193 181]]
Classification Report:
              precision    recall  f1-score   support

           0      0.828     0.896     0.861      1033
           1      0.628     0.484     0.547       374

    accuracy                          0.787      1407
   macro avg      0.728     0.690     0.704      1407
weighted avg      0.775     0.787     0.777      1407


Fold 3 Confusion Matrix:
[[925 108]
 [188 185]]
Classification Report:
              precision    recall  f1-score   support

           0      0.831     0.895     0.862      1033
           1      0.6