In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/creditcardfraud/creditcard.csv


## Train-Test Split & Cross-Validation

### 1. Introduction: Model Evaluation, Generalization, and Overfitting

* **Model Evaluation:**
   
    * The cornerstone of machine learning, model evaluation quantifies a model's performance on a given dataset.
    * It's not just about getting a score; it's about understanding how well our model is likely to perform in the real world.
* **Generalization:**
   
    * The ultimate goal of any machine learning model.
    * Generalization is the model's capacity to accurately predict outcomes on *unseen* data, data it wasn't trained on.
    * A model with good generalization is robust and reliable.
* **Overfitting:**
   
    * The nemesis of generalization.
    * Overfitting occurs when a model learns the training data *too* well, capturing noise and idiosyncrasies instead of the underlying patterns.
    * An overfit model performs brilliantly on training data but miserably on new data.
    * Evaluation techniques are crucial for detecting and mitigating overfitting.
* **Mathematical Significance:**
   
    * Model evaluation is deeply rooted in statistical learning theory, which provides a framework for quantifying the uncertainty in model predictions.
    * Concepts like bias-variance tradeoff, confidence intervals, and statistical hypothesis testing are fundamental to rigorous model evaluation.

### 2. Train-Test Split: The Basic Necessity

* **Importance:**
   
    * The train-test split is the most basic yet essential technique to assess generalization.
    * It simulates the real-world scenario where a model is trained on historical data and deployed to make predictions on new, incoming data.
    * By holding out a portion of the data, we get an unbiased estimate of how the model will perform in practice.
* **Implementation:**
   
    * The data is partitioned into two mutually exclusive subsets:
        * **Training Set:** The larger subset, typically 70-80% of the data, used to train the model. The model learns the patterns and relationships within this data.
        * **Testing Set:** The smaller subset, typically 20-30% of the data, held aside and used *only* to evaluate the model's final performance. This set acts as a proxy for unseen data.
    * The split is ideally performed *randomly* to ensure both sets have a similar statistical distribution.
    * Scikit-learn's `train_test_split` function simplifies this process, offering control over the split ratio and randomization.
* **Demonstrating the Impact:**
   
    * We'll train a simple model (e.g., Logistic Regression) *without* and *with* a train-test split to highlight the difference in evaluation.
* **Mathematical/Technical Significance:**
   
    * The train-test split provides an *estimate* of the generalization error.
    * If we train and evaluate on the same data, our performance metrics will be optimistically biased, reflecting the model's ability to *memorize* rather than *generalize*.


In [19]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load the Credit Card Fraud Detection dataset (replace with your path)
data = pd.read_csv("/kaggle/input/creditcardfraud/creditcard.csv")
data.drop('Time', axis=1, inplace=True) # Drop 'Time' for simplicity

In [20]:
X = data.drop('Class', axis=1)
y = data['Class']

# 1. Training and Evaluating WITHOUT Train-Test Split (Illustrating Overfitting)
model_no_split = LogisticRegression(solver='liblinear')
model_no_split.fit(X, y)
y_pred_no_split = model_no_split.predict(X) # Predict on the *same* data
print("--- Evaluation WITHOUT Train-Test Split ---")
print("Accuracy:", accuracy_score(y, y_pred_no_split))
print("Classification Report:\n", classification_report(y, y_pred_no_split))


--- Evaluation WITHOUT Train-Test Split ---
Accuracy: 0.9992029690281489
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    284315
           1       0.88      0.62      0.73       492

    accuracy                           1.00    284807
   macro avg       0.94      0.81      0.86    284807
weighted avg       1.00      1.00      1.00    284807



In [22]:
# 2. Training and Evaluating WITH Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Split data
model_with_split = LogisticRegression(solver='liblinear')
model_with_split.fit(X_train, y_train)
y_pred_with_split = model_with_split.predict(X_test) # Predict on the *test* data
print("\n--- Evaluation WITH Train-Test Split ---")
print("Accuracy:", accuracy_score(y_test, y_pred_with_split))
print("Classification Report:\n", classification_report(y_test, y_pred_with_split))


--- Evaluation WITH Train-Test Split ---
Accuracy: 0.9991046662687406
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.86      0.57      0.69        98

    accuracy                           1.00     56962
   macro avg       0.93      0.79      0.84     56962
weighted avg       1.00      1.00      1.00     56962



* **Expected Outcome:**
   
    * The model trained and evaluated *without* a train-test split will likely show very high accuracy and excellent classification metrics. This is misleading! It's simply memorizing the training data.
    * The model trained *with* a train-test split will show a more realistic (and often lower) accuracy on the test set. This is a better reflection of how the model will perform on new data.

> This is a demo dataset hence we can always expect high accurary however it still makes some difference as you observed.
>
> 
* **Key Takeaway:**
   
    * The difference between the two evaluations highlights the importance of the train-test split in providing an unbiased estimate of generalization performance.

### 3. Cross-Validation: Robust Evaluation

* **Limitations of Train-Test Split:**
   
    * While essential, a single train-test split has limitations:
        * **Variability:** The performance estimate can be sensitive to how the data is split, especially with smaller datasets. A different random split might yield a different result.
        * **Data Waste:** The test set is held out entirely from training, meaning we're not using all available data to train the model. This can be a concern when data is scarce.
* **Cross-Validation to the Rescue:**
   
    * Cross-validation is a powerful technique to overcome these limitations.
    * It provides a *more robust* and *less biased* estimate of model performance by averaging performance across multiple train-test splits.
    * It also utilizes *all* data for both training and evaluation, addressing the data waste issue.
* **k-Fold Cross-Validation:**
   
    * The most common type of cross-validation.
    * The data is divided into *k* equal-sized "folds."
    * The model is trained and evaluated *k* times.
    * In each "fold," one fold is held out as the test set, and the remaining *k-1* folds are used as the training set.
    * The final performance is the *average* of the performance across the *k* folds.
    * Common values for *k* are 5 and 10.
* **Stratified k-Fold Cross-Validation:**
   
    * A variant of k-fold, crucial for *imbalanced* datasets (like our fraud data).
    * It ensures that *each fold* has approximately the *same proportion* of samples from each class as the original dataset.
    * This prevents a scenario where some folds have very few or no examples of the minority class (e.g., fraud), which would lead to unreliable evaluation.
* **Mathematical/Technical Significance:**
   
    * Cross-validation provides a *distribution* of performance metrics (e.g., a distribution of accuracy scores across the k folds), giving us a better sense of the model's variability.
    * The average performance from cross-validation is a *more reliable* estimate of the generalization error than a single train-test split.
    * Stratified k-fold maintains the class distribution, which is essential for reliable evaluation in classification problems with imbalanced classes.




In [26]:
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, make_scorer

# Reload data (to be safe)
data = pd.read_csv("/kaggle/input/creditcardfraud/creditcard.csv")
data.drop('Time', axis=1, inplace=True)
X = data.drop('Class', axis=1)
y = data['Class']

model = LogisticRegression(solver='liblinear') # Our model

# 1. K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42) # 5 folds, shuffle data

print("\n--- K-Fold Cross-Validation ---")
auc_scores_kf = cross_val_score(model, X, y, cv=kf, scoring='roc_auc') # AUC for each fold
print("AUC Scores (K-Fold):", auc_scores_kf)
print("Mean AUC (K-Fold):", auc_scores_kf.mean())
print("AUC Std Dev (K-Fold):", auc_scores_kf.std()) # Variability




--- K-Fold Cross-Validation ---
AUC Scores (K-Fold): [0.97630688 0.97969451 0.96133088 0.98437416 0.97225807]
Mean AUC (K-Fold): 0.9747929014598146
AUC Std Dev (K-Fold): 0.007820099053305354


In [27]:
# 2. Stratified K-Fold Cross-Validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("\n--- Stratified K-Fold Cross-Validation ---")
auc_scores_skf = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
print("AUC Scores (Stratified K-Fold):", auc_scores_skf)
print("Mean AUC (Stratified K-Fold):", auc_scores_skf.mean())
print("AUC Std Dev (Stratified K-Fold):", auc_scores_skf.std())


--- Stratified K-Fold Cross-Validation ---
AUC Scores (Stratified K-Fold): [0.97758177 0.98356265 0.97903428 0.96609036 0.9652525 ]
Mean AUC (Stratified K-Fold): 0.9743043141085049
AUC Std Dev (Stratified K-Fold): 0.0073244197100807255


* **Expected Outcome:**
   
    * Cross-validation will provide a *more consistent* estimate of the model's performance compared to a single train-test split.
    * Stratified k-fold will be particularly important for the Credit Card Fraud dataset. Because it is heavily imbalanced, stratified k-fold will likely provide a more reliable estimate of the model's ability to detect fraud.
    * The standard deviation of the cross-validation scores will give us a measure of the variability in the model's performance across different folds. Lower standard deviation is desirable, indicating a more stable model.
* **Key Takeaways:**
   
    * Cross-validation is a powerful tool for robust model evaluation.
    * Stratification is essential for imbalanced datasets.
    * Cross-validation provides a more complete picture of model performance than a single train-test split.