# Task 4: Binary Classification - Cross-Validation

This notebook loads the preprocessed data saved by `1_consolidate_data.ipynb` and trains/evaluates a Logistic Regression model using K-Fold Cross-Validation.

Cross-validation provides a more robust estimate of the model's performance on unseen data compared to a single train-test split.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
import numpy as np
import utils

df = utils.load_preprocessed_data() # Default path 'data/preprocessed_data.parquet'

In [None]:
# Convert target variable using utility function
y = utils.convert_target_variable(y_original)

# Separate train/test and features/target using utility function
X_train_scaled, y_train, X_test_scaled, y_test = utils.split_data_features_target(df)
y_train = utils.convert_target_variable(y_train)
y_test = utils.convert_target_variable(y_test)

In [None]:
# Separate features (X) and target (y) from the entire dataset
# Cross-validation will handle the splitting internally.
# Note: We are not using the 'split' column here as CV works on the whole dataset 
# (typically the training portion if it were pre-split, but here we use the full df from parquet).
X = df.drop(['Class', 'split'], axis=1) # Assumes 'split' column exists and is not needed for features.
y_original = df['Class'] # Get the original target column

# Convert target variable using utility function
y = utils.convert_target_variable(y_original)

print(f"\nFeatures shape for CV: {X.shape}")
print(f"Target shape for CV: {y.shape}")


# Perform Cross-Validation

We will use Stratified K-Fold cross-validation to ensure that each fold maintains the same proportion of classes as the original dataset, which is important for potentially imbalanced datasets.

In [None]:
# Instantiate the model
log_reg_cv = LogisticRegression(random_state=42, max_iter=1000)

# Define the cross-validation strategy
# Using StratifiedKFold for classification tasks, especially if the target is imbalanced
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, zero_division=0),
    'recall': make_scorer(recall_score),
    'f1': make_scorer(f1_score)
}

# Perform cross-validation for multiple scores
print("Performing 5-fold cross-validation...")
cv_scores = {}
for metric_name, scorer in scoring.items():
    scores = cross_val_score(log_reg_cv, X, y, cv=cv_strategy, scoring=scorer)
    cv_scores[metric_name] = scores
    print(f"Scores for {metric_name}: {scores}")
    print(f"Mean {metric_name}: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})")
    print("---")

print("Cross-validation complete.")

# Interpretation

The results above show the performance for each of the 5 folds and the mean (+/- standard deviation) across the folds for accuracy, precision, recall, and F1-score.

- **Mean Score:** Gives an average estimate of the model's performance.
- **Standard Deviation:** Indicates the variability of the performance across different folds. A lower standard deviation suggests more consistent performance.