# Task 4: Binary Classification - Cross-Validation

This notebook loads the preprocessed data saved by `1_consolidate_data.ipynb` and trains/evaluates a Logistic Regression model using K-Fold Cross-Validation.

Cross-validation provides a more robust estimate of the model's performance on unseen data compared to a single train-test split.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
import numpy as np

# Define input path
input_parquet_path = 'data/preprocessed_data.parquet'

# Load the preprocessed data
print(f"Loading preprocessed data from {input_parquet_path}...")
try:
    df = pd.read_parquet(input_parquet_path)
    print("Data loaded successfully.")
    print("\nLoaded DataFrame Info:")
    df.info()
except FileNotFoundError:
    print(f"Error: File not found at {input_parquet_path}. Please run notebook 1 first.")
    # raise
except ImportError:
    print("\nError: 'pyarrow' or 'fastparquet' package is required to read Parquet format.")
    print("Please install it using: pip install pyarrow")
    # raise
except Exception as e:
    print(f"\nAn error occurred while loading the Parquet file: {e}")
    # raise

In [None]:
# Separate features (X) and target (y) from the entire dataset
# Cross-validation will handle the splitting internally
X = df.drop(['Class', 'split'], axis=1)
y = df['Class']

# Convert target variable 'Class' from object ('n'/'y') to numeric (0/1) if necessary
if y.dtype == 'object':
    print("\nConverting target variable 'Class' to numeric (n=0, y=1)...")
    y = y.map({'n': 0, 'y': 1})
    print("Target variable converted.")
    print("Target value counts:\n", y.value_counts())

print(f"\nFeatures shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Perform Cross-Validation

We will use Stratified K-Fold cross-validation to ensure that each fold maintains the same proportion of classes as the original dataset, which is important for potentially imbalanced datasets.

In [None]:
# Instantiate the model
log_reg_cv = LogisticRegression(random_state=42, max_iter=1000)

# Define the cross-validation strategy
# Using StratifiedKFold for classification tasks, especially if the target is imbalanced
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, zero_division=0),
    'recall': make_scorer(recall_score),
    'f1': make_scorer(f1_score)
}

# Perform cross-validation for multiple scores
print("Performing 5-fold cross-validation...")
cv_scores = {}
for metric_name, scorer in scoring.items():
    scores = cross_val_score(log_reg_cv, X, y, cv=cv_strategy, scoring=scorer)
    cv_scores[metric_name] = scores
    print(f"Scores for {metric_name}: {scores}")
    print(f"Mean {metric_name}: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})")
    print("---")

print("Cross-validation complete.")

# Interpretation

The results above show the performance for each of the 5 folds and the mean (+/- standard deviation) across the folds for accuracy, precision, recall, and F1-score.

- **Mean Score:** Gives an average estimate of the model's performance.
- **Standard Deviation:** Indicates the variability of the performance across different folds. A lower standard deviation suggests more consistent performance.