# Simple Binary Classification with Cross-Validation

This notebook loads the preprocessed data saved by `1_consolidate_data.ipynb` and trains/evaluates a Logistic Regression model using K-Fold Cross-Validation.

Cross-validation provides a more robust estimate of the model's performance on unseen data compared to a single train-test split.

In [6]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

import utils


df = utils.load_preprocessed_data('data/preprocessed_data2.parquet') # Default path 'data/preprocessed_data.parquet'

Loading preprocessed data from data/preprocessed_data2.parquet...
Data loaded successfully.
<class 'pandas.core.frame.DataFrame'>
Index: 3700 entries, 0 to 739
Data columns (total 55 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   BIB           3700 non-null   float64
 1   FAN           3700 non-null   float64
 2   LUK           3700 non-null   float64
 3   NUS           3700 non-null   float64
 4   SIS           3700 non-null   float64
 5   UIN           3700 non-null   float64
 6   WET           3700 non-null   float64
 7   COD_iii       3700 non-null   float64
 8   COD_rrr       3700 non-null   float64
 9   COD_uuu       3700 non-null   float64
 10  ERG_aaa       3700 non-null   float64
 11  ERG_missing   3700 non-null   float64
 12  ERG_nnn       3700 non-null   float64
 13  ERG_www       3700 non-null   float64
 14  GJAH_ii       3700 non-null   float64
 15  GJAH_iii      3700 non-null   float64
 16  GJAH_missing  3700 non-null 

In [7]:
# Separate train/test and features/target using utility function
X_train_scaled, y_train, X_test_scaled, y_test = utils.split_data_features_target(df)

# Convert target variables using utility function
y_train = utils.convert_target_variable(y_train)
y_test = utils.convert_target_variable(y_test)

Training features shape: (2960, 53), Training target shape: (2960,)
Test features shape: (740, 53), Test target shape: (740,)

Converting target variable 'Class' to numeric (n=0, y=1)...
Target variable converted.
Value counts:
 Class
1    2739
0     221
Name: count, dtype: int64

Converting target variable 'Class' to numeric (n=0, y=1)...
Target variable converted.
Value counts:
 Class
1    685
0     55
Name: count, dtype: int64


In [8]:
# Separate features (X) and target (y) from the entire dataset
# Cross-validation will handle the splitting internally.
# Note: We are not using the 'split' column here as CV works on the whole dataset 
# (typically the training portion if it were pre-split, but here we use the full df from parquet).
X = df.drop(['Class', 'split'], axis=1) # Assumes 'split' column exists and is not needed for features.
y_original = df['Class'] # Get the original target column

# Convert target variable using utility function
y = utils.convert_target_variable(y_original)

print(f"\nFeatures shape for CV: {X.shape}")
print(f"Target shape for CV: {y.shape}")


Converting target variable 'Class' to numeric (n=0, y=1)...
Target variable converted.
Value counts:
 Class
1    3424
0     276
Name: count, dtype: int64

Features shape for CV: (3700, 53)
Target shape for CV: (3700,)


# Perform Cross-Validation

We will use Stratified K-Fold cross-validation to ensure that each fold maintains the same proportion of classes as the original dataset, which is important for potentially imbalanced datasets.

In [9]:
# Instantiate the model
log_reg_cv = LogisticRegression(random_state=42, max_iter=1000)

# Define the cross-validation strategy
# Using StratifiedKFold for classification tasks, especially if the target is imbalanced
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, zero_division=0),
    'recall': make_scorer(recall_score),
    'f1': make_scorer(f1_score)
}

# Perform cross-validation for multiple scores
print("Performing 5-fold cross-validation...")
cv_scores = {}
for metric_name, scorer in scoring.items():
    scores = cross_val_score(log_reg_cv, X, y, cv=cv_strategy, scoring=scorer)
    cv_scores[metric_name] = scores
    print(f"Scores for {metric_name}: {scores}")
    print(f"Mean {metric_name}: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})")
    print("---")

print("Cross-validation complete.")

Performing 5-fold cross-validation...
Scores for accuracy: [0.96081081 0.96351351 0.96216216 0.96216216 0.97027027]
Mean accuracy: 0.9638 (+/- 0.0034)
---
Scores for precision: [0.96322489 0.96866097 0.96463932 0.9713056  0.9715505 ]
Mean precision: 0.9679 (+/- 0.0034)
---
Scores for recall: [0.99561404 0.99270073 0.99562044 0.98832117 0.99708029]
Mean recall: 0.9939 (+/- 0.0031)
---
Scores for f1: [0.97915169 0.98053353 0.97988506 0.97973951 0.98414986]
Mean f1: 0.9807 (+/- 0.0018)
---
Cross-validation complete.


# Interpretation

The results above show the performance for each of the 5 folds and the mean (+/- standard deviation) across the folds for accuracy, precision, recall, and F1-score.

- **Mean Score:** Gives an average estimate of the model's performance which is high, but this is a binary classification on a very imbalanced data.
- **Standard Deviation:** Indicates the variability of the performance across different folds. A lower standard deviation suggests more consistent performance. In our case the std is low which indicates that the data might be stable, but also that the model might generalize well.