# Step 3: Model Evaluation with LazyClassifier

## Objective
Evaluate multiple classification algorithms using LazyClassifier to identify the best performing models for USD/BRL price direction prediction.

## Process
1. Load engineered features from `data/processed/BRL_X_features.csv`
2. Split data into training and testing sets (time series split)
3. Run LazyClassifier to evaluate 30+ classification algorithms
4. Compare model performance across multiple metrics
5. Select top 3 models based on accuracy
6. Save model names for ensemble building in notebook 04

## Output
- Performance comparison of all classification models
- Top 3 models selected based on accuracy: stored in `TOP_3_MODELS` variable
- Model evaluation metrics saved to `data/processed/metrics/` directory

## Evaluation Criteria
**Primary Metric: Accuracy**
- Measures the percentage of correct predictions (both up and down directions)
- Most intuitive metric for balanced binary classification
- Suitable when both false positives and false negatives have equal cost

**Secondary Metrics Considered:**
- Balanced Accuracy: Adjusted for class imbalance
- ROC AUC: Ability to distinguish between classes
- F1 Score: Harmonic mean of precision and recall
- Time Taken: Model training efficiency

In [1]:
# Import required libraries
import os
import pandas as pd
import numpy as np
from datetime import datetime
from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

print(f"Model evaluation started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Model evaluation started at: 2025-10-25 11:06:21


In [2]:
# Define configuration parameters
FEATURES_PATH = '../data/processed/BRL_X_features.csv'  # Input file from notebook 02
METRICS_PATH = '../data/processed/metrics/'              # Output directory for evaluation metrics
MODEL_RESULTS_FILE = 'lazyclassifier_results.csv'        # Filename for saving results

# Train/test split configuration
TEST_SIZE = 0.2        # 20% of data for testing (most recent data)
RANDOM_STATE = 42      # For reproducibility
SHUFFLE = False        # Critical: Do not shuffle time series data

# Model selection criteria
TOP_N_MODELS = 3       # Number of best models to select for ensemble
SELECTION_METRIC = 'Accuracy'  # Primary metric for model selection

print(f"Configuration:")
print(f"  Input: {FEATURES_PATH}")
print(f"  Output: {METRICS_PATH}")
print(f"  Test Size: {TEST_SIZE * 100}%")
print(f"  Shuffle: {SHUFFLE} (preserving time series order)")
print(f"  Top Models: {TOP_N_MODELS}")
print(f"  Selection Metric: {SELECTION_METRIC}")

Configuration:
  Input: ../data/processed/BRL_X_features.csv
  Output: ../data/processed/metrics/
  Test Size: 20.0%
  Shuffle: False (preserving time series order)
  Top Models: 3
  Selection Metric: Accuracy


In [3]:
# Load engineered features from previous notebook
df = pd.read_csv(FEATURES_PATH, index_col=0)

# Convert index to datetime for proper time series handling
df.index = pd.to_datetime(df.index)
df.index.name = 'Date'

# Display data information
print(f"Loaded {len(df)} records from {df.index.min().strftime('%Y-%m-%d')} to {df.index.max().strftime('%Y-%m-%d')}")
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Verify target distribution
print(f"\nTarget distribution:")
print(df['target'].value_counts())
print(f"\nClass balance:")
print(df['target'].value_counts(normalize=True))

print(f"\nFirst 5 rows:")
df.head()

Loaded 4103 records from 2010-01-21 to 2025-10-24
Dataset shape: (4103, 18)
Columns: ['target', 'mm_std6', 'std6', 'mm_std12', 'std12', 'RSL_6', 'RSL_12', 'v', 'a', 'm', 'f', 'T', 'cat', 'w', 'k', 'tau', 'M', 'g']

Target distribution:
target
1    2067
0    2036
Name: count, dtype: int64

Class balance:
target
1   0.50
0   0.50
Name: proportion, dtype: float64

First 5 rows:


Unnamed: 0_level_0,target,mm_std6,std6,mm_std12,std12,RSL_6,RSL_12,v,a,m,f,T,cat,w,k,tau,M,g
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2010-01-21,1,0.01,0.0,0.0,0.0,0.22,1.09,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2010-01-22,0,0.01,0.0,0.0,0.0,0.96,1.43,-0.0,-0.0,0.03,-0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0
2010-01-25,1,0.0,0.01,0.0,0.01,-11.51,-10.05,-0.0,-0.0,0.01,-0.0,0.0,0.01,-0.0,0.0,-0.0,-0.0,0.0
2010-01-26,1,0.01,0.02,0.0,0.01,3.24,4.83,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2010-01-27,1,0.01,0.02,0.01,0.01,0.24,0.56,0.0,-0.0,0.04,-0.0,0.0,0.01,-0.0,0.0,-0.0,0.0,0.0


In [4]:
# Separate features and target variable
# X: All columns except 'target' (17 engineered features)
# y: Target variable (binary: 0=price down, 1=price up)

X = df.drop(columns=['target'])
y = df['target']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns ({X.shape[1]} total):")
for i, col in enumerate(X.columns, 1):
    print(f"  {i}. {col}")

Features shape: (4103, 17)
Target shape: (4103,)

Feature columns (17 total):
  1. mm_std6
  2. std6
  3. mm_std12
  4. std12
  5. RSL_6
  6. RSL_12
  7. v
  8. a
  9. m
  10. f
  11. T
  12. cat
  13. w
  14. k
  15. tau
  16. M
  17. g


In [5]:
# Split data into training and testing sets
# CRITICAL: shuffle=False to preserve time series order
# Training set: Earlier time periods (80% of data)
# Testing set: Most recent time periods (20% of data)
# This simulates real-world scenario where we predict future based on past

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=TEST_SIZE, 
    random_state=RANDOM_STATE, 
    shuffle=SHUFFLE
)

# Display split information
print(f"Train/Test Split Summary:")
print(f"  Total samples: {len(X)}")
print(f"  Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"  Testing samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")

# Show date ranges for each set (time series context)
train_start = X_train.index.min().strftime('%Y-%m-%d')
train_end = X_train.index.max().strftime('%Y-%m-%d')
test_start = X_test.index.min().strftime('%Y-%m-%d')
test_end = X_test.index.max().strftime('%Y-%m-%d')

print(f"\nTraining period: {train_start} to {train_end}")
print(f"Testing period: {test_start} to {test_end}")

# Verify class distribution in both sets
print(f"\nClass distribution in training set:")
print(y_train.value_counts())
print(y_train.value_counts(normalize=True))

print(f"\nClass distribution in testing set:")
print(y_test.value_counts())
print(y_test.value_counts(normalize=True))

Train/Test Split Summary:
  Total samples: 4103
  Training samples: 3282 (80.0%)
  Testing samples: 821 (20.0%)

Training period: 2010-01-21 to 2022-08-29
Testing period: 2022-08-30 to 2025-10-24

Class distribution in training set:
target
1    1672
0    1610
Name: count, dtype: int64
target
1   0.51
0   0.49
Name: proportion, dtype: float64

Class distribution in testing set:
target
0    426
1    395
Name: count, dtype: int64
target
0   0.52
1   0.48
Name: proportion, dtype: float64


In [6]:
# Run LazyClassifier to evaluate multiple classification algorithms
# LazyClassifier automatically trains and evaluates 30+ different models
# It provides a quick comparison to identify the best performing algorithms

print("Running LazyClassifier evaluation...")
print("This may take several minutes to train and evaluate all models.\n")

clf = LazyClassifier(
    verbose=1,                      # Show progress during execution
    ignore_warnings=True,           # Suppress sklearn warnings for cleaner output
    custom_metric=accuracy_score    # Use accuracy as primary evaluation metric
)

# Train all models and get predictions
models, predictions = clf.fit(X_train, X_test, y_train, y_test)

print("\nLazyClassifier evaluation completed!")
print(f"Total models evaluated: {len(models)}")

Running LazyClassifier evaluation...
This may take several minutes to train and evaluate all models.



  0%|          | 0/32 [00:00<?, ?it/s]

{'Model': 'AdaBoostClassifier', 'Accuracy': 0.5164433617539586, 'Balanced Accuracy': 0.507785107268081, 'ROC AUC': 0.507785107268081, 'F1 Score': 0.4894588483697956, 'accuracy_score': 0.5164433617539586, 'Time taken': 0.4217972755432129}
{'Model': 'BaggingClassifier', 'Accuracy': 0.5298416565164433, 'Balanced Accuracy': 0.5267754204552207, 'ROC AUC': 0.5267754204552207, 'F1 Score': 0.5268000656788605, 'accuracy_score': 0.5298416565164433, 'Time taken': 0.49164462089538574}
{'Model': 'BernoulliNB', 'Accuracy': 0.5030450669914738, 'Balanced Accuracy': 0.502243418315802, 'ROC AUC': 0.502243418315802, 'F1 Score': 0.5029963279646509, 'accuracy_score': 0.5030450669914738, 'Time taken': 0.0169677734375}
{'Model': 'CalibratedClassifierCV', 'Accuracy': 0.5334957369062119, 'Balanced Accuracy': 0.5411659832412195, 'ROC AUC': 0.5411659832412195, 'F1 Score': 0.5140562991072148, 'accuracy_score': 0.5334957369062119, 'Time taken': 0.05190086364746094}
{'Model': 'DecisionTreeClassifier', 'Accuracy': 0

In [7]:
# Display complete model evaluation results
# Results are automatically sorted by Accuracy (descending)

print("Complete Model Evaluation Results:")
print("="*80)
models

Complete Model Evaluation Results:


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,accuracy_score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
LinearDiscriminantAnalysis,0.54,0.54,0.54,0.53,0.54,0.02
LinearSVC,0.54,0.54,0.54,0.53,0.54,0.02
LogisticRegression,0.54,0.54,0.54,0.53,0.54,0.02
RidgeClassifier,0.54,0.54,0.54,0.53,0.54,0.02
RidgeClassifierCV,0.54,0.54,0.54,0.53,0.54,0.0
QuadraticDiscriminantAnalysis,0.54,0.54,0.54,0.54,0.54,0.02
CalibratedClassifierCV,0.53,0.54,0.54,0.51,0.53,0.05
KNeighborsClassifier,0.54,0.54,0.54,0.54,0.54,1.82
SGDClassifier,0.53,0.53,0.53,0.52,0.53,0.02
RandomForestClassifier,0.53,0.53,0.53,0.53,0.53,1.28


## Model Selection Criteria

### Why Accuracy as Primary Metric?

**Accuracy** measures the proportion of correct predictions out of all predictions made:
```
Accuracy = (True Positives + True Negatives) / Total Predictions
```

**Reasons for choosing Accuracy:**
1. **Balanced Classes**: Our target variable has relatively balanced distribution (verified in previous cells)
2. **Equal Cost**: Both types of errors have similar consequences in price direction prediction:
   - False Positive (predicting UP when actually DOWN): Potential loss from wrong position
   - False Negative (predicting DOWN when actually UP): Missed opportunity
3. **Interpretability**: Accuracy is intuitive and easy to communicate to stakeholders
4. **Trading Context**: In forex trading, correctly predicting direction (regardless of up or down) is equally valuable

### Secondary Metrics Considered:

- **Balanced Accuracy**: Adjusts for any remaining class imbalance, useful validation metric
- **ROC AUC**: Measures the model's ability to distinguish between classes across all thresholds
- **F1 Score**: Harmonic mean of precision and recall, useful for understanding model balance
- **Time Taken**: Important for production deployment and real-time predictions

### Selection Process:
We select the **Top 3 models** based on highest Accuracy because:
1. Provides diversity for ensemble methods (different algorithms capture different patterns)
2. Multiple models reduce overfitting risk through model averaging
3. Ensemble of top performers typically outperforms single best model
4. Three models balance performance with computational efficiency

In [8]:
# Select Top 3 models based on Accuracy
# Store model names in TOP_3_MODELS variable for use in notebook 04

# Sort models by Accuracy in descending order and select top 3
top_models_df = models.sort_values(by=SELECTION_METRIC, ascending=False).head(TOP_N_MODELS)

# Extract model names (index of the dataframe)
TOP_3_MODELS = list(top_models_df.index)

print(f"Top {TOP_N_MODELS} Models Selected (Based on {SELECTION_METRIC}):")
print("="*80)
for rank, model_name in enumerate(TOP_3_MODELS, 1):
    accuracy = top_models_df.loc[model_name, 'Accuracy']
    balanced_acc = top_models_df.loc[model_name, 'Balanced Accuracy']
    roc_auc = top_models_df.loc[model_name, 'ROC AUC']
    f1 = top_models_df.loc[model_name, 'F1 Score']
    time_taken = top_models_df.loc[model_name, 'Time Taken']
    
    print(f"\n{rank}. {model_name}")
    print(f"   Accuracy:          {accuracy:.4f}")
    print(f"   Balanced Accuracy: {balanced_acc:.4f}")
    print(f"   ROC AUC:           {roc_auc:.4f}")
    print(f"   F1 Score:          {f1:.4f}")
    print(f"   Time Taken:        {time_taken:.2f}s")

print("\n" + "="*80)
print(f"\nVariable 'TOP_3_MODELS' created with selected model names:")
print(f"TOP_3_MODELS = {TOP_3_MODELS}")
print(f"\nThis variable will be used in notebook 04 for ensemble building.")

Top 3 Models Selected (Based on Accuracy):

1. QuadraticDiscriminantAnalysis
   Accuracy:          0.5420
   Balanced Accuracy: 0.5419
   ROC AUC:           0.5419
   F1 Score:          0.5422
   Time Taken:        0.02s

2. LinearDiscriminantAnalysis
   Accuracy:          0.5396
   Balanced Accuracy: 0.5449
   ROC AUC:           0.5449
   F1 Score:          0.5307
   Time Taken:        0.02s

3. LinearSVC
   Accuracy:          0.5384
   Balanced Accuracy: 0.5437
   ROC AUC:           0.5437
   F1 Score:          0.5293
   Time Taken:        0.02s


Variable 'TOP_3_MODELS' created with selected model names:
TOP_3_MODELS = ['QuadraticDiscriminantAnalysis', 'LinearDiscriminantAnalysis', 'LinearSVC']

This variable will be used in notebook 04 for ensemble building.


In [9]:
# Display detailed comparison of Top 3 models
# Compare all metrics side-by-side for better understanding

print("Detailed Comparison of Top 3 Selected Models:")
print("="*80)
top_models_df

Detailed Comparison of Top 3 Selected Models:


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,accuracy_score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
QuadraticDiscriminantAnalysis,0.54,0.54,0.54,0.54,0.54,0.02
LinearDiscriminantAnalysis,0.54,0.54,0.54,0.53,0.54,0.02
LinearSVC,0.54,0.54,0.54,0.53,0.54,0.02


In [10]:
# Save evaluation results for documentation and future reference
# Create metrics directory if it doesn't exist
os.makedirs(METRICS_PATH, exist_ok=True)

# Save complete results
complete_results_path = os.path.join(METRICS_PATH, MODEL_RESULTS_FILE)
models.to_csv(complete_results_path)

# Save top 3 models selection
top_models_path = os.path.join(METRICS_PATH, 'top_3_models.csv')
top_models_df.to_csv(top_models_path)

# Save model names to text file for easy reference
model_names_path = os.path.join(METRICS_PATH, 'top_3_model_names.txt')
with open(model_names_path, 'w') as f:
    f.write(f"Top {TOP_N_MODELS} Models Selected Based on {SELECTION_METRIC}\n")
    f.write(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    for rank, model_name in enumerate(TOP_3_MODELS, 1):
        f.write(f"{rank}. {model_name}\n")

print(f"Results saved successfully:")
print(f"  All models: {complete_results_path}")
print(f"  Top 3 models: {top_models_path}")
print(f"  Model names: {model_names_path}")

Results saved successfully:
  All models: ../data/processed/metrics/lazyclassifier_results.csv
  Top 3 models: ../data/processed/metrics/top_3_models.csv
  Model names: ../data/processed/metrics/top_3_model_names.txt


## Summary

Model evaluation completed successfully using LazyClassifier:
- Evaluated 30+ classification algorithms on engineered features
- Used time series split (shuffle=False) to preserve temporal order
- Training set: Earlier 80% of data for model training
- Testing set: Most recent 20% of data for validation

**Top 3 Models Selected Based on Accuracy:**
The `TOP_3_MODELS` variable contains the names of the best performing models, which will be used as base learners in the stacking ensemble (notebook 04).

**Selection Rationale:**
- Accuracy chosen as primary metric due to balanced classes and equal error costs
- Top 3 models provide diversity for ensemble methods
- Multiple models reduce overfitting through model averaging
- All results saved to `data/processed/metrics/` for documentation

**Key Findings:**
- All evaluation metrics (Accuracy, Balanced Accuracy, ROC AUC, F1) considered
- Model training times recorded for production deployment planning
- Results demonstrate feasibility of USD/BRL direction prediction

## Next Steps
Proceed to `04_stacking_ensemble.ipynb` to:
- Load the `TOP_3_MODELS` selected in this notebook
- Build a stacking ensemble classifier using the top models as base learners
- Add a meta-learner to combine predictions
- Evaluate ensemble performance vs individual models
- Compare ensemble to baseline strategies