# Phase 2: Model Training & Evaluation
## Feature Selection, Model Complexity, and Data Comparison

**Goal:** Compare the role of different classifiers and feature importance for facial expression recognition

**Models Explored:**
1. Logistic Regression (with StandardScaler and class_weight='balanced')
2. Random Forest Classifier
3. Decision Tree Classifier

**Analysis Goals:**
1. Study the importance of feature subsets (e.g., facial points related to the mouth, eyes, etc.) using Random Forest
2. Demonstrate learning curves as a function of model complexity
3. Compare models trained on 'geometric' data with 'motion' data
4. Measure training and prediction time for all models

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
from time import time

## 1. Load Data

Load both motion and geometric datasets for comparison.

In [None]:
# Load motion data
data_motion = pd.read_csv('data_motion.csv')
print("Motion dataset shape:", data_motion.shape)

# Load geometric data
data_geometric = pd.read_csv('data_geometric.csv')
print("Geometric dataset shape:", data_geometric.shape)

# Display first few rows
print("\nMotion data preview:")
print(data_motion.head())

## 2. Prepare Data

Split features and labels, then create train/test splits.

In [None]:
# Prepare motion data
feature_cols = [col for col in data_motion.columns if col.startswith('x') or col.startswith('y')]
X_motion = data_motion[feature_cols]
y_motion = data_motion['Emotion']

# Prepare geometric data
X_geometric = data_geometric[feature_cols]
y_geometric = data_geometric['Emotion']

# Split motion data
X_train_motion, X_test_motion, y_train_motion, y_test_motion = train_test_split(
    X_motion, y_motion, test_size=0.2, random_state=42, stratify=y_motion
)

# Split geometric data
X_train_geometric, X_test_geometric, y_train_geometric, y_test_geometric = train_test_split(
    X_geometric, y_geometric, test_size=0.2, random_state=42, stratify=y_geometric
)

print(f"Motion training set: {X_train_motion.shape}")
print(f"Motion test set: {X_test_motion.shape}")
print(f"\nClass distribution in training set:")
print(y_train_motion.value_counts().sort_index())

## 3. Logistic Regression

Train Logistic Regression with StandardScaler (important for small decimal values) and class_weight='balanced' to address class imbalance.

In [None]:
# Scale the motion data (important for Logistic Regression)
scaler_motion = StandardScaler()
X_train_motion_scaled = scaler_motion.fit_transform(X_train_motion)
X_test_motion_scaled = scaler_motion.transform(X_test_motion)

# Train Logistic Regression on motion data
print("Training Logistic Regression on motion data...")
start_time = time()
lr_motion = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
lr_motion.fit(X_train_motion_scaled, y_train_motion)
train_time_lr_motion = time() - start_time

# Make predictions and measure time
start_time = time()
y_pred_lr_motion = lr_motion.predict(X_test_motion_scaled)
predict_time_lr_motion = (time() - start_time) * 1000  # Convert to milliseconds

# Calculate accuracy
accuracy_lr_motion = accuracy_score(y_test_motion, y_pred_lr_motion)

print(f"\nLogistic Regression (Motion Data):")
print(f"  Training time: {train_time_lr_motion:.2f} seconds")
print(f"  Prediction time: {predict_time_lr_motion:.2f} milliseconds")
print(f"  Accuracy: {accuracy_lr_motion:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test_motion, y_pred_lr_motion))

## 4. Random Forest Classifier

Train Random Forest and analyze feature importance.

In [None]:
# Train Random Forest on motion data
print("Training Random Forest on motion data...")
start_time = time()
rf_motion = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_motion.fit(X_train_motion, y_train_motion)
train_time_rf_motion = time() - start_time

# Make predictions and measure time
start_time = time()
y_pred_rf_motion = rf_motion.predict(X_test_motion)
predict_time_rf_motion = (time() - start_time) * 1000  # Convert to milliseconds

# Calculate accuracy
accuracy_rf_motion = accuracy_score(y_test_motion, y_pred_rf_motion)

print(f"\nRandom Forest (Motion Data):")
print(f"  Training time: {train_time_rf_motion:.2f} seconds")
print(f"  Prediction time: {predict_time_rf_motion:.2f} milliseconds")
print(f"  Accuracy: {accuracy_rf_motion:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test_motion, y_pred_rf_motion))

## 5. Decision Tree Classifier

Train Decision Tree for comparison with other models.

In [None]:
# Train Decision Tree on motion data
print("Training Decision Tree on motion data...")
start_time = time()
dt_motion = DecisionTreeClassifier(random_state=42)
dt_motion.fit(X_train_motion, y_train_motion)
train_time_dt_motion = time() - start_time

# Make predictions and measure time
start_time = time()
y_pred_dt_motion = dt_motion.predict(X_test_motion)
predict_time_dt_motion = (time() - start_time) * 1000  # Convert to milliseconds

# Calculate accuracy
accuracy_dt_motion = accuracy_score(y_test_motion, y_pred_dt_motion)

print(f"\nDecision Tree (Motion Data):")
print(f"  Training time: {train_time_dt_motion:.2f} seconds")
print(f"  Prediction time: {predict_time_dt_motion:.2f} milliseconds")
print(f"  Accuracy: {accuracy_dt_motion:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test_motion, y_pred_dt_motion))

## 6. Feature Importance Analysis

Use Random Forest feature importance to identify critical facial points.

Facial landmark points:
- Points 37-48: Eye region (12 points total for both eyes)
- Points 49-68: Mouth region (20 points)

In [None]:
# Get feature importances from Random Forest
feature_importance = rf_motion.feature_importances_

# Create a dataframe with feature names and importance
importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': feature_importance
})

# Calculate importance by facial region
# Extract point numbers from feature names (e.g., 'x37' -> 37)
importance_df['point'] = importance_df['feature'].str.extract(r'(\d+)').astype(int)

# Group by regions
eye_points = importance_df[importance_df['point'].between(37, 48)]
mouth_points = importance_df[importance_df['point'].between(49, 68)]
other_points = importance_df[~importance_df['point'].between(37, 68)]

# Calculate total importance per region
eye_importance = eye_points['importance'].sum()
mouth_importance = mouth_points['importance'].sum()
other_importance = other_points['importance'].sum()

print("Feature Importance by Facial Region:")
print(f"  Eye region (points 37-48): {eye_importance:.4f}")
print(f"  Mouth region (points 49-68): {mouth_importance:.4f}")
print(f"  Other regions: {other_importance:.4f}")

# Plot top 20 most important features
top_features = importance_df.nlargest(20, 'importance')
plt.figure(figsize=(10, 6))
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Top 20 Most Important Features (Random Forest)')
plt.tight_layout()
plt.show()

# Bar chart of regional importance
plt.figure(figsize=(8, 5))
regions = ['Eyes (37-48)', 'Mouth (49-68)', 'Other']
importances = [eye_importance, mouth_importance, other_importance]
plt.bar(regions, importances)
plt.ylabel('Total Importance')
plt.title('Feature Importance by Facial Region')
plt.tight_layout()
plt.show()

## 7. Learning Curves: Model Complexity Study

Study how Random Forest accuracy changes with different values of n_estimators.

In [None]:
# Study impact of n_estimators on Random Forest accuracy
n_estimators_values = [10, 25, 50, 75, 100, 150, 200]
train_accuracies = []
test_accuracies = []

print("Evaluating Random Forest with different n_estimators...")
for n_est in n_estimators_values:
    rf = RandomForestClassifier(n_estimators=n_est, random_state=42, n_jobs=-1)
    rf.fit(X_train_motion, y_train_motion)
    
    train_acc = accuracy_score(y_train_motion, rf.predict(X_train_motion))
    test_acc = accuracy_score(y_test_motion, rf.predict(X_test_motion))
    
    train_accuracies.append(train_acc)
    test_accuracies.append(test_acc)
    print(f"  n_estimators={n_est}: Train={train_acc:.4f}, Test={test_acc:.4f}")

# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(n_estimators_values, train_accuracies, marker='o', label='Training Accuracy')
plt.plot(n_estimators_values, test_accuracies, marker='s', label='Test Accuracy')
plt.xlabel('Number of Estimators (n_estimators)')
plt.ylabel('Accuracy')
plt.title('Random Forest: Accuracy vs Model Complexity')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 8. Training Models on Geometric Data

Train all three models on geometric data for comparison.

In [None]:
# Scale geometric data for Logistic Regression
scaler_geometric = StandardScaler()
X_train_geometric_scaled = scaler_geometric.fit_transform(X_train_geometric)
X_test_geometric_scaled = scaler_geometric.transform(X_test_geometric)

# Logistic Regression on geometric data
print("Training Logistic Regression on geometric data...")
start_time = time()
lr_geometric = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
lr_geometric.fit(X_train_geometric_scaled, y_train_geometric)
train_time_lr_geo = time() - start_time

start_time = time()
y_pred_lr_geo = lr_geometric.predict(X_test_geometric_scaled)
predict_time_lr_geo = (time() - start_time) * 1000
accuracy_lr_geo = accuracy_score(y_test_geometric, y_pred_lr_geo)

print(f"  Training time: {train_time_lr_geo:.2f} seconds")
print(f"  Prediction time: {predict_time_lr_geo:.2f} milliseconds")
print(f"  Accuracy: {accuracy_lr_geo:.4f}")

# Random Forest on geometric data
print("\nTraining Random Forest on geometric data...")
start_time = time()
rf_geometric = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_geometric.fit(X_train_geometric, y_train_geometric)
train_time_rf_geo = time() - start_time

start_time = time()
y_pred_rf_geo = rf_geometric.predict(X_test_geometric)
predict_time_rf_geo = (time() - start_time) * 1000
accuracy_rf_geo = accuracy_score(y_test_geometric, y_pred_rf_geo)

print(f"  Training time: {train_time_rf_geo:.2f} seconds")
print(f"  Prediction time: {predict_time_rf_geo:.2f} milliseconds")
print(f"  Accuracy: {accuracy_rf_geo:.4f}")

# Decision Tree on geometric data
print("\nTraining Decision Tree on geometric data...")
start_time = time()
dt_geometric = DecisionTreeClassifier(random_state=42)
dt_geometric.fit(X_train_geometric, y_train_geometric)
train_time_dt_geo = time() - start_time

start_time = time()
y_pred_dt_geo = dt_geometric.predict(X_test_geometric)
predict_time_dt_geo = (time() - start_time) * 1000
accuracy_dt_geo = accuracy_score(y_test_geometric, y_pred_dt_geo)

print(f"  Training time: {train_time_dt_geo:.2f} seconds")
print(f"  Prediction time: {predict_time_dt_geo:.2f} milliseconds")
print(f"  Accuracy: {accuracy_dt_geo:.4f}")

## 9. Comparison Table: Motion vs Geometric Data

Compare the performance of all models on both datasets.

In [None]:
# Create comparison dataframe
comparison_data = {
    'Model': ['Logistic Regression', 'Random Forest', 'Decision Tree'] * 2,
    'Dataset': ['Motion'] * 3 + ['Geometric'] * 3,
    'Accuracy': [
        accuracy_lr_motion, accuracy_rf_motion, accuracy_dt_motion,
        accuracy_lr_geo, accuracy_rf_geo, accuracy_dt_geo
    ],
    'Train Time (s)': [
        train_time_lr_motion, train_time_rf_motion, train_time_dt_motion,
        train_time_lr_geo, train_time_rf_geo, train_time_dt_geo
    ],
    'Predict Time (ms)': [
        predict_time_lr_motion, predict_time_rf_motion, predict_time_dt_motion,
        predict_time_lr_geo, predict_time_rf_geo, predict_time_dt_geo
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print("\nComparison Table: Motion vs Geometric Data")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)

# Visualization of accuracy comparison
plt.figure(figsize=(10, 6))
motion_accuracies = [accuracy_lr_motion, accuracy_rf_motion, accuracy_dt_motion]
geometric_accuracies = [accuracy_lr_geo, accuracy_rf_geo, accuracy_dt_geo]
models = ['Logistic\nRegression', 'Random\nForest', 'Decision\nTree']

x = np.arange(len(models))
width = 0.35

plt.bar(x - width/2, motion_accuracies, width, label='Motion Data')
plt.bar(x + width/2, geometric_accuracies, width, label='Geometric Data')

plt.ylabel('Accuracy')
plt.title('Model Performance: Motion vs Geometric Data')
plt.xticks(x, models)
plt.legend()
plt.ylim([0, 1])
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

## 10. Analysis and Interpretation

### Key Findings:

1. **Motion vs Geometric Data**: 
   - Motion data represents the change in facial landmarks over time (delta values)
   - Geometric data represents the actual positions of facial landmarks
   - Performance differences can be attributed to how each representation captures facial expressions

2. **Feature Importance**:
   - The mouth region (points 49-68) and eye region (points 37-48) show different importance levels
   - This helps identify which facial areas are most critical for emotion classification

3. **Model Complexity**:
   - The learning curve shows how increasing n_estimators affects Random Forest performance
   - Helps identify the optimal trade-off between accuracy and computational cost

4. **Model Comparison**:
   - Logistic Regression benefits from StandardScaler and class_weight='balanced'
   - Random Forest typically performs well without scaling
   - Decision Tree is fast but may overfit without pruning

5. **Time Performance**:
   - Training time varies significantly between models
   - Prediction time is important for real-time applications