# Week 5: Classification Algorithms

## Learning Objectives:
- Understand classification problems
- Implement various classification algorithms
- Master classification evaluation metrics
- Understand how certain parameters change the behavior of common classification models

## Topics Covered:
- Logistic regression
- Decision trees
- Random forests
- k-Nearest Neighbors (k-NN)
- Classification metrics (accuracy, precision, recall, F1)

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                           f1_score, classification_report, confusion_matrix,
                           roc_auc_score, roc_curve)
from sklearn.datasets import make_classification
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

## 1. Understanding Classification Problems

Classification is a supervised learning task where we predict discrete categories or classes. Unlike regression (predicting continuous values), classification predicts categorical outcomes.

### Types of Classification:
- **Binary Classification**: Two classes (e.g., spam vs. not spam)
- **Multi-class Classification**: More than two classes (e.g., flower species)
- **Multi-label Classification**: Multiple labels per instance

In [None]:
# Create a comprehensive classification dataset
np.random.seed(42)

# Generate synthetic customer data for predicting purchase behavior
n_samples = 1000

# Customer features
age = np.random.normal(35, 12, n_samples)
age = np.clip(age, 18, 70)  # Realistic age range
income = np.random.normal(50000, 20000, n_samples)
income = np.clip(income, 20000, 150000)  # Realistic income range
time_on_site = np.random.exponential(5, n_samples)  # Minutes on website
pages_visited = np.random.poisson(3, n_samples) + 1
previous_purchases = np.random.poisson(2, n_samples)
days_since_last_visit = np.random.exponential(10, n_samples)

# Create realistic relationships for purchase decision
# Higher income, more time on site, more pages visited = higher purchase probability
purchase_probability = (
    0.3 * (income / 50000) +
    0.2 * (time_on_site / 10) +
    0.2 * (pages_visited / 5) +
    0.15 * (previous_purchases / 3) +
    0.1 * (age / 50) +
    0.05 * (1 / (days_since_last_visit + 1))
)

# Add some noise and create binary outcome
purchase_probability += np.random.normal(0, 0.1, n_samples)
purchase_probability = np.clip(purchase_probability, 0, 1)
will_purchase = (purchase_probability > 0.5).astype(int)

# Create DataFrame
customer_data = pd.DataFrame({
    'Age': age,
    'Income': income,
    'Time_on_Site': time_on_site,
    'Pages_Visited': pages_visited,
    'Previous_Purchases': previous_purchases,
    'Days_Since_Last_Visit': days_since_last_visit,
    'Will_Purchase': will_purchase
})

print("Customer Purchase Prediction Dataset:")
print(customer_data.head())
print(f"\nDataset shape: {customer_data.shape}")
print(f"\nClass distribution:")
print(customer_data['Will_Purchase'].value_counts())
print(f"\nClass balance: {customer_data['Will_Purchase'].mean():.2%} positive class")

In [None]:
# Exploratory Data Analysis for Classification
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Feature distributions by class
features = ['Age', 'Income', 'Time_on_Site', 'Pages_Visited', 'Previous_Purchases', 'Days_Since_Last_Visit']

for i, feature in enumerate(features):
    row = i // 3
    col = i % 3
    
    # Create histogram for each class
    axes[row, col].hist(customer_data[customer_data['Will_Purchase'] == 0][feature], 
                       alpha=0.7, label='No Purchase', bins=30, color='red')
    axes[row, col].hist(customer_data[customer_data['Will_Purchase'] == 1][feature], 
                       alpha=0.7, label='Purchase', bins=30, color='green')
    axes[row, col].set_title(f'Distribution of {feature} by Class')
    axes[row, col].set_xlabel(feature)
    axes[row, col].set_ylabel('Frequency')
    axes[row, col].legend()
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = customer_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
           square=True, fmt='.2f', cbar_kws={'shrink': 0.8})
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

## 2. Logistic Regression

Logistic regression is a linear classifier that uses the logistic function to model the probability of class membership.

### Mathematical Foundation:
```
P(y=1|x) = 1 / (1 + e^(-(β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ)))
```

### Key Properties:
- Outputs probabilities between 0 and 1
- Linear decision boundary
- Assumes linear relationship between features and log-odds
- Requires feature scaling for optimal performance

In [None]:
# Prepare data for modeling
print("=== LOGISTIC REGRESSION ===")

# Separate features and target
X = customer_data.drop('Will_Purchase', axis=1)
y = customer_data['Will_Purchase']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale features (important for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_pred_proba_log = log_reg.predict_proba(X_test_scaled)[:, 1]

# Evaluate performance
accuracy_log = accuracy_score(y_test, y_pred_log)
precision_log = precision_score(y_test, y_pred_log)
recall_log = recall_score(y_test, y_pred_log)
f1_log = f1_score(y_test, y_pred_log)
auc_log = roc_auc_score(y_test, y_pred_proba_log)

print(f"Accuracy: {accuracy_log:.4f}")
print(f"Precision: {precision_log:.4f}")
print(f"Recall: {recall_log:.4f}")
print(f"F1-Score: {f1_log:.4f}")
print(f"AUC-ROC: {auc_log:.4f}")

# Display feature importance (coefficients)
print("\nFeature Importance (Coefficients):")
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': log_reg.coef_[0],
    'Abs_Coefficient': np.abs(log_reg.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

print(feature_importance)

## 3. Decision Trees

Decision trees make predictions by learning simple decision rules inferred from data features. They create a tree-like model of decisions.

### Key Properties:
- Easy to understand and interpret
- Handles both numerical and categorical data
- No need for feature scaling
- Prone to overfitting
- Non-linear decision boundaries

In [None]:
# Decision Tree Implementation
print("=== DECISION TREE ===")

# Train decision tree (no scaling needed)
dt_clf = DecisionTreeClassifier(random_state=42, max_depth=5)
dt_clf.fit(X_train, y_train)

# Make predictions
y_pred_dt = dt_clf.predict(X_test)
y_pred_proba_dt = dt_clf.predict_proba(X_test)[:, 1]

# Evaluate performance
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)
f1_dt = f1_score(y_test, y_pred_dt)
auc_dt = roc_auc_score(y_test, y_pred_proba_dt)

print(f"Accuracy: {accuracy_dt:.4f}")
print(f"Precision: {precision_dt:.4f}")
print(f"Recall: {recall_dt:.4f}")
print(f"F1-Score: {f1_dt:.4f}")
print(f"AUC-ROC: {auc_dt:.4f}")

# Display feature importance
print("\nFeature Importance:")
feature_importance_dt = pd.DataFrame({
    'Feature': X.columns,
    'Importance': dt_clf.feature_importances_
}).sort_values('Importance', ascending=False)

print(feature_importance_dt)

In [None]:
# Visualize decision tree
plt.figure(figsize=(20, 12))
plot_tree(dt_clf, 
          feature_names=X.columns,
          class_names=['No Purchase', 'Purchase'],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('Decision Tree Visualization')
plt.tight_layout()
plt.show()

# Feature importance visualization
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_dt['Feature'], feature_importance_dt['Importance'])
plt.xlabel('Feature Importance')
plt.title('Decision Tree Feature Importance')
plt.tight_layout()
plt.show()

## 4. Random Forest

Random Forest is an ensemble method that combines multiple decision trees to create a more robust classifier.

### Key Properties:
- Reduces overfitting compared to single decision trees
- Handles missing values well
- Provides feature importance
- Can handle large datasets efficiently
- Generally achieves high accuracy

In [None]:
# Random Forest Implementation
print("=== RANDOM FOREST ===")

# Train Random Forest
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
rf_clf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_clf.predict(X_test)
y_pred_proba_rf = rf_clf.predict_proba(X_test)[:, 1]

# Evaluate performance
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
auc_rf = roc_auc_score(y_test, y_pred_proba_rf)

print(f"Accuracy: {accuracy_rf:.4f}")
print(f"Precision: {precision_rf:.4f}")
print(f"Recall: {recall_rf:.4f}")
print(f"F1-Score: {f1_rf:.4f}")
print(f"AUC-ROC: {auc_rf:.4f}")

# Display feature importance
print("\nFeature Importance:")
feature_importance_rf = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_clf.feature_importances_
}).sort_values('Importance', ascending=False)

print(feature_importance_rf)

## 5. k-Nearest Neighbors (k-NN)

k-NN is a lazy learning algorithm that classifies new instances based on the majority class among the k nearest neighbors.

### Key Properties:
- Simple and intuitive
- No training phase (lazy learning)
- Sensitive to feature scaling
- Can be computationally expensive for large datasets
- Works well with small datasets

In [None]:
# k-NN Implementation
print("=== K-NEAREST NEIGHBORS ===")

# Train k-NN (using scaled features)
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train_scaled, y_train)

# Make predictions
y_pred_knn = knn_clf.predict(X_test_scaled)
y_pred_proba_knn = knn_clf.predict_proba(X_test_scaled)[:, 1]

# Evaluate performance
accuracy_knn = accuracy_score(y_test, y_pred_knn)
precision_knn = precision_score(y_test, y_pred_knn)
recall_knn = recall_score(y_test, y_pred_knn)
f1_knn = f1_score(y_test, y_pred_knn)
auc_knn = roc_auc_score(y_test, y_pred_proba_knn)

print(f"Accuracy: {accuracy_knn:.4f}")
print(f"Precision: {precision_knn:.4f}")
print(f"Recall: {recall_knn:.4f}")
print(f"F1-Score: {f1_knn:.4f}")
print(f"AUC-ROC: {auc_knn:.4f}")

# Test different k values
print("\n=== TESTING DIFFERENT K VALUES ===")
k_values = range(1, 21)
k_scores = []

for k in k_values:
    knn_temp = KNeighborsClassifier(n_neighbors=k)
    knn_temp.fit(X_train_scaled, y_train)
    y_pred_temp = knn_temp.predict(X_test_scaled)
    accuracy_temp = accuracy_score(y_test, y_pred_temp)
    k_scores.append(accuracy_temp)

# Find best k
best_k = k_values[np.argmax(k_scores)]
best_accuracy = max(k_scores)
print(f"Best k: {best_k} with accuracy: {best_accuracy:.4f}")

# Plot k vs accuracy
plt.figure(figsize=(10, 6))
plt.plot(k_values, k_scores, 'bo-')
plt.axvline(best_k, color='red', linestyle='--', alpha=0.7, label=f'Best k={best_k}')
plt.xlabel('k Value')
plt.ylabel('Accuracy')
plt.title('k-NN Performance vs k Value')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 6. Classification Metrics Deep Dive

Understanding evaluation metrics is crucial for classification problems. Different metrics emphasize different aspects of performance.

In [None]:
# Comprehensive Model Comparison
print("=== COMPREHENSIVE MODEL COMPARISON ===")

# Store all results
results = {
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 'k-NN'],
    'Accuracy': [accuracy_log, accuracy_dt, accuracy_rf, accuracy_knn],
    'Precision': [precision_log, precision_dt, precision_rf, precision_knn],
    'Recall': [recall_log, recall_dt, recall_rf, recall_knn],
    'F1-Score': [f1_log, f1_dt, f1_rf, f1_knn],
    'AUC-ROC': [auc_log, auc_dt, auc_rf, auc_knn]
}

results_df = pd.DataFrame(results)
print(results_df.round(4))

# Best model
best_model_idx = results_df['F1-Score'].idxmax()
best_model = results_df.iloc[best_model_idx]['Model']
print(f"\nBest model based on F1-Score: {best_model}")

In [None]:
# Confusion Matrices
models = {
    'Logistic Regression': y_pred_log,
    'Decision Tree': y_pred_dt,
    'Random Forest': y_pred_rf,
    'k-NN': y_pred_knn
}

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

for i, (model_name, predictions) in enumerate(models.items()):
    cm = confusion_matrix(y_test, predictions)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[i])
    axes[i].set_title(f'{model_name} Confusion Matrix')
    axes[i].set_xlabel('Predicted')
    axes[i].set_ylabel('Actual')

plt.tight_layout()
plt.show()

In [None]:
# ROC Curves
plt.figure(figsize=(10, 8))

# Calculate ROC curves for all models
probabilities = {
    'Logistic Regression': y_pred_proba_log,
    'Decision Tree': y_pred_proba_dt,
    'Random Forest': y_pred_proba_rf,
    'k-NN': y_pred_proba_knn
}

for model_name, proba in probabilities.items():
    fpr, tpr, _ = roc_curve(y_test, proba)
    auc_score = roc_auc_score(y_test, proba)
    plt.plot(fpr, tpr, label=f'{model_name} (AUC = {auc_score:.3f})')

# Plot diagonal line
plt.plot([0, 1], [0, 1], 'k--', alpha=0.7, label='Random Classifier')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 7. Hyperparameter Tuning

Let's optimize our models using GridSearchCV to find the best hyperparameters.

In [None]:
# Hyperparameter tuning for Random Forest
print("=== HYPERPARAMETER TUNING FOR RANDOM FOREST ===")

# Define parameter grid
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid search
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), 
                             param_grid_rf, cv=5, scoring='f1', n_jobs=-1)
grid_search_rf.fit(X_train, y_train)

# Best parameters
print(f"Best parameters: {grid_search_rf.best_params_}")
print(f"Best CV F1-Score: {grid_search_rf.best_score_:.4f}")

# Test best model
best_rf = grid_search_rf.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)
f1_best_rf = f1_score(y_test, y_pred_best_rf)
print(f"Test F1-Score: {f1_best_rf:.4f}")

## 8. Summary

Congratulations! You've mastered the fundamentals of classification algorithms. Here's what you learned:

### Key Concepts Mastered:
1. **Classification vs Regression**: Understanding the difference between predicting categories vs continuous values
2. **Logistic Regression**: Linear classifier using the logistic function
3. **Decision Trees**: Tree-based models that create interpretable decision rules
4. **Random Forest**: Ensemble method combining multiple decision trees
5. **k-NN**: Instance-based learning using nearest neighbors
6. **Classification Metrics**: Accuracy, precision, recall, F1-score, and AUC-ROC

### Key Skills Acquired:
- Implementing various classification algorithms
- Understanding when to use each algorithm
- Evaluating classification performance with appropriate metrics
- Interpreting confusion matrices and ROC curves
- Tuning hyperparameters for optimal performance
- Handling imbalanced datasets

### When to Use Each Algorithm:
- **Logistic Regression**: When you need interpretable linear relationships and probability estimates
- **Decision Trees**: When you need highly interpretable models and can handle overfitting
- **Random Forest**: When you want high accuracy with built-in feature importance
- **k-NN**: When you have small datasets and local patterns are important

### Classification Metrics Guide:
- **Accuracy**: Overall correctness (good for balanced datasets)
- **Precision**: How many selected items are relevant (minimize false positives)
- **Recall**: How many relevant items are selected (minimize false negatives)
- **F1-Score**: Harmonic mean of precision and recall (balanced metric)
- **AUC-ROC**: Overall discrimination ability across all thresholds

### Best Practices:
- Always check class balance and consider stratified sampling
- Use appropriate metrics based on your problem (precision vs recall trade-off)
- Scale features for algorithms that need it (logistic regression, k-NN)
- Use cross-validation for reliable performance estimates
- Consider ensemble methods for better performance
- Visualize decision boundaries when possible

### Next Steps:
In the next week, we'll explore unsupervised learning including clustering algorithms and dimensionality reduction techniques like PCA.