# Week 7: General Machine Learning Techniques

## Learning Objectives:
- Learn ensemble methods
- Understand model selection and tuning
- Explore advanced evaluation techniques
- Handle imbalanced datasets

## Topics Covered:
- Ensemble methods (bagging, boosting)
- Gradient boosting (XGBoost, LightGBM)
- Hyperparameter tuning
- Grid search and random search
- Handling imbalanced data
- Feature importance and selection
- Model interpretability (SHAP, LIME)

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.feature_selection import SelectKBest, chi2, RFE
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import warnings
warnings.filterwarnings('ignore')

# Advanced libraries (may need installation)
try:
    import xgboost as xgb
    print("XGBoost available")
except ImportError:
    print("XGBoost not available - install with: pip install xgboost")
    xgb = None

try:
    import lightgbm as lgb
    print("LightGBM available")
except ImportError:
    print("LightGBM not available - install with: pip install lightgbm")
    lgb = None

try:
    import shap
    print("SHAP available")
except ImportError:
    print("SHAP not available - install with: pip install shap")
    shap = None

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

## 1. Introduction to Ensemble Methods

Ensemble methods combine multiple models to create stronger predictors. The key insight is that combining weak learners can create strong learners.

### Types of Ensemble Methods:
1. **Bagging**: Bootstrap Aggregating - trains models on different subsets of data
2. **Boosting**: Sequential training where each model corrects previous errors
3. **Stacking**: Uses a meta-model to combine predictions from base models
4. **Voting**: Combines predictions through majority vote or averaging

### Benefits:
- Reduces overfitting
- Improves generalization
- More robust predictions
- Often achieves better performance than individual models

In [None]:
# Create a comprehensive dataset for classification
from sklearn.datasets import make_classification

# Generate synthetic dataset
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    class_sep=0.8,
    random_state=42
)

# Convert to DataFrame
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

print("Dataset created:")
print(f"Shape: {df.shape}")
print(f"Class distribution: {np.bincount(y)}")
print(f"Class balance: {y.mean():.2%} positive class")

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nTraining set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

## 2. Bagging Methods

Bagging (Bootstrap Aggregating) trains multiple models on different bootstrap samples of the training data.

### Random Forest:
- Bagging + Random feature selection
- Each tree uses a random subset of features
- Reduces correlation between trees
- Provides feature importance

In [None]:
# Bagging with Random Forest
print("=== BAGGING: RANDOM FOREST ===")

# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
rf_pred = rf_model.predict(X_test)
rf_pred_proba = rf_model.predict_proba(X_test)[:, 1]

# Evaluate
rf_auc = roc_auc_score(y_test, rf_pred_proba)
print(f"Random Forest AUC: {rf_auc:.4f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 most important features:")
print(feature_importance.head(10))

# Compare with single Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred_proba = dt_model.predict_proba(X_test)[:, 1]
dt_auc = roc_auc_score(y_test, dt_pred_proba)

print(f"\nSingle Decision Tree AUC: {dt_auc:.4f}")
print(f"Improvement from Random Forest: {rf_auc - dt_auc:.4f}")

In [None]:
# Manual Bagging implementation
print("\n=== MANUAL BAGGING IMPLEMENTATION ===")

# Create bagging classifier
bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,
    random_state=42
)

bagging_model.fit(X_train, y_train)
bagging_pred_proba = bagging_model.predict_proba(X_test)[:, 1]
bagging_auc = roc_auc_score(y_test, bagging_pred_proba)

print(f"Bagging AUC: {bagging_auc:.4f}")

# Visualize feature importance
plt.figure(figsize=(12, 6))
plt.barh(range(10), feature_importance.head(10)['importance'])
plt.yticks(range(10), feature_importance.head(10)['feature'])
plt.xlabel('Feature Importance')
plt.title('Top 10 Feature Importances (Random Forest)')
plt.tight_layout()
plt.show()

## 3. Boosting Methods

Boosting trains models sequentially, where each model attempts to correct the errors of the previous ones.

### Gradient Boosting:
- Builds models sequentially
- Each model predicts the residuals of the previous model
- Combines weak learners into strong learners
- Can overfit if not properly regularized

In [None]:
# Gradient Boosting
print("=== BOOSTING: GRADIENT BOOSTING ===")

# Train Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(X_train, y_train)

# Make predictions
gb_pred_proba = gb_model.predict_proba(X_test)[:, 1]
gb_auc = roc_auc_score(y_test, gb_pred_proba)

print(f"Gradient Boosting AUC: {gb_auc:.4f}")

# Plot training progress
plt.figure(figsize=(12, 6))

# Training and validation scores
train_scores = gb_model.train_score_
plt.subplot(1, 2, 1)
plt.plot(train_scores, label='Training Score')
plt.xlabel('Boosting Iterations')
plt.ylabel('Score')
plt.title('Gradient Boosting Training Progress')
plt.legend()
plt.grid(True, alpha=0.3)

# Feature importance
plt.subplot(1, 2, 2)
gb_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': gb_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.barh(range(10), gb_importance.head(10)['importance'])
plt.yticks(range(10), gb_importance.head(10)['feature'])
plt.xlabel('Feature Importance')
plt.title('Top 10 Features (Gradient Boosting)')

plt.tight_layout()
plt.show()

print("\nTop 10 most important features (Gradient Boosting):")
print(gb_importance.head(10))

## 4. Advanced Boosting: XGBoost and LightGBM

XGBoost and LightGBM are optimized implementations of gradient boosting with additional features:

### XGBoost:
- Regularization to prevent overfitting
- Parallel processing
- Missing value handling
- Cross-validation support

### LightGBM:
- Faster training
- Lower memory usage
- Better accuracy
- Handles categorical features

In [None]:
# XGBoost implementation
if xgb is not None:
    print("=== XGBOOST ===")
    
    # Train XGBoost
    xgb_model = xgb.XGBClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=6,
        random_state=42,
        eval_metric='logloss'
    )
    
    xgb_model.fit(X_train, y_train)
    
    # Make predictions
    xgb_pred_proba = xgb_model.predict_proba(X_test)[:, 1]
    xgb_auc = roc_auc_score(y_test, xgb_pred_proba)
    
    print(f"XGBoost AUC: {xgb_auc:.4f}")
    
    # Feature importance
    xgb_importance = pd.DataFrame({
        'feature': feature_names,
        'importance': xgb_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("\nTop 10 most important features (XGBoost):")
    print(xgb_importance.head(10))
else:
    print("XGBoost not available - skipping")
    xgb_auc = None

# LightGBM implementation
if lgb is not None:
    print("\n=== LIGHTGBM ===")
    
    # Train LightGBM
    lgb_model = lgb.LGBMClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=6,
        random_state=42,
        verbose=-1
    )
    
    lgb_model.fit(X_train, y_train)
    
    # Make predictions
    lgb_pred_proba = lgb_model.predict_proba(X_test)[:, 1]
    lgb_auc = roc_auc_score(y_test, lgb_pred_proba)
    
    print(f"LightGBM AUC: {lgb_auc:.4f}")
    
    # Feature importance
    lgb_importance = pd.DataFrame({
        'feature': feature_names,
        'importance': lgb_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("\nTop 10 most important features (LightGBM):")
    print(lgb_importance.head(10))
else:
    print("LightGBM not available - skipping")
    lgb_auc = None

## 5. Voting Classifiers

Voting classifiers combine different types of algorithms and use majority voting or averaging to make predictions.

### Types:
- **Hard Voting**: Uses predicted class labels
- **Soft Voting**: Uses predicted probabilities (usually better)

In [None]:
# Voting Classifier
print("=== VOTING CLASSIFIER ===")

# Create individual classifiers
classifiers = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('lr', LogisticRegression(random_state=42, max_iter=1000))
]

# Add XGBoost if available
if xgb is not None:
    classifiers.append(('xgb', xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss')))

# Create voting classifier
voting_model = VotingClassifier(estimators=classifiers, voting='soft')
voting_model.fit(X_train_scaled, y_train)

# Make predictions
voting_pred_proba = voting_model.predict_proba(X_test_scaled)[:, 1]
voting_auc = roc_auc_score(y_test, voting_pred_proba)

print(f"Voting Classifier AUC: {voting_auc:.4f}")

# Compare individual classifiers
print("\nIndividual classifier performance:")
for name, clf in classifiers:
    if name == 'lr':
        clf.fit(X_train_scaled, y_train)
        pred_proba = clf.predict_proba(X_test_scaled)[:, 1]
    else:
        clf.fit(X_train, y_train)
        pred_proba = clf.predict_proba(X_test)[:, 1]
    
    auc = roc_auc_score(y_test, pred_proba)
    print(f"{name.upper()}: {auc:.4f}")

print(f"\nVoting Classifier combines all: {voting_auc:.4f}")

## 6. Hyperparameter Tuning

Hyperparameter tuning is crucial for optimal model performance. We'll explore both grid search and random search.

### Grid Search vs Random Search:
- **Grid Search**: Exhaustive search over parameter grid
- **Random Search**: Random sampling from parameter distributions
- **Random Search**: Often more efficient for high-dimensional spaces

In [None]:
# Grid Search for Random Forest
print("=== GRID SEARCH HYPERPARAMETER TUNING ===")

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Test best model
best_rf = grid_search.best_estimator_
best_rf_pred_proba = best_rf.predict_proba(X_test)[:, 1]
best_rf_auc = roc_auc_score(y_test, best_rf_pred_proba)

print(f"Test AUC with best parameters: {best_rf_auc:.4f}")
print(f"Improvement over default: {best_rf_auc - rf_auc:.4f}")

In [None]:
# Random Search for comparison
print("\n=== RANDOM SEARCH HYPERPARAMETER TUNING ===")

# Define parameter distributions
param_dist = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [3, 5, 7, 10, 15, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 4, 8],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Random search
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist,
    n_iter=50,  # Number of parameter combinations to try
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")

# Test best model
best_rf_random = random_search.best_estimator_
best_rf_random_pred_proba = best_rf_random.predict_proba(X_test)[:, 1]
best_rf_random_auc = roc_auc_score(y_test, best_rf_random_pred_proba)

print(f"Test AUC with random search: {best_rf_random_auc:.4f}")

# Compare search methods
print(f"\nComparison:")
print(f"Grid Search AUC: {best_rf_auc:.4f}")
print(f"Random Search AUC: {best_rf_random_auc:.4f}")
print(f"Default RF AUC: {rf_auc:.4f}")

## 7. Handling Imbalanced Data

Real-world datasets often have imbalanced classes. We'll explore techniques to handle this:

### Techniques:
1. **Resampling**: Over-sampling minority class or under-sampling majority class
2. **SMOTE**: Synthetic Minority Over-sampling Technique
3. **Class Weights**: Penalize misclassification of minority class more
4. **Cost-sensitive Learning**: Adjust decision threshold
5. **Ensemble Methods**: Combine balanced models

In [None]:
# Create imbalanced dataset
print("=== HANDLING IMBALANCED DATA ===")

# Create imbalanced version of our data
# Keep all positive examples, randomly sample negative examples
positive_indices = np.where(y_train == 1)[0]
negative_indices = np.where(y_train == 0)[0]

# Keep all positive examples and 20% of negative examples
np.random.seed(42)
selected_negative = np.random.choice(negative_indices, int(len(negative_indices) * 0.2), replace=False)
imbalanced_indices = np.concatenate([positive_indices, selected_negative])

X_train_imb = X_train[imbalanced_indices]
y_train_imb = y_train[imbalanced_indices]

print(f"Original training set: {np.bincount(y_train)}")
print(f"Imbalanced training set: {np.bincount(y_train_imb)}")
print(f"Imbalance ratio: {y_train_imb.mean():.2%}")

# Train baseline model on imbalanced data
baseline_model = RandomForestClassifier(n_estimators=100, random_state=42)
baseline_model.fit(X_train_imb, y_train_imb)
baseline_pred_proba = baseline_model.predict_proba(X_test)[:, 1]
baseline_auc = roc_auc_score(y_test, baseline_pred_proba)

print(f"\nBaseline model (imbalanced data) AUC: {baseline_auc:.4f}")

In [None]:
# Method 1: SMOTE (Synthetic Minority Over-sampling Technique)
print("\n=== METHOD 1: SMOTE ===")

# Apply SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_imb, y_train_imb)

print(f"After SMOTE: {np.bincount(y_train_smote)}")

# Train model with SMOTE data
smote_model = RandomForestClassifier(n_estimators=100, random_state=42)
smote_model.fit(X_train_smote, y_train_smote)
smote_pred_proba = smote_model.predict_proba(X_test)[:, 1]
smote_auc = roc_auc_score(y_test, smote_pred_proba)

print(f"SMOTE model AUC: {smote_auc:.4f}")
print(f"Improvement: {smote_auc - baseline_auc:.4f}")

In [None]:
# Method 2: Class Weights
print("\n=== METHOD 2: CLASS WEIGHTS ===")

# Train model with balanced class weights
weighted_model = RandomForestClassifier(
    n_estimators=100, 
    class_weight='balanced',
    random_state=42
)
weighted_model.fit(X_train_imb, y_train_imb)
weighted_pred_proba = weighted_model.predict_proba(X_test)[:, 1]
weighted_auc = roc_auc_score(y_test, weighted_pred_proba)

print(f"Weighted model AUC: {weighted_auc:.4f}")
print(f"Improvement: {weighted_auc - baseline_auc:.4f}")

In [None]:
# Method 3: Under-sampling
print("\n=== METHOD 3: UNDER-SAMPLING ===")

# Apply random under-sampling
undersampler = RandomUnderSampler(random_state=42)
X_train_under, y_train_under = undersampler.fit_resample(X_train_imb, y_train_imb)

print(f"After under-sampling: {np.bincount(y_train_under)}")

# Train model with under-sampled data
under_model = RandomForestClassifier(n_estimators=100, random_state=42)
under_model.fit(X_train_under, y_train_under)
under_pred_proba = under_model.predict_proba(X_test)[:, 1]
under_auc = roc_auc_score(y_test, under_pred_proba)

print(f"Under-sampled model AUC: {under_auc:.4f}")
print(f"Improvement: {under_auc - baseline_auc:.4f}")

In [None]:
# Compare all methods
print("\n=== COMPARISON OF IMBALANCED DATA METHODS ===")

methods = {
    'Baseline (Imbalanced)': baseline_auc,
    'SMOTE': smote_auc,
    'Class Weights': weighted_auc,
    'Under-sampling': under_auc
}

for method, auc in methods.items():
    print(f"{method:<20}: {auc:.4f}")

# Best method
best_method = max(methods.items(), key=lambda x: x[1])
print(f"\nBest method: {best_method[0]} (AUC: {best_method[1]:.4f})")

# Visualize ROC curves
plt.figure(figsize=(10, 8))

method_probabilities = {
    'Baseline': baseline_pred_proba,
    'SMOTE': smote_pred_proba,
    'Weighted': weighted_pred_proba,
    'Under-sampled': under_pred_proba
}

for name, proba in method_probabilities.items():
    fpr, tpr, _ = roc_curve(y_test, proba)
    auc_score = roc_auc_score(y_test, proba)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.3f})')

plt.plot([0, 1], [0, 1], 'k--', alpha=0.7, label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves: Imbalanced Data Methods')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 8. Feature Selection

Feature selection helps improve model performance and interpretability by selecting the most relevant features.

### Methods:
1. **Filter Methods**: Statistical tests (chi-square, correlation)
2. **Wrapper Methods**: Use model performance (RFE, forward/backward selection)
3. **Embedded Methods**: Feature selection during training (Lasso, Random Forest)

In [None]:
# Feature Selection
print("=== FEATURE SELECTION ===")

# Method 1: SelectKBest with chi-square
# Note: chi-square requires non-negative features
X_train_pos = X_train - X_train.min() + 1  # Make all features positive
X_test_pos = X_test - X_train.min() + 1

selector = SelectKBest(score_func=chi2, k=10)
X_train_selected = selector.fit_transform(X_train_pos, y_train)
X_test_selected = selector.transform(X_test_pos)

# Get selected features
selected_features = [feature_names[i] for i in selector.get_support(indices=True)]
print(f"Selected features (chi-square): {selected_features}")

# Train model with selected features
selected_model = RandomForestClassifier(n_estimators=100, random_state=42)
selected_model.fit(X_train_selected, y_train)
selected_pred_proba = selected_model.predict_proba(X_test_selected)[:, 1]
selected_auc = roc_auc_score(y_test, selected_pred_proba)

print(f"Selected features model AUC: {selected_auc:.4f}")
print(f"Original model AUC: {rf_auc:.4f}")
print(f"Difference: {selected_auc - rf_auc:.4f}")

In [None]:
# Method 2: Recursive Feature Elimination (RFE)
print("\n=== RECURSIVE FEATURE ELIMINATION ===")

# Use RFE with Random Forest
estimator = RandomForestClassifier(n_estimators=50, random_state=42)
rfe = RFE(estimator, n_features_to_select=10, step=1)
rfe.fit(X_train, y_train)

# Get selected features
rfe_features = [feature_names[i] for i in range(len(feature_names)) if rfe.support_[i]]
print(f"Selected features (RFE): {rfe_features}")

# Transform data
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)

# Train model with RFE features
rfe_model = RandomForestClassifier(n_estimators=100, random_state=42)
rfe_model.fit(X_train_rfe, y_train)
rfe_pred_proba = rfe_model.predict_proba(X_test_rfe)[:, 1]
rfe_auc = roc_auc_score(y_test, rfe_pred_proba)

print(f"RFE model AUC: {rfe_auc:.4f}")
print(f"Difference from original: {rfe_auc - rf_auc:.4f}")

## 9. Summary

Congratulations! You've mastered advanced machine learning techniques. Here's what you learned:

### Key Concepts Mastered:
1. **Ensemble Methods**: Combining models for better performance
2. **Bagging**: Random Forest and bootstrap aggregating
3. **Boosting**: Gradient Boosting, XGBoost, LightGBM
4. **Voting**: Combining different algorithm types
5. **Hyperparameter Tuning**: Grid search vs random search
6. **Imbalanced Data**: SMOTE, class weights, resampling
7. **Feature Selection**: Choosing the most relevant features

### Key Skills Acquired:
- Building and combining ensemble models
- Optimizing hyperparameters systematically
- Handling imbalanced datasets effectively
- Selecting features to improve model performance
- Understanding trade-offs between different techniques
- Implementing advanced ML algorithms

### Best Practices:
- Always validate ensemble improvements with cross-validation
- Use random search for initial exploration, grid search for fine-tuning
- Consider class imbalance early in your modeling process
- Feature selection should be done within cross-validation
- Monitor overfitting, especially with boosting algorithms
- Combine multiple techniques for optimal results

### When to Use Each Technique:
- **Random Forest**: General-purpose, good baseline, feature importance
- **Gradient Boosting**: High accuracy, handles missing values
- **XGBoost/LightGBM**: Competitions, large datasets, best performance
- **Voting**: Combining diverse algorithms
- **SMOTE**: Moderate imbalance, sufficient data
- **Class Weights**: Severe imbalance, limited data

### Real-world Applications:
- Fraud detection (imbalanced data, ensemble methods)
- Medical diagnosis (high accuracy, interpretability)
- Marketing campaigns (customer targeting, feature selection)
- Financial modeling (risk assessment, robust predictions)
- Competition machine learning (XGBoost, stacking)

### Next Steps:
In the next week, we'll explore neural networks and deep learning, building on the foundation you've established with traditional machine learning techniques.