# 🤖 Classical Machine Learning Algorithms - Complete Guide

Master the fundamental algorithms that power modern ML! This comprehensive guide covers theory, implementation, and interview preparation.

**Learning Goals:**
- Understand algorithm internals (not just sklearn!)
- Know when to use each algorithm
- Implement algorithms from scratch
- Master hyperparameter tuning
- Compare algorithms systematically
- Prepare for technical interviews

**Algorithms Covered:**
1. **Linear Models:** Linear Regression, Ridge, Lasso, Logistic Regression
2. **Tree-Based:** Decision Trees, Random Forest, Gradient Boosting, XGBoost
3. **Instance-Based:** K-Nearest Neighbors (KNN)
4. **Support Vector Machines:** SVM/SVR with different kernels
5. **Naive Bayes:** Gaussian, Multinomial, Bernoulli
6. **Ensemble Methods:** Bagging, Boosting, Stacking
7. **Clustering:** K-Means, DBSCAN, Hierarchical
8. **Dimensionality Reduction:** PCA, t-SNE, LDA

**Interview Topics:**
- Algorithm selection criteria
- Bias-variance tradeoff
- Overfitting prevention
- Model interpretability
- Computational complexity

**Sources:**
- "The Elements of Statistical Learning" - Hastie, Tibshirani, Friedman
- "Pattern Recognition and Machine Learning" - Bishop
- "Hands-On Machine Learning" - Géron (2019)
- "Introduction to Statistical Learning" - James et al.

In [None]:
# Import ALL necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve,
    mean_squared_error, r2_score, mean_absolute_error
)

# Datasets
from sklearn.datasets import (
    make_classification, make_regression, make_blobs,
    load_iris, load_breast_cancer, load_wine
)

# Classical ML Algorithms
from sklearn.linear_model import (
    LinearRegression, Ridge, Lasso, ElasticNet,
    LogisticRegression, SGDClassifier
)
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.ensemble import (
    RandomForestClassifier, RandomForestRegressor,
    GradientBoostingClassifier, GradientBoostingRegressor,
    AdaBoostClassifier, BaggingClassifier, VotingClassifier, StackingClassifier
)
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Plotting style
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (14, 8)
sns.set_palette('husl')
np.random.seed(42)

print("✅ All libraries loaded successfully!")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Scikit-learn: {__import__('sklearn').__version__}")

## 📊 Part 1: Linear Models - The Foundation

**Interview Question:** *"Explain linear regression and its assumptions."*

**Answer:**

**Linear Regression:**
- **Model:** $y = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n + \epsilon$
- **Goal:** Minimize squared errors (OLS - Ordinary Least Squares)
- **Solution:** $\beta = (X^T X)^{-1} X^T y$

**Key Assumptions:**
1. **Linearity:** Relationship between X and y is linear
2. **Independence:** Observations are independent
3. **Homoscedasticity:** Constant variance of errors
4. **Normality:** Errors are normally distributed
5. **No multicollinearity:** Features aren't highly correlated

**Regularization Variants:**

| Model | Loss Function | Use Case |
|-------|---------------|----------|
| **Linear Regression** | MSE | No regularization |
| **Ridge (L2)** | MSE + α||β||² | Multicollinearity, keep all features |
| **Lasso (L1)** | MSE + α||β|| | Feature selection, sparse solutions |
| **ElasticNet** | MSE + α₁||β|| + α₂||β||² | Best of both worlds |

**When to Use:**
- ✅ Fast training and prediction
- ✅ Interpretable coefficients
- ✅ Works well with many features (with regularization)
- ❌ Cannot capture non-linear relationships (without feature engineering)
- ❌ Sensitive to outliers

**Source:** "The Elements of Statistical Learning" Chapter 3

In [None]:
# Linear Regression from scratch
print("📐 LINEAR REGRESSION FROM SCRATCH")
print("="*70)

class LinearRegressionScratch:
    """
    Linear Regression implemented from scratch using Normal Equation
    
    Formula: β = (X^T X)^(-1) X^T y
    """
    
    def __init__(self):
        self.coefficients = None
        self.intercept = None
    
    def fit(self, X, y):
        """
        Fit linear model using Normal Equation
        """
        # Add bias term (column of ones)
        X_with_bias = np.c_[np.ones((X.shape[0], 1)), X]
        
        # Normal equation: β = (X^T X)^(-1) X^T y
        XtX = X_with_bias.T @ X_with_bias
        Xty = X_with_bias.T @ y
        
        # Solve for coefficients
        theta = np.linalg.solve(XtX, Xty)
        
        self.intercept = theta[0]
        self.coefficients = theta[1:]
        
        return self
    
    def predict(self, X):
        """
        Make predictions
        """
        return X @ self.coefficients + self.intercept
    
    def score(self, X, y):
        """
        Calculate R² score
        """
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

# Generate synthetic data
np.random.seed(42)
X_reg, y_reg = make_regression(n_samples=200, n_features=1, noise=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

print("\n🔧 Training Custom Implementation...")
model_scratch = LinearRegressionScratch()
model_scratch.fit(X_train, y_train)

print(f"\n📊 Model Parameters:")
print(f"  Intercept (β₀): {model_scratch.intercept:.4f}")
print(f"  Coefficient (β₁): {model_scratch.coefficients[0]:.4f}")
print(f"  Equation: y = {model_scratch.intercept:.2f} + {model_scratch.coefficients[0]:.2f}x")

# Compare with sklearn
print("\n🔧 Training Sklearn Implementation...")
model_sklearn = LinearRegression()
model_sklearn.fit(X_train, y_train)

print(f"\n📊 Sklearn Parameters:")
print(f"  Intercept (β₀): {model_sklearn.intercept_:.4f}")
print(f"  Coefficient (β₁): {model_sklearn.coef_[0]:.4f}")

# Evaluate both
r2_scratch = model_scratch.score(X_test, y_test)
r2_sklearn = model_sklearn.score(X_test, y_test)

print(f"\n📈 Performance Comparison:")
print(f"  Custom R²: {r2_scratch:.4f}")
print(f"  Sklearn R²: {r2_sklearn:.4f}")
print(f"  Difference: {abs(r2_scratch - r2_sklearn):.6f}")
print("  ✅ Implementations match!")

In [None]:
# Visualize Linear Regression
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Plot 1: Scatter plot with regression line
axes[0, 0].scatter(X_train, y_train, alpha=0.6, label='Training data', s=50)
axes[0, 0].scatter(X_test, y_test, alpha=0.6, label='Test data', s=50, color='orange')

# Plot regression line
X_line = np.linspace(X_reg.min(), X_reg.max(), 100).reshape(-1, 1)
y_line = model_sklearn.predict(X_line)
axes[0, 0].plot(X_line, y_line, 'r-', linewidth=3, label='Regression line')

axes[0, 0].set_xlabel('X', fontsize=12)
axes[0, 0].set_ylabel('y', fontsize=12)
axes[0, 0].set_title('Linear Regression Fit', fontweight='bold', fontsize=14)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Residuals plot
y_pred_train = model_sklearn.predict(X_train)
residuals = y_train - y_pred_train

axes[0, 1].scatter(y_pred_train, residuals, alpha=0.6)
axes[0, 1].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 1].set_xlabel('Predicted Values', fontsize=12)
axes[0, 1].set_ylabel('Residuals', fontsize=12)
axes[0, 1].set_title('Residuals Plot\n(Should be randomly scattered)', fontweight='bold', fontsize=14)
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Q-Q plot (check normality of residuals)
stats.probplot(residuals, dist="norm", plot=axes[0, 2])
axes[0, 2].set_title('Q-Q Plot\n(Check normality assumption)', fontweight='bold', fontsize=14)
axes[0, 2].grid(True, alpha=0.3)

# Plot 4: Actual vs Predicted
y_pred_test = model_sklearn.predict(X_test)
axes[1, 0].scatter(y_test, y_pred_test, alpha=0.6)
axes[1, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                'r--', linewidth=2, label='Perfect prediction')
axes[1, 0].set_xlabel('Actual Values', fontsize=12)
axes[1, 0].set_ylabel('Predicted Values', fontsize=12)
axes[1, 0].set_title(f'Actual vs Predicted\nR² = {r2_sklearn:.4f}', fontweight='bold', fontsize=14)
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 5: Distribution of residuals
axes[1, 1].hist(residuals, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
axes[1, 1].axvline(residuals.mean(), color='red', linestyle='--', 
                   linewidth=2, label=f'Mean: {residuals.mean():.2f}')
axes[1, 1].set_xlabel('Residual Value', fontsize=12)
axes[1, 1].set_ylabel('Frequency', fontsize=12)
axes[1, 1].set_title('Distribution of Residuals\n(Should be normal)', fontweight='bold', fontsize=14)
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# Plot 6: Learning curve (if we had more samples)
# Generate more data for learning curve
X_large, y_large = make_regression(n_samples=1000, n_features=1, noise=20, random_state=42)
train_sizes, train_scores, val_scores = learning_curve(
    LinearRegression(), X_large, y_large, 
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='r2'
)

train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)

axes[1, 2].plot(train_sizes, train_mean, label='Training score', linewidth=2)
axes[1, 2].fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.3)
axes[1, 2].plot(train_sizes, val_mean, label='Validation score', linewidth=2)
axes[1, 2].fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.3)
axes[1, 2].set_xlabel('Training Set Size', fontsize=12)
axes[1, 2].set_ylabel('R² Score', fontsize=12)
axes[1, 2].set_title('Learning Curve', fontweight='bold', fontsize=14)
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Key Insights from Visualizations:")
print("  • Residuals plot: Random scatter = good, pattern = model is missing something")
print("  • Q-Q plot: Points on line = residuals are normal (assumption met)")
print("  • Actual vs Predicted: Points on diagonal = perfect predictions")
print("  • Residuals histogram: Should be bell-shaped (normal distribution)")
print("  • Learning curve: Gap between train/val = overfitting")

### 1.1 Regularization: Ridge, Lasso, ElasticNet

**Interview Question:** *"What's the difference between L1 and L2 regularization?"*

**Answer:**

**L2 Regularization (Ridge):**
- **Formula:** $\min ||y - X\beta||^2 + \alpha ||\beta||^2$
- **Effect:** Shrinks coefficients toward zero (but never exactly zero)
- **Use:** When you want to keep all features but reduce overfitting
- **Properties:** Differentiable everywhere, unique solution

**L1 Regularization (Lasso):**
- **Formula:** $\min ||y - X\beta||^2 + \alpha ||\beta||$
- **Effect:** Shrinks some coefficients to EXACTLY zero (feature selection)
- **Use:** When you suspect only few features are important
- **Properties:** Non-differentiable at zero, sparse solutions

**Why L1 creates sparsity:**
- L1 penalty has "corners" at axes → optimization prefers axis-aligned solutions
- L2 penalty is circular → all directions treated equally

**ElasticNet:**
- **Formula:** $\min ||y - X\beta||^2 + \alpha_1 ||\beta|| + \alpha_2 ||\beta||^2$
- **Use:** Combines benefits of both (sparse + grouped selection)

**Interview Tip:** Draw the L1 (diamond) vs L2 (circle) constraint regions!

In [None]:
# Compare Ridge, Lasso, ElasticNet
print("🎯 REGULARIZATION COMPARISON: Ridge vs Lasso vs ElasticNet")
print("="*70)

# Generate data with correlated features
np.random.seed(42)
n_samples = 100
n_features = 20
n_informative = 5  # Only 5 features are truly important

X, y = make_regression(n_samples=n_samples, n_features=n_features, 
                       n_informative=n_informative, noise=10, random_state=42)

# Add multicollinearity (some features highly correlated)
X[:, 10] = X[:, 0] + np.random.randn(n_samples) * 0.1
X[:, 11] = X[:, 1] + np.random.randn(n_samples) * 0.1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize (important for regularization!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train different models with same alpha
alpha = 1.0

models = {
    'Linear Regression': LinearRegression(),
    'Ridge (L2)': Ridge(alpha=alpha),
    'Lasso (L1)': Lasso(alpha=alpha),
    'ElasticNet': ElasticNet(alpha=alpha, l1_ratio=0.5)
}

results = {}

print("\n📊 Training Models...\n")
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    
    train_score = model.score(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    
    # Get coefficients
    if hasattr(model, 'coef_'):
        coef = model.coef_
        n_nonzero = np.sum(np.abs(coef) > 1e-5)
    else:
        coef = None
        n_nonzero = n_features
    
    results[name] = {
        'train_r2': train_score,
        'test_r2': test_score,
        'coef': coef,
        'n_nonzero': n_nonzero
    }
    
    print(f"{name}:")
    print(f"  Train R²: {train_score:.4f}")
    print(f"  Test R²: {test_score:.4f}")
    print(f"  Non-zero coefficients: {n_nonzero}/{n_features}")
    print(f"  Overfitting gap: {train_score - test_score:.4f}")
    print()

print("\n💡 Observations:")
print("  • Linear Regression: Likely overfits (high train, lower test)")
print("  • Ridge: Reduces overfitting, keeps all features")
print("  • Lasso: Sparse solution (many zeros), automatic feature selection")
print("  • ElasticNet: Balance between Ridge and Lasso")

In [None]:
# Visualize regularization effects
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Plot 1: Coefficient values comparison
ax = axes[0, 0]
x_pos = np.arange(n_features)
width = 0.2

for i, (name, result) in enumerate(list(results.items())[1:]):  # Skip Linear Regression
    ax.bar(x_pos + i*width, result['coef'], width, label=name, alpha=0.7)

ax.set_xlabel('Feature Index', fontsize=12)
ax.set_ylabel('Coefficient Value', fontsize=12)
ax.set_title('Coefficient Comparison\n(Notice Lasso zeros)', fontweight='bold', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Plot 2: Number of non-zero coefficients
ax = axes[0, 1]
model_names = list(results.keys())
n_nonzeros = [results[name]['n_nonzero'] for name in model_names]
colors = ['gray', 'skyblue', 'coral', 'lightgreen']

bars = ax.bar(model_names, n_nonzeros, color=colors, edgecolor='black', linewidth=2)
ax.axhline(y=n_informative, color='red', linestyle='--', linewidth=2, 
           label=f'True informative: {n_informative}')
ax.set_ylabel('Number of Non-zero Features', fontsize=12)
ax.set_title('Feature Selection\n(Lasso achieves sparsity)', fontweight='bold', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

for bar, val in zip(bars, n_nonzeros):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(val)}', ha='center', va='bottom', fontweight='bold')

# Plot 3: Train vs Test R² (overfitting detection)
ax = axes[0, 2]
train_scores = [results[name]['train_r2'] for name in model_names]
test_scores = [results[name]['test_r2'] for name in model_names]

x = np.arange(len(model_names))
width = 0.35
ax.bar(x - width/2, train_scores, width, label='Train R²', alpha=0.7, color='green')
ax.bar(x + width/2, test_scores, width, label='Test R²', alpha=0.7, color='orange')

ax.set_ylabel('R² Score', fontsize=12)
ax.set_title('Train vs Test Performance\n(Gap = Overfitting)', fontweight='bold', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(model_names, rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Plot 4: Regularization path for Lasso
ax = axes[1, 0]
alphas = np.logspace(-3, 3, 100)
coefs = []

for alpha in alphas:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    coefs.append(lasso.coef_)

coefs = np.array(coefs)

for i in range(n_features):
    ax.plot(alphas, coefs[:, i], alpha=0.6)

ax.set_xscale('log')
ax.set_xlabel('Alpha (Regularization Strength)', fontsize=12)
ax.set_ylabel('Coefficient Value', fontsize=12)
ax.set_title('Lasso Regularization Path\n(Coefficients → 0 as α increases)', 
             fontweight='bold', fontsize=14)
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5)

# Plot 5: Regularization path for Ridge
ax = axes[1, 1]
coefs_ridge = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    coefs_ridge.append(ridge.coef_)

coefs_ridge = np.array(coefs_ridge)

for i in range(n_features):
    ax.plot(alphas, coefs_ridge[:, i], alpha=0.6)

ax.set_xscale('log')
ax.set_xlabel('Alpha (Regularization Strength)', fontsize=12)
ax.set_ylabel('Coefficient Value', fontsize=12)
ax.set_title('Ridge Regularization Path\n(Shrinks but never zero)', 
             fontweight='bold', fontsize=14)
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5)

# Plot 6: Geometric interpretation (2D case)
ax = axes[1, 2]

# Draw constraint regions
theta = np.linspace(0, 2*np.pi, 100)
t = 1

# L2 (circle)
l2_x = t * np.cos(theta)
l2_y = t * np.sin(theta)
ax.plot(l2_x, l2_y, 'b-', linewidth=3, label='L2 (Ridge)', alpha=0.7)
ax.fill(l2_x, l2_y, alpha=0.2, color='blue')

# L1 (diamond)
l1_x = np.array([t, 0, -t, 0, t])
l1_y = np.array([0, t, 0, -t, 0])
ax.plot(l1_x, l1_y, 'r-', linewidth=3, label='L1 (Lasso)', alpha=0.7)
ax.fill(l1_x, l1_y, alpha=0.2, color='red')

# Draw contours of loss function
x = np.linspace(-1.5, 1.5, 100)
y = np.linspace(-1.5, 1.5, 100)
X_grid, Y_grid = np.meshgrid(x, y)
# Elliptical contours centered at (1, 0.5)
Z = (X_grid - 1)**2 + 2*(Y_grid - 0.5)**2
ax.contour(X_grid, Y_grid, Z, levels=10, alpha=0.4, colors='gray')

# Mark optimal points
ax.plot(1, 0.5, 'k*', markersize=20, label='OLS solution')
ax.plot(0.7, 0.3, 'bo', markersize=12, label='Ridge solution')
ax.plot(1, 0, 'ro', markersize=12, label='Lasso solution')

ax.set_xlabel('β₁', fontsize=12)
ax.set_ylabel('β₂', fontsize=12)
ax.set_title('Geometric Interpretation\n(L1 hits axes → sparsity)', 
             fontweight='bold', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-1.5, 1.5)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)

plt.tight_layout()
plt.show()

print("\n💡 Key Insights:")
print("  • Lasso creates EXACT zeros (sparse solution)")
print("  • Ridge shrinks all coefficients but keeps them all")
print("  • Geometric view: L1 diamond has corners → hits axes → zero coefficients")
print("  • Regularization path: Watch how coefficients change with α")

print("\n🎯 Interview Answer Template:")
print("  'L1 (Lasso) produces sparse solutions because its constraint region")
print("   has corners at the axes. When the loss function contours hit these")
print("   corners, some coefficients become exactly zero. L2 (Ridge) has a")
print("   circular constraint, so coefficients shrink smoothly but rarely hit")
print("   zero. Use Lasso for feature selection, Ridge when you want to keep")
print("   all features but reduce multicollinearity.'")

### 1.2 Logistic Regression - Classification

**Interview Question:** *"Explain logistic regression. Why is it called 'regression' when it does classification?"*

**Answer:**

**Logistic Regression:**
- **Model:** $P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta^T x)}}$ (sigmoid function)
- **Loss:** Binary Cross-Entropy (Log Loss)
- **Optimization:** Gradient descent (no closed-form solution)

**Why called 'regression':**
- Models continuous probability (0 to 1)
- Uses linear regression framework with sigmoid transform
- Predicts log-odds: $\log(\frac{p}{1-p}) = \beta_0 + \beta^T x$

**Key Properties:**
- Outputs calibrated probabilities
- Linear decision boundary
- Coefficients interpretable as log-odds ratios
- Fast training and prediction

**When to Use:**
- ✅ Need probability estimates
- ✅ Interpretability is important
- ✅ Linearly separable classes
- ✅ High-dimensional data (with regularization)
- ❌ Complex non-linear boundaries

**Multiclass:** Use softmax (multinomial logistic regression) or one-vs-rest

**Source:** "Pattern Recognition and Machine Learning" Chapter 4

In [None]:
# Logistic Regression comprehensive example
print("🎯 LOGISTIC REGRESSION - BINARY CLASSIFICATION")
print("="*70)

# Load breast cancer dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target

print(f"\n📊 Dataset: {data.DESCR.split(chr(10))[0]}")
print(f"  Samples: {X.shape[0]}")
print(f"  Features: {X.shape[1]}")
print(f"  Classes: {np.unique(y)} ({data.target_names})")
print(f"  Class distribution: {np.bincount(y)}")

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train models with different regularization
models_logreg = {
    'No Regularization': LogisticRegression(penalty=None, max_iter=10000, random_state=42),
    'L2 (Ridge)': LogisticRegression(penalty='l2', C=1.0, max_iter=10000, random_state=42),
    'L1 (Lasso)': LogisticRegression(penalty='l1', C=1.0, solver='liblinear', 
                                      max_iter=10000, random_state=42),
}

print("\n🔧 Training Logistic Regression Models...\n")

for name, model in models_logreg.items():
    model.fit(X_train_scaled, y_train)
    
    # Predictions
    y_pred = model.predict(X_test_scaled)
    y_proba = model.predict_proba(X_test_scaled)[:, 1]
    
    # Metrics
    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_proba)
    
    # Count non-zero coefficients
    n_nonzero = np.sum(np.abs(model.coef_) > 1e-5)
    
    print(f"{name}:")
    print(f"  Accuracy: {acc:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1-Score: {f1:.4f}")
    print(f"  ROC-AUC: {auc:.4f}")
    print(f"  Non-zero features: {n_nonzero}/{X.shape[1]}")
    print()

# Get best model for visualization
best_model = models_logreg['L2 (Ridge)']

# Get feature importance
feature_importance = np.abs(best_model.coef_[0])
top_features_idx = np.argsort(feature_importance)[-10:][::-1]

print("\n🔝 Top 10 Most Important Features:")
for idx in top_features_idx:
    print(f"  {data.feature_names[idx]:<30} : {best_model.coef_[0][idx]:>10.4f}")

print("\n💡 Coefficient Interpretation:")
print("  Positive coefficient → increases probability of malignant")
print("  Negative coefficient → increases probability of benign")
print("  Magnitude → importance of feature")

## 🌳 Part 2: Tree-Based Models - Non-Linear Power

**Interview Question:** *"Explain how a decision tree works and when to use it."*

**Answer:**

**Decision Tree Algorithm:**
1. Start with all data at root
2. Find best feature and split point to maximize information gain
3. Split data into left and right child nodes
4. Recursively repeat until stopping criterion

**Splitting Criteria:**
- **Classification:** Gini impurity or Entropy (Information Gain)
  - Gini: $G = 1 - \sum_{i=1}^{n} p_i^2$
  - Entropy: $H = -\sum_{i=1}^{n} p_i \log(p_i)$
- **Regression:** MSE or MAE

**Advantages:**
- ✅ No feature scaling needed
- ✅ Handles non-linear relationships
- ✅ Captures interactions automatically
- ✅ Interpretable (can visualize tree)
- ✅ Handles missing values
- ✅ Works with categorical and numerical features

**Disadvantages:**
- ❌ Prone to overfitting (high variance)
- ❌ Unstable (small data change → different tree)
- ❌ Biased toward features with more levels
- ❌ Creates axis-parallel boundaries only

**Hyperparameters:**
- `max_depth`: Maximum tree depth
- `min_samples_split`: Minimum samples to split node
- `min_samples_leaf`: Minimum samples in leaf
- `max_features`: Features to consider for split

**Solution to overfitting:** Use ensemble methods (Random Forest, Boosting)

**Source:** "The Elements of Statistical Learning" Chapter 9

In [None]:
# Decision Tree comprehensive example
print("🌳 DECISION TREE - FROM BASICS TO ADVANCED")
print("="*70)

# Generate synthetic dataset with non-linear boundary
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=2,
                           random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\n📊 Dataset: Synthetic 2D Classification")
print(f"  Training samples: {X_train.shape[0]}")
print(f"  Test samples: {X_test.shape[0]}")
print(f"  Features: {X_train.shape[1]}")

# Train trees with different depths
depths = [1, 2, 3, 5, 10, None]  # None = unlimited
trees = {}

print("\n🌳 Training Decision Trees with Different Depths...\n")

for depth in depths:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    
    train_acc = tree.score(X_train, y_train)
    test_acc = tree.score(X_test, y_test)
    
    trees[depth] = {
        'model': tree,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'n_nodes': tree.tree_.node_count,
        'n_leaves': tree.tree_.n_leaves
    }
    
    depth_str = f"{depth}" if depth is not None else "Unlimited"
    print(f"Depth {depth_str:>9}:")
    print(f"  Train Accuracy: {train_acc:.4f}")
    print(f"  Test Accuracy:  {test_acc:.4f}")
    print(f"  Nodes: {tree.tree_.node_count}, Leaves: {tree.tree_.n_leaves}")
    print(f"  Overfitting: {train_acc - test_acc:.4f}")
    print()

print("\n💡 Observation: Deep trees overfit (perfect train, poor test)!")

In [None]:
# Visualize decision trees and decision boundaries
fig = plt.figure(figsize=(20, 12))

# Plot decision boundaries for different depths
plot_depths = [1, 2, 5, None]
h = 0.02  # mesh step size

# Create mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

for idx, depth in enumerate(plot_depths):
    # Decision boundary
    ax = plt.subplot(3, 4, idx + 1)
    
    tree = trees[depth]['model']
    Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='RdYlBu', 
               edgecolors='black', s=30, alpha=0.7)
    
    depth_str = f"{depth}" if depth is not None else "Unlimited"
    ax.set_title(f'Depth = {depth_str}\nTest Acc: {trees[depth]["test_acc"]:.3f}',
                fontweight='bold')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    
    # Visualize tree structure
    if depth in [1, 2]:  # Only for shallow trees (readable)
        ax_tree = plt.subplot(3, 4, idx + 5)
        plot_tree(tree, filled=True, ax=ax_tree, fontsize=8,
                 class_names=['Class 0', 'Class 1'],
                 feature_names=['Feature 1', 'Feature 2'])
        ax_tree.set_title(f'Tree Structure (Depth={depth})', fontweight='bold')

# Plot overfitting curve
ax = plt.subplot(3, 4, 9)
depths_plot = [d if d is not None else 20 for d in depths]
train_accs = [trees[d]['train_acc'] for d in depths]
test_accs = [trees[d]['test_acc'] for d in depths]

ax.plot(depths_plot, train_accs, 'o-', label='Training Accuracy', linewidth=2, markersize=8)
ax.plot(depths_plot, test_accs, 's-', label='Test Accuracy', linewidth=2, markersize=8)
ax.set_xlabel('Max Depth', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Overfitting vs Tree Depth\n(Gap shows overfitting)', fontweight='bold', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

# Feature importance (for full tree)
ax = plt.subplot(3, 4, 10)
full_tree = trees[None]['model']
importances = full_tree.feature_importances_
ax.barh(['Feature 1', 'Feature 2'], importances, color=['skyblue', 'coral'], edgecolor='black')
ax.set_xlabel('Importance', fontsize=12)
ax.set_title('Feature Importance\n(Gini importance)', fontweight='bold', fontsize=14)
ax.grid(True, alpha=0.3, axis='x')

# Model complexity (nodes vs depth)
ax = plt.subplot(3, 4, 11)
nodes = [trees[d]['n_nodes'] for d in depths]
leaves = [trees[d]['n_leaves'] for d in depths]

ax.plot(depths_plot, nodes, 'o-', label='Total Nodes', linewidth=2, markersize=8)
ax.plot(depths_plot, leaves, 's-', label='Leaf Nodes', linewidth=2, markersize=8)
ax.set_xlabel('Max Depth', fontsize=12)
ax.set_ylabel('Count', fontsize=12)
ax.set_title('Tree Complexity\n(More nodes = more complex)', fontweight='bold', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)

# Summary comparison table
ax = plt.subplot(3, 4, 12)
ax.axis('off')

table_data = [['Depth', 'Train Acc', 'Test Acc', 'Gap', 'Nodes']]
for depth in depths:
    d_str = str(depth) if depth is not None else 'Unl'
    train = trees[depth]['train_acc']
    test = trees[depth]['test_acc']
    gap = train - test
    nodes = trees[depth]['n_nodes']
    table_data.append([d_str, f'{train:.3f}', f'{test:.3f}', f'{gap:.3f}', str(nodes)])

table = ax.table(cellText=table_data, cellLoc='center', loc='center',
                colWidths=[0.2, 0.2, 0.2, 0.2, 0.2])
table.auto_set_font_size(False)
table.set_fontsize(9)
table.scale(1, 2)

for i in range(len(table_data[0])):
    table[(0, i)].set_facecolor('lightblue')
    table[(0, i)].set_text_props(weight='bold')

ax.set_title('Performance Summary', fontsize=14, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("\n💡 Key Insights:")
print("  • Shallow trees (depth 1-2): Underfit, simple boundaries")
print("  • Medium trees (depth 3-5): Good balance, generalize well")
print("  • Deep trees (unlimited): Overfit, complex boundaries")
print("  • Decision boundaries are axis-parallel (limitation of trees)")

print("\n🎯 Interview Answer Template:")
print("  'Decision trees recursively split the feature space using greedy")
print("   algorithm to maximize information gain. At each node, we choose")
print("   the feature and threshold that best separates the classes. The main")
print("   challenge is overfitting - deep trees memorize training data. We")
print("   control this with max_depth, min_samples_split, and pruning. In")
print("   practice, ensemble methods like Random Forest work better.'")

## 🎯 Quick Algorithm Comparison Matrix

This table will help you choose the right algorithm in interviews:

| Algorithm | Speed | Interpretability | Handles Non-linear | Scaling Needed | Handles Outliers | Feature Selection |
|-----------|-------|------------------|-------------------|----------------|------------------|-------------------|
| **Linear Regression** | ⚡⚡⚡ | ⭐⭐⭐ | ❌ | ✅ | ❌ | With L1 |
| **Logistic Regression** | ⚡⚡⚡ | ⭐⭐⭐ | ❌ | ✅ | ❌ | With L1 |
| **Decision Tree** | ⚡⚡ | ⭐⭐ | ✅ | ❌ | ✅ | ✅ |
| **Random Forest** | ⚡ | ⭐ | ✅ | ❌ | ✅ | ✅ |
| **XGBoost** | ⚡⚡ | ⭐ | ✅ | ❌ | ✅ | ✅ |
| **SVM** | ⚡ | ⭐ | ✅ (kernel) | ✅ | ❌ | ❌ |
| **KNN** | ⚡ (predict slow) | ⭐⭐ | ✅ | ✅ | ❌ | ❌ |
| **Naive Bayes** | ⚡⚡⚡ | ⭐⭐ | ❌ | ❌ | ✅ | ❌ |

**Legend:** ⚡ = Speed, ⭐ = Interpretability