# Logistic Regression: Complete Guide

A comprehensive guide to logistic regression covering theory, implementation, regularization, and practical applications for classification problems.

## Table of Contents
1. [Introduction](#introduction)
2. [Mathematical Foundation](#mathematical-foundation)
3. [Types of Logistic Regression](#types-of-logistic-regression)
4. [Implementation from Scratch](#implementation-from-scratch)
5. [Feature Engineering](#feature-engineering)
6. [Regularization Techniques](#regularization-techniques)
7. [Model Evaluation](#model-evaluation)
8. [Practical Considerations](#practical-considerations)
9. [Advanced Topics](#advanced-topics)
10. [Real-World Applications](#real-world-applications)

## Introduction

Logistic regression is a fundamental classification algorithm that uses the logistic function to model the probability of binary or categorical outcomes. Despite its name, it's a classification algorithm, not a regression algorithm.

### Key Characteristics
- **Supervised Learning**: Requires labeled training data
- **Classification Task**: Predicts discrete categories/classes
- **Probabilistic**: Outputs probabilities between 0 and 1
- **Linear Decision Boundary**: Creates linear separation between classes
- **Interpretable**: Coefficients represent log-odds ratios

### When to Use Logistic Regression
- **Binary Classification**: Two-class problems (spam/not spam, disease/healthy)
- **Multi-class Classification**: Multiple categories (with extensions)
- **Probability Estimation**: When you need class probabilities
- **Baseline Model**: Starting point for classification problems
- **Interpretability**: When understanding feature importance is crucial

### Advantages
- Simple and fast
- No tuning of hyperparameters required
- Doesn't require feature scaling (but recommended)
- Less prone to overfitting with low-dimensional data
- Provides probability estimates
- No assumptions about distributions of classes

### Disadvantages
- Assumes linear relationship between features and log-odds
- Sensitive to outliers
- Requires large sample sizes for stable results
- Can struggle with complex relationships

## Mathematical Foundation

### 1. The Logistic Function (Sigmoid)

The core of logistic regression is the sigmoid function:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where $z = w^T x + b$

**Properties of Sigmoid:**
- Output range: $(0, 1)$
- S-shaped curve
- Smooth and differentiable
- $\sigma(0) = 0.5$
- $\sigma(-\infty) = 0$, $\sigma(+\infty) = 1$

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """Sigmoid activation function"""
    # Clip z to prevent overflow
    z = np.clip(z, -500, 500)
    return 1 / (1 + np.exp(-z))

# Visualize the sigmoid function
z = np.linspace(-10, 10, 100)
y = sigmoid(z)

plt.figure(figsize=(10, 6))
plt.plot(z, y, 'b-', linewidth=2, label='Sigmoid Function')
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.7, label='Decision Boundary (0.5)')
plt.axvline(x=0, color='g', linestyle='--', alpha=0.7, label='z = 0')
plt.xlabel('z = w^T x + b')
plt.ylabel('Ïƒ(z)')
plt.title('Sigmoid Function')
plt.grid(True, alpha=0.3)
plt.legend()
plt.ylim(-0.1, 1.1)
plt.show()

print(f"Ïƒ(-5) = {sigmoid(-5):.4f}")
print(f"Ïƒ(0) = {sigmoid(0):.4f}")
print(f"Ïƒ(5) = {sigmoid(5):.4f}")

### 2. Probability Interpretation

For binary classification:
- $P(y=1|x) = \sigma(w^T x + b)$
- $P(y=0|x) = 1 - \sigma(w^T x + b)$

The decision boundary occurs when $P(y=1|x) = 0.5$, i.e., when $w^T x + b = 0$.

### 3. Odds and Log-Odds

**Odds:**
$$\text{Odds} = \frac{P(y=1|x)}{P(y=0|x)} = \frac{\sigma(z)}{1-\sigma(z)} = e^z$$

**Log-Odds (Logit):**
$$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = z = w^T x + b$$

This shows that logistic regression models the log-odds as a linear function of the features.

In [None]:
# Demonstrate the relationship between probability, odds, and log-odds
probabilities = np.array([0.1, 0.3, 0.5, 0.7, 0.9])
odds = probabilities / (1 - probabilities)
log_odds = np.log(odds)

print("Probability\tOdds\t\tLog-Odds")
print("-" * 40)
for p, o, lo in zip(probabilities, odds, log_odds):
    print(f"{p:.1f}\t\t{o:.2f}\t\t{lo:.2f}")

# Visualize the relationships
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Probability vs Log-Odds
z_range = np.linspace(-5, 5, 100)
prob_range = sigmoid(z_range)
axes[0].plot(z_range, prob_range, 'b-', linewidth=2)
axes[0].set_xlabel('Log-Odds (z)')
axes[0].set_ylabel('Probability')
axes[0].set_title('Probability vs Log-Odds')
axes[0].grid(True, alpha=0.3)

# Odds vs Log-Odds
odds_range = np.exp(z_range)
axes[1].plot(z_range, odds_range, 'r-', linewidth=2)
axes[1].set_xlabel('Log-Odds (z)')
axes[1].set_ylabel('Odds')
axes[1].set_title('Odds vs Log-Odds')
axes[1].set_yscale('log')
axes[1].grid(True, alpha=0.3)

# Probability vs Odds
axes[2].plot(prob_range, odds_range, 'g-', linewidth=2)
axes[2].set_xlabel('Probability')
axes[2].set_ylabel('Odds')
axes[2].set_title('Probability vs Odds')
axes[2].set_yscale('log')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 4. Cost Function (Cross-Entropy Loss)

Unlike linear regression, we can't use MSE because it's non-convex for logistic regression. Instead, we use the log-likelihood:

**For a single example:**
$$\ell(y, \hat{y}) = y \log(\hat{y}) + (1-y) \log(1-\hat{y})$$

**Cost function (negative log-likelihood):**
$$J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)})]$$

Where $\hat{y}^{(i)} = \sigma(w^T x^{(i)} + b)$

In [None]:
def compute_logistic_cost(y_true, y_pred_proba):
    """
    Compute logistic regression cost (cross-entropy)
    
    Args:
        y_true: actual labels (0 or 1)
        y_pred_proba: predicted probabilities
    
    Returns:
        cost: cross-entropy cost
    """
    # Add small epsilon to prevent log(0)
    epsilon = 1e-15
    y_pred_clipped = np.clip(y_pred_proba, epsilon, 1 - epsilon)
    
    cost = -np.mean(y_true * np.log(y_pred_clipped) + 
                   (1 - y_true) * np.log(1 - y_pred_clipped))
    return cost

# Example: Compare costs for different predictions
y_true = np.array([1, 0, 1, 0, 1])
y_pred_good = np.array([0.9, 0.1, 0.8, 0.2, 0.85])  # Good predictions
y_pred_bad = np.array([0.3, 0.7, 0.4, 0.6, 0.45])   # Bad predictions

cost_good = compute_logistic_cost(y_true, y_pred_good)
cost_bad = compute_logistic_cost(y_true, y_pred_bad)

print(f"True labels: {y_true}")
print(f"Good predictions: {y_pred_good}")
print(f"Bad predictions: {y_pred_bad}")
print(f"\nCost (good predictions): {cost_good:.4f}")
print(f"Cost (bad predictions): {cost_bad:.4f}")
print(f"\nLower cost indicates better predictions!")

### 5. Gradient Descent for Logistic Regression

**Gradients:**
$$\frac{\partial J}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) x_j^{(i)}$$

$$\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})$$

**Parameter Updates:**
$$w_j := w_j - \alpha \frac{\partial J}{\partial w_j}$$
$$b := b - \alpha \frac{\partial J}{\partial b}$$

Note: The gradient form is identical to linear regression, but $\hat{y}$ is computed using the sigmoid function.

In [None]:
def compute_gradients(X, y_true, y_pred):
    """
    Compute gradients for logistic regression
    
    Args:
        X: feature matrix
        y_true: actual labels
        y_pred: predicted probabilities
    
    Returns:
        dw: gradient for weights
        db: gradient for bias
    """
    m = len(y_true)
    dw = (1/m) * np.dot(X.T, (y_pred - y_true))
    db = (1/m) * np.sum(y_pred - y_true)
    return dw, db

# Example gradient computation
X = np.array([[1, 2], [2, 3], [3, 1], [1, 3]])
y_true = np.array([1, 1, 0, 0])
weights = np.array([0.5, -0.3])
bias = 0.1

# Forward pass
z = np.dot(X, weights) + bias
y_pred = sigmoid(z)

# Compute gradients
dw, db = compute_gradients(X, y_true, y_pred)

print(f"Features: \n{X}")
print(f"True labels: {y_true}")
print(f"Predicted probabilities: {y_pred}")
print(f"\nWeight gradients: {dw}")
print(f"Bias gradient: {db:.4f}")

# Update parameters
learning_rate = 0.1
weights_new = weights - learning_rate * dw
bias_new = bias - learning_rate * db

print(f"\nUpdated weights: {weights_new}")
print(f"Updated bias: {bias_new:.4f}")

## Implementation from Scratch

Let's implement a complete Logistic Regression class from scratch:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class LogisticRegression:
    def __init__(self, learning_rate=0.01, max_iterations=1000, tolerance=1e-6):
        self.learning_rate = learning_rate
        self.max_iterations = max_iterations
        self.tolerance = tolerance
        self.weights = None
        self.bias = None
        self.cost_history = []
    
    def sigmoid(self, z):
        """Sigmoid activation function with numerical stability"""
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def initialize_parameters(self, n_features):
        """Initialize weights and bias"""
        self.weights = np.zeros(n_features)
        self.bias = 0.0
    
    def predict_proba(self, X):
        """Predict class probabilities"""
        z = np.dot(X, self.weights) + self.bias
        return self.sigmoid(z)
    
    def predict(self, X, threshold=0.5):
        """Make binary predictions"""
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int)
    
    def compute_cost(self, y_true, y_pred_proba):
        """Compute cross-entropy cost"""
        epsilon = 1e-15
        y_pred_clipped = np.clip(y_pred_proba, epsilon, 1 - epsilon)
        
        cost = -np.mean(y_true * np.log(y_pred_clipped) + 
                       (1 - y_true) * np.log(1 - y_pred_clipped))
        return cost
    
    def compute_gradients(self, X, y_true, y_pred_proba):
        """Compute gradients for weights and bias"""
        m = len(y_true)
        dw = (1/m) * np.dot(X.T, (y_pred_proba - y_true))
        db = (1/m) * np.sum(y_pred_proba - y_true)
        return dw, db
    
    def fit(self, X, y):
        """Train the model using gradient descent"""
        # Initialize parameters
        self.initialize_parameters(X.shape[1])
        
        # Training loop
        for i in range(self.max_iterations):
            # Forward pass
            y_pred_proba = self.predict_proba(X)
            
            # Compute cost
            cost = self.compute_cost(y, y_pred_proba)
            self.cost_history.append(cost)
            
            # Compute gradients
            dw, db = self.compute_gradients(X, y, y_pred_proba)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # Progress reporting
            if i % 100 == 0:
                print(f"Iteration {i}: Cost = {cost:.6f}")
            
            # Early stopping
            if i > 0 and abs(self.cost_history[-2] - cost) < self.tolerance:
                print(f"Converged at iteration {i}")
                break
    
    def evaluate(self, X, y):
        """Evaluate model performance"""
        y_pred_proba = self.predict_proba(X)
        y_pred = self.predict(X)
        
        # Accuracy
        accuracy = np.mean(y_pred == y)
        
        # Log loss
        log_loss = self.compute_cost(y, y_pred_proba)
        
        return {'accuracy': accuracy, 'log_loss': log_loss}

print("âœ… LogisticRegression class implemented successfully!")

### Example: Tumor Classification

Let's test our implementation with a medical diagnosis example:

In [None]:
# Generate sample tumor classification data
np.random.seed(42)
n_samples = 1000

# Features: tumor_size, patient_age (normalized)
tumor_size = np.random.normal(0, 1, n_samples)
patient_age = np.random.normal(0, 1, n_samples)

# Create feature matrix
X = np.column_stack([tumor_size, patient_age])

# Target: malignant (1) or benign (0) with some relationship to features
# Larger tumors and older patients more likely to be malignant
z_true = 0.8 * tumor_size + 0.3 * patient_age + 0.1
probabilities = sigmoid(z_true)
y = np.random.binomial(1, probabilities)

print(f"Dataset shape: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")
print(f"Malignant rate: {y.mean():.3f}")

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\nTraining set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

In [None]:
# Train the model
model = LogisticRegression(learning_rate=0.01, max_iterations=1000)
model.fit(X_train, y_train)

# Evaluate the model
train_metrics = model.evaluate(X_train, y_train)
test_metrics = model.evaluate(X_test, y_test)

print(f"\nTraining Performance:")
print(f"Accuracy: {train_metrics['accuracy']:.4f}")
print(f"Log Loss: {train_metrics['log_loss']:.4f}")

print(f"\nTest Performance:")
print(f"Accuracy: {test_metrics['accuracy']:.4f}")
print(f"Log Loss: {test_metrics['log_loss']:.4f}")

print(f"\nLearned Parameters:")
print(f"Weights: {model.weights}")
print(f"Bias: {model.bias:.4f}")

# Interpret coefficients
print(f"\nCoefficient Interpretation:")
print(f"Tumor size coefficient: {model.weights[0]:.3f} (positive = larger tumors more likely malignant)")
print(f"Patient age coefficient: {model.weights[1]:.3f} (positive = older patients more likely malignant)")

In [None]:
# Visualize results
plt.figure(figsize=(15, 5))

# Plot 1: Learning curve
plt.subplot(1, 3, 1)
plt.plot(model.cost_history)
plt.xlabel('Iteration')
plt.ylabel('Cost (Log Loss)')
plt.title('Learning Curve')
plt.grid(True, alpha=0.3)

# Plot 2: Decision boundary
plt.subplot(1, 3, 2)
# Create a mesh for decision boundary
h = 0.1
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Make predictions on the mesh
mesh_points = np.c_[xx.ravel(), yy.ravel()]
Z = model.predict_proba(mesh_points)
Z = Z.reshape(xx.shape)

# Plot decision boundary and data points
plt.contourf(xx, yy, Z, levels=50, alpha=0.6, cmap='RdYlBu')
plt.contour(xx, yy, Z, levels=[0.5], colors='black', linestyles='--', linewidths=2)
scatter = plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap='RdYlBu', edgecolors='black')
plt.colorbar(scatter)
plt.xlabel('Tumor Size (normalized)')
plt.ylabel('Patient Age (normalized)')
plt.title('Decision Boundary')

# Plot 3: Probability predictions
plt.subplot(1, 3, 3)
y_pred_proba = model.predict_proba(X_test)
plt.hist(y_pred_proba[y_test == 0], bins=20, alpha=0.7, label='Benign', color='blue')
plt.hist(y_pred_proba[y_test == 1], bins=20, alpha=0.7, label='Malignant', color='red')
plt.axvline(x=0.5, color='black', linestyle='--', label='Decision Threshold')
plt.xlabel('Predicted Probability')
plt.ylabel('Frequency')
plt.title('Probability Distribution')
plt.legend()

plt.tight_layout()
plt.show()

## Regularization Techniques

Just like linear regression, logistic regression can benefit from regularization to prevent overfitting.

### 1. Ridge Logistic Regression (L2 Regularization)

**Cost Function:**
$$J_{ridge}(w, b) = J(w, b) + \lambda \sum_{j=1}^{n} w_j^2$$

Where $J(w, b)$ is the standard logistic regression cost function.

In [None]:
class RidgeLogisticRegression(LogisticRegression):
    def __init__(self, alpha=1.0, learning_rate=0.01, max_iterations=1000, tolerance=1e-6):
        super().__init__(learning_rate, max_iterations, tolerance)
        self.alpha = alpha
    
    def compute_cost(self, y_true, y_pred_proba):
        """Compute cost with L2 regularization"""
        logistic_cost = super().compute_cost(y_true, y_pred_proba)
        l2_penalty = self.alpha * np.sum(self.weights ** 2)
        return logistic_cost + l2_penalty
    
    def compute_gradients(self, X, y_true, y_pred_proba):
        """Compute gradients with L2 regularization"""
        dw, db = super().compute_gradients(X, y_true, y_pred_proba)
        dw += 2 * self.alpha * self.weights
        return dw, db

print("âœ… RidgeLogisticRegression class implemented!")

### 2. Lasso Logistic Regression (L1 Regularization)

**Cost Function:**
$$J_{lasso}(w, b) = J(w, b) + \lambda \sum_{j=1}^{n} |w_j|$$

Where $J(w, b)$ is the standard logistic regression cost function.

In [None]:
class LassoLogisticRegression(LogisticRegression):
    def __init__(self, alpha=1.0, learning_rate=0.01, max_iterations=1000, tolerance=1e-6):
        super().__init__(learning_rate, max_iterations, tolerance)
        self.alpha = alpha
    
    def compute_cost(self, y_true, y_pred_proba):
        """Compute cost with L1 regularization"""
        logistic_cost = super().compute_cost(y_true, y_pred_proba)
        l1_penalty = self.alpha * np.sum(np.abs(self.weights))
        return logistic_cost + l1_penalty
    
    def compute_gradients(self, X, y_true, y_pred_proba):
        """Compute gradients with L1 regularization"""
        dw, db = super().compute_gradients(X, y_true, y_pred_proba)
        l1_gradient = np.where(self.weights > 0, 1, 
                              np.where(self.weights < 0, -1, 0))
        dw += self.alpha * l1_gradient
        return dw, db

print("âœ… LassoLogisticRegression class implemented!")

In [None]:
# Compare regularization techniques
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features to demonstrate overfitting
poly = PolynomialFeatures(degree=3, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

print(f"Original features: {X_train.shape[1]}")
print(f"Polynomial features: {X_train_poly.shape[1]}")

# Train different models
models = {
    'Logistic': LogisticRegression(learning_rate=0.01, max_iterations=1000),
    'Ridge': RidgeLogisticRegression(alpha=0.1, learning_rate=0.01, max_iterations=1000),
    'Lasso': LassoLogisticRegression(alpha=0.01, learning_rate=0.01, max_iterations=1000)
}

results = {}
for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train_poly, y_train)
    
    train_metrics = model.evaluate(X_train_poly, y_train)
    test_metrics = model.evaluate(X_test_poly, y_test)
    
    results[name] = {
        'train_acc': train_metrics['accuracy'],
        'test_acc': test_metrics['accuracy'],
        'weights_norm': np.linalg.norm(model.weights),
        'non_zero_weights': np.sum(np.abs(model.weights) > 1e-6)
    }
    
    print(f"Train Acc: {train_metrics['accuracy']:.4f}, Test Acc: {test_metrics['accuracy']:.4f}")
    print(f"Weights norm: {results[name]['weights_norm']:.4f}")
    print(f"Non-zero weights: {results[name]['non_zero_weights']}/{len(model.weights)}")

## Model Evaluation

Classification requires different evaluation metrics than regression.

### Classification Metrics

**Accuracy:**
$$\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}$$

**Precision:**
$$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}$$

**Recall (Sensitivity):**
$$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$$

**F1-Score:**
$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}$$

**Specificity:**
$$\text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives + False Positives}}$$

In [None]:
def compute_classification_metrics(y_true, y_pred, y_pred_proba=None):
    """Compute comprehensive classification metrics"""
    
    # Confusion matrix components
    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    
    # Basic metrics
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    metrics = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'specificity': specificity,
        'f1_score': f1,
        'confusion_matrix': np.array([[tn, fp], [fn, tp]])
    }
    
    # Add log loss if probabilities provided
    if y_pred_proba is not None:
        epsilon = 1e-15
        y_pred_clipped = np.clip(y_pred_proba, epsilon, 1 - epsilon)
        log_loss = -np.mean(y_true * np.log(y_pred_clipped) + 
                           (1 - y_true) * np.log(1 - y_pred_clipped))
        metrics['log_loss'] = log_loss
    
    return metrics

# Evaluate our best model
best_model = models['Ridge']  # Assuming Ridge performed best
y_pred = best_model.predict(X_test_poly)
y_pred_proba = best_model.predict_proba(X_test_poly)

metrics = compute_classification_metrics(y_test, y_pred, y_pred_proba)

print("ðŸ“Š Comprehensive Model Evaluation:")
print(f"Accuracy: {metrics['accuracy']:.4f}")
print(f"Precision: {metrics['precision']:.4f}")
print(f"Recall: {metrics['recall']:.4f}")
print(f"Specificity: {metrics['specificity']:.4f}")
print(f"F1-Score: {metrics['f1_score']:.4f}")
print(f"Log Loss: {metrics['log_loss']:.4f}")

print(f"\nConfusion Matrix:")
print(f"[[TN={metrics['confusion_matrix'][0,0]}, FP={metrics['confusion_matrix'][0,1]}]")
print(f" [FN={metrics['confusion_matrix'][1,0]}, TP={metrics['confusion_matrix'][1,1]}]]")

### ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve plots True Positive Rate vs False Positive Rate at various threshold settings.

In [None]:
def compute_roc_curve(y_true, y_pred_proba):
    """Compute ROC curve and AUC"""
    thresholds = np.linspace(0, 1, 100)
    tpr_list = []  # True Positive Rate
    fpr_list = []  # False Positive Rate
    
    for threshold in thresholds:
        y_pred = (y_pred_proba >= threshold).astype(int)
        
        tp = np.sum((y_true == 1) & (y_pred == 1))
        tn = np.sum((y_true == 0) & (y_pred == 0))
        fp = np.sum((y_true == 0) & (y_pred == 1))
        fn = np.sum((y_true == 1) & (y_pred == 0))
        
        tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
        
        tpr_list.append(tpr)
        fpr_list.append(fpr)
    
    # Compute AUC using trapezoidal rule
    auc = np.trapz(tpr_list, fpr_list)
    
    return np.array(fpr_list), np.array(tpr_list), auc

# Compute and plot ROC curve
fpr, tpr, auc = compute_roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(12, 5))

# ROC Curve
plt.subplot(1, 2, 1)
plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid(True, alpha=0.3)

# Confusion Matrix Heatmap
plt.subplot(1, 2, 2)
cm = metrics['confusion_matrix']
plt.imshow(cm, interpolation='nearest', cmap='Blues')
plt.title('Confusion Matrix')
plt.colorbar()
tick_marks = np.arange(2)
plt.xticks(tick_marks, ['Benign', 'Malignant'])
plt.yticks(tick_marks, ['Benign', 'Malignant'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')

# Add text annotations
thresh = cm.max() / 2.
for i, j in np.ndindex(cm.shape):
    plt.text(j, i, format(cm[i, j], 'd'),
             horizontalalignment="center",
             color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.show()

print(f"AUC Score: {auc:.4f}")
print(f"AUC Interpretation:")
print(f"  0.9-1.0: Excellent")
print(f"  0.8-0.9: Good")
print(f"  0.7-0.8: Fair")
print(f"  0.6-0.7: Poor")
print(f"  0.5-0.6: Fail")

## Real-World Applications

### 1. Medical Diagnosis
- **Binary**: Disease/No Disease
- **Features**: Symptoms, test results, demographics
- **Considerations**: High recall for serious conditions, interpretability for doctors

### 2. Marketing and Customer Analytics
- **Binary**: Purchase/No Purchase, Click/No Click
- **Features**: Demographics, behavior, history
- **Considerations**: Probability calibration, interpretability for business decisions

### 3. Fraud Detection
- **Binary**: Fraud/Legitimate
- **Features**: Transaction patterns, user behavior
- **Considerations**: Imbalanced data, real-time prediction, false positive costs

### 4. Spam Detection
- **Binary**: Spam/Not Spam
- **Features**: Text features, sender information
- **Considerations**: Feature engineering from text, concept drift

## Best Practices

### 1. Data Preparation
- **Handle Missing Values**: Imputation or removal
- **Feature Scaling**: Standardization recommended
- **Outlier Detection**: Can significantly impact results
- **Class Balance**: Address imbalanced datasets

### 2. Model Development
- **Start Simple**: Basic logistic regression first
- **Feature Engineering**: Create meaningful features
- **Regularization**: Use when overfitting occurs
- **Cross-Validation**: Proper model selection

### 3. Model Evaluation
- **Multiple Metrics**: Don't rely on accuracy alone
- **ROC/PR Curves**: Understand performance across thresholds
- **Confusion Matrix**: Understand error types
- **Business Context**: Align metrics with business goals

### 4. Production Considerations
- **Probability Calibration**: Ensure probabilities are well-calibrated
- **Monitoring**: Track model performance over time
- **Interpretability**: Maintain explainability
- **Threshold Optimization**: Optimize for business metrics

## Conclusion

Logistic regression is a powerful and interpretable classification algorithm that forms the foundation of many machine learning applications. Its probabilistic nature, mathematical elegance, and interpretability make it an excellent choice for many real-world problems.

### Key Takeaways:
- **Mathematical Foundation**: Understanding sigmoid function and cross-entropy loss is crucial
- **Implementation Skills**: Building from scratch provides deep insights into the algorithm
- **Regularization**: Essential for preventing overfitting and feature selection
- **Evaluation**: Comprehensive evaluation beyond accuracy is critical
- **Practical Applications**: Wide applicability across domains with proper considerations

### Next Steps:
1. Practice with different classification datasets
2. Experiment with feature engineering techniques
3. Compare with other classification algorithms
4. Study advanced topics like multinomial logistic regression
5. Apply to your own classification problems

This comprehensive guide provides the theoretical knowledge and practical skills needed to effectively apply logistic regression to real-world classification problems while understanding its assumptions, limitations, and best practices.