# Logistic Regression - Classification with Gradient Descent

## What's Different from Linear Regression?

**Linear Regression**: Predicts continuous values (e.g., house prices: $200k, $350k, etc.)

**Logistic Regression**: Predicts discrete classes (e.g., spam or not spam: 0 or 1)

### The Problem

Imagine we want to predict if a student will pass (1) or fail (0) based on hours studied.

If we use linear regression:
- Prediction could be 1.5 (what does that mean?)
- Or -0.3 (negative probability?)

**We need predictions between 0 and 1 (like probabilities)!**

## The Solution: Sigmoid Function

The **sigmoid function** squashes any number to a value between 0 and 1.

**Formula:** $$\boxed{\sigma(z) = \frac{1}{1 + e^{-z}}}$$

Where:
- $z = wx + b$ (same as linear regression)
- $\sigma(z)$ = probability of being class 1

Properties:
- When $z \to +\infty$, $\sigma(z) \to 1$
- When $z \to -\infty$, $\sigma(z) \to 0$
- When $z = 0$, $\sigma(z) = 0.5$
- S-shaped curve (smooth transition from 0 to 1)

### The Model

$$\boxed{\hat{y} = \sigma(Xw + b) = \frac{1}{1 + e^{-(Xw + b)}}}$$

**Decision rule:**
- If $\hat{y} \geq 0.5$ → predict class 1
- If $\hat{y} < 0.5$ → predict class 0

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)

## Step 1: Understanding the Sigmoid Function

In [None]:
# Visualize sigmoid function
def sigmoid(z):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-z))

# Create range of values
z = np.linspace(-10, 10, 100)
y_sigmoid = sigmoid(z)

# Plot
plt.figure(figsize=(12, 5))

# Left: Sigmoid function
plt.subplot(1, 2, 1)
plt.plot(z, y_sigmoid, linewidth=3, color='#2E86AB')
plt.axhline(y=0.5, color='red', linestyle='--', linewidth=2, alpha=0.7, label='Decision Boundary (0.5)')
plt.axhline(y=0, color='gray', linestyle='-', linewidth=1, alpha=0.3)
plt.axhline(y=1, color='gray', linestyle='-', linewidth=1, alpha=0.3)
plt.axvline(x=0, color='gray', linestyle='-', linewidth=1, alpha=0.3)
plt.xlabel('z = wx + b', fontsize=12)
plt.ylabel('σ(z)', fontsize=12)
plt.title('Sigmoid Function: σ(z) = 1/(1 + e⁻ᶻ)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend()

# Right: Derivative of sigmoid
plt.subplot(1, 2, 2)
# Derivative: σ'(z) = σ(z) * (1 - σ(z))
y_sigmoid_derivative = y_sigmoid * (1 - y_sigmoid)
plt.plot(z, y_sigmoid_derivative, linewidth=3, color='#A23B72')
plt.axvline(x=0, color='gray', linestyle='-', linewidth=1, alpha=0.3)
plt.xlabel('z = wx + b', fontsize=12)
plt.ylabel("σ'(z)", fontsize=12)
plt.title('Derivative of Sigmoid: σ\'(z) = σ(z)(1 - σ(z))', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key Values:")
print(f"σ(-10) = {sigmoid(-10):.6f} ≈ 0")
print(f"σ(-5)  = {sigmoid(-5):.6f}")
print(f"σ(0)   = {sigmoid(0):.6f} = 0.5")
print(f"σ(5)   = {sigmoid(5):.6f}")
print(f"σ(10)  = {sigmoid(10):.6f} ≈ 1")

## Step 2: Binary Cross-Entropy Loss

### Why not use Mean Squared Error (MSE)?

For classification, MSE doesn't work well:
- Not convex with sigmoid (multiple local minima)
- Gradients can become very small (vanishing gradients)

### Binary Cross-Entropy (Log Loss)

**Formula:** $$\boxed{L = -\frac{1}{n} \sum_{i=1}^{n} \left[y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)\right]}$$

Where:
- $y$ = actual label (0 or 1)
- $\hat{y}$ = predicted probability

### Why does this work?

Let's break it down for a single sample:

**If $y = 1$** (actual class is 1):
- Loss $= -\log(\hat{y})$
- If $\hat{y} = 1$ (confident correct prediction) → Loss $= 0$
- If $\hat{y} = 0.5$ (uncertain) → Loss $= 0.69$
- If $\hat{y} \approx 0$ (confident wrong prediction) → Loss $\to \infty$

**If $y = 0$** (actual class is 0):
- Loss $= -\log(1-\hat{y})$
- If $\hat{y} = 0$ (confident correct prediction) → Loss $= 0$
- If $\hat{y} = 0.5$ (uncertain) → Loss $= 0.69$
- If $\hat{y} \approx 1$ (confident wrong prediction) → Loss $\to \infty$

**Key insight:** We heavily penalize confident wrong predictions!

In [None]:
# Visualize binary cross-entropy loss
y_pred = np.linspace(0.001, 0.999, 100)  # Avoid log(0)

# Loss when y=1
loss_y1 = -np.log(y_pred)

# Loss when y=0
loss_y0 = -np.log(1 - y_pred)

plt.figure(figsize=(12, 5))
plt.plot(y_pred, loss_y1, linewidth=3, label='Loss when y=1: -log(ŷ)', color='#2E86AB')
plt.plot(y_pred, loss_y0, linewidth=3, label='Loss when y=0: -log(1-ŷ)', color='#A23B72')
plt.xlabel('Predicted Probability (ŷ)', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Binary Cross-Entropy Loss', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.ylim(0, 5)
plt.show()

print("Loss Examples:")
print("\nWhen actual y=1:")
print(f"  Predict 0.99 → Loss = {-np.log(0.99):.4f}")
print(f"  Predict 0.50 → Loss = {-np.log(0.50):.4f}")
print(f"  Predict 0.01 → Loss = {-np.log(0.01):.4f}")
print("\nWhen actual y=0:")
print(f"  Predict 0.01 → Loss = {-np.log(0.99):.4f}")
print(f"  Predict 0.50 → Loss = {-np.log(0.50):.4f}")
print(f"  Predict 0.99 → Loss = {-np.log(0.01):.4f}")

## Step 3: Deriving the Gradients

This is where the magic happens! Let's derive the gradients step by step.

### Setup

- $z = Xw + b$ (linear combination)
- $\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$ (prediction)
- $L = -\frac{1}{n} \sum [y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$ (loss)

### Gradient with respect to $w$

We need: $\frac{\partial L}{\partial w}$

Using chain rule:
$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial z} \times \frac{\partial z}{\partial w}$$

Let's compute each part:

#### Part 1: $\frac{\partial L}{\partial \hat{y}}$

$$L = -\frac{1}{n} \sum [y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$$

For a single sample:
$$\frac{\partial L}{\partial \hat{y}} = -\left(\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}\right) = -\frac{y - \hat{y}}{\hat{y}(1-\hat{y})}$$

#### Part 2: $\frac{\partial \hat{y}}{\partial z}$ (Derivative of Sigmoid)

$$\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$$

Using quotient rule:
- Let $u = 1$, $v = 1 + e^{-z}$
- $\hat{y} = \frac{u}{v}$
- $\frac{\partial \hat{y}}{\partial z} = \frac{v \cdot \frac{\partial u}{\partial z} - u \cdot \frac{\partial v}{\partial z}}{v^2}$
- $\frac{\partial v}{\partial z} = e^{-z} \times (-1) = -e^{-z}$
- $\frac{\partial \hat{y}}{\partial z} = \frac{0 - 1 \times (-e^{-z})}{(1 + e^{-z})^2} = \frac{e^{-z}}{(1 + e^{-z})^2}$
- $= \frac{1}{1 + e^{-z}} \times \frac{e^{-z}}{1 + e^{-z}} = \sigma(z) \times (1 - \sigma(z))$
- $= \hat{y}(1 - \hat{y})$

**Beautiful result!** The derivative of sigmoid is $\sigma(z) \times (1 - \sigma(z))$

#### Part 3: $\frac{\partial z}{\partial w}$

$$z = Xw + b$$

$$\frac{\partial z}{\partial w} = X$$

#### Combine Everything

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial z} \times \frac{\partial z}{\partial w}$$

$$= \left[-\frac{y - \hat{y}}{\hat{y}(1-\hat{y})}\right] \times [\hat{y}(1-\hat{y})] \times X$$

$$= -(y - \hat{y}) \times X = (\hat{y} - y) \times X$$

For all samples (vectorized):

$$\boxed{\frac{\partial L}{\partial w} = \frac{1}{n} X^T(\hat{y} - y)}$$

### Gradient with respect to $b$

Similarly:

$$\boxed{\frac{\partial L}{\partial b} = \frac{1}{n} \sum (\hat{y} - y)}$$

### Simple Result!

Notice: The gradient formula looks almost the same as linear regression!
- Linear Regression: $\frac{\partial L}{\partial w} = \frac{-2}{n} X^T(y - \hat{y})$
- Logistic Regression: $\frac{\partial L}{\partial w} = \frac{1}{n} X^T(\hat{y} - y)$

The cross-entropy loss and sigmoid activation simplify beautifully!

## Step 4: Generate Sample Data

In [None]:
# Generate binary classification dataset
X, y = make_classification(
    n_samples=500,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    flip_y=0.1,
    random_state=42
)

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize features
X_mean = X_train.mean(axis=0)
X_std = X_train.std(axis=0)
X_train_norm = (X_train - X_mean) / X_std
X_test_norm = (X_test - X_mean) / X_std

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Class distribution (train): {np.bincount(y_train)}")
print(f"Class distribution (test):  {np.bincount(y_test)}")

# Visualize data
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 
            label='Class 0', alpha=0.7, s=50, edgecolors='black', linewidth=0.5)
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 
            label='Class 1', alpha=0.7, s=50, edgecolors='black', linewidth=0.5)
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Training Data', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1], 
            label='Class 0', alpha=0.7, s=50, edgecolors='black', linewidth=0.5)
plt.scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1], 
            label='Class 1', alpha=0.7, s=50, edgecolors='black', linewidth=0.5)
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Test Data', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 5: Implement Logistic Regression

In [None]:
class LogisticRegression:
    """Logistic Regression using Gradient Descent"""
    
    def __init__(self, learning_rate=0.1, n_iterations=1000):
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.w = None
        self.b = None
        self.cost_history = []
        
    def sigmoid(self, z):
        """Sigmoid activation function"""
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def compute_cost(self, X, y):
        """Binary cross-entropy loss"""
        n = len(y)
        z = X @ self.w + self.b
        y_pred = self.sigmoid(z)
        
        # Clip predictions to avoid log(0)
        epsilon = 1e-7
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        
        # Binary cross-entropy
        cost = -(1/n) * np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
        return cost
    
    def compute_gradients(self, X, y):
        """Calculate gradients"""
        n = len(y)
        z = X @ self.w + self.b
        y_pred = self.sigmoid(z)
        errors = y_pred - y
        
        # Gradients
        dw = (1/n) * (X.T @ errors)
        db = (1/n) * np.sum(errors)
        
        return dw, db
    
    def fit(self, X, y, verbose=True):
        """Train the model"""
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.w = np.zeros(n_features)
        self.b = 0.0
        
        # Gradient descent
        for i in range(self.n_iterations):
            # Compute gradients
            dw, db = self.compute_gradients(X, y)
            
            # Update parameters
            self.w -= self.lr * dw
            self.b -= self.lr * db
            
            # Track cost
            cost = self.compute_cost(X, y)
            self.cost_history.append(cost)
            
            if verbose and (i % 100 == 0 or i == self.n_iterations - 1):
                print(f"Iteration {i:4d} | Cost: {cost:.6f}")
    
    def predict_proba(self, X):
        """Predict probabilities"""
        z = X @ self.w + self.b
        return self.sigmoid(z)
    
    def predict(self, X, threshold=0.5):
        """Predict class labels"""
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int)

## Step 6: Train the Model

In [None]:
# Train model
model = LogisticRegression(learning_rate=0.1, n_iterations=1000)
model.fit(X_train_norm, y_train)

## Step 7: Evaluate and Visualize

In [None]:
# Make predictions
y_train_pred = model.predict(X_train_norm)
y_test_pred = model.predict(X_test_norm)
y_train_proba = model.predict_proba(X_train_norm)
y_test_proba = model.predict_proba(X_test_norm)

# Calculate accuracy
train_accuracy = np.mean(y_train_pred == y_train)
test_accuracy = np.mean(y_test_pred == y_test)

print(f"\nTraining Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy:     {test_accuracy:.4f}")

# Confusion Matrix
from sklearn.metrics import confusion_matrix, classification_report

print("\nConfusion Matrix (Test):")
cm = confusion_matrix(y_test, y_test_pred)
print(cm)

print("\nClassification Report (Test):")
print(classification_report(y_test, y_test_pred))

In [None]:
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: Cost history
axes[0, 0].plot(model.cost_history, linewidth=2, color='#2E86AB')
axes[0, 0].set_xlabel('Iteration', fontsize=12)
axes[0, 0].set_ylabel('Cost (Binary Cross-Entropy)', fontsize=12)
axes[0, 0].set_title('Learning Curve', fontsize=14, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Decision boundary
# Create mesh
x1_min, x1_max = X_train_norm[:, 0].min() - 1, X_train_norm[:, 0].max() + 1
x2_min, x2_max = X_train_norm[:, 1].min() - 1, X_train_norm[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 200),
                        np.linspace(x2_min, x2_max, 200))
X_mesh = np.c_[xx1.ravel(), xx2.ravel()]
Z = model.predict_proba(X_mesh).reshape(xx1.shape)

# Plot decision boundary
contour = axes[0, 1].contourf(xx1, xx2, Z, levels=20, cmap='RdYlBu_r', alpha=0.6)
axes[0, 1].contour(xx1, xx2, Z, levels=[0.5], colors='black', linewidths=2)
axes[0, 1].scatter(X_train_norm[y_train == 0, 0], X_train_norm[y_train == 0, 1],
                   label='Class 0', alpha=0.8, s=50, edgecolors='black', linewidth=0.5)
axes[0, 1].scatter(X_train_norm[y_train == 1, 0], X_train_norm[y_train == 1, 1],
                   label='Class 1', alpha=0.8, s=50, edgecolors='black', linewidth=0.5)
axes[0, 1].set_xlabel('Feature 1 (normalized)', fontsize=12)
axes[0, 1].set_ylabel('Feature 2 (normalized)', fontsize=12)
axes[0, 1].set_title('Decision Boundary (Training Data)', fontsize=14, fontweight='bold')
axes[0, 1].legend()
plt.colorbar(contour, ax=axes[0, 1], label='P(y=1)')

# Plot 3: Confusion Matrix Heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1, 0],
            xticklabels=['Pred 0', 'Pred 1'],
            yticklabels=['True 0', 'True 1'])
axes[1, 0].set_title('Confusion Matrix (Test)', fontsize=14, fontweight='bold')
axes[1, 0].set_ylabel('True Label', fontsize=12)
axes[1, 0].set_xlabel('Predicted Label', fontsize=12)

# Plot 4: Probability distribution
axes[1, 1].hist(y_test_proba[y_test == 0], bins=20, alpha=0.7, label='Class 0', color='#2E86AB')
axes[1, 1].hist(y_test_proba[y_test == 1], bins=20, alpha=0.7, label='Class 1', color='#A23B72')
axes[1, 1].axvline(x=0.5, color='red', linestyle='--', linewidth=2, label='Threshold')
axes[1, 1].set_xlabel('Predicted Probability', fontsize=12)
axes[1, 1].set_ylabel('Frequency', fontsize=12)
axes[1, 1].set_title('Predicted Probability Distribution (Test)', fontsize=14, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Key Takeaways

### What We Learned:

1. **Logistic Regression is for Classification**
   - Linear Regression: predicts continuous values
   - Logistic Regression: predicts probabilities (0 to 1)

2. **Sigmoid Function** squashes outputs to [0, 1]
   - $\sigma(z) = \frac{1}{1 + e^{-z}}$
   - S-shaped curve
   - Derivative: $\sigma'(z) = \sigma(z) \times (1 - \sigma(z))$

3. **Binary Cross-Entropy Loss**
   - Better than MSE for classification
   - $L = -\frac{1}{n} \sum [y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$
   - Heavily penalizes confident wrong predictions

4. **Gradients Simplify Beautifully**
   - Despite complex loss function, gradient is simple!
   - $\frac{\partial L}{\partial w} = \frac{1}{n} X^T(\hat{y} - y)$
   - $\frac{\partial L}{\partial b} = \frac{1}{n} \sum (\hat{y} - y)$
   - Very similar to linear regression!

5. **The Math Breakdown:**

   **Chain Rule Application:**
   $$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial z} \times \frac{\partial z}{\partial w}$$
   
   **Each Component:**
   - $\frac{\partial L}{\partial \hat{y}} = -\frac{y - \hat{y}}{\hat{y}(1-\hat{y})}$
   - $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1 - \hat{y})$
   - $\frac{\partial z}{\partial w} = X$
   
   **Combine (terms cancel!):**
   $$= \left[-\frac{y - \hat{y}}{\hat{y}(1-\hat{y})}\right] \times [\hat{y}(1-\hat{y})] \times X = (\hat{y} - y) \times X$$

6. **Decision Boundary**
   - Predict class 1 if $\sigma(wx + b) \geq 0.5$
   - Equivalently: if $wx + b \geq 0$
   - Creates a linear boundary in feature space

### Why This Matters:

1. **Foundation for Neural Networks**
   - Each neuron is basically logistic regression
   - Stack them → deep learning!

2. **Understanding Activation Functions**
   - Sigmoid is one type of activation
   - Others: ReLU, tanh, etc.
   - All have derivatives for backpropagation

3. **Loss Functions**
   - Different tasks need different losses
   - Cross-entropy for classification
   - MSE for regression

4. **Gradient Descent Still Works!**
   - Same algorithm as linear regression
   - Just different loss and activation
   - This generalizes to neural networks

### Next Steps:

Now we're ready for neural networks!
- Stack multiple logistic regression units
- Add hidden layers
- Backpropagation through multiple layers