# Lab 1: Implementing Logistic Regression Components from Scratch

## 1. Objectives
In this notebook, we will implement the fundamental mathematical building blocks of a neural network using only **NumPy**. This will help in understanding how backpropagation and optimization work under the hood.

## 2. Mathematical Background
We will focus on the following three components:
1.  **Sigmoid Activation Function**: Maps any real-valued number into the range (0, 1).
2.  **Binary Cross-Entropy Loss**: Measures the performance of a classification model.
3.  **Gradient Descent**: The optimization algorithm used to minimize the loss by updating weights.

## 3. Implementation Challenge

### Task 1: The Sigmoid Function
The formula is:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

**Question for you:** In terms of deep learning, why do we use `Sigmoid` instead of a simple linear function for the output of a binary classifier? 

In [1]:
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

### Analysis of Sigmoid Implementation
- **Domain**: $(-\infty, \infty)$
- **Range**: $(0, 1)$
- **Interpretation**: Represents the probability $P(y=1|x)$.

In the naive implementation of Sigmoid, `np.exp(-z)` can overflow if $z$ is a large negative number (e.g., $z = -1000$). To prevent this, we use the **stable sigmoid** approach:

- If $z \ge 0$: $\sigma(z) = \frac{1}{1 + e^{-z}}$
- If $z < 0$: $\sigma(z) = \frac{e^z}{1 + e^z}$

This ensures that we only compute $e^x$ where $x \le 0$, keeping the exponential term bounded between $(0, 1]$.

In [None]:
def stable_sigmoid(z):
    # For z >= 0: 1 / (1 + exp(-z))
    # For z < 0: exp(z) / (1 + exp(z))
    # This ensures that exp() is only called with non-positive arguments.
    return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))


### Task 2: Binary Cross-Entropy Loss
The formula for a single sample is:
$$L = -(y \log(\hat{y}) + (1-y) \log(1-\hat{y}))$$

**Question for you:** If the model predicts $\hat{y} = 0.9$ and the true label is $y = 1$, will the loss be high or low? What happens to the math if $\hat{y}$ is exactly $0$ or $1$?

In [3]:
def binary_cross_entropy(y_true, y_pred):
    """
    Calculates the Binary Cross-Entropy (BCE) between true labels and predictions.

    Args:
        y_true: Ground truth values.
        y_pred: Predicted values from the model.
    """
    epsilon = 1e-15
    # np.clip:Clip array elements to the specified min and max range
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon) 
    loss = -(y_true * np.log(y_pred) + (1-y_true) * np.log(1 - y_pred))
    # return the average loss over the batch
    return np.mean(loss)

## Task 3: Gradient Descent and The Chain Rule Derivation

To understand why we update weights using the formula $dw = \frac{1}{n} X^T(\hat{y} - y)$, we must look at the **Chain Rule**.

### 1. The Components
Logistic Regression follows this flow:
$$w, b \xrightarrow{\text{Linear}} z = wx + b \xrightarrow{\text{Sigmoid}} \hat{y} = \sigma(z) \xrightarrow{\text{BCE}} Loss$$

### 2. The Chain Rule Formula
To find how the Loss changes with respect to weights ($w$):
$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$$

### 3. Step-by-Step Differentiation
* **Step 1 (Loss to Prediction):** Differentiating $L = -(y \log\hat{y} + (1-y) \log(1-\hat{y}))$ gives:
    $$\frac{\partial L}{\partial \hat{y}} = \frac{\hat{y}-y}{\hat{y}(1-\hat{y})}$$
* **Step 2 (Prediction to Linear):** Differentiating the Sigmoid function $\hat{y} = \sigma(z)$ gives:
    $$\frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y})$$
* **Step 3 (Linear to Weights):** Differentiating $z = wx + b$ with respect to $w$ gives:
    $$\frac{\partial z}{\partial w} = x$$

### 4. The "Magic" Cancellation
Multiplying them together:
$$\frac{\partial L}{\partial w} = \left[ \frac{\hat{y}-y}{\hat{y}(1-\hat{y})} \right] \cdot \left[ \hat{y}(1-\hat{y}) \right] \cdot x$$
The terms $\hat{y}(1-\hat{y})$ cancel out perfectly, leaving:
$$\frac{\partial L}{\partial w} = (\hat{y} - y)x$$

In [4]:
import numpy as np

def update_weights(X, y_true, y_pred, weights, bias, learning_rate):
    """
    Updates the model parameters using Gradient Descent.
    
    Parameters:
    X (ndarray): Feature matrix of shape (n_samples, n_features)
    y_true (ndarray): True labels of shape (n_samples, 1)
    y_pred (ndarray): Predicted probabilities of shape (n_samples, 1)
    weights (ndarray): Current weight vector
    bias (float): Current bias scalar
    learning_rate (float): The step size alpha
    
    Returns:
    tuple: (updated_weights, updated_bias)
    """
    n = len(y_true)
    
    # Calculate the error (difference between prediction and truth)
    error = y_pred - y_true
    
    # Calculate gradients
    # dw: average of (error * features)
    # X.T dot error performs the summation and multiplication efficiently
    dw = (1 / n) * np.dot(X.T, error)
    
    # db: average of the error
    db = (1 / n) * np.sum(error)
    
    # Parameter update rule: theta = theta - learning_rate * gradient
    new_weights = weights - learning_rate * dw
    new_bias = bias - learning_rate * db
    
    return new_weights, new_bias

In [None]:
class LogisticRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.loss_history = []
    
    def stable_sigmoid(self, z):
        # For z >= 0: 1 / (1 + exp(-z))
        # For z < 0: exp(z) / (1 + exp(z))
        # This ensures that exp() is only called with non-positive arguments.
        return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))

    def binary_cross_entropy(self, y_true, y_pred):
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        loss = -(y_true * np.log(y_pred) + (1-y_true) * np.log(1 - y_pred))
        return np.mean(loss)

    def fit(self, X, y):
        n_samples, n_features = X.shape

        # Initialize weights and bias
        self.weights = np.zeros(n_features)
        self.bias = 0

        # Gradient descent loop
        for _ in range(self.n_iterations):
            # Step1: Calculate predicted values
            y_predicted = self.stable_sigmoid(np.dot(X, self.weights) + self.bias)

            # Step2: Calculate gradients
            dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
            db = (1 / n_samples) * np.sum(y_predicted - y)

            # Step3: Update weights and bias
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

            # Step4: Calculate and store loss
            loss = self.binary_cross_entropy(y, y_predicted)
            self.loss_history.append(loss)
            
    def predict(self, X):
        y_predicted = self.stable_sigmoid(np.dot(X, self.weights) + self.bias)
        return np.where(y_predicted >= 0.5, 1, 0)