<a href="https://colab.research.google.com/github/aherre52/MAT422/blob/main/HW_3_4_MAT_422.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **HW 3.4: Logistic regression**

Concepts covered:


* 3.4 Logistic regression

# 3.4 Logistic regression

Logistic regression models the probability of a binary outcome (Yes/No, True/False, Obese/Not-obese, etc), where given input data points $(\alpha_i, b_i)$ with $\alpha_i \in \mathbb{R}^d$ as feature vectors and $b_i \in \{0, 1\}$ as labels, we aim to approximate the probability that $b = 1$. This model uses the **logit function** $\log \frac{p(\alpha; x)}{1 - p(\alpha; x)} = \alpha^T x$, where $p(\alpha; x) = \sigma(\alpha^T x)$, and $\sigma$ is the sigmoid function, defined as $\sigma(t) = \frac{1}{1 + e^{-t}}$. To fit the model to data, we minimize the cross-entropy loss:

$$\mathcal{L}(x; A, b) = -\frac{1}{n} \sum_{i=1}^n \left( b_i \log(\sigma(\alpha_i^T x)) + (1 - b_i) \log(1 - \sigma(\alpha_i^T x)) \right)$$

This approach benefits from gradient descent since it is convex, where we iteratively update $x$ to minimize $\mathcal{L}(x; A, b)$. The gradient is computed as $$\nabla_x \mathcal{L}(x; A, b) = -\frac{1}{n} \sum_{i=1}^n (b_i - \sigma(\alpha_i^T x)) \alpha_i$$ with each step improving the model's fit to accurately classify inputs.


Code Description

The code implements logistic regression to model a binary dependent variable using gradient descent optimization. It begins by generating random sample data with 100 samples and 2 features, and then models binary labels, $b \in \{0,1\}$, using a binomial distribution. The cross-entropy loss function is defined to evaluate prediction error, and its gradient is calculated to update the parameters. Gradient descent iteratively adjusts parameters $x$ to minimize the cross-entropy loss, resulting in an optimal parameter vector $\text{x}_{optimal}$. Finally, the code uses these optimized parameters to predict probabilities for new data points.


In [1]:
import numpy as np

# generate the sample data
np.random.seed(42)
n_samples = 100
# Number of features
d = 2
A = np.random.randn(n_samples, d)

# modeling a binary dependent variable, so using binomial distribution
b = np.random.binomial(1, 0.5, size=n_samples)

# Define the sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Define the cross-entropy loss function
def cross_entropy_loss(x, A, b):
    epsilon = 1e-10  # To avoid log(0) error
    predictions = sigmoid(A.dot(x))
    return -np.mean(b * np.log(predictions + epsilon) + (1 - b) * np.log(1 - predictions + epsilon))

# Gradient of the cross-entropy loss
def gradient(x, A, b):
    predictions = sigmoid(A.dot(x))
    return -np.mean((b - predictions)[:, None] * A, axis=0)

# Gradient descent optimization
def gradient_descent(A, b, x0, learning_rate=0.1, max_iter=1000, tol=1e-5):
    x = x0
    for _ in range(max_iter):
        gradient_value = gradient(x, A, b)
        x = x - learning_rate * gradient_value
        if np.linalg.norm(gradient_value) < tol:
            break
    return x

# Initial guess for parameters
x0 = np.zeros(d)

# Perform gradient descent
optimal_x = gradient_descent(A, b, x0)

# Predict probabilities for new data
new_data = np.random.randn(10, d)
predicted_probabilities = sigmoid(new_data.dot(optimal_x))

# can print out the values from gradient descent and sigmoid functions
print("Optimal parameters:", optimal_x)
print("Predicted probabilities:", predicted_probabilities)


Optimal parameters: [-0.07255775 -0.19126973]
Predicted probabilities: [0.47464325 0.52465884 0.52732202 0.47010067 0.52663187 0.4690272
 0.45526782 0.54172665 0.45736198 0.49476904]
