Welcome to Homework 1! Your task for this homework is to implement logistic regression from scratch. You are welcome to use existing software packages to check your work. However, implementing it once in your life will help you better understand how this algorithm works.

Recall: Logistic regression is a classification model that estimates the probability of a binary outcome $Y$ being equal to 1, given variables/features $X$. It assumes that the log odds is linear with respect to $X$. Because it can be viewed as a generalization of linear regression, it falls under the general umbrella of methods called "generalizaed linear models."

Helpful resources:
* https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Concepts you'll need from lectures:
* Maximum likelihood estimation
* Gradient descent


### Part 1

**How does Logistic Regression work?**

Q1: What is the mathematical equation that describes the probability distribution of a binary random variable? (4 points)

$p(x; q) = q^x(1-q)^{1-x}$

Q2: What probability distribution does logistic regression assume $Y|X$ follows? (5 points)

In [None]:

Given data $(X_i, Y_i)$, the log likelihood of any given paramater is:

$$\ell(\theta) = \log L(\theta) = \sum_{i=1}^N \log \Pr(Y_i \mid X_i; \theta)$$


First, we use the log liklehood for N observations:

$$\ell(\beta) = \sum_{i=1}^{N} \log p_{g_i}(x_i; \beta)$$

 Which written to binary logistic regression form is:

$$\ell(\beta) = \sum_{i=1}^{N} \left[ y_i \log p(x_i;\beta) + (1-y_i) \log(1-p(x_i;\beta)) \right]$$

* For binary logistic regression, where $g_i = y_i \in \{0,1\}$.

* Where $p(x_i;\beta) = \frac{1}{1 + e^{-\beta^T x_i}}$ is the probability that $Y_i = 1$ given $X_i$.

Given data $(X_i, Y_i)$ for $i=1,\cdots, n$ the log likelihood for binary logistic regression parameters $\beta$ is defined as:
$$\ell(\beta) = \sum_{i=1}^{    N} \left[ y_i \beta^T x_i - \log(1+e^{\beta^T x_i}) \right]$$

* $\mathbf{W}$ is a $N \times N$ diagonal matrix of weights where:
  * The ith diagonal element is $p(x_i;\beta^{\text{old}})(1 - p(x_i; \beta^{\text{old}}))$
  


Q3: What are the parameters of a logistic regression model? (5 points)

Q4: What is the log likelihood of the parameters given observations $(X_i, Y_i)$ for $i=1,\cdots, n$? (8 points)

Q5: What is the optimization problem that we try to solve when fitting logistic regression?  (8 points)

Q6: What procedures can be used to solve the optimization problem underlying logistic regression? (5 points)

Q7: Derive the gradient of the log likelihood with respect to the parameters of the logistic regression model step by step. (5 points)

### Part 2

**Implement Logistic Regression**

Q1: Write the function `generate_X(n,p)`, which returns randomly generated $X_1,\cdots, X_n$, where $X_i \in \mathbb{R}^p$. You can sample the variables using a uniform distribution or a standard normal distribution. (8 points)

Q2: Write the function `generate_Y(X, beta, intercept)`, which generates outcomes for observations $X_1,\cdots, X_p$ per a logistic regression model with coefficients $\beta \in \mathbb{R}^{p}$ and intercept $\beta_0$. (10 points)

Q3: Generate some data using your functions above with $p=2$, $n=1000$, coefficients $\beta=(0.5,2)$, and intercept $\beta_0 = 1$. (7 points)

Q4: Implement a function that runs gradient descent `run_gradient_descent(X, Y, alpha, num_iterations, initial_betas)`. Make sure to vectorize your code. (Otherwise it will run really slowly.) (15 points)

Q5: Apply your implementation of gradient descent to the generated data to estimate the parameters. How close are they to the true parameters? (5 points)

In [None]:
# Plot the cost history to check convergence
plt.figure(figsize=(10, 5))
plt.plot(cost_history)
plt.xlabel('Iteration')
plt.ylabel('Cost (Negative Log Likelihood)')
plt.title('Cost History During Gradient Descent')
plt.grid(True, alpha=0.3)
plt.show()

# Visualize the data with both true and estimated decision boundaries
plt.figure(figsize=(10, 6))
plt.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0', alpha=0.6)
plt.scatter(X[Y==1, 0], X[Y==1, 1], label='Class 1', alpha=0.6)

# Plot the true decision boundary
x1_range = np.linspace(min(X[:, 0]), max(X[:, 0]), 100)
x2_true_boundary = -(true_intercept + true_beta[0] * x1_range) / true_beta[1]
plt.plot(x1_range, x2_true_boundary, 'r--', label='True Decision Boundary')

# Plot the estimated decision boundary
x2_estimated_boundary = -(estimated_intercept + estimated_coefficients[0] * x1_range) / estimated_coefficients[1]
plt.plot(x1_range, x2_estimated_boundary, 'g-', label='Estimated Decision Boundary')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Data with True and Estimated Decision Boundaries')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()


Q6: Rerun your implementation of gradient descent but with a different initialization. Are the estimated parameters the same as that in Q5? (8 points)

In [None]:
# Plot cost histories for both initializations
plt.figure(figsize=(10, 5))
plt.plot(cost_history, label='Q5 (Zero Initialization)')
plt.plot(cost_history_new, label='Q6 (Random Initialization)')
plt.xlabel('Iteration')
plt.ylabel('Cost (Negative Log Likelihood)')
plt.title('Cost History Comparison for Different Initializations')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Visualize the decision boundaries from both initializations
plt.figure(figsize=(10, 6))
plt.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0', alpha=0.6)
plt.scatter(X[Y==1, 0], X[Y==1, 1], label='Class 1', alpha=0.6)

# Plot the true decision boundary
x1_range = np.linspace(min(X[:, 0]), max(X[:, 0]), 100)
x2_true_boundary = -(true_intercept + true_beta[0] * x1_range) / true_beta[1]
plt.plot(x1_range, x2_true_boundary, 'k--', label='True Decision Boundary')

# Plot the Q5 decision boundary
x2_q5_boundary = -(estimated_betas_q5[0] + estimated_betas_q5[1] * x1_range) / estimated_betas_q5[2]
plt.plot(x1_range, x2_q5_boundary, 'r-', label='Q5 Decision Boundary')

# Plot the Q6 decision boundary
x2_q6_boundary = -(estimated_betas_new[0] + estimated_betas_new[1] * x1_range) / estimated_betas_new[2]
plt.plot(x1_range, x2_q6_boundary, 'g-', label='Q6 Decision Boundary')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Decision Boundaries with Different Initializations')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

**Comparing your solution against scikit-learn**

Q7: Apply `sklearn.linear_model.LogisticRegressionÂ¶` to your generated data to estimate the parameters of a logistic regression model. ( 7 points)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

# Use the same random seed as before for consistency
np.random.seed(42)

# Generate data with the specified parameters (same as before)
n = 1000
p = 2
true_beta = np.array([0.5, 2])
true_intercept = 1

# Generate features and outcomes (same as before)
X = generate_X(n, p)
Y = generate_Y(X, true_beta, true_intercept)

# Apply scikit-learn's LogisticRegression
sklearn_model = LogisticRegression(fit_intercept=True, C=1e10, solver='liblinear', max_iter=1000)
sklearn_model.fit(X, Y)

# Extract the estimated parameters
sklearn_intercept = sklearn_model.intercept_[0]
sklearn_coefficients = sklearn_model.coef_[0]

# Compare true and sklearn-estimated parameters
print("Parameter Comparison:")
print(f"True parameters: Intercept={true_intercept:.4f}, β₁={true_beta[0]:.4f}, β₂={true_beta[1]:.4f}")
print(f"sklearn estimates: Intercept={sklearn_intercept:.4f}, β₁={sklearn_coefficients[0]:.4f}, β₂={sklearn_coefficients[1]:.4f}")

# Calculate differences between true and sklearn-estimated parameters
diff_intercept = abs(true_intercept - sklearn_intercept)
diff_beta1 = abs(true_beta[0] - sklearn_coefficients[0])
diff_beta2 = abs(true_beta[1] - sklearn_coefficients[1])

print("\nDifferences between true and sklearn-estimated parameters:")
print(f"Intercept difference: {diff_intercept:.6f}")
print(f"β₁ difference: {diff_beta1:.6f}")
print(f"β₂ difference: {diff_beta2:.6f}")

# Visualize the data with true and sklearn decision boundaries
plt.figure(figsize=(10, 6))
plt.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0', alpha=0.6)
plt.scatter(X[Y==1, 0], X[Y==1, 1], label='Class 1', alpha=0.6)

# Plot the true decision boundary
x1_range = np.linspace(min(X[:, 0]), max(X[:, 0]), 100)
x2_true_boundary = -(true_intercept + true_beta[0] * x1_range) / true_beta[1]
plt.plot(x1_range, x2_true_boundary, 'r--', label='True Decision Boundary')

# Plot the sklearn decision boundary
x2_sklearn_boundary = -(sklearn_intercept + sklearn_coefficients[0] * x1_range) / sklearn_coefficients[1]
plt.plot(x1_range, x2_sklearn_boundary, 'g-', label='sklearn Decision Boundary')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Data with True and sklearn Decision Boundaries')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Compare predictions between true model and sklearn model
true_probs = 1 / (1 + np.exp(-(true_intercept + np.dot(X, true_beta))))
sklearn_probs = sklearn_model.predict_proba(X)[:, 1]

plt.figure(figsize=(8, 8))
plt.scatter(true_probs, sklearn_probs, alpha=0.5)
plt.plot([0, 1], [0, 1], 'r--')
plt.xlabel('True Model Probabilities')
plt.ylabel('sklearn Model Probabilities')
plt.title('Comparison of Predicted Probabilities')
plt.grid(True, alpha=0.3)
plt.show()

Q8: Are the answers the same as that from your implementation? (1 point)