<a href="https://colab.research.google.com/github/YolandaZhao10/CSCI-6170-Project-in-AI-and-ML/blob/main/homework_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1: Advanced Objective Function and Use Case

## Derive the objective function for Logistic Regression using Maximum LikelihoodEstimation (MLE)
1. Given a binary classification dataset
$\mathscr D =\{(x_i,y_i)\}_{i=1}^N,\quad
x_i\in\mathbb R^d,\; y_i\in\{0,1\}$,
logistic regression models the conditional probability as
$$
p(y_i=1\mid x_i;w)=\sigma(w^\top x_i),
\quad
\sigma(z)=\frac{1}{1+e^{-z}}.
$$
Base on probability for $y_i = 1$, we can also get conditional probability for $y_i = 0$.
$$
p(y_i=0\mid x_i;w)=1-\sigma(w^\top x_i).
$$

2. Each label $y_i$ is assumed to follow a Bernoulli distribution. The conditional likelihood can be written compactly as
$$
p(y_i\mid x_i;w)
=\sigma(w^\top x_i)^{y_i}
\left(1-\sigma(w^\top x_i)\right)^{1-y_i}.
$$
Assuming the samples are i.i.d., the likelihood over the entire dataset is
$$
\mathscr{L}(w)
=\prod_{i=1}^N p(y_i\mid x_i;w)
=\prod_{i=1}^N
\sigma(w^\top x_i)^{y_i}
\left(1-\sigma(w^\top x_i)\right)^{1-y_i}.
$$

3. We can taking the logarithm of the likelihood yields the log-likelihood:
$$
\ell(w)
=\log \mathscr{L}(w)
=\sum_{i=1}^N
\Big[
y_i \log \sigma(w^\top x_i)
+(1-y_i)\log\left(1-\sigma(w^\top x_i)\right)
\Big].
$$

4. Maximum Likelihood Estimation seeks the parameter $w$ that maximizes the log-likelihood:
$$
w_{\text{MLE}}=\arg\max_w \ell(w).
$$


5. Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood. Therefore, the objective function of logistic regression under MLE is
$$
J_{\text{MLE}}(w)
=-\ell(w)
=\sum_{i=1}^N
\Big[
- y_i \log \sigma(w^\top x_i)
-(1-y_i)\log\left(1-\sigma(w^\top x_i)\right)
\Big].
$$


## Research on the MAP technique for Logistic Regression

### 1) MLE vs. MAP: definitions
Given data  $
\mathscr{D}=\{(x_i,y_i)\}_{i=1}^N,\quad y_i\in\{0,1\},
$ logistic regression models $
p(y_i=1\mid x_i;w)=\sigma(w^\top x_i),\quad \sigma(z)=\frac{1}{1+e^{-z}}.
$

**MLE (Maximum Likelihood Estimation)**: $
w_{\text{MLE}}=\arg\max_w \; p(\mathscr{D}\mid w).
$

**MAP (Maximum A Posteriori)** incorporates a prior $p(w)$:
$
w_{\text{MAP}}=\arg\max_w \; p(w\mid\mathscr{D})
=\arg\max_w \; p(\mathscr{D}\mid w)\,p(w).
$
This follows from Bayes' rule:
$
p(w\mid\mathscr{D})=\frac{p(\mathscr{D}\mid w)\,p(w)}{p(\mathscr{D})}.
$

---

### 2) MAP objective form
Take logs:

$$
\log p(w\mid\mathscr{D})
=\log p(\mathscr{D}\mid w)+\log p(w)+C.
$$

Thus:

$$
w_{\text{MAP}}
=\arg\max_w\Big(\log p(\mathscr{D}\mid w)+\log p(w)\Big).
$$

Equivalently in minimization form:

$$
w_{\text{MAP}}
=\arg\min_w\Big(-\log p(\mathscr{D}\mid w)-\log p(w)\Big).
$$

---

### 3) Logistic regression negative log-likelihood (MLE loss)
The logistic regression negative log-likelihood is:

$$
J_{\text{MLE}}(w)
=-\log p(\mathscr{D}\mid w)
=\sum_{i=1}^N\Big[
- y_i\log\sigma(w^\top x_i)
-(1-y_i)\log(1-\sigma(w^\top x_i))
\Big].
$$

---

### 4) Key idea: MAP = MLE loss + regularization
From the MAP minimization objective:

$$
J_{\text{MAP}}(w)=J_{\text{MLE}}(w)-\log p(w).
$$

So MAP differs from MLE by adding a penalty term derived from the prior.

---

## Common priors and their MAP interpretation

### (A) Gaussian prior $\rightarrow$ L2 regularization
Assume a zero-mean Gaussian prior:

$$
p(w)=\mathcal{N}(0,\sigma^2 I).
$$

Then:

$$
-\log p(w)=\frac{1}{2\sigma^2}\|w\|_2^2 + \text{const}.
$$

So MAP yields:

$$
J_{\text{MAP}}(w)
=J_{\text{MLE}}(w)+\lambda\|w\|_2^2,
\quad \lambda=\frac{1}{2\sigma^2}.
$$

This shows:

> **L2-regularized logistic regression = MAP with Gaussian prior.**

---

### (B) Laplace prior $\rightarrow$ L1 regularization
Assume a Laplace prior:

$$
p(w_j)\propto \exp\left(-\frac{|w_j|}{b}\right).
$$

Then:

$$
-\log p(w)\propto \|w\|_1.
$$

So MAP yields:

$$
J_{\text{MAP}}(w)
=J_{\text{MLE}}(w)+\lambda\|w\|_1.
$$

This shows:

> **L1-regularized logistic regression = MAP with Laplace prior.**

---

## MAP vs. MLE: differences (brief)

1. **Uses a prior**
   - MLE maximizes $p(\mathscr{D}\mid w)$
   - MAP maximizes $p(\mathscr{D}\mid w)\,p(w)$

2. **Regularization interpretation**
   - MAP adds $-\log p(w)$, i.e., a complexity penalty
   - Gaussian prior $\rightarrow$ L2 shrinkage
   - Laplace prior $\rightarrow$ L1 sparsity

3. **When MAP helps**
   MAP often improves generalization when:
   - dataset is small
   - number of features is large
   - features are noisy or highly correlated

As $N\to\infty$, the likelihood dominates, and MAP approaches MLE.


## References (IEEE)

[1] S. Aswani, “IEOR 165 – Lecture 8: Regularization (Maximum A Posteriori Estimation),” University of California, Berkeley. [Online]. Available: https://aswani.ieor.berkeley.edu/teaching/SP16/165/lecture_notes/ieor165_lec8.pdf

[2] X. Fern, “Logistic Regression,” Oregon State University, CS534 Lecture Notes. [Online]. Available: https://web.engr.oregonstate.edu/~xfern/classes/cs534-18/Logistic-Regression-3-updated.pdf

[3] Cornell University, “CS4780 Lecture Note 06: Logistic Regression,” Dept. of Computer Science. [Online]. Available: https://www.cs.cornell.edu/courses/cs4780/2023fa/lectures/lecturenote06.html

[4] S.-I. Lee, H. Lee, P. Abbeel, and A. Y. Ng, “Efficient L1 Regularized Logistic Regression,” in *Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06)*, 2006. [Online]. Available: https://cdn.aaai.org/AAAI/2006/AAAI06-064.pdf


## Define a machine learning problem solve using Logistic Regression

# Task 2: Dataset and Advanced EDA

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
# Replace 'my_folder/my_data.csv' with your file's actual path
file_path = '/content/drive/MyDrive/'
df = pd.read_csv(file_path)