<a href="https://colab.research.google.com/github/YolandaZhao10/CSCI-6170-Project-in-AI-and-ML/blob/main/homework_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1: Advanced Objective Function and Use Case

## Derive the objective function for Logistic Regression using Maximum LikelihoodEstimation (MLE)
1. Given a binary classification dataset
$\mathscr D =\{(x_i,y_i)\}_{i=1}^N,\quad
x_i\in\mathbb R^d,\; y_i\in\{0,1\}$,
logistic regression models the conditional probability as
$$
p(y_i=1\mid x_i;w)=\sigma(w^\top x_i),
\quad
\sigma(z)=\frac{1}{1+e^{-z}}.
$$
Base on probability for $y_i = 1$, we can also get conditional probability for $y_i = 0$.
$$
p(y_i=0\mid x_i;w)=1-\sigma(w^\top x_i).
$$

2. Each label $y_i$ is assumed to follow a Bernoulli distribution. The conditional likelihood can be written compactly as
$$
p(y_i\mid x_i;w)
=\sigma(w^\top x_i)^{y_i}
\left(1-\sigma(w^\top x_i)\right)^{1-y_i}.
$$
Assuming the samples are i.i.d., the likelihood over the entire dataset is
$$
\mathscr{L}(w)
=\prod_{i=1}^N p(y_i\mid x_i;w)
=\prod_{i=1}^N
\sigma(w^\top x_i)^{y_i}
\left(1-\sigma(w^\top x_i)\right)^{1-y_i}.
$$

3. We can taking the logarithm of the likelihood yields the log-likelihood:
$$
\ell(w)
=\log \mathscr{L}(w)
=\sum_{i=1}^N
\Big[
y_i \log \sigma(w^\top x_i)
+(1-y_i)\log\left(1-\sigma(w^\top x_i)\right)
\Big].
$$

4. Maximum Likelihood Estimation seeks the parameter $w$ that maximizes the log-likelihood:
$$
w_{\text{MLE}}=\arg\max_w \ell(w).
$$


5. Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood. Therefore, the objective function of logistic regression under MLE is
$$
J_{\text{MLE}}(w)
=-\ell(w)
=\sum_{i=1}^N \Big[ - y_i \log \sigma(w^\top x_i) -(1-y_i)\log\left(1-\sigma(w^\top x_i)\right)\Big].
$$


## Research on the MAP technique for Logistic Regression

### MLE vs. MAP: definitions
Given data  $
\mathscr{D}=\{(x_i,y_i)\}_{i=1}^N,\quad y_i\in\{0,1\},
$ logistic regression models $
p(y_i=1\mid x_i;w)=\sigma(w^\top x_i),\quad \sigma(z)=\frac{1}{1+e^{-z}}.
$

**MLE (Maximum Likelihood Estimation)**: $
w_{\text{MLE}}=\arg\max_w \; p(\mathscr{D}\mid w).
$

**MAP (Maximum A Posteriori)** incorporates a prior $p(w)$:
$w_{\text{MAP}}
=\arg\max_w \; p(w\mid\mathscr{D})
=\arg\max_w \; \frac{p(\mathscr{D}\mid w)\,p(w)}{p(\mathscr{D})}
=\arg\max_w \; p(\mathscr{D}\mid w)\,p(w),$


### MLE vs. MAP: objective form
**MAP objective form**:
$
w_{\text{MAP}}
=\arg\min_w\Big(-\log p(\mathscr{D}\mid w)-\log p(w)\Big).
$

**Logistic regression negative log-likelihood (MLE loss)**:
$
J_{\text{MLE}}(w)
=-\log p(\mathscr{D}\mid w)
=\sum_{i=1}^N\Big[ - y_i\log\sigma(w^\top x_i) -(1-y_i)\log(1-\sigma(w^\top x_i))
\Big].
$


### MAP = MLE loss + regularization
From the MAP minimization objective:

$$
J_{\text{MAP}}(w)=J_{\text{MLE}}(w)-\log p(w).
$$

So MAP differs from MLE by adding a penalty term derived from the prior.


### Common priors and their MAP interpretation

1. Gaussian prior -> L2 regularization
Assume a zero-mean Gaussian prior:$
p(w)=\mathcal{N}(0,\sigma^2 I).
$ Then, we can get $ -\log p(w)=\frac{1}{2\sigma^2}\|w\|_2^2 + \text{const}.
$
So MAP yields: $$
J_{\text{MAP}}(w)
=J_{\text{MLE}}(w)+\lambda\|w\|_2^2,
\quad \lambda=\frac{1}{2\sigma^2}.
$$
This shows: **L2-regularized logistic regression = MAP with Gaussian prior.**

2. Laplace prior -> L1 regularization
Assume a Laplace prior:$
p(w_j)\propto \exp\left(-\frac{|w_j|}{b}\right).
$ Then, we can get $
-\log p(w)\propto \|w\|_1.
$ So MAP yields:
$$
J_{\text{MAP}}(w)
=J_{\text{MLE}}(w)+\lambda\|w\|_1.
$$
This shows: **L1-regularized logistic regression = MAP with Laplace prior.**


### MAP vs. MLE: differences
MAP differs from MLE by incorporating a prior distribution over the model parameters. While MLE estimates parameters by maximizing only the likelihood $p(\mathscr{D}\mid w)$, MAP maximizes the posterior $p(w\mid\mathscr{D}) \propto p(\mathscr{D}\mid w)p(w)$, combining data fit with prior belief. As a result, MAP is equivalent to minimizing the negative log-likelihood plus a regularization term $-\log p(w)$. In practice, MAP often generalizes better than MLE in small-data or high-dimensional settings by preventing overly large weights and reducing overfitting.


### References

[1] S. Aswani, “IEOR 165 – Lecture 8: Regularization (Maximum A Posteriori Estimation),” University of California, Berkeley. [Online]. Available: https://aswani.ieor.berkeley.edu/teaching/SP16/165/lecture_notes/ieor165_lec8.pdf

[2] Cornell University, “CS4780 Lecture Note 06: Logistic Regression,” Dept. of Computer Science. [Online]. Available: https://www.cs.cornell.edu/courses/cs4780/2023fa/lectures/lecturenote06.html

[3] S.-I. Lee, H. Lee, P. Abbeel, and A. Y. Ng, “Efficient L1 Regularized Logistic Regression,” in *Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06)*, 2006. [Online]. Available: https://cdn.aaai.org/AAAI/2006/AAAI06-064.pdf

[4] R. Garnett, “Logistic Regression,” Washington Univ. in St. Louis, CSE 515T Lecture Notes, 2024. [Online]. Available: https://www.cse.wustl.edu/~garnett/cse515t/fall_2024/files/lecture_notes/9.pdf

[5] X. Fern, “Logistic Regression,” Oregon State University, CS534 Lecture Notes. [Online]. Available: https://web.engr.oregonstate.edu/~xfern/classes/cs534-18/Logistic-Regression-3-updated.pdf

[6] C. M. Bishop and N. M. Nasrabadi, “MAP Estimation,” in *Pattern Recognition and Machine Learning*. Springer, 2006.


## Define a machine learning problem solve using Logistic Regression
### Problem Definition (RPI GPA → Salary Outcome)
Given an RPI undergraduate student dataset
$
\mathscr{D}=\{(x_i,y_i)\}_{i=1}^N,
$ each feature vector $x_i\in\mathbb{R}^d$ contains student information such as:
- overall GPA  
- Math department GPA  
- major / department  
- internship experience (yes/no or count)  
- research experience  
- graduation year, etc.

The target label is defined as a binary variable:
$$
y_i=
\begin{cases}
1, & \text{if the student’s salary 10 years after graduation } \ge T,\\
0, & \text{otherwise},
\end{cases}
$$
where $T$ is a chosen salary threshold (e.g., $T=\$150{,}000$).

Logistic Regression models the probability of reaching the high-salary threshold as$
p(y_i=1\mid x_i;w)=\sigma(w^\top x_i),
\quad
\sigma(z)=\frac{1}{1+e^{-z}}.
$

The prediction is then
$$
\hat{y}_i=
\begin{cases}
1,& \text{if } p(y_i=1\mid x_i;w)\ge 0.5,\\
0,& \text{otherwise}.
\end{cases}
$$


### Justification: Why Logistic Regression?

Logistic Regression is an appropriate choice for this problem because it is designed for binary classification and directly models the probability of a student reaching a high-salary outcome:
$$
p(y=1\mid x;w)=\sigma(w^\top x).
$$
This probabilistic output is useful because it provides not only a class prediction but also a confidence score, which allows flexible decision thresholds (e.g., choosing a stricter threshold for identifying high-salary candidates). In addition, Logistic Regression is interpretable: the learned weights indicate how features such as overall GPA or Math GPA increase or decrease the likelihood of reaching the salary threshold.

### Brief Comparison to Another Linear Model (Linear SVM)

A common alternative linear classification model is the Linear Support Vector Machine (SVM). Both Logistic Regression and linear SVM learn a linear decision boundary, but they differ in objective functions and outputs. Logistic Regression minimizes the log-loss (cross-entropy) and naturally produces calibrated probabilities, while linear SVM minimizes the hinge loss and focuses on maximizing the margin between classes. As a result, SVM often provides strong classification accuracy, but it does not output probabilities unless additional calibration is applied. Therefore, for this salary-threshold prediction problem where probability interpretation and threshold tuning are important, Logistic Regression is typically the better choice.

# Task 2: Dataset and Advanced EDA

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
# Replace 'my_folder/my_data.csv' with your file's actual path
file_path = '/content/drive/MyDrive/'
df = pd.read_csv(file_path)