<a href="https://colab.research.google.com/github/YolandaZhao10/CSCI-6170-Project-in-AI-and-ML/blob/main/hw1/homework_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1: Advanced Objective Function and Use Case

# Task 1.1

## Model Assumption
Given a binary classification dataset
$\mathscr D =\{(x_i,y_i)\}_{i=1}^N,\quad
x_i\in\mathbb R^d,\; y_i\in\{0,1\}$,
logistic regression models the conditional probability as
$$
p(y_i=1\mid x_i;w)=\sigma(w^\top x_i),
\quad
\sigma(z)=\frac{1}{1+e^{-z}}.
$$

Base on probability for $y_i = 1$, we can also get conditional probability for $y_i = 0$.

$$
p(y_i=0\mid x_i;w)=1-\sigma(w^\top x_i).
$$



## Maximum Likelihood Estimation
Each label $y_i$ is assumed to follow a Bernoulli distribution. The conditional likelihood can be written compactly as

$$
p(y_i\mid x_i;w)
=\sigma(w^\top x_i)^{y_i}
\left(1-\sigma(w^\top x_i)\right)^{1-y_i}.
$$

Assuming the samples are i.i.d., the likelihood over the entire dataset is

$$
\mathscr{L}(w)
=\prod_{i=1}^N p(y_i\mid x_i;w)
=\prod_{i=1}^N
\sigma(w^\top x_i)^{y_i}
\left(1-\sigma(w^\top x_i)\right)^{1-y_i}.
$$

Then, we can taking the logarithm of the likelihood yields the log-likelihood:

$$
\ell(w)
=\log \mathscr{L}(w)
=\sum_{i=1}^N
\Big[
y_i \log \sigma(w^\top x_i)
+(1-y_i)\log\left(1-\sigma(w^\top x_i)\right)
\Big].
$$

Maximum Likelihood Estimation seeks the parameter $w$ that maximizes the log-likelihood:

$$
w_{\text{MLE}}=\arg\max_w \ell(w).
$$


## Get Objective Function of Logistic Regression Under MLE
Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood. Therefore, the objective function of logistic regression under MLE is

$$
J_{\text{MLE}}(w)
=-\ell(w)
=\sum_{i=1}^N
\Big[
- y_i \log \sigma(w^\top x_i)
-(1-y_i)\log\left(1-\sigma(w^\top x_i)\right)
\Big].
$$

This objective function is known as the **logistic loss** or **binary cross-entropy loss**.

## MLE vs. MAP for Logistic Regression


**Definition**: Given a dataset$
\mathscr{D}=\{(x_i,y_i)\}_{i=1}^N \quad y_i\in\{0,1\},
$ Logistic Regression models $
p(y_i=1\mid x_i;w)=\sigma(w^\top x_i),\quad
\sigma(z)=\frac{1}{1+e^{-z}}$.

### Objective
- **MLE:** maximize likelihood $p(\mathscr{D}\mid w)$  
- **MAP:** maximize posterior $p(w\mid\mathscr{D}) \propto p(\mathscr{D}\mid w)p(w)$ -> MAP can be interpreted as **MLE + regularization**.

### Regularization
- **MLE:** no explicit regularization term  
- **MAP:** adds prior-based regularization ($-\log p(w)$)

### When it helps
- **MLE:** works well with large datasets  
- **MAP:** often better for small datasets / high-dimensional features because priors reduce overfitting


### Reference
[1] S. Aswani, “IEOR 165 – Lecture 8: Regularization (Maximum A Posteriori Estimation),” University of California, Berkeley. [Online]. Available: https://aswani.ieor.berkeley.edu/teaching/SP16/165/lecture_notes/ieor165_lec8.pdf


[2] A. Ng, “CS229 Lecture Notes: Logistic Regression,” Stanford University. [Online]. Available: https://cs229.stanford.edu/notes2020spring/cs229-notes1.pdf

[3] X. Fern, “Logistic Regression,” Oregon State University, CS534 Lecture Notes. [Online]. Available: https://web.engr.oregonstate.edu/~xfern/classes/cs534-18/Logistic-Regression-3-updated.pdf


# Task 1.2

## Define a machine learning problem you wish to solve using Logistic Regression.
As someone interested in weather tasks, I chose to work with weather prediction, a domain where classification models can provide practical value for planning and risk management. In this project, we use the **Weather Dataset (Rattle Package)**, which contains historical daily weather observations collected from multiple locations in Australia. Each record includes meteorological measurements such as temperature, humidity, atmospheric pressure, wind direction/speed, rainfall, and cloud coverage.

The goal of this project is to solve a **binary classification** problem: predicting whether it will **rain tomorrow** (`RainTomorrow = Yes/No`) based on weather conditions observed today. To address this, we apply **logistic regression**, which is a suitable model because the prediction target is binary and logistic regression naturally outputs a probability value \(P(y=1\mid x)\) through a sigmoid activation function applied to a linear combination of the input features. This probability can then be compared against a threshold to determine the predicted class. In addition to being computationally efficient, logistic regression also provides interpretable feature coefficients, allowing us to understand which weather factors are most strongly associated with rainfall on the following day.


**Goal**: Predict whether it will rain tomorrow using historical weather measurements.

**Dataset (X, y)**: Using the Kaggle dataset: **Weather Dataset (Rattle Package)**  (loaded locally as `weatherAUS.csv`). Link to dataset: https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package

**Target variable (y)**: The dataset provides a binary label: `RainTomorrow ∈ {Yes, No}` We define: $y_i =
\begin{cases}
1, & \text{if RainTomorrow = Yes}\\
0, & \text{if RainTomorrow = No}
\end{cases}$ So: $y = \texttt{RainTomorrow}$

**Input features (X)**
For each observation \(i\), we define the feature vector \(x_i\) using weather-related measurements from the same day.

A typical choice of \(X\) is: $x_i = [\text{Location},\text{MinTemp},\text{MaxTemp},\text{Rainfall},\text{Evaporation},\text{Sunshine},
\text{WindGustDir},\text{WindGustSpeed},\text{WindDir9am},\text{WindDir3pm},
\text{WindSpeed9am},\text{WindSpeed3pm},\text{Humidity9am},\text{Humidity3pm},
\text{Pressure9am},\text{Pressure3pm},\text{Cloud9am},\text{Cloud3pm},
\text{Temp9am},\text{Temp3pm},\text{RainToday}]$ where categorical variables such as `Location`, `WindGustDir`, etc. are encoded using **one-hot encoding**. We exclude `Date` since it is not a physical measurement by itself and is typically replaced by engineered temporal features (month/season) if needed.

## Why Logistic Regression is the Best Choice

1. `RainTomorrow` is a two-class label (Yes/No), which directly matches Logistic Regression.

2. Second, Interpretable coefficients Logistic Regression provides interpretable weights: positive coefficient → increases rain probability
& negative coefficient → decreases rain probability. This is valuable for weather-related decision support (e.g., agriculture, transportation).

3. This dataset contains tens of thousands of rows. Logistic Regression trains quickly, handles large datasets well, and supports regularization (L1/L2) to reduce overfitting.

4. Instead of only predicting a class label, Logistic Regression outputs:
$P(\text{RainTomorrow}=\text{Yes}\mid x)$ which is useful for risk-based decisions (e.g., predicting rain with confidence thresholds).

## Comparison to Another Linear Classification Model (Linear SVM)
### Objective functions
- **Logistic Regression**: $\min_{w,b} \sum_{i=1}^N \log\bigl(1+\exp(-y_i(w^\top x_i+b))\bigr)$
- **Linear SVM**: $\min_{w,b} \sum_{i=1}^N \max(0, 1 - y_i(w^\top x_i + b))$

### Output
- **Logistic Regression**: Probability \(P(y=1|x)\)
- **Linear SVM**: Class score (not probability)

### Loss
- **Logistic Regression**: Log loss
- **Linear SVM**: Hinge loss

### Interpretability
- **Logistic Regression**: High
- **Linear SVM**: Medium

### Best for
- **Logistic Regression**: Probabilistic risk estimation
- **Linear SVM**: Maximum-margin separation

## Why Logistic Regression is preferred here
In weather prediction, **probability outputs** are often necessary (e.g., “70% chance of rain tomorrow”), so Logistic Regression is more suitable than SVM unless extra probability calibration is added.

## Reference
[1] X. Fern, “Logistic Regression,” Oregon State University, CS534 Lecture Notes. [Online]. Available: https://web.engr.oregonstate.edu/~xfern/classes/cs534-18/Logistic-Regression-3-updated.pdf

[2] Cornell University, “CS4780 Lecture Note 06: Logistic Regression,” Dept. of Computer Science. [Online]. Available: https://www.cs.cornell.edu/courses/cs4780/2023fa/lectures/lecturenote06.html

[3] A. Ng, “CS229 Lecture Notes: Support Vector Machines,” Stanford University. [Online]. Available: https://cs229.stanford.edu/notes2020spring/cs229-notes2.pdf

[4] D. Klein, “CS 180 Lecture Notes: Support Vector Machines,” University of California, Berkeley. [Online]. Available: https://people.eecs.berkeley.edu/~klein/cs180f13/lectures/lec14.pdf

[5] M. Collins, “Lecture Notes: Support Vector Machines,” Carnegie Mellon University. [Online]. Available: http://www.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf


# Task 1.3

## Mapping dataset variables to equation variables

In the logistic regression formulation shown in the derivation, the binary classification dataset is written as: $$\mathscr{D} = \{(x_i, y_i)\}_{i=1}^{N}, \quad x_i \in \mathbb{R}^d,\quad y_i \in \{0,1\}.$$

In this dataset:

- **$N$** = number of valid daily observations used for training (after filtering missing labels, etc.)
- **$x_i$** = feature vector derived from the $i$-th row of `weatherAUS.csv`. After preprocessing (e.g., one-hot encoding categorical variables), each example becomes a vector: $x_i \in \mathbb{R}^d$ where $d$ is the total number of numerical features plus the number of one-hot encoded categories.
- **$y_i$** = label derived from the $i$-th row, specifically the column `RainTomorrow ∈ {Yes, No}` $y_i =
\begin{cases}
1, & \text{if RainTomorrow = Yes}\\
0, & \text{if RainTomorrow = No}
\end{cases}$ Thus, \(y_i\in\{0,1\}\) matches the Bernoulli assumption used in the MLE derivation.


## Connection to Logistic Regression Probability Model

The derivation assumes logistic regression models: $p(y_i=1 \mid x_i; w) = \sigma(w^\top x_i), \quad \sigma(z)=\frac{1}{1+e^{-z}}$.

In this dataset, this corresponds to: $p(\text{RainTomorrow=Yes} \mid \text{weather features today})$

meaning logistic regression outputs the probability that tomorrow will be rainy based on today's meteorological conditions.

The conditional probability for the negative class is: $p(y_i=0 \mid x_i;w) = 1-\sigma(w^\top x_i)$, which corresponds to the probability of `RainTomorrow = No`.

## Assumptions Highlighted in the MLE Derivation

The MLE derivation in the picture relies on several key assumptions. In this dataset, these assumptions translate into the following:

### Assumption 1: Binary labels follow a Bernoulli distribution
The derivation states each \(y_i\) is Bernoulli:

\[
y_i \sim \text{Bernoulli}(p_i),\quad p_i=\sigma(w^\top x_i).
\]

This is appropriate because `RainTomorrow` is binary and can be represented as \(0/1\).

### Assumption 2: Conditional independence of labels given features
Logistic regression assumes that once we condition on \(x_i\), the label \(y_i\) depends only on \(x_i\) and parameters \(w\), not on other training examples.

This supports writing the likelihood as a product:

\[
\mathcal{L}(w)=\prod_{i=1}^{N}p(y_i\mid x_i; w).
\]

### Assumption 3: i.i.d. samples (independent and identically distributed)
The derivation explicitly assumes samples are i.i.d.:

\[
\{(x_i,y_i)\}_{i=1}^N \text{ are i.i.d.}
\]

In practice, this is an approximation for weather data because observations are time-dependent (weather has temporal correlation). However, for a baseline supervised learning model, we treat each row as an independent sample after feature extraction.

### Assumption 4: Linear decision boundary in feature space
Logistic regression uses a linear score \(w^\top x_i\), which implies a linear boundary:

\[
w^\top x + b = 0.
\]

Thus, we assume that a linear combination of meteorological features is sufficient to separate rainy vs. non-rainy outcomes reasonably well.

### Assumption 5: Correct preprocessing makes \(x_i\in\mathbb{R}^d\)
The mathematical form requires all features to be numeric. Therefore we assume:

- categorical features (e.g., `Location`, wind directions) are converted via one-hot encoding
- missing values are either removed or imputed
- all features are aligned into a consistent numeric feature vector

## Summary

In summary, the WeatherAUS dataset fits the logistic regression + MLE framework because:

- each observation provides a feature vector \(x_i\)
- the label `RainTomorrow` is naturally binary and can be mapped to \(y_i\in\{0,1\}\)
- the Bernoulli likelihood and log-likelihood derivation apply directly
- we make standard ML assumptions (i.i.d. samples, conditional independence, linearity), acknowledging that real weather data may mildly violate strict independence due to temporal patterns.


# Task 2: Dataset and Advanced EDA

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
# Replace 'my_folder/my_data.csv' with your file's actual path
file_path = '/content/drive/MyDrive/'
df = pd.read_csv(file_path)