<a href="https://colab.research.google.com/github/YolandaZhao10/CSCI-6170-Project-in-AI-and-ML/blob/main/homework_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1: Advanced Objective Function and Use Case

## Model Assumption
Given a binary classification dataset
$\mathscr D =\{(x_i,y_i)\}_{i=1}^N,\quad
x_i\in\mathbb R^d,\; y_i\in\{0,1\}$,
logistic regression models the conditional probability as
$$
p(y_i=1\mid x_i;w)=\sigma(w^\top x_i),
\quad
\sigma(z)=\frac{1}{1+e^{-z}}.
$$

Base on probability for $y_i = 1$, we can also get conditional probability for $y_i = 0$.

$$
p(y_i=0\mid x_i;w)=1-\sigma(w^\top x_i).
$$



## Maximum Likelihood Estimation
Each label $y_i$ is assumed to follow a Bernoulli distribution. The conditional likelihood can be written compactly as

$$
p(y_i\mid x_i;w)
=\sigma(w^\top x_i)^{y_i}
\left(1-\sigma(w^\top x_i)\right)^{1-y_i}.
$$

Assuming the samples are i.i.d., the likelihood over the entire dataset is

$$
\mathscr{L}(w)
=\prod_{i=1}^N p(y_i\mid x_i;w)
=\prod_{i=1}^N
\sigma(w^\top x_i)^{y_i}
\left(1-\sigma(w^\top x_i)\right)^{1-y_i}.
$$

Then, we can taking the logarithm of the likelihood yields the log-likelihood:

$$
\ell(w)
=\log \mathscr{L}(w)
=\sum_{i=1}^N
\Big[
y_i \log \sigma(w^\top x_i)
+(1-y_i)\log\left(1-\sigma(w^\top x_i)\right)
\Big].
$$

Maximum Likelihood Estimation seeks the parameter $w$ that maximizes the log-likelihood:

$$
w_{\text{MLE}}=\arg\max_w \ell(w).
$$


## Get Objective Function of Logistic Regression Under MLE
Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood. Therefore, the objective function of logistic regression under MLE is

$$
J_{\text{MLE}}(w)
=-\ell(w)
=\sum_{i=1}^N
\Big[
- y_i \log \sigma(w^\top x_i)
-(1-y_i)\log\left(1-\sigma(w^\top x_i)\right)
\Big].
$$

This objective function is known as the **logistic loss** or **binary cross-entropy loss**.




# Task 2: Dataset and Advanced EDA

## Define a machine learning problem you wish to solve using Logistic Regression.
As someone interested in weather tasks, I chose to work with weather prediction, a domain where classification models can provide practical value for planning and risk management. In this project, we use the **Weather Dataset (Rattle Package)**, which contains historical daily weather observations collected from multiple locations in Australia. Each record includes meteorological measurements such as temperature, humidity, atmospheric pressure, wind direction/speed, rainfall, and cloud coverage.

The goal of this project is to solve a **binary classification** problem: predicting whether it will **rain tomorrow** (`RainTomorrow = Yes/No`) based on weather conditions observed today. To address this, we apply **logistic regression**, which is a suitable model because the prediction target is binary and logistic regression naturally outputs a probability value \(P(y=1\mid x)\) through a sigmoid activation function applied to a linear combination of the input features. This probability can then be compared against a threshold to determine the predicted class. In addition to being computationally efficient, logistic regression also provides interpretable feature coefficients, allowing us to understand which weather factors are most strongly associated with rainfall on the following day.


**Goal**: Predict whether it will rain tomorrow using historical weather measurements.

**Dataset (X, y)**: Using the Kaggle dataset: **Weather Dataset (Rattle Package)**  (loaded locally as `weatherAUS.csv`). Link to dataset: https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package

**Target variable (y)**: The dataset provides a binary label: `RainTomorrow ∈ {Yes, No}` We define: $y_i =
\begin{cases}
1, & \text{if RainTomorrow = Yes}\\
0, & \text{if RainTomorrow = No}
\end{cases}$ So: $y = \texttt{RainTomorrow}$

**Input features (X)**
For each observation \(i\), we define the feature vector \(x_i\) using weather-related measurements from the same day.

A typical choice of \(X\) is: $x_i = [\text{Location},\text{MinTemp},\text{MaxTemp},\text{Rainfall},\text{Evaporation},\text{Sunshine},
\text{WindGustDir},\text{WindGustSpeed},\text{WindDir9am},\text{WindDir3pm},
\text{WindSpeed9am},\text{WindSpeed3pm},\text{Humidity9am},\text{Humidity3pm},
\text{Pressure9am},\text{Pressure3pm},\text{Cloud9am},\text{Cloud3pm},
\text{Temp9am},\text{Temp3pm},\text{RainToday}]$ where categorical variables such as `Location`, `WindGustDir`, etc. are encoded using **one-hot encoding**. We exclude `Date` since it is not a physical measurement by itself and is typically replaced by engineered temporal features (month/season) if needed.

## Why Logistic Regression is the Best Choice

1. `RainTomorrow` is a two-class label (Yes/No), which directly matches Logistic Regression.

2. Second, Interpretable coefficients Logistic Regression provides interpretable weights: positive coefficient → increases rain probability
& negative coefficient → decreases rain probability. This is valuable for weather-related decision support (e.g., agriculture, transportation).

3. This dataset contains tens of thousands of rows. Logistic Regression trains quickly, handles large datasets well, and supports regularization (L1/L2) to reduce overfitting.

4. Instead of only predicting a class label, Logistic Regression outputs:
$P(\text{RainTomorrow}=\text{Yes}\mid x)$ which is useful for risk-based decisions (e.g., predicting rain with confidence thresholds).

## Comparison to Another Linear Classification Model (Linear SVM)

| Aspect | Logistic Regression | Linear SVM |
|--------------|--------------|--------------|
|Objective functions| $\min_{w,b} \sum_{i=1}^N \log\bigl(1+\exp(-y_i(w^\top x_i+b))\bigr)$ | $\min_{w,b} \sum_{i=1}^N \max(0, 1 - y_i(w^\top x_i + b))$|
| Output | Probability \(P(y=1|x)\) | Class score (not probability) |
| Loss | Log loss | Hinge loss |
| Interpretability | High | Medium |
| Best for | Probabilistic risk estimation | Maximum-margin separation |

## Why Logistic Regression is preferred here
In weather prediction, **probability outputs** are often necessary (e.g., “70% chance of rain tomorrow”), so Logistic Regression is more suitable than SVM unless extra probability calibration is added.

**Citation (SVM reference):**  
Cortes, C., & Vapnik, V. (1995). *Support-vector networks*. Machine Learning, 20, 273–297.


In [None]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
# Replace 'my_folder/my_data.csv' with your file's actual path
file_path = '/content/drive/MyDrive/'
df = pd.read_csv(file_path)