# Logistic Regression
Categorical data - Classification issue on unbounded values (0 or 1)
</br>Requires the use of a *sigmoid activation function*
</br>Sigmoid function maps between 0 and 1: $s(z) = \frac{1}{1 + e^(-z)}$, where $z = w^tx + b$

- $z$: linear combination
- $w$: weight vector
- $b$: bias
- $w^tx$: dot product of the weights and features
- $s(z) or \hat{y}$: predicted probability

Rule:
- If s(z) >= 0.5 then -> 1
- Else, predict 0

## Cost function

Unlike Linear Regression, we cannot use mean squared error — applying it to sigmoid outputs leads to a **non-convex cost surface**, making optimization unstable.

Instead, logistic regression uses **binary cross-entropy loss** (also known as **log loss**):

**Loss =**  
`-(1/m) * Σ [y(i) * log(ŷ(i)) + (1 - y(i)) * log(1 - ŷ(i))]`

Where:  
- `y(i)` is the true label (0 or 1)  
- `ŷ(i)` is the predicted probability from the sigmoid function

---

### 🔍 Intuition

- The loss measures how far off the predicted probability `ŷ` is from the actual label `y`.
- If the **true label is 1**, only the term `y * log(ŷ)` matters:
  - If `ŷ` is close to 1 → **low loss**
  - If `ŷ` is close to 0 → **high loss**
- If the **true label is 0**, only the term `(1 - y) * log(1 - ŷ)` matters:
  - If `ŷ` is close to 0 → **low loss**
  - If `ŷ` is close to 1 → **high loss**
- This means **confident, wrong predictions are penalized heavily**, which encourages the model to output well-calibrated probabilities.

---

### 🎯 Why is this loss function used?

- It follows from **maximum likelihood estimation** — we're maximizing the probability that our model assigns to the correct labels.
- It is **convex** when used with a sigmoid, so gradient descent can reliably find a **global minimum**.



In [1]:
# Imports
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pprint
import pickle