# Introduction to Regression (Probabilistic Perspective)

## Key Concepts

* **Objective:** Model the relationship between input $\mathbf{x}$ and output $y$.

* **Uncertainty:** Output $y$ has an associated uncertainty modeled by a probability distribution.

* **Example:**
  
  $y = f(\mathbf{x}; \mathbf{w}) + \epsilon$ , where $\epsilon \sim \mathcal{N}(0,\sigma^2)$

* The goal is to learn $f(\mathbf{x}; \mathbf{w})$ to predict $y$.

## Curve Fitting with Noise

* In real-world scenarios, observed output $y$ is noisy.

* **Model: True output plus noise**
  
  $y = f(\mathbf{x}; \mathbf{w}) + \epsilon$

* Noise represents unknown or unmodeled factors.

* **Example:** Predicting house prices based on features with inherent unpredictability.

## Expected Value of Output

* **Best Estimate:** The conditional expectation of $y$ given $\mathbf{x}$.
  
  $\mathbb{E}[y|\mathbf{x}] = f(\mathbf{x}; \mathbf{w})$

* **Goal:** Learn a function $f(\mathbf{x}; \mathbf{w})$ that represents the average behavior of the data.

* **Key Point:** The model captures the mean of the target variable given input $\mathbf{x}$.

## 🔹 **What is Likelihood?**
Likelihood is a measure of how probable it is to observe some given data under a particular model or set of parameters.  

Mathematically, the **likelihood function** is written as:

$$
L(\theta) = P(D | \theta)
$$

where:
- $L(\theta)$ is the **likelihood** of the parameters $\theta$.
- $D$ is the observed **data**.
- $P(D | \theta)$ is the probability of observing $D$ **given** the parameters $\theta$.

👉 **Key idea:** **Likelihood is not the probability of parameters; it is the probability of the data given the parameters.**  

### 🔹 **Example of Likelihood**
Imagine you flip a coin 10 times and observe 7 heads and 3 tails. You want to find out how likely it is that the coin is biased (i.e., not fair).  

Let's define:
- $\theta$ as the probability of heads.
- The observed data is **7 heads and 3 tails**.

The likelihood function is:

$$
L(\theta) = P(D | \theta) = \theta^7 (1 - \theta)^3
$$

This function tells us how **likely** different values of $\theta$ (the probability of heads) are, given the observed data.

---

## 🔹 **What is Maximum Likelihood Estimation (MLE)?**
**Maximum Likelihood Estimation (MLE)** is a method used to find the **best** value for a parameter ($\theta$) by maximizing the likelihood function.

👉 **The goal of MLE:**  
Find the parameter value $\theta$ that makes the observed data **most likely**.

### 🔹 **Mathematical Definition of MLE**
The **MLE estimate** for $\theta$ is the value that **maximizes** the likelihood function:

$$
\hat{\theta} = \arg\max_{\theta} L(\theta)
$$

Since likelihood functions often involve **products of probabilities**, it is easier to work with the **log-likelihood** (taking the natural logarithm):

$$
\log L(\theta) = \sum \log P(D_i | \theta)
$$

Maximizing the **log-likelihood** instead of the likelihood is common because:
- It converts products into **sums**, making calculations easier.
- The logarithm is a **monotonic function**, so maximizing likelihood and maximizing log-likelihood are equivalent.

---

### 🔹 **Example: MLE for a Coin Flip**
Suppose we flip a coin $n$ times and get $k$ heads. We want to estimate the probability of heads ($\theta$) using MLE.

1️⃣ **Likelihood function**:
$$
L(\theta) = \theta^k (1 - \theta)^{n-k}
$$

2️⃣ **Log-likelihood function**:
$$
\log L(\theta) = k \log \theta + (n-k) \log (1 - \theta)
$$

3️⃣ **Find $\theta$ that maximizes this function**:  
To maximize, take the derivative and set it to zero:

$$
\frac{d}{d\theta} [ k \log \theta + (n-k) \log (1 - \theta) ] = 0
$$

Solving for $\theta$, we get:

$$
\hat{\theta} = \frac{k}{n}
$$

👉 **Conclusion**: The MLE for the probability of heads in a coin flip is just the fraction of times we observed heads!

---

## 🔹 **Why Use MLE?**
MLE is widely used in statistics and machine learning because:
✅ It provides **consistent** estimates as data increases.  
✅ It works for a wide range of probability distributions.  
✅ It is the foundation of many machine learning algorithms (e.g., logistic regression, Gaussian mixture models).  

# Maximum Likelihood Estimation (MLE)

## Basic Concepts

* **MLE**: A method to estimate parameters that maximize the likelihood of the data.

* Given data $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$, MLE maximizes:

$$
L(\mathcal{D};\mathbf{w}, \sigma^2) = \prod_{i=1}^n p(y_i|\mathbf{x}_i, \mathbf{w}, \sigma^2)
$$

* MLE finds parameters $\mathbf{w}$ and $\sigma^2$ that best explain the data.

## Log-Likelihood

* Instead of maximizing the likelihood, it is often easier to maximize the log-likelihood:

$$
\log L(\mathcal{D};\mathbf{w}, \sigma^2) = \sum_{i=1}^n \log p(y_i|\mathbf{x}_i, \mathbf{w}, \sigma^2)
$$

* It is because $\log f(x)$ preserves the behaviour of $f(x)$.
* It is also easier to find derivative on summation of terms.

## Univariate Linear Function Example

* Assuming Gaussian noise with parameters $(0,\sigma^2)$, probability of observing real output value $y$ is:

$$
p(y|\mathbf{x}, \mathbf{w}, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y-f(\mathbf{x};\mathbf{w}))^2}{2\sigma^2}\right)
$$

* For a simple linear model $f(\mathbf{x};\mathbf{w}) = w_0 + w_1x$ we have:

$$
p(y|x, \mathbf{w}, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y-w_0-w_1x)^2}{2\sigma^2}\right)
$$

**Key Observation**: Points far from the fitted line will have a low likelihood value.

## Log-Likelihood and Sum of Squares

* Using log-likelihood we have:

$$
\log L(\mathcal{D};\mathbf{w}, \sigma^2) = -n\log\sigma - \frac{n}{2}\log(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y^{(i)} - f(\mathbf{x}^{(i)};\mathbf{w}))^2
$$

* Since the objective of MLE is to optimize with regards to random variables, we can rule out the constants:

$$
\log L(\mathcal{D};\mathbf{w}, \sigma^2) \sim -\sum_{i=1}^n(y^{(i)} - f(\mathbf{x}^{(i)};\mathbf{w}))^2
$$

* **Equivalence**: Maximizing the log-likelihood is equivalent to minimizing the Sum of Squared Errors (SSE):

$$
J(\mathbf{w}) = \sum_{i=1}^n(y^{(i)} - f(\mathbf{x}^{(i)};\mathbf{w}))^2
$$

## Estimating $\sigma^2$

* The maximum likelihood estimate of the noise variance $\sigma^2$:

$$
\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(y^{(i)} - f(\mathbf{x}^{(i)};\hat{\mathbf{w}}))^2
$$

* **Interpretation**: Mean squared error of the predictions.
* **Note**: $\sigma^2$ reflects the noise level in the observations.