# Course Overview

* **Supervised Learning**  
    Learn to predict outcomes using labeled data

* **Unsupervised Learning**  
    Find patterns in unlabeled data

* **Reinforcement Learning**  
    Partial(indirect) feedback, no explicit guidance  
    Given reward for a sequence of moves to learn a policy and utility function

* **Neural Networks**  
    Mimic the human brain to solve complex tasks    

* **Computer Vision**  
    Enable machines to see and interpret images    

* **Natural Language Processing (NLP)**  
    Understand and generate human language in text or voice format

* **Contrastive Language-Image Pre training (CLIP)**  
    Connecting text and images

# What is Machine Learning?

* A field of study that enables computers to learn from data without being explicitly programmed.


* Tom Mitchel (1998): Well-posed learning problem:  
    A computer program is said to learn from **experience E** with respect to some class of **tasks T** and **performance measure P**, if its performance at tasks in **T**, as measured by **P**, improves with experience **E**.  
    Learning Problem = (T,P,E)

* Goal:  
    Develop models that make accurate predictions based on past data.

* The essence of machine learning:
    1. A pattern exists  
    2. We do not know it mathematically  
    3. We have data on it

# Statistics & Estimation

## Statistical Models in ML

* Target model in the learning problems can be considered as a statistical model.

* For a fixed set of data and underlying target (statistical model), the estimation methods try to estimate the target from available data.

* **Goal:** Estimating the probability density function $p(x)$, given a set of data $x^{(i)}$



* **Main approaches:**
    - **Parametric:** Assuming a parameterized model for density function.  
        - A number of parameters are optimized by fitting the model to the data set

    - **Non-parametric (Instance-based):** No specific parametric model is assumed.  
        - The form of density function is determined entirely by the data


## Parametric Approach

**Goal**: Estimate parameters $(\theta)$ of a distribution from a dataset $D = \{x_1, x_2, ..., x_N\}$  

### **Maximum Likelihood Estimation (MLE)**

#### **Introduction**

*MLE* is a method of estimating the parameters of a statistical model given data.

Likelihood is the conditional probability of observations $D = \{x_1, x_2, ..., x_N\}$ given the value of parameters $\theta$ (assuming *independent*, *identical* *distributed* *(i.i.d.)* samples)  
This approach tends to overfit to $D$.

$$p(D | \theta) = \prod_{i=1}^N p(x^{(i)} | \theta) = p(x_1 | \theta)p(x_2 | \theta)...p(x_N | \theta)$$

Maximum Likelihood Estimation (MLE):
$$\hat{\theta}_{ML} = argmax_{\theta}\ p(D | \theta)$$

<div style="text-align:center">
  <img src="images/MLE.png" alt="MaximumLikelihood Estimation Example">
</div>

**Key Argument**:  
Logarithms convert *products* (prone to underflow and hard to differentiate) into *sums* (stable and algebraically convenient):  

$$\log \left( \prod_{i} a_i \right) = \sum_{i} \log a_i$$  

So we have:

$$L(\theta) = \ln p(D |\theta) = \ln \prod_{i=1}^N p(x^{(i)} | \theta) = \sum_{i=1}^{N} \ln p(x^{(i)} | \theta)$$

$$\hat{\theta}_{ML} = argmax_{\theta}\ L(\theta) = argmax_{\theta}\ \sum_{i=1}^{N} \ln p(x^{(i)} | \theta)$$

#### **MLE Bernoulli**

* **Bernoulli:** given $x \in \{0, 1\}$ then $p(x=1|\theta) = \theta$ and $p(x=0|\theta) = 1-\theta$

* **MLE Bernoulli:** given $D = \{x_1, x_2, ..., x_N\}$, $m$ heads (1), $N-m$ tails (0).
$$p(x|\theta) = \theta^x(1-\theta)^{1-x}$$

* So we have:
$$p(D | \theta) = \prod_{i=1}^N p(x^{(i)} | \theta) = \prod_{i=1}^N \theta^{x^{(i)}}(1-\theta)^{{1-x}^{(i)}}$$
$$\ln p(D | \theta) = \sum_{i=1}^{N} \ln p(x^{(i)} | \theta) = \sum_{i=1}^{N} x^{(i)}\ln \theta + ({1-x}^{(i)})\ln (1-\theta)$$
$$\frac{\partial \ln p(\mathcal{D}|\theta)}{\partial \theta} = 0 \Rightarrow \theta_{ML} = \frac{\sum_{i=1}^{N} x^{(i)}}{N} = \frac{m}{N}$$

* This approach tends to overfit to $D$.  
    - E.g: $D=\{1,1,1\}$ then $\hat \theta_{ML} = 1$

#### **Multinomial Distribution**

$$x = [x_1, x_2, ..., x_K]$$
$$x_k \in \{0, 1\}$$
$$\sum_{k=1}^{K} x_k = 1$$
*E.g: $x = [0, 0, 0, 1, 0, 0]$; which k=6*

<br>

$$\theta = [\theta_1, \theta_2, ..., \theta_K]$$
$$\theta_k \in [0, 1]$$
$$\sum_{k=1}^{K} \theta_k = 1$$

<br>

$$\theta_k = p(x_k = 1)$$
$$P(x|\theta) = \prod_{k=1}^K \theta_k^{x_k}$$
*(only one $x_k$ is one)*

Given:  
$$D = \{x^{(1)}, x^{(2)}, ..., x^{(N)}\}$$
$$p(D | \theta) = \prod_{i=1}^N p(x^{(i)} | \theta) = \prod_{i=1}^N \prod_{k=1}^K \theta_k^{x_k^{(i)}}$$
Assume:  
$$ N_k = \sum_{i=1}^N x_k^{(i)}$$
Then:
$$p(D | \theta) = \prod_{k=1}^K \theta_k^{N_k}$$
Accept this *(we will prove it later)*:
$$L(\theta, \lambda) = \ln p(D|\theta) + \lambda(1-\sum_{k=1}^K \theta_k)$$
Then:
$$\hat \theta_k = \frac{\sum_{i=1}^{N} x_k^{(i)}}{N} = \frac{N_k}{N}$$

**Interpretation**: We can conclude that the probability $P(k)$ of the $k$-th instance is the ratio of its occurrence count to the total number of instances

#### **Gaussian (Normal) Distribution**

Formula:
$$
g(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2}
$$
where:
- $\mu$ is the mean,
- $\sigma$ is the standard deviation.

**MLE for Gaussian with Unknown $\mu$**
1. **Log-Likelihood of a Single Sample**:
   $$
   \ln p(x^{(i)}|\mu) = -\ln \left( \sqrt{2\pi}\sigma \right) - \frac{1}{2\sigma^2} \left( x^{(i)} - \mu \right)^2
   $$
   
$$
L(\mu) = \ln p(D |\mu) = \ln \prod_{i=1}^N p(x^{(i)} | \mu) = \sum_{i=1}^{N} \ln p(x^{(i)} | \mu) = \sum_{i=1}^{N} -\ln \left( \sqrt{2\pi}\sigma \right) - \frac{1}{2\sigma^2} \left( x^{(i)} - \mu \right)^2
$$
2. **Derivative of Log-Likelihood**:
   Set the derivative of the total log-likelihood $L(\mu)$ to zero:
   $$
   \frac{\partial L(\mu)}{\partial\mu} = 0 \implies \frac{\partial}{\partial\mu} \left( \sum_{i=1}^N \ln p(x^{(i)}|\mu) \right) = 0
   $$

3. **Solve for $\mu$**:
   $$
   \sum_{i=1}^N \frac{1}{\sigma^2} \left( x^{(i)} - \mu \right) = 0 \implies \hat{\mu}_{ML} = \frac{1}{N} \sum_{i=1}^N x^{(i)}
   $$

**Interpretation**:
The MLE estimate $\hat{\mu}_{ML}$ is the sample mean, matching classical statistical methods.

### **Maximum A Posterior (MAP)**

#### **Introduction**

- $p(D|\theta) \rightarrow$ likelihood
- $p(\theta)\rightarrow$ prior
- $p(\theta|D) \rightarrow$ posterior

* ***Maximum Likelihood Estimation (MLE)***  
    - MLE finds the parameter values that maximize the likelihood of observing the given data:
    $$\hat{\theta}_{ML} = \argmax_{\theta}\ p(D | \theta)$$
    - Purely data-driven (ignores prior knowledge).
    - Can overfit with limited data.
<br><br>
* ***Maximum A Posteriori (MAP) Estimation***
    - MAP incorporates prior beliefs (as a probability distribution) and maximizes the posterior *(acts mathematically as adding some mock samples to the dataset.)*:
    $$\hat{\theta}_{MAP} = \argmax_{\theta}\ p(\theta | D) = \argmax_{\theta}\ p(\theta)p(D|\theta)$$
    - Adds a prior term $p(\theta)$ to regularize the estimate.
    - Balances data evidence with prior knowledge.
    - Reduces overfitting

*MAP* Estimation:
$$\hat{\theta}_{MAP} = \argmax_{\theta}\ p(\theta | D)$$
Bayes' Theorem:
$$p(\theta | D) =\frac{p(\theta)p(D|\theta)}{p(D)}$$
And there is no effect of $\theta$ on the denominator. So we have:
$$\hat{\theta}_{MAP} = \argmax_{\theta}\ p(\theta)p(D|\theta)$$

#### **MAP Estimation example for Gaussian (Unknown μ)**

* **Likelihood**:
$$p(D |\mu) = \prod_{i=1}^N p(x^{(i)} | \mu) = \prod_{i=1}^N \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x^{(i)} - \mu}{\sigma}\right)^2}$$

* **Prior**:
$$p(\mu) = \frac{1}{\sigma_0\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{\mu - \mu_0}{\sigma_0}\right)^2}$$

* **Posterior**:
$$p(\mu|D) \propto p(D|\mu)p(\mu)$$
$$\ln p(\mu | D) = \ln {(p(\mu) \prod_{i=1}^N p(x^{(i)} | \mu))} = \ln p(\mu) + \ln \prod_{i=1}^N p(x^{(i)} | \mu)$$  
$$\ln p(\mu) = -\ln(\sigma_0\sqrt{2\pi}) -\frac{1}{2}\left(\frac{\mu - \mu_0}{\sigma_0}\right)^2$$
$$\ln \prod_{i=1}^N p(x^{(i)} | \mu) = \sum_{i=1}^{N} -\ln \left( \sqrt{2\pi}\sigma \right) - \frac{1}{2\sigma^2} \left( x^{(i)} - \mu \right)^2$$

* **Derivative and Solution for μ**  
  Step 1: Take the Derivative of the Log-Posterior  
  $$\frac{d}{d\mu} \ln p(\mu|D) = \frac{1}{\sigma^2} \sum_{i=1}^N (x^{(i)} - \mu) - \frac{1}{\sigma_0^2} (\mu - \mu_0) = 0$$  
  Step 2: Solve the Equation  
  $$\hat{\mu}_{MAP} = \frac{\frac{\mu_0}{\sigma_0^2} + \frac{\sum_{i=1}^N x^{(i)}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{N}{\sigma^2}} = \frac{\sigma^2 \mu_0 + \sigma_0^2 N \bar{x}}{\sigma^2 + N \sigma_0^2} = \frac{\sigma^2} {\sigma^2 + N \sigma_0^2}\mu_0 + \frac{\sigma_0^2 N} {\sigma^2 + N \sigma_0^2}\mu_{ML}$$

* **Interpretation of Results**
1. **Weighted Average**:  
   $\hat{\mu}_{MAP}$ combines prior $\mu_0$ and data $\bar{x}$.
2. **Asymptotic Behavior**:  
   - If $( \sigma_0^2 \gg \sigma^2 )$, the prior becomes uninformative (flat Gaussian), and the MAP estimate reduces to the MLE again.
   - With large $N$, $(\hat{\mu}_{MAP} \approx \hat{\mu}_{MLE} = \bar{x} = \frac{1}{N} \sum_{i=1}^N x^{(i)})$.
   <div style="text-align:center">
   <img src="images/MAP.png" alt="Asymptotic Behavior">
   </div>
3. **Posterior Precision**:  
   $$\frac{1}{\sigma_N^2} = \frac{1}{\sigma_0^2} + \frac{N}{\sigma^2}$$
   More samples $\rightarrow$ Sharper Posterior $p(\mu | D)$ $\rightarrow$ Higher Confidence in Estimation
   <div style="text-align:center">
   <img src="images/MAPHigherN.png" alt="Asymptotic Behavior">
   </div>

#### **Conjugate Priors**

**definition:** A prior is chosen such that the posterior has the same functional form as the prior.   
*In the Gaussian case (shown earlier), both the prior and posterior are Gaussian distributions.*

$$p(\theta|D) \propto p(D|\theta) p(\theta)$$
Key Terms:
- $p(\theta|D)$: Posterior distribution
- $p(D|\theta)$: Likelihood function
- $p(\theta)$: Prior distribution *(same functional form as the posterior)*

#### **Prior for Bernoulli**

**Beta Distribution over $\theta \in [0,1]$**

1. Probability Density Function
$$\text{Beta}(\theta|\alpha_1, \alpha_0) \propto \theta^{\alpha_1 - 1}(1 - \theta)^{\alpha_0 - 1}$$

2. Normalized Form
$$\text{Beta}(\theta|\alpha_1, \alpha_0) = \frac{\Gamma(\alpha_0 + \alpha_1)}{\Gamma(\alpha_0)\Gamma(\alpha_1)} \theta^{\alpha_1 - 1}(1 - \theta)^{\alpha_0 - 1}$$

3. Key Properties
- *Mean*:
  $$E[\theta] = \frac{\alpha_1}{\alpha_0 + \alpha_1}$$
- *Mode (Most Probable $\theta$)*:
  $$\hat{\theta} = \frac{\alpha_1 - 1}{\alpha_0 - 1 + \alpha_1 - 1}$$

<div style="text-align:center">
  <img src="images/BetaDistribution.png" alt="Beta Distribution">
</div>

**Conjugate Prior Derivation**
1. Bayesian Posterior Proportionality
$$p(\theta|D) \propto p(D|\theta)p(\theta)$$
2. Prior Distribution (Beta)
$$p(\theta) = \theta^{\alpha_1 - 1}(1 - \theta)^{\alpha_0 - 1}$$
3. Likelihood:
$$p(D |\theta) = \prod_{i=1}^N p(x^{(i)} | \theta) = \prod_{i=1}^N \theta ^ {x^{(i)}}(1 - \theta)^{(1-x^{(i)})}$$
4. Combine Prior and Likelihood
$$p(D|\theta)p(\theta) = \theta ^ {(\alpha_1 + \sum_{i=1}^N x^{(i)} - 1)} + (1 - \theta) ^ {(\alpha_0 + \sum_{i=1}^N (1-x^{(i)}) - 1)}$$
5. Resulting Posterior (Beta)
    The posterior is a Beta distribution with updated parameters:
    $$p(\theta|D) \propto \text{Beta}(\theta | \alpha_1', \alpha_0')$$
    Where ($m$ heads (1) and $N-m$ tails (0)):
    $$\alpha_1' = \alpha_1 + \sum_{i=1}^N x^{(i)} = \alpha_1 + m$$
    $$\alpha_0' = \alpha_0 + \sum_{i=1}^N (1-x^{(i)}) = \alpha_0 + N - m$$
    $$\hat{\theta} = \frac{\alpha_1' - 1}{\alpha_0' - 1 + \alpha_1' - 1}$$
**Conjugacy:** The posterior retains the Beta form.

### **Bayesian Interference**

#### **Introduction**

* **MLE (Maximum Likelihood Estimation)**: 
    - seeks a fixed point estimate for a parameters $\theta$ by maximizing the likelihood:
    $$\hat{\theta}_{ML} = \argmax_{\theta}\ p(D | \theta)$$
    - Limitation: Ignores prior knowledge and parameter uncertainty.

* **MAP (Maximum A posteriori)**:
    - Extends MLE by incorporating a prior, but still returns a fixed point estimate:
    $$\hat{\theta}_{MAP} = \argmax_{\theta}\ p(\theta | D) = \argmax_{\theta}\ p(\theta)p(D|\theta)$$
    - Limitation: Collapses the posterior to a single value, discarding uncertainty.

* **Fully Bayesian**:
    - Treats parameters $\theta$ as random variables with a full posterior distribution $p(\theta | D)$ and then compute the predictive distribution $p(x | D)$
    - Does not fix $\theta$ to a single value. Instead, it quantifies uncertainty by maintaining the entire distribution.
    - Limitation: Needs higher computational complexity to calculating integral.

#### **The Predictive Distribution**

To compute a predictive distribution $p(x | D)$ using posterior distribution:
$$p(x | D) = \int p(x,\theta | D) {d}\theta = \int p(x | D,\theta)p(\theta| D) {d}\theta = \int p(x | \theta)p(\theta| D) {d}\theta$$
Where:  
- $p(x | \theta)$ is likelihood of $x$ given specific $\theta$.
- $p(\theta | D)$ is posterior belief over $\theta$ after observing data $D$.
- The integral marginalizes over $\theta$, averaging predictions across all possible $\theta$ values, weighted by their posteriori probability.

Weighting Mechanism:  
- The posterior $p(\theta ∣ D)$ acts as a *weight* for each $\theta$, reflecting how much we believe in that parameter value after seeing the data.
- Example: If $p(\theta_1 | D)$ is high, predictions using $\theta_1$ contribute more to $p(x | D)$

#### **When Fully Bayesian ≈ MAP**

When the posterior distribution collapses to a single point. *(posterior variance $\sigma_N^2 \to 0$)*  
Posterior Precision:  
   $$\frac{1}{\sigma_N^2} = \frac{1}{\sigma_0^2} + \frac{N}{\sigma^2}$$
When $\sigma_N^2 \to 0$:
- Large Data ($N \to \infty$) *proved before*
- Strong Prior ($\sigma_0^2 \to 0$)
- Low Data Noise ($\sigma_0^2 \to 0$)

Mechanism:  
  - Posterior becomes a delta function $\delta(\theta - \mu_N)$.  
  - Predictive integral reduces to $p(x|\mu_N)$.  

Implication:  
  - Bayesian uncertainty vanishes ⇒ MAP and Bayesian predictions align.  

#### **Example: Bernoulli Likelihood: prediction**

1. Prior and Posterior Distributions
We showed in the conjugacy section:
* Prior:
    $$p(\theta) = \theta^{\alpha_1 - 1}(1 - \theta)^{\alpha_0 - 1}$$
* Posterior:
    $$p(\theta|D) \propto \text{Beta}(\theta | \alpha_1', \alpha_0')$$
    With updated parameters (for $m$ heads(1) and $N-m$ tails(0)):
    $$\alpha_1' = \alpha_1 + \sum_{i=1}^N x^{(i)} = \alpha_1 + m$$
    $$\alpha_0' = \alpha_0 + \sum_{i=1}^N (1-x^{(i)}) = \alpha_0 + N - m$$
    Thus:
    $$p(\theta|D) = \theta ^ {(\alpha_1 + m - 1)} + (1 - \theta) ^ {(\alpha_0 + (N - m) - 1)}$$

2. Predictive Distribution
    The predictive distribution for data x is:
    $$p(x | D) = \int p(x | \theta)p(\theta| D) {d}\theta$$
    This is equivalent to taking the expectation over the posterior (since we already know its formula):
    $$E[f(x)] = \int f(x)p(x){d}x$$
    $$p(x | D) = E_{p(\theta | D)}[p(x | \theta)]$$

3. Solving for $p(x=1 | D)$:
    Likelihood:
    $$p(x | \theta) = \theta^x(1 - \theta)^{(1 - x)} \rightarrow p(x=1 | \theta) = \theta$$
    Posterior Expectation: of $\theta$ under a Beta distribution:
    $$E[\theta] = \frac{\alpha_1'}{\alpha_0' + \alpha_1'}$$
    result:
    $$p(x=1 | D) = E_{p(\theta | D)}[\theta] = \frac{\alpha_1 + m}{\alpha_0 + \alpha_1 + N}$$