# Computer Vision Recap - Pattern Recognition

## Lecture 20: Bayes Decision Theory

### Bayes Theorem

$$ P(w_j|x) = \frac{ P(x|w_j) \; P(w_j) }{ P(x) } $$ 

where the evidence $P(x)$ is a mixture of Gaussians (this is not a normal distribution anymore):
$$ P(x) = \sum_{k=1} P(x | w_k) P(w_k) $$


PDF:
$$f\left(x_{1}, \ldots, x_{m}\right)=\frac{1}{(2 \pi)^{m / 2} \sqrt{\operatorname{det} \Sigma}} e^{-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{\top} \mathbf{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})}$$

Mahalanobis Distance:
$$ r = \sqrt{ (x-\mu)^\top \Sigma^{-1}(x-\mu) } $$

Curse of dimensionality:

- if features are not independent, $\color{red}{exponentially}$ more training data is needed to compute meaninful likelihoods
- If features are independent: $P(x|w_j)=P(x_1,\dots,x_d|w_j)=P(x_1|w_j)\dots P(x_d|w_j)$, then the effort grows linearly with the dimension.

<br>

### Bayesian Risk

The cost (risk) of a decision is defined by a cost (loss) function $\lambda(a,w_j|x)$ given by:

$$ R[a|x] = \mathbb E_{w_j \sim p(w_j|x)}[\lambda(a,w_j|x)] = \sum_{j=1} \lambda(a,w_j|x) \, p(w_j|x) $$

Examples of conditional risks:
$$ R\left(\alpha_{1} | \boldsymbol{x}\right)=\lambda_{11} P\left(\omega_{1} | \boldsymbol{x}\right)+\lambda_{12} P\left(\omega_{2} | \boldsymbol{x}\right)$$
$$ R\left(\alpha_{2} | \boldsymbol{x}\right)=\lambda_{21} P\left(\omega_{1} | \boldsymbol{x}\right)+\lambda_{22} P\left(\omega_{2} | \boldsymbol{x}\right)$$

$\lambda_{11}$ and $\lambda_{22}$ are the cost for correct classification, that might not be zero, but at least we expect $\lambda_{11} < \lambda_{21}$ and $\lambda_{22} < \lambda_{12}$

Bayes decision rule: decide on $w_j$ if $R(a_j|x) < R(a_k|x), \;\; \forall k$

<br>

### Likelihood Ratio Test (LRT)

$$ \Lambda(\boldsymbol{x})=\frac{P\left(\boldsymbol{x} | \omega_{1}\right)}{P\left(\boldsymbol{x} | \omega_{2}\right)} \begin{array}{l} \omega_{1} \\ >\\<\\ \omega_{2} \end{array} \underbrace{\frac{\lambda_{12}-\lambda_{22}}{\lambda_{21}-\lambda_{11}} \cdot \frac{P\left(\omega_{2}\right)}{P\left(\omega_{1}\right)}}_{T} $$

if $\lambda_{11}=\lambda_{22}=1$ and $\lambda_{12}=\lambda_{21}=0$, then the LRT is called MAP criterion and ML for equal priors (T=1)



### Discriminant Based Classification

Discriminant function: $g(x)$. Choosen class $i$ if 
$$ g_i(x) > g_j(x) \quad \forall i \neq j $$

Bayes rule can be used as a DF: $g_i(x) = P(w_i|x)$

which implies that:
$$g_{i}(\boldsymbol{x})=-\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}_{i}\right)^{\top} \Sigma_{i}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{i}\right)-\frac{d}{2} \ln (2 \pi)-\frac{1}{2} \ln \left(\left|\Sigma_{i}\right|\right)+\ln P\left(\omega_{i}\right)$$

#### Case 1: $\Sigma_i=\sigma I$

$$g_{i}(\boldsymbol{x})=-\frac{\left\|\boldsymbol{x}-\boldsymbol{\mu}_{i}\right\|^{2}}{2 \sigma}=-\frac{x^{\top} x-2 \mu_i^{\top} x+\mu_i^{\top} \mu_i}{2 \sigma}$$

results in $\color{red}{linear}$ discriminant:
$$g_{i}(\boldsymbol{x})=\boldsymbol{w}_{i}^{\top} \boldsymbol{x}+\boldsymbol{w}_{i 0} \quad \text{where}$$
$$\boldsymbol{w}_{i}=\frac{1}{\sigma} \boldsymbol{\mu}_{i}, \quad w_{i 0}=-\frac{1}{2 \sigma} \boldsymbol{\mu}_{i}^{\top} \boldsymbol{\mu}_{i}+\ln P\left(\omega_{i}\right)$$

#### Case 2: $\Sigma_i=\Sigma$

$$g_{i}(\boldsymbol{x})=-\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}_{i}\right)^{\top} \Sigma_{i}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{i}\right)$$

which is also a $\color{red}{linear}$ discriminant:
$$g_{i}(\boldsymbol{x})=\boldsymbol{w}_{i}^{\top} \boldsymbol{x}+\boldsymbol{w}_{i 0} \quad \text{where}$$
$$\boldsymbol{w}_{i}=\Sigma^{-1} \boldsymbol{\mu}_{i}, \quad w_{i 0}=-\frac{1}{2} \boldsymbol{\mu}_{i}^{\top} \Sigma^{-1} \boldsymbol{\mu}_{i}+\ln P\left(\omega_{i}\right)$$

#### Case 3: $\Sigma_i$ = Arbitrary

$$ g_i(x) = x^\top W_i x + w_i^\top x + w_{i,0} \quad \text{where}$$
$$W_{i}=-\frac{1}{2} \Sigma_{i}^{-1}, \quad \boldsymbol{w}_{i}=\Sigma_{i}^{-1} \boldsymbol{\mu}_{i}, \quad w_{i 0}=-\frac{1}{2} \boldsymbol{\mu}_{i}^{\top} \Sigma_{i}^{-1} \boldsymbol{\mu}_{i}-\frac{1}{2} \ln \left|\Sigma_{i}\right|+\ln P\left(\omega_{i}\right)$$

which is $\color{red}{quadratic}$ and non-contiguous




<br><br>

## Lecture 21: Parametric Techniques, Density Estimation


### Parameter Estimation

#### Maximum Likelihood

If samples are drawn independently
$$ l(\boldsymbol{\theta})=\ln p(D | \boldsymbol{\theta})=\ln \prod_{k=1}^{n} p\left(\boldsymbol{x}_{k} | \boldsymbol{\theta}\right)=\sum_{k=1}^{n} \ln p\left(\boldsymbol{x}_{k} | \boldsymbol{\theta}\right) $$

$$\hat{\boldsymbol{\theta}}=\arg \max _{\boldsymbol{\theta}} l(\boldsymbol{\theta})$$

Maximizing the likelihood implies that:
$$\nabla_{\theta} l(\theta)=\nabla_{\theta} \sum_{k=1}^{n} \ln p\left(x_{k} | \theta\right)=\sum_{k=1}^{n}\nabla_{\theta} \ln p\left(x_{k} | \theta\right)=0$$

If Gaussian case and unknown mean $\mu$ and unknown covariance $\Sigma$, then:

$$\hat{\mu}=\frac{1}{n} \sum_{k=1}^{m} x_{k} \quad \text{and} \quad \hat{\Sigma}=\frac{1}{n} \sum_{k=1}^{n}\left(x_{k}-\mu\right)\left(x_{k}-\mu\right)^{\top}$$

<br>

### Density Estimation

Idea: Estimate the distribution (function) from scratch

$$P=\int_{R} p(x) d x \approx p\left(x^{*}\right) \operatorname{vol}(R) \approx \frac{k}{n}$$

where $k$ out of $n$ samples fall into the range $R$

Choose a large enough R, so that it contains sufficiently many samples and (so that variance/uncertainty is not so large)

<br>

<img src="./images/densest.png" width=400>

<br><br>

To estimate distribution at $x^*$:
$$p_{n}\left(x^{*}\right)=\frac{k_{n} / n}{\operatorname{vol}\left(R_{n}\right)} \approx p\left(x^{*}\right)$$


<br>

#### Bias-Variance Tradeoff

<img src="./images/biasvariance.png" width=600>

<br><br>

## Lecture 22: Non-Parametric Techniques

### Parzen Windows

**Idea:** Count the number of samples $k$ within a region $R$ of fixed size

Define a funcion:
$$\varphi(u)=\left\{ \begin{array}{ll} 1, & \left|u_{j}\right| \leq \frac{1}{2}, j=1,2, \ldots, d \\ 0, & \text { otherwise }\end{array} \right.$$

The previous definitions for the density and the number of samples
$$\begin{align*} p(\boldsymbol{x}) \approx \frac{k_{n} / n}{\operatorname{vol}\left(R_{n}\right)} \quad \text { and } \quad k_{n}=\sum_{i=1}^{n} \varphi\left(\frac{\boldsymbol{x}-\boldsymbol{x}_{i}}{h}\right) \end{align*}$$

in a region $R$ lead to the density estimate
$$ \begin{align*} \tilde{p}(\boldsymbol{x}) \approx \frac{1}{n} \sum_{i=1}^{n} \frac{1}{\operatorname{vol}(R)} \varphi\left(\frac{x-x_{i}}{h}\right)=\frac{1}{n h^{d}} \sum_{i=1}^{n} \varphi\left(\frac{x-x_{i}}{h}\right) \end{align*} $$

### General Parzen Windows

Change kernel function $\varphi((x-x_i)/h)$ by a Gaussian centered at $x_i$ with variance $h$

<img src="./images/parzen.png" width=400>

### K-Nearest Neighbours (KNN)



<br><br>

## Lecture 23: Dimensionality Reduction

### Principal Component Analysis (PCA)



### Linear Discriminant Analysis (LDA)




