In [1]:
import numpy as np

# Nadno de Fritas UBC ML Lectures

Lecture Videos: https://www.youtube.com/playlist?list=PLE6Wd9FR--Ecf_5nCbnSQMHqORpiChfJf

Website: http://www.cs.ubc.ca/~nando/340-2012/lectures.php

Google Group: https://groups.google.com/forum/#!forum/cpsc340-2012

Latex symbols: https://oeis.org/wiki/List_of_LaTeX_mathematical_symbols

Equations: https://en.wikibooks.org/wiki/LaTeX/Advanced_Mathematics

# Lecture Notes 10

## Question on slide 9, $P(M=1 \mid T=0)$?

To continue from **slide 5**, the probabilities are:

$P( T \mid F=1, S=0, \beta_{2} ) = \beta^1_2 (1-\beta_{2})^1$

$P( T \mid F=0, S=1, \beta_{3} ) = \beta^1_3 (1-\beta_{3})^0$

$P( T \mid F=1, S=1, \beta_{4} ) = \beta^1_4 (1-\beta_{4})^0$

With **Maximum Likelihood**, I get:

$\beta_{2,ML} = \frac{1}{2}$

$\beta_{3,ML} = \frac{1}{1} = 1$

$\beta_{4,ML} = \frac{1}{1} = 1$

Also to complete **slide 6**: 

$\gamma_{2,ML} = \frac{3}{4}$

Then we have all the numbers we need to complete the tables on **slide 4**.

Assuming all priors are Beta(1,1), with **posterior mean**, E.g.:

$P(\theta) = Beta(1,1) \propto \theta^{(1-1)}(1-\theta)^{(1-1)}$?

Since all priors are Beta(1,1), we get the following, with probablities on slide 5:

$
\begin{eqnarray}
P(\theta|M) = P(M|\theta)P(\theta) \propto \theta^4(1-\theta)^1\theta^0(1-\theta)^0 \Rightarrow \mathbb{E}(\theta|M)= \frac{5}{5+2}=\frac{5}{7}
\end{eqnarray}
$

And therefore:

$\mathbb{E}(\alpha \mid S)=\frac{3}{4}$

$\mathbb{E}(\gamma_1 \mid F,M=0)=\frac{1}/{1+2}=\frac{1}{3}$

$\mathbb{E}(\gamma_2 \mid F,M=1)=\frac{4}/{4+2}=\frac{2}{3}$

$\mathbb{E}(\beta_1 \mid T,F=0,S=0)=\frac{1}{1+2}=\frac{1}{3}$

$\mathbb{E}(\beta_2 \mid T,F=1,S=0)=\frac{2}{2+2}=\frac{1}{2}$

$\mathbb{E}(\beta_3 \mid T,F=0,S=1)=\frac{2}{2+1}=\frac{2}{3}$

$\mathbb{E}(\beta_4 \mid T,F=1,S=1)=\frac{2}{2+1}=\frac{2}{3}$

Now we have all parameters needed for filling the tables on **slide 4**.

# Matrix Algebra & Calculus

http://www.statpower.net/Content/312/Handout/Matrix.pdf


## Properties of Transposition

1. $(A^{'})^{'} = A$
2. $(cA)^{'} = cA^{'}$
3. $(A + B)^{'} = A^{'} + B^{'}$
4. $(AB)^{'} = B^{'}A^{'}$
5. $A^{'}B = B^{'}A$
6. $(X^{-1})^{'} = (X^{'})^{-1}$

## Other Properties

* If $x^{-1} == x^{T}$ then $x$ is **orthgonal**.

## Calculus

Matrix caculus for machine learning [here](https://arxiv.org/abs/1802.01528)

Proof: http://www.atmos.washington.edu/~dennis/MatrixCalculus.pdf

| y | $\frac{\partial{y}}{\partial{x}}$ |
|:---:|:---------------:| 
| $Ax$ | $A^{T}$ |
| $x^{T}A$ | $A$ |
| $x^{T}x$ | $2x$ |
| $x^{T}Ax$ | $Ax + A^{T}x$, if $A$ is **symmetric**, $2Ax$ |

| $\alpha$ | $\frac{\partial{\alpha}}{\partial{x}}$ | $\frac{\partial{\alpha}}{\partial{y}}$ |
|:---: |: --- :|: --- :|
| $y^{T}Ax$ | $y^{T}A$ | $x^{T}A^{T}$ |

| $\alpha$  | $\frac{\partial{\alpha}}{\partial{z}}$ can't figout out how to make it wider|
|:---: |: --- :|
| $y^{T}x$ | $x^{T}\frac{\partial{y}}{\partial{z}} + y^{T}\frac{\partial{x}}{\partial{z}}$ |
| $y^{T}Ax$ | $x^{T}A^{T}\frac{\partial{y}}{\partial{z}} + y^{T}A\frac{\partial{x}}{\partial{z}}$ |
| $x^{T}x$ | $2x^{T}\frac{\partial{x}}{\partial{z}}$ |
| $x^{T}Ax$ | $x^{T}(A + A^{T})\frac{\partial{x}}{\partial{z}}$ |

### Trace of Matrices

Kevin Murphy's book, p101

Trace of matrix $A$ is the sum of its diagonal elements.

$$
\begin{aligned}
tr(A) =& \sum_{i} A_{ii} \\
tr(ABC) =& tr(CAB) = tr(BCA) 
\end{aligned}
$$

**Trace trick** last equation above is called the **cyclic permutation property** of the trace operator. Using this we derive the trace trick, which reorders the scaler inner product $x^TAx$:

$$x^TAx = tr(x^TAx) = tr(xx^TA) = tr(Axx^T) $$

# Lecture 19

## Linear Regression - Maximum Likelihood Matrix Maths

$$
\begin{aligned}
\ L(\theta) & = [y - x \theta]^{T} [y - x \theta ] \\
\ &= [y^{T} - (x\theta)^{T}][y - x\theta] \\
\ & = [y^{T} - \theta^{T}x^{T}][y - x\theta] \\
\ & = y^{T}y - y^{T}x\theta - \theta^{T}x^{T}y + \theta^{T}x^{T}x\theta \\
\ & = y^{T}y - 2y^{T}x\theta +\theta^{T}x^{T}x\theta \\
\end{aligned}
$$

Note that $y^{T}x\theta == \theta^{T}x^{T}y$. See numerical example below.

In [19]:
y = np.matrix([[2], [3]])
x = np.matrix([[3, 5], [6, 8]])
theta = np.matrix([[5], [6]])

In [20]:
y

matrix([[2],
        [3]])

In [21]:
x

matrix([[3, 5],
        [6, 8]])

In [22]:
theta

matrix([[5],
        [6]])

In [18]:
np.isclose(y.T @ x @ theta - theta.T @ x.T @ y, 0)

matrix([[ True]], dtype=bool)

Set $\frac{\partial{L(\theta)}}{\partial{\theta}} = 0$ for maximum likelihood gives:

$$
\begin{aligned}
\ \frac{\partial{L(\theta)}}{\partial{\theta}} = -2y^{T}x + 2x^{T}x\theta &= 0\\
\ -2y^{T}x + 2x^{T}x\theta &= 0 \\
\ \hat{\theta} &= (x^{T}x)^{-1}x^{T}y \\
\end{aligned}
$$

# Probability Distributions

## Normal Distribution

Python: `scipy.stats.norm`

Given a normal distribution $\mathcal{N}(\mu, \sigma^2)$:

$$ 
\begin{aligned}
\ PDF(x) &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\bigg(-\frac{(x-\mu)^2}{2\sigma^2}\bigg) \\
\ CDF(x) &= \frac{1}{2}\big[1 + erf\big(\frac{x - \mu}{\sigma\sqrt{2}}\big)\big] \\
\ erf(x) &= \frac{1}{\sqrt{\pi}}\int_{-x}^{x}e^{-t^2}dt = \frac{2}{\sqrt{\pi}}\int_{0}^{x}e^{-t^2}dt\\
\end{aligned}
$$

**Control of Precision**: when we need to evaluate the PDF many times, we set $\beta = \frac{1}{\sigma^2}$:

$$ 
\begin{aligned}
\ PDF(x) &= \sqrt{\frac{\beta}{2\pi}} \exp\bigg(-\frac{1}{2}\beta(x-\mu)^2\bigg) \\
\end{aligned}
$$

## Beta Distribution

https://en.wikipedia.org/wiki/Beta_distribution

Python: `scipy.stats.beta`

In Bayesian inference, the beta distribution is the **conjugate prior** probability distribution for the **Bernoulli, binomial, negative binomial and geometric distributions**. 

Given $Beta(\alpha, \beta)$:

$$ 
\begin{aligned}
\ \Gamma(x) &= (n - 1)! \, \forall n \in \text{positive integers, i.e. }\mathbb{N}^+ \\
\ \Gamma(z) &= \int_{0}^{\infty}x^{z-1}e^{-x}dx \\
\ B(\alpha, \beta) &= \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} \\
\ PDF(x) &= \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)} \\
\ CDF(x) &= I_x(\alpha, \beta) = \frac{B(x; \alpha,\beta)}{B(\alpha, \beta)} \
 = \frac{\int_{0}^{\infty}t^{\alpha-1}(1-t)^{\beta-1}dt}{B(\alpha, \beta)} \\
\ \mathbb{E}[Beta(\alpha, \beta)] &= \frac{\alpha}{\alpha + \beta} \\
\ var(Beta(\alpha, \beta) &= \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} \\
\end{aligned}
$$

## Bernoulli Distribution

See **Nando's lecture notes 19** for a good discussion on its application.

Given $ 0 < p < 1, p \in \mathbb{R}$:

$$
\begin{aligned}
\ PDF(p,k) &= p^k(1-p)^{1-k}, &k \in \{0, 1\} \\
\ CDF(p,k) &= 
\begin{cases}
0 &k < 0 \\
1-p &0 \leq k < 1 \\
1 &k \geq 1 
\end{cases} \\
\ Mean = p \\
\ Variance = p(1-p)
\end{aligned}
$$

## Binomial Distribution

Given $B(n, p), n \in \mathbb{N}_0, p \in [0, 1]$:

$$
\begin{aligned}
PDF &= \binom{n}{k} p^k(1-p)^{n-k} \\
CDF &= \sum_{i=0}^{\lfloor k \rfloor} \binom{n}{i} p^i(1-p)^{n-i} = I_{1-p}(n-k, 1+k) \\
Mean &= np \\
Variance &= np(1-p)
\end{aligned}
$$

# Lecture Notes 15

## Multivariate Normal Distribution

Let $y \in \mathbb{R}^{n\times1}$, then the PDF of a **n-dimensional** multivariable normal distribution is (only when $\Sigma$ is **positive-definite**):

$$
\begin{aligned}
pdf(y) &= (2\pi)^{-n/2}\det(\Sigma)^{-1/2}\exp\bigg[-\frac{1}{2}(y-\mu)^T\Sigma^{-1}(y-\mu)\bigg] \\
&= \frac{1}{\sqrt{(2\pi)^{n}det(\Sigma)}} \exp\bigg[-\frac{1}{2}(y-\mu)^T\Sigma^{-1}(y-\mu)\bigg] \\
\mu &= \mathbb{E}(y) \\
\Sigma &= \mathbb{E}[(y-\mu)^T(y-\mu)] = \text{covariance} \\
\end{aligned}
$$

**Mahalanobis distance** measures the distance between a data vector $x$ and the mean vector $\mu$, It is the expression inside the exponential in the multivariate normal PDF formula.

**Control of Precision**: when we need to evaluate the PDF many times, we need to invert $\Sigma$ first and assign as $\beta$: See Deep Learning (Goodfellow) p63.

$$
\begin{aligned}
pdf(y) &= (2\pi)^{-n/2}\det(\Sigma)^{1/2}\exp\bigg[-\frac{1}{2}(y-\mu)^T\beta(y-\mu)\bigg] \\
\mu &= \mathbb{E}(y) \\
\Sigma &= \mathbb{E}[(y-\mu)^T(y-\mu)] = \text{covariance} \\
\end{aligned}
$$

## Eigenvector & Eigenvalues

Deep Learning (Goodfellow)

An **eigenvector** of a **square** matrix $A$ is a nonzero vector $v$ such that:
$$ Av = \lambda v $$
Where $\lambda$ is a constant scaler, known as the **eigenvalue** of this corresponding eigenvector.

Concatenating all the eigenvectors to form a matrix $V$ with one eigenvector per column, and eigenvalues into a diagonal matrix:

$$
\begin{aligned}
V =& \big[v^{(1)},\dots,v^{(n)}\big] \\
D =& \text{diag}([\lambda_1, \dots, \lambda_n])
\end{aligned}
$$

We have the **Eigen-decomposition**:

$$ A = V D V^{-1} $$

### Properties of Eigen-decomposition

$V$ above is an **orthogonal matrix** of eigenvectors, satisfying $V^{-1} = V^T$ and $V^T V=I$

$D$ is a diagonal, square matrix of eigenvalues

Note the inverse of a diagonal matrix exists only if every diagonal entry is nonzero. In this case:
$$ diag(v)^{-1} = [1/v_1, \dots, 1/v_n]^T $$

### Linear Dependence / Singular Matrix

A set of vectors is **linearly independent** if no vector in the set is a linear combination of the other vectors.

A **square** matrix with linearly dependent columns is know as **singular**.

For square matrices the left inverse and right inverse are equal.

## Matrix Fractional Power

http://math.stackexchange.com/questions/732511/fractional-power-of-matrix 

If a matrix is diagonalizable, first diagonalize it (**eigen-decomposition**), apply the power then convert back.

$$
\begin{aligned}
A &= VDV^{-1} \\
A^n &= VD^nV^{-1} \\
\end{aligned}
$$

See **power of matrices** section at https://en.wikipedia.org/wiki/Matrix_multiplication

**For a diagonal matrix $A$, $A^n$ is just raising the diagonal elements of $A$ to the power of $n$.**

## SVD

Deep Learning (Goodfellow)

Assume matrix $A$ is $m \times n$, SVD is

$$ A = UDV^{T} $$

where:

1. $U$ is $m \times m$, orthogonal, columns are **left-singular vectors** of $A$
2. $D$ is $m \times n$, diagonal matrix of **singular values** of $A$
3. $V$ is $n \times n$, orthogonal, columns are **right-singular vectors** of $A$

### Other Properties

1. $U$ are the eigenvectors of $AA^T$
2. $V$ are the eigenvectors of $A^T A$
3. the nonzero singular values of $A$ are the square roots of eigenvalues of $A^T A$, same is true for $AA^T$

### Moore-Penrose Pseudoinverse

Solving equations: $Ax = y$, given $A$ is $n \times m$.

1. $n > m$ - possible that there is no solution
2. $n < m$ - could be multiple solutions

Pseudoinverse definition:

$$ A^+ = VD^+U^T $$

where $U$, $D$, $V$ are coming from the SVD of $A$, the pseudoinverse $D^+$ of diagonal matrix $D$ is obtained by taking the reciprocal of its nonzero elements then taking the transpose of the resulting matrix.

# Lecture 20 (Notes PDF L16b)

Ridge regression is the mean of Bayesian regression (posterior mean), in special case where prior mean == 0.

Ridge regression maths essentially gets rid of the small eigenvalues.

## Bayesian Linear Regression

The likelihood is Gaussian, $\mathcal{N}(y\mid X\theta, \sigma^{2}I_d)$.

The conjugate prior is also Gaussian, $\mathcal{N}(\theta \mid \theta_{0}, V_{0})$

With Bayes rule, the posterior is given by:

$$
\begin{aligned}
\ p(\theta \mid X, y, \sigma^{2}) &\propto \mathcal{N}(\theta \mid \theta_{0}, V_{0}) \mathcal{N}(y \mid X\theta, \sigma^{2}I) = \mathcal{N}(\theta | \theta_{n}, V_{n}) \\
\ \theta_{n} &= V_{n}V_{0}^{-1}\theta_{0} + \frac{1}{\sigma^2}V_{n}X^{T}y \\
\ V_{n}^{-1} &= V_{0}^{-1} + \frac{1}{\sigma^2}X^{T}X 
\end{aligned}
$$


## Slide title Bayesian versus ML plugin prediction, minute 35:00

For the Bayesian, the posterior is the new prior, $\mathcal{N}(\theta \mid \theta_{n}, V_{n})$. 

In the special case where:

$$
\begin{aligned}
\ \theta_0 &= 0 \\
\ V_0 &= \tau^2 I_d \\
\ \text{define: } \lambda &= \frac{\sigma^2}{\tau^2} \\
\ \text{Posterior mean: } \theta_{n} &= (X^{T}X + \lambda I_d)^{-1}X^{T}y \\
\ \text{Posterior variance: } V_{n} &= \sigma^{2}(X^{T}X + \lambda I_d)^{-1} \\
\end{aligned}
$$

Bayesian linear regression produces difference variance to Maximum Likelihood method. 

Implication is that Bayesian predictions have **lower** variance in places where it has seen data, but **higher** variance where it has not seen data before. This is shown by the variance term for Bayesian prediction, which has a data depended term associated with $V_{n}$, which is essentially the **inverse of data matrix**.

Bayesian, given $D = (X, y)$ is the data matrix:
$$
\begin{aligned}
\ P(y|x_{*}, D, \sigma^{2}) &= \mathcal{N}(y \mid x_{*}^{T}\theta_{n}, \sigma^{2} + x_{*}^{T}V_{n}x_{*}) \\
\end{aligned}
$$

Also the posterial is proper if $n > k$ where $n$ is the number of rows and $k$ the number of features, and $rank(X) == k$. Ref: https://www.youtube.com/watch?v=d1iIUtnDngg, min 4:17

Maximum Likelihood:
$$
\begin{aligned}
\ P(y|x_{*}, D, \sigma^{2}) &= \mathcal{N}(y \mid x_{*}^{T}\theta_{ML}, \sigma^{2}) \\
\end{aligned}
$$


This means that Bayesian believes there are many possible values of $\theta$, they are weighted by their probabilities. 

When a new **data point** $x_{*}$ comes in, it is evaluated with all possible $\theta$'s, then weighted by their probabilities again. Posterior $\propto$ Prior * Likelihood.

For Maximum Likelihood, the posterior is a delta function that takes a value at $\theta_{ML}$, i.e. there is only one solution.

# Lecture 21

## Sub-differentiation

This is the treatment of taking derivative of absolute functions. The idea is to consider all possible slopes (derivatives), and define the solution as a set. Example below. This is also why for **Lasso** some parameters will be **zero**.

$$
\begin{aligned}
\ J(\theta) &= \delta^{2}|\theta| \\
\ \frac{\partial{J}}{\partial{\theta}} &= 
\begin{cases}
-1, \text{if $\theta < 0$} \\
\lbrack -1, 1 \rbrack, \text{if $\theta = 0$} \\
1, \text{if $\theta > 0$}
\end{cases}
\end{aligned}
$$

## Coordinate Descent Algorithm for Sparse Prediction

In the derivation of derivative of the objective function, for a vector $x$, the notation $x_{i-j}$ refers to all the elements of $x$ except the $j^{th}$ element. Similarly, $x_{j}$ refers to only the $j^{th}$ element.

# Boosting