# 1. Introduction

* **Goal:**

>*What is the probability that player 1 defeats player 2?*

* **Generative Model for Game Outcomes:**

>* Player has **skills** $\rightarrow$ computer skill difference

>$$s = w_1 - w_2$$

>* **Add noise**

>$$t = s + n \;\;\;\text{where}\;\;\; n \text{~} \mathcal{N}(0,1)$$

>* **Computer game outcome**

>$$y=\text{sign}(t)= \Bigg\{ \begin{matrix} +1 \rightarrow \text{Player 1 wins}\\ -1 \rightarrow \text{Player 1 wins}\end{matrix}$$

>* **Probability that player 1 wins**

>$$p(t|w_1,w_2) = \mathcal{N}(t;w_1-w_2,1)$$

>$$p(y=1|w_1,w_2)=p(t>0|w_1,w_2)=\Phi(w_1-w_2)$$

>$$\Phi(x)=\int^x_{-\infty} \mathcal{N}(z;0,1)dz = \int^\infty_0 \mathcal{N}(z;x,1)dz$$

>* **Likelihood**

>$$p(y|w_1,w_2) = \Phi(y(w_1-w_2))$$

* **TrueSkill: a probabilistic skill rating system:**

>* **Prior**

>$$p(w_i) = \mathcal{N}(w_i|\mu_i,\sigma^2_i)$$

>* **Likelihood**
>  * $p(s|w_1,w_2)$: delta fn.
>  * $p(y|t)$: step fn.

>$$p(y|w_1,w_2)=\iint p(y|t)p(t|s)p(s|w_1,w_2)dsdt$$



>* **Posterior:** no longer Gaussian / does not factorise / looks like a high-dim ball

>\begin{align}
p(w_1,w_2|y) &= \frac{p(w_1)p(w_2)p(y|w_1,w_2)}{\iint p(w_1)p(w_2)p(y|w_1,w_2)dw_1 dw_2} \\
&= \frac{\mathcal{N}(w_1;\mu_1,\sigma_1^2)\mathcal{N}(w_2;\mu_2,\sigma_2^2) \Phi(y(w_1-w_2))}{\iint \mathcal{N}(w_1;\mu_1,\sigma_1^2)\mathcal{N}(w_2;\mu_2,\sigma_2^2)\Phi(y(w_1-w_2)) dw_1dw_2}
\end{align}

>* **Normalising constant:** have closed form

>$$p(y) = \Phi \left( \frac{y(\mu_1-\mu_2)}{\sqrt{1+\sigma^2_1+\sigma^2_1}} \right) \;\;\;\rightarrow\;\;\; \text{smoother version of the likelihood}$$

# 2. Gibbs Sampling

* **Q. How do we integrate wrt an intractable posterior?**
* **The original integral:**

>$$\mathbb{E}_{p(\mathbf{x})}[\phi(\mathbf{x})] = \bar{\phi} = \int \phi(\mathbf{x}) p(\mathbf{x}) \text{d}\mathbf{x} \;\;\; , \;\;\; \mathbf{x} \in \mathbb{R}^D$$

* **Numerical integram on a grid:** (practical only to $D \leq 4$)

>$$\int{\phi(\mathbf{x})p(\mathbf{x})\text{d}\mathbf{x}} \approx \sum^T_{\tau=1} \phi(\mathbf{x}^{(\tau)}) p(\mathbf{x}^{(\tau)}) \Delta \mathbf{x}$$

* **Monte Carlo:**

>$$\mathbb{E}_{p(\mathbf{x})} [\phi(\mathbf{x})] \approx \hat{\phi} = \frac{1}{T} \sum^T_{\tau=1} \phi(\mathbf{x}^{(\tau)}) \;\;\;,\;\;\; \mathbf{x}^{(\tau)} \text{ ~ } p(\mathbf{x})$$

>* $\hat{\phi}$: unbiased estimate with

>$$\mathbb{V}[\hat{\phi}] = \frac{\mathbb{V}[\phi]}{T} \;\;\;,\;\;\; \mathbb{V}[\phi] = \int \left( \phi(\mathbf{x})-\bar{\phi} \right)^2 p(\mathbf{x}) \text{d} \mathbf{x}$$

>* **NOTE:** the variance in independent of the dimension of $\mathbf{x}$

* **Markov Chain Monte Carlo:**

>$$\mathbf{x} \rightarrow \mathbf{x}' \rightarrow \mathbf{x}'' \rightarrow \mathbf{x}''' \rightarrow \cdots$$

>$$x'_i \sim p(x_i|x_1,...,x_{i-1},x_{i+1},...,x_D)$$

>* This will eventually generate dependent samples from $p(\mathbf{x})$

* **Gibbs Sampling**

>* **Advantage:** Sampling from a joint distribution $\rightarrow$ sampling from a sequence of univariate conditional distributions

>* **Disadvantage:** Initial convergence / Dependence between samples
>* **Improvements:** 
>  * Initial samples are discarded
>  * Samples are thinned (e.g.  every 10th or 100th)
>  * Change the *effective correlation length*
>  * Run several Gibbs samplers (different starting points) to compare results