In [7]:
import numpy as np
import matplotlib.pyplot as plt

from IPython.display import Video

from scipy.stats import norm, gamma, cauchy, multivariate_normal, chi2
from scipy.special import ndtr, comb

from utils import color_cycle

# Lecture 1: Sampling

  * The problem with uniform sampling
  * Importance sampling
  * Rejection sampling

MacKay, **Chapter 29**.

_Recommended readings_:

* **Chapter 27** of the book [Bayesian Reasoning and Machine Learning](http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.HomePage) by David Barber
* **Chapter 4** of MacKay for more on the typical set and source coding

# The problem

In many machine learning problems one needs to estimate high dimensional integrals:

1.  Estimate expected parameter values in the posterior of a Bayesian learning problem:
    $$p(\theta|D)=\frac{p(D|\theta) p(\theta)}{p(D)}\qquad \mathbb{E}\theta =\int d\theta \theta p(\theta|D)$$

2.  Estimate the *evidence* for model selection
    $$p(D|\mathcal{H}_i)=\int d\theta p(D|\theta,\mathcal{H}_i) p(\theta|\mathcal{H}_i)$$

3.  Computing statistics in graphical models $p(x_1,\ldots,x_n)=\prod_{i=1}^n p_i(x_i|x_{\text{Pa}(i)})$
    $$p(x_1,x_2)=\sum_{x_3,\ldots,x_n} p(x_1,\ldots,x_n)$$

Denote $p(x)=\frac{p^*(x)}{Z}$ the probability distribution of interest with $Z=\int dx p^*(x)$.
We can assume that $p^*(x)$ is easily evaluated for any $x$, but $Z$ is hard to evaluate.

# The problem with uniform sampling

Suppose we're only allowed to produce $N$ points $x^r$ in the support of $p^*$. The best way to choose them is to extract them with probability $p(x)=p^*(x)/Z$:
$$\hat{\Phi}=\frac{1}{N}\sum_{r=1}^N \phi(x^r)\qquad x^r \sim p(x)$$
$\hat{\Phi}$ is a random variable (each time we generate $N$ samples we get a different outcome). If we draw the $x^r$ independently we have

$$\mathbb{E} \hat{\Phi}=\frac{1}{N}\sum_{r=1}^N\sum_{x^r} p(x^r)\phi(x^r)=\Phi$$

The estimator $\hat{\Phi}$ is **unbiased**.

The variance in $\hat{\Phi}$ decreases with the number of samples $N$. Indeed, for a sum of independent random variables $X=\sum_i X_i$ is a sum of independent random variables, the variance of $X$ is the sum of the variances of each $X_i$. Therefore:

$$\mathrm{Var} \left[\hat{\Phi}\right] = N \mathrm{Var} \left[\frac{\phi}{N}\right] =\frac{1}{N}\mathrm{Var} \left[\phi\right] \qquad \mathrm{Var} \left[\phi\right] = \sum_x p(x) (\phi(x) - \Phi)^2$$

Thus,
$$\hat{\Phi}=\Phi + \mathcal{O}\left(\frac{1}{\sqrt{N}}\right)$$

The accuracy does not depend on the dimension of the problem. But sampling from $p$ is very hard.

## Entropy and the typical set

#### A simple experiment with $N$ coins

In [8]:
# toss N coins with increasing N
num_n = 20
num_samples = 100

ns = np.logspace(1, 6, num_n, dtype=int)
num_heads = np.zeros((num_n, num_samples))

for i, n in enumerate(ns):
    draws = 1 * (np.random.rand(num_samples, n) <= 0.5)
    num_heads[i] = draws.sum(1) / n

The following plot shows a well known result:

In [None]:
plt.errorbar(ns, y=num_heads.mean(1), yerr=num_heads.std(1));
plt.xlabel('N')
plt.ylabel('# of heads');

Let's try to sample from a non-uniform distribution using a uniform proposal. The simplest non-uniform distribution for discrete variables is a set of bernoulli distributed events with $p\neq0.5$. Let us take $p=0.2$:

$$p\left(x\right)=\prod_{i}p^{x_{i}}\left(1-p\right)^{1-x_{i}}=e^{\sum_{i}\left(x_{i}\log p+\left(1-x_{i}\right)\log\left(1-p\right)\right)}$$

and use

$$
\hat{\Phi}=\frac{\sum_{r=1}^{N}\phi\left(x^{r}\right)p\left(x^{r}\right)}{\sum_{r=1}^{N}p\left(x^{r}\right)}
$$

In [None]:
# toss n coins with increasing n
p = 0.2
num_samples = 10000

ns = [5, 10, 20, 50, 100, 500, 1000]
num_n = len(ns)
num_heads_av = np.zeros(num_n)

for i, n in enumerate(ns):
    draws = 1 * (np.random.rand(num_samples, n) <= 0.5)
    logpx = (draws * np.log2(p) + (1-draws) * np.log2(1-p)).sum(1)
    px = 2**logpx
    num_heads_av[i] = (draws.sum(1) / n * px).sum() / px.sum()

plt.plot(ns, num_heads_av, '.-');
plt.xscale('log');

Our interpretation is the following: of all the $2^N$ configurations of outcomes, the ones where the total sum of heads is $N/2$ dominate (in fact they **exponentially dominate**) the counting over a uniform distribution, this being more and more evident as $N$ grows. Indeed, the number of heads have a binomial distribution with mean $Np$ and variance $Np(1-p)$: the relative scaling of the standard deviation wrt the mean is an indication of concentration.

#### A slightly more involved example using the Ising model

Consider the Ising model
$$p(x)=\frac{\exp\left(\beta \sum_{i>j} x_i x_j w_{ij}\right)}{Z}=\frac{p^*(x)}{Z}\qquad Z=\sum_xp^*(x)$$
with $x=(x_1,\ldots,x_n)$ and $x_i=\pm 1, i=1,\ldots,n$. Total number of states is $2^n$.

#### Visualize samples from a 32x32 2d Ising model at two different temperatures:

In [None]:
# example of an Ising model at T = 10
Video("ising_size32_32_T10.mp4")

In [None]:
# example of an Ising model at the critical temperature
Video("ising_size32_32_T2.269.mp4")

#### Sampling and entropy

Draw a long sequence of $N$ samples from a distribution:
$$x^1,x^2, \ldots, x^N$$The probability of the sequence is
$$p(x^1,x^2\ldots, x^N)=p(x^1)P(x^2)\ldots p(x^N)= \prod_x p(x)^{N(x)}$$
with $N(x)\approx p(x) N$ the number of times the value $x$ occurs in the string. Thus we have:
$$p(x^1,x^2\ldots, x^N)\approx \prod_x p(x)^{p(x)N} =\left(2^{- H}\right)^N$$
with $H=-\sum_x p_x \log_2 p_x$ the entropy of the distribution. Equivalently
$$\prod_{i}p\left(x_{i}\right)=2^{N\frac{\sum_{i}\log p\left(x_{i}\right)}{N}}\approx2^{-NH}$$

Thus, in the large $N$ limit all **typical** strings have the same probability $2^{-NH}$ and all other **non-typical** strings have probability zero.

The formula suggests that *on average* for a single sample, there are typical samples with probability $2^{-H}$ and non-typical samples with probability zero.

Denote $T$ the set of typical samples. The number of typical samples is thus $|T|\approx 2^{H}$.

The **typical set** should be compared to the total number of states $2^n$. The volume fraction is
$$2^{H-n}$$
If we sample states $x$ uniformly at random, the probability to hit an element of the typical set is $2^{H-n}$.
Thus, one needs on the order of $N_\text{min}=2^{n-H}$ samples to hit the typical set once and therefore the
number of samples
$$N\gg N_\text{min}=2^{n-H}$$

For $n$ binary spins $0\le H \le n$.

#### Typical set in coin flipping

Let us look at a histogram of $-\log(p(x))$ with $x$ a sequence of $N$ independent tosses of a biased coin with $p=0.1$:

In [None]:
N = 100
num_samples = 30

p = 0.1
draws = 1 * (np.random.rand(num_samples, N) <= p)
logpx = (draws * np.log2(p) + (1-draws) * np.log2(1-p)).sum(1)

H = -N * (p * np.log2(p) + (1-p) * np.log2(1-p))

logp_all1 = N * np.log2(p)
logp_all0 = N * np.log2(1-p)

plt.figure(figsize=(12,4))

plt.subplot(121)
plt.imshow(draws, cmap='gray');

plt.subplot(122)
plt.hist(-logpx);
plt.vlines(x=H, ymin=0, ymax=10, ls='--', color='black', label='H');
plt.vlines(x=-logp_all1, ymin=0, ymax=10, ls='--', color='red', label='-logp all 1');
plt.vlines(x=-logp_all0, ymin=0, ymax=10, ls='--', color='green', label='-logp all 0');
plt.xscale('log')
plt.legend();

plt.tight_layout();

#### Typical set in the Ising model

In [None]:
# Load entropy of an 8x8 spin 2d Ising model with couplings all equal to 1

load = np.loadtxt("T_H_8_8_ising.txt")
Ts, Hs = load[0], load[1]
idx_crit = np.where((np.abs(Ts-2.27)<1e-2))[0][0]

plt.plot(Ts, Hs, ls='-');
plt.hlines(y=np.log(2), xmin=Ts[0], xmax=Ts[-1], ls=':', color="gray");
plt.vlines(x=Ts[idx_crit], ymin=0, ymax=Hs[idx_crit], ls='--', color='black')
plt.xlabel('T')
plt.ylabel('H');

print("H at T critical:", Hs[idx_crit])
print("H at T = inf:", np.log(2))

For (very) high temperature (low $\beta$) $H \approx n$ and $N_\text{min}\approx 1$ and uniform sampling is feasible

Around the critical temperature, $H\approx n/2$ and $N_\text{min}\approx 2^{n/2}$. For $n=1000$ spins this is of order $10^{150}$.

#### Conclusion: Uniform sampling only works for uniform distributions.

# Importance sampling

Let us write
$$\Phi=\int dxp\left(x\right)\phi\left(x\right)=\int dxq\left(x\right)\frac{p\left(x\right)}{q\left(x\right)}\phi\left(x\right)$$

and sample from the distribution $q(x)$ that is
* better than uniform;
* easy to sample from.

For instance, a (spherical) Gaussian:
$$Q^*(x)\propto \exp\left(-\frac{1}{2}\sum_i x_i^2\right)$$

Consider simple 1-d sampling problem. Given $p(x)$, compute
$$\Phi=\mathrm{Prob}(x<0)=\int_{-\infty}^\infty \phi(x) p(x)dx$$
with $\phi(x)=1$ if $x\le0$ and zero otherwise.

**Naive method**: generate $N$ samples $X_i\sim p$
$$\hat{\Phi}=\frac{1}{N}\sum_{i=1}^N \phi(X_i)$$

**Importance sampling**: consider another distribution $q(x)$. Then
$$\Phi=\mathrm{Prob}(x<0)=\int_{-\infty}^\infty \phi(x) \frac{p(x)}{q(x)}q(x) dx$$

Generate $N$ samples $X_i\sim q$
$$\hat{\Phi}=\frac{1}{N}\sum_{i=1}^N \phi(X_i)\frac{p(X_i)}{q(X_i)}$$

#### The best importance sampler

Any importance sampler is unbiased by construction, indeed:

$$
\mathbb{E}\hat{\Phi}=\mathbb{E}\phi\left(x\right)\frac{p\left(x\right)}{q\left(x\right)}=\sum_{x}\frac{q\left(x\right)}{q\left(x\right)}p\left(x\right)\phi\left(x\right)=\Phi
$$

As for the variance, we have

$$\text{Var}\hat{\Phi}=\frac{1}{N}\text{Var}\left(\phi\left(x\right)\frac{p\left(x\right)}{q\left(x\right)}\right)$$

The minimal variance can be attained by making $\phi\left(x\right)\frac{p\left(x\right)}{q\left(x\right)}$ constant, i.e. by taking $q\left(x\right)\propto\phi\left(x\right)p\left(x\right)$. In such a case, we would only need a single observation. We're left with the problem of choosing the proportionality constant in $q\left(x\right)=\frac{\phi\left(x\right)p\left(x\right)}{Z_{q}}$, i.e. normalize $q$, by imposing

$$\sum_{x}\phi\left(x\right)p\left(x\right)=Z_{q}$$

from which we see that this approach is hopeless. Nonetheless, we can use this idea to construct $q$ distributions that have high probability mass in those regions where $\phi$ is nonzero - as we did in the previous example.

#### A simple example of why you would want to use importance sampling

Let's compute the probability of a normal random variable $X$ to be greater than a value `thresh` using both the original density $p$ and a broader candidate $q$.

In [None]:
thresh = 4
num_sample = 10000
num_trials = 100
pthres_1 = np.zeros(num_trials)
pthres_4 = np.zeros(num_trials)

for trial in range(num_trials):
    normal_1 = multivariate_normal(0, 1)
    normal_4 = multivariate_normal(0, 4) # try also a shifted Gaussian with (1,4) options
    sample_1 = normal_1.rvs(num_sample)
    sample_4 = normal_4.rvs(num_sample)
    pthres_1[trial] = (sample_1 > thresh).mean()
    pthres_4[trial] = (normal_1.pdf(sample_4) / normal_4.pdf(sample_4) * (sample_4 > thresh)).mean()

# visualize a single sample
xs = np.linspace(-5, 5, 100)
plt.hist(sample_1, bins=50, histtype="step", density=True, label='std=1');
plt.hist(sample_4, bins=50, histtype="step", density=True, label='std=4');
plt.plot(xs, normal_1.pdf(xs), c=color_cycle[0])
plt.plot(xs, normal_4.pdf(xs), c=color_cycle[1]);
plt.legend();

In [None]:
plt.hist(pthres_1, label=f'p(X>{thresh})');
plt.hist(pthres_4, label=f'p(X>{thresh}) Importance Sampling');
plt.xscale('log')
plt.vlines(x=(1-ndtr(thresh)), ymin=0, ymax=100, ls=':', color='black', label='true value');
plt.legend();

#### The problem with rare events

Rare events give large contributions. It it thus best to use wide distribution $q$.

#### Importance Sampling with non-normalized densities

The importance weights $p(x)/q(x)$ assume that we can evaluate $p(x)$. However, most often $p(x)$ can only be easily computed up to a constant:
$$p(x)=\frac{p^*(x)}{Z} \qquad Z=\int dx p^*(x)$$

We can thus estimate both numerator and denominator by sampling:
$$\Phi=\int dx p(x) \phi(x) =\frac{\int dx p^*(x) \phi(x)}{\int dx p^*(x)}=\frac{\int dx q(x)\frac{p^*(x)}{q(x)} \phi(x)}{\int dx q(x) \frac{p^*(x)}{q(x)}}$$

Sample ${x^r}$ from $q(x)$ and compute
$$w_r=\frac{p^*(x^r)}{q(x^r)}\qquad \hat{\Phi}=\frac{\sum_r w_r \phi(x^r)}{\sum_r w_r}$$

The estimate is biased:
$$\mathbb{E} \hat{\Phi}=\mathbb{E}\left(\frac{ \sum_{r=1}^N w_r \phi(x^r)}{ \sum_{r=1}^N w_r}\right) \ne\frac{\mathbb{E} \sum_{r=1}^N w_r \phi(x^r)}{\mathbb{E} \sum_{r=1}^N w_r}=\frac{N \int dx p^*(x) \phi(x)}{N\int dx p^*(x)}=\Phi$$

However, for large $N$:
$$\sum_{r=1}^N w_r= N \mathbb{E} w_r + \mathcal{O}\left(\sqrt{N}\right)\approx N \mathbb{E} w_r$$
$$\mathbb{E} \hat{\Phi}\approx \mathbb{E}\left(\frac{ \sum_{r=1}^N w_r \phi(x^r)}{N \mathbb{E} w_r}\right)=\frac{\mathbb{E} \sum_{r=1}^N w_r \phi(x^r)}{N \int dx p^*(x)}=\frac{N \int dx p^*(x) \phi(x)}{N\int dx p^*(x)}=\Phi$$
The estimator $\hat{\Phi}$ is asymptotically unbiased.

# Rejection sampling

Choose, $c$ such that for all $x: c Q^*(x) > P^*(x)$

  * generate $x$ from $Q^*(x)$
  * generate $u$ uniform from $[0,c Q^*(x)]$
  * if $u > P^*(x)$ reject $x$, otherwise accept $x$

Probability of a sample $x$ is $Q^*(x) \frac{P^*(x)}{cQ^*(x)}\propto P^*(x)$.

#### An example of rejection sampling: Gamma distribution

Warning: in the following I used `k` for $c$ just for the sake of confusing readers.

In [None]:
a = 2
b = 0.4
d = 40 # maximal element for sampling

xs = np.arange(0, d, 0.1)
plt.plot(xs, gamma.pdf(xs, a, scale=1/b), label="Gamma");
plt.vlines(x=a/b, ymin=0, ymax=gamma.pdf(a/b, a, scale=1/b), color='gray', ls=':', label="mean");
plt.vlines(x=(a-1)/b, ymin=0, ymax=gamma.pdf((a-1)/b, a, scale=1/b), color='gray', ls='--', label="mode");

# using plain Cauchy
C = (a - 1)/b
k = 2 # the choice of k here is not easy!
B = k/(np.pi * gamma.pdf((a-1)/b, a, scale=1/b))
plt.plot(xs, k/(1 + (xs - C)**2/B**2)/(np.pi*B), label="Cauchy");
# plt.plot(xs, k*cauchy.pdf(xs, loc=C, scale=B)) # just a check!

# using renormalized Cauchy
C = (a - 1)/b
B = 4.5
Z_renorm = B * (np.arctan((d-C)/B)-np.arctan(-C/B))
k_renorm = Z_renorm * gamma.pdf((a-1)/b, a, scale=1/b)
plt.plot(xs, k_renorm/(1 + (xs - C)**2/B**2)/Z_renorm, label="Cauchy renorm");
# plt.plot(xs, k*cauchy.pdf(xs, loc=C, scale=B)) # just a check!

plt.legend();

In [None]:
## using a renormalized Cauchy
num_samples = 5000
t_left = np.arctan(-C/B)
t_right = np.arctan((d-C)/B)
z = C + B * np.tan(np.random.rand(num_samples) * (t_right-t_left) + t_left)

pz = gamma.pdf(z, a, scale=1/b)
qz = 1/(1 + (z - C)**2/B**2)/Z_renorm
# sample homogeneously from 0 to k q(z)
u = k_renorm * qz * np.random.rand(num_samples)
accept = u <= pz
x = z[accept]
num_accept = len(x)

plt.figure(figsize=(10,4))
plt.subplot(121)
plt.hist(x, bins=50, density=True, histtype="step", color=color_cycle[0]);
plt.plot(xs, gamma.pdf(xs, a, scale=1/b), label="p", color=color_cycle[0]);

# note that the density of total samples is q(x)
plt.hist(z, bins=50, density=True, histtype="step", color=color_cycle[1]);
plt.plot(xs, 1/(1 + (xs - C)**2/B**2)/Z_renorm, '--', label="q", color=color_cycle[1]);
plt.legend();

# k q(x) can be visualized in a scatter plot
plt.subplot(122)
plt.scatter(z[accept], u[accept], marker='.', s=2, color=color_cycle[0]);
plt.scatter(z[~accept], u[~accept], marker='.', s=2, color="red");
plt.plot(xs, gamma.pdf(xs, a, scale=1/b), label="p");
plt.plot(xs, k_renorm/(1 + (xs - C)**2/B**2)/Z_renorm, '--', label="k q");
plt.legend();
plt.tight_layout();

# Curse of dimensionality

#### Importance sampling in high dimensions

Importance sampling breaks down in high dimension because of the large variance of the sample weights $w(x)=\frac{p(x)}{q(x)}$.


The variance in $w$ depends on the difference between the distributions $q$ and $p$ [when $p=q$ one obviously has $w(x)=1$, i.e. weights have zero variance].

In high dimension, $\hat{\Phi}$ will be dominated by large contributions $w(x)\phi(x)$ induced by small $q(x)$: such contributions has low probability of occuring in the sample, and thus fluctuate wildly across realization of the importance sampler.

#### Geometry of spheres and Gaussians in high dimension

In [None]:
sigma = 2
num_sample = 10000

Ns = np.array([10, 20, 50, 100, 200, 300])

sqmean, sqvar = np.zeros(len(Ns)), np.zeros(len(Ns))

plt.figure(figsize=(4, 2 * len(Ns)))
count_fig = 0
for i, N in enumerate(Ns):
    sample = np.random.randn(num_sample, N) * sigma
    sqsample = (sample**2).sum(1)
    sqmean[i] = sqsample.mean()
    sqvar[i] = sqsample.var()

    count_fig += 1
    plt.subplot(len(Ns), 1, count_fig)
    plt.hist(sqsample, bins=50, density=True, alpha=0.5);
    mean = N * sigma**2
    var = 2 * N * sigma**4
    normal = multivariate_normal(mean, var)
    chisq = chi2(N, scale=sigma**2)
    xs = np.linspace(mean - 3 * np.sqrt(var), mean + 3 * np.sqrt(var))
    plt.plot(xs, chisq.pdf(xs), c=color_cycle[0], label='$χ^2$');
    plt.plot(xs, normal.pdf(xs), c=color_cycle[1], label='normal');
    plt.legend();
    
plt.tight_layout();

The previous plots show that in high dimensions the mass of a multidimensional Gaussian is concentrated in a shell or radius $\sqrt{N} \sigma$ [note that such region is pretty far away from the mode of the distribution]. Indee, the quantity $\rho(x)=\sqrt{\sum_i x_i^2}$ has an approximately Gaussian distribution with mean and variance expressed by:

$$\rho^{2}\sim N\sigma^{2}\pm\sigma^{2}\sqrt{2N}$$

Visualize such relations in the following plot:

In [None]:
plt.plot(Ns, sqmean, '.', c=color_cycle[0]);
plt.plot(Ns, Ns * sigma**2, c=color_cycle[0], alpha=0.5, label="av $ρ^2$");
plt.plot(Ns, sqvar, '.', c=color_cycle[1]);
plt.plot(Ns, Ns * 2 * sigma**4, '-', c=color_cycle[1], alpha=0.5, label="var $ρ^2$")
plt.xscale('log')
plt.yscale('log')
plt.xlabel('N')
plt.ylabel('av $ρ^2$, var $ρ^2$');
plt.legend();

This concentration property extends to a sphere in $N$ dimension. Since the sphere volume scales with the radius $r$ as $r^N$, one has that for a small shell around the radius

$$\frac{r^{N}-\left(r-\epsilon\right)^{N}}{r^{N}}=1-\left(1-\frac{\epsilon}{r}\right)^{N}$$

thus all the volume is concentrated in the shell.

#### Importance sampling in high dimensions (reprise)

Suppose now we may want to sample from a uniform distribution $p(x)=p^*(x)/Z_P$ inside an $N$ dimensional sphere of radius $R_P$ using for $q$ a factorized Gaussian. We have:

$$p^{*}\left(x\right)=\theta\left(R_{P}-\rho\left(x\right)\right)$$

with $\theta$ the Heaviside function.

Suppose we are able to choose $\sigma$ in such a way that the shell lies inside the $R_P$ sphere. Samples from $q$ will have an approximate probability

$$\frac{1}{\left(2\pi\sigma^{2}\right)^{\frac{N}{2}}}\exp\left(-\frac{N}{2}\pm\sqrt{\frac{N}{2}}\right)$$

and thus a typical weight $w(x)=p^*(x)/q(x)$ equal to

$$\left(2\pi\sigma^{2}\right)^{\frac{N}{2}}\exp\left(\frac{N}{2}\pm\sqrt{\frac{N}{2}}\right)$$

We see that their fluctuations scale like $\exp(\sqrt{N})$.

In summary: even if we are able to make sure that $q$ generates samples from the typical set of $p$, the resulting weight will have a very large variance.

#### Rejection sampling in high dimensions

Rejection sampling is inefficient in high dimensions. Consider  $p(x)$ and $q(x)$ spherical Gaussians in $n$ dimensions with mean 0 and
$\sigma_q =1.01 \sigma_p$.

Since
$$q(x)=\left(\frac{1}{\sqrt{2\pi \sigma^2_q}}\right)^ne^{-\frac{1}{2\sigma_q^2}\sum_i x_i^2}\qquad p(x)=\left(\frac{1}{\sqrt{2\pi \sigma^2_p}}\right)^n e^{-\frac{1}{2\sigma_p^2}\sum_i x_i^2}$$then$$c=\frac{p(0)}{q(0)}=\left(\frac{\sigma_q}{\sigma_p}\right)^n=1.01^n$$

In [None]:
sigma_p = 10
sigma_q = 1.01 * sigma_p
N = 1000
c = (sigma_q/sigma_p)**N
print("c =", c)

Thus volume under $c q$ is 20.000 times the volume under $p$. Therefore, the acceptance rate = $\frac{\mbox{volume}~p}{\mbox{volume}~c q}=\frac{1}{c}$

# <center>Assignments</center>

#### Ex 1.1

Sampling from a one dimensional Gaussian can be done by sampling from a uniform distribution using the **Box-Muller transformation**. 

Consider two variables $x_1$ and $x_2$ sampled independently from a uniform distribution in $[0,1]$. Take:

$$y_{1}=\sqrt{-2\log x_{1}}\cos\left(2\pi x_{2}\right)$$
$$y_{2}=\sqrt{-2\log x_{1}}\sin\left(2\pi x_{2}\right)$$

Show that $y_1$ and $y_2$ are uncorrelated normal variables.

#### Ex 1.2 (MacKay Ex 29.13)

This exercise shows that importance sampling can fail even for Gaussian distributions in one dimension.

We wish to sample from a zero-mean one-dimensional Gaussian $p$ with variance $\sigma^2_p$, using a zero-mean Gaussian $q$ with variance $\sigma^2_q$. Compute the variance of the importance weights. What happens when $\sigma^2_q=\sigma^2_p / 2$? Reproduce Figure 29.20 in MacKay's book.