# Exercise 02:  Introduction to distributions and basic sampling in CUQIpy

This notebooks describes basic usage of distributions including visualizing their PDF/CDF and generating samples.  It also describes how distributions can be equipped with geometry to represent sampling in nontrivial spaces. Finally conditional distributions are demonstrated along with the creation of user-defined distributions.

## Learning objectives of this notebook:
- Set up random variables following uni- and multivariate distributions in CUQIpy.
- Generate samples from distributions and use CUQIpy tools to inspect visually.
- Explain the use of Geometry in distributions and samples.
- ★ Set up conditional distributions in CUQIpy - simple and using lambda functions.
- ★ Create a user-defined distribution from a logpdf function.

## Table of contents: 
* [1. Normal distribution (univariate)](#Normal)
* [2. Multivariate distributions](#Multivariate)
* [3. Geometry in distribution and Samples](#Geometry)
* [4. Conditional distributions ★](#Conditional)
* [5. User-defined distributions ★](#Userdefined)

## References
[1] *Bardsley, Johnathan. 2018. Computational Uncertainty Quantification for Inverse Problems. SIAM, Society for Industrial and Applied Mathematics.*




First we need to import any Python packages needed, here Numpy for array computations and matplotlib for plotting.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

We import CUQIpy. In the previous notebook we imported upfront the specific tools we needed, like `from cuqi.distribution import Gaussian` to get the Gaussian distribution from CUQIpy's distribution module. We now simply import the complete package and then specify the complete name such as `cuqi.distribution.Gaussian` when using it. Both approaches are fine, each with pros and cons.

In [None]:
import sys
sys.path.append("../../CUQIpy")

In [None]:
import cuqi

## 1. Normal distribution  (univariate)  <a class="anchor" id="Normal"></a> 

The first thing we can do is define a simple normal distribution of a single variable, e.g.,

$$ X \sim \mathcal{N}(0,1^2) $$

This is done using the following syntax:

In [None]:
X = cuqi.distribution.Normal(mean=0, std=1)

More information on the distribution can be found in the CUQIpy documentation: https://cuqi-dtu.github.io/CUQIpy/api/index.html

Once created, we can print the distribution object and its dimension:

In [None]:
print(X)
print(X.dim)

and query information such as its mean and standard deviation

In [None]:
print(X.mean)
print(X.std)

Distributions in CUQIpy have commonly used methods that one might expect like *pdf*, *logpdf*, *cdf*, etc. For example we can evaluate the cumulative distribution function (CDF) at 0, which should be 0.5, since the pdf is symmetric about 0:

In [None]:
X.cdf(0)

We can evaluate and plot the CDF on an interval by evaluating the CDF on a grid:

In [None]:
grid = np.linspace(-10, 10, 1001)
cdf_vals = np.zeros(grid.shape)
for k in range(len(grid)):
    cdf_vals[k] = X.cdf(grid[k])
plt.plot(grid, cdf_vals)

Alternative more compact form using python's list comprehension:

In [None]:
plt.plot(grid, [X.cdf(grid[k]) for k in range(len(grid))])

CUQIpy distributions also have `sample` method which returns one or more samples from the distribution as a CUQIarray:

In [None]:
X.sample()

By default a single sample is returned. More samples can easily be requested:

In [None]:
s = X.sample(10000)
type(s)

When more than one sample is generated, a CUQIpy `Samples` object is returned. This is essentially an array in which each column contains one sample, and further equipped with a number of methods for example for plotting.

For example one can make a "chain plot", i.e., the sampled values of selected parameter(s) of interest. Here we have a single parameter and with Python being zero-indexed we specify this parameter as follows:

In [None]:
s.plot_chain(0)

Another possibility is a histogram of the parameter chain: (The keyword arguments are passed directly to the underlying matplotlib `hist` function for full control). Again, we specify 0 as the element to look at the chain for:

In [None]:
s.hist_chain(0, bins=100, density=True)

CUQIpy has integrated support for common statistical plots with the [ArviZ library](https://arviz-devs.github.io/arviz/), for example a "trace plot" combines the previous two plots, where the histogram is replaced by a kernel density estimate (KDE).

In [None]:
s.plot_trace()

and a "violin plot" displays the median as a white circle, the interquartile range, along with the density/histogram on either side:

In [None]:
s.plot_violin()

#### Try yourself (optional):  
 - Create a new random variable `Y` following a normal distribution with mean 2 and standard deviation 3.
 - Generate 100 samples and display a histogram.
 - Compare with the theoretical distribution by plotting the probability density function of `Y` on top of the histogram.
 - Increase the number of samples and (hopefully) see the histogram approach the theoretical PDF.

In [None]:
# Type code here:





## 2. Multivariate distributions <a class="anchor" id="Multivariate"></a> 

CUQIpy currently implements a number multivariate distributions in the `cuqi.distribution` module:

- Beta
- Cauchy_diff
- Gamma
- Gaussian
- GaussianCov
- GaussianPrec
- GaussianSqrtPrec
- GMRF
- InverseGamma
- Laplace
- Laplace_diff
- LMRF
- LogNormal
- Uniform

and more can easily be added when needed.


To demonstrate, we specify here a 3-element random variable `Z` following a Gaussian distribution with independent elements:

$$Z \sim \mathcal{N}(\mu,\mathrm{diag}(\sigma^2)) \quad \text{for} \quad \mu = [5, 3, 1]^T \quad \text{and} \quad \sigma = [1,2,3]$$

In [None]:
true_mu = np.array([5, 3, 1])
true_sigma = np.array([1, 2, 3])
Z = cuqi.distribution.Gaussian(mean=true_mu, std=true_sigma)

As before we can take a look at the distribution by printing it and its dimension:

In [None]:
print(Z)
print(Z.dim)

as well as its mean

In [None]:
print(Z.mean)

and covariance matrix:

In [None]:
print(Z.cov)

We generate a single sample which produces a 3-element CUQIarray:

In [None]:
Z.sample()

If we ask for more than one sample, say 1000, we get a `Samples` object with 1000 columns each holding a 3-element sample:

In [None]:
sZ = Z.sample(1000)
print(sZ)
sZ.shape

We can plot chains of a few of these variable samples, here we pick element 2 and 0:

In [None]:
sZ.plot_chain([2, 0])

As well as plot a few individual 3-element samples:

In [None]:
sZ.plot();

In [None]:
sZ.plot(plot_par=True)

By default 5 random samples are plotted, but we can also specify indices of specific samples we wish to plot, like every 100th sample:

In [None]:
sZ.plot([0, 100, 200, 300, 400, 500, 600, 700, 800, 900]);

We can also plot the sample mean and compare with the true mean of the distribution:

In [None]:
sZ.plot_mean(label="Sample mean")
plt.plot(Z.mean, 'o', label="Distribution mean")
plt.legend()

and sample standard deviation along with the true standard deviations of the distribution which we obtain as the square-root of the diagonal of the covariance matrix:

In [None]:
sZ.plot_std(label="Sample std")
plt.plot(np.sqrt(np.diag(Z.cov)), 'o', label="Distribution std")
plt.legend()

#### Try yourself (optional):  
 - Plot mean with 95% credibility interval, hint: `help(sZ.plot_ci)`.
 - Include in the credibility interval plot a comparison with the true mean using the `exact` keyword argument of `plot_ci`.
 - Reduce and increase the number of samples and study the effect on the mean and credibility interval.
 - Try also 50% and 99% credibility intervals.

In [None]:
# Type code here:


## 3. Geometry in distribution and Samples <a class="anchor" id="Geometry"></a> 

By default no particular structure or space is assumed of the parameters. If we want to express that parameters constitute for example a 2D image or are a set of discrete named parameters we can specify this by means of a CUQIpy geometry. 

By default distributions (and the Samples produced from distributions) contain a default (trivial) geometry.

In [None]:
print(Z.geometry)
print(sZ.geometry)

As we saw, samples are plotted with line plot by default:

In [None]:
sZ.plot([100,200,300])

But we can also plot the raw underlying parameters using the plot_par argument:

In [None]:
sZ.plot([100,200,300], plot_par=True)

We may equip the distribution with a different geometry, either when creating it, or afterwards. For example if the three parameters represent labelled quantities such as height, width and depth we can use a `Discrete` geometry:

In [None]:
geom = cuqi.geometry.Discrete(['height','width','depth'])

We can update the distribution's geometry and generate some new samples:

In [None]:
Z.geometry = geom

In [None]:
sZ2 = Z.sample(100)

The samples will now know about their new `Discrete` geometry and the plotting style will be changed:

In [None]:
sZ2.plot();

The credibility interval plot style is also updated to show errorbars for the `Discrete` geometry:

In [None]:
sZ2.plot_ci(95, exact=true_mu)

And the similarly in the chain plot the legend reflects the particular labels:

In [None]:
sZ2.plot_chain([2,0])

Another use of geometry is to represent 1D or 2D versions of the same distribution (prior). To do that let us first look at two new geometries.

In CUQIpy we can represent 1D and 2D signals using the `Continuous1D` and `Continuous2D` geometries:

In [None]:
N = 100     # number of pixels
dom = 1     # 1D or 2D domain

x = np.linspace(0,1,N)

if (dom == 1):
    geometry = cuqi.geometry.Continuous1D(x)
elif (dom == 2):
    geometry = cuqi.geometry.Continuous2D((x, x))

In this example in 1D there will be N parameters and in 2D there will be N^2 parameters. We can check the number of parameters of the geometry as well as its type:

In [None]:
geometry.par_dim

In [None]:
type(geometry)

A Gaussian Markov Random Field (GMRF) can be used in 1 or 2 spatial dimensions, please see documentation for details: https://cuqi-dtu.github.io/CUQIpy/api/_autosummary/cuqi.distribution/cuqi.distribution.GMRF.html. 

We can now specify a GMRF distribution (with some chosen mean, precision, boundary conditions etc.) The same exact code will work in 1D and 2D due to the geometry:

In [None]:
mean = np.zeros(geometry.par_dim)
prec = 4
pX = cuqi.distribution.GMRF(mean, prec, dom, bc_type='zero', geometry=geometry)

With the distribution set up, we are ready to generate some samples

In [None]:
# call method to sample
sampleX = pX.sample(50)

We can check that we have produced 50 samples, each of size 100 in the 1D case (in 2D, size 10000):

In [None]:
sampleX.shape

We plot a couple of samples:

In [None]:
sampleX.plot()   

#### Try yourself (optional):  
 - Go back and change `dom` to 2 to get the 2D case and rerun the subsequent cells.
 - Play with the number of pixels `N` as well as parameters of the GMRF and see the effect on the samples.

## 4. Conditional distributions ★ <a class="anchor" id="Conditional"></a> 

In CUQIpy defining conditional distributions is simple. Assume we are interested in defining the Normal distribution condtioned on the standard deviation, e.g.

$$ X_2 \mid \mathrm{std} \sim \mathcal{N}(0,\mathrm{std}^2) $$

This can simply be achieved by *omitting* the keyword argument for the standard deviation as shown in the following code

In [None]:
X2 = cuqi.distribution.Normal(mean=0)

Printing it will tell us that the variable `std` has not been specified, i.e., it is a *conditioning variable*:

In [None]:
print(X2)

Because $X_2$ is a conditional distribution, we cannot evaluate the logpdf or sample it directly without specifying the value of the conditioning variable (the standard deviation in this case). Hence this code will fail to run:

In [None]:
# This code will give an error so we wrap it in a try/except block and print the error
try:
    X2.sample()
except Exception as e:
    print(e)

However, we can specify the conditioning variable using the "call" syntax, i.e., `X2(std=2)` to specify the value of the standard deviation in the conditional distribution as shown below.

In [None]:
X2(std=2).sample()

In fact, conditioning creates a new *unconditional* distribution. Here printing reveals that it does not have any conditioning variables:

In [None]:
X2_std2 = X2(std=2)
print(X2_std2)

We expect we can then sample it directly, which is confirmed:

In [None]:
X2_std2.sample()

In general one may need more flexibility than simply conditioning directly on the attributes of the distribution. Let us assume we want to condition on the variance - denoted d - rather than the standard deviation of the normal distribution, i.e.

$$ X_3 \mid d \sim \mathcal{N}(0,d) $$

In CUQIpy this is can be achieved through *lambda* functions. A lambda function is the Python equivalent of a MATLAB anonymous function, i.e. a function defined in a single line with the following syntax for an example function the simply sums two input arguments:

In [None]:
myfun = lambda v1, v2: v1+v2

In [None]:
myfun(5,7)

We can pass a lambda function directly as an argument to the distribution, e.g.,

In [None]:
X3 = cuqi.distribution.Normal(mean=0, std=lambda d: np.sqrt(d))
print(X3)

where we see that `d` is now the conditioning variable instead of `std` as before.

We can then pass a value for `d` to condition on, which allows us to sample from the now fully specified distribution:

In [None]:
X3(d=2).sample()

What actually happens behind the scenes is that writing `X3(d=2)` defined a new CUQIpy distribution, where the standard deviation is defined by evaluating the lambda function. This can be seen by storing the new distribution as follows.

In [None]:
X4 = X3(d=2)
X4.std

One can even go crazy and define lambda functions for all attributes e.g.

In [None]:
#Functions for mean and std with various (shared) inputs
mean = lambda sigma,gamma: sigma+gamma
std  = lambda delta,gamma: np.sqrt(delta+gamma)

z = cuqi.distribution.Normal(mean, std)
print(z)

The three variable names `sigma`, `gamma` and `delta` used to define the two lambda functions for the mean and standard deviation are now the conditioning variables of the conditional distribution `z`.

By providing values for all three variables we obtain a fully specified distribution

In [None]:
Z = z(sigma=3, delta=5, gamma=-2)
print(Z)

that we can sample:

In [None]:
Z.sample()

Conditional distributions will play a major role when specifying Bayesian inverse problems including hierarchical models where some random variables depend on other random variables. We revisit this in later notebooks.

## 5. User-defined distributions ★ <a class="anchor" id="Userdefined"></a> 

In addition to the distributions provided by CUQIpy, there is also the possibility for users to specify new distributions. One option is to write their own class in the same style as existing distributions such as the Beta distribution (see code here: https://github.com/CUQI-DTU/CUQIpy/blob/main/cuqi/distribution/_beta.py).

Another option is to specify a user-defined distribution, which is convenient if one for example only wishes to evaluate the logpdf.

The example below demonstrates how to manually specify a normal distribution through a lambda function for the logpdf and compare it to the normal distribution defined in the beginning of this notebook.

We specify variables for the mean and the standard deviation and specify the lambda function for the logpdf. 

In [None]:
mu1 = 0
std1 = 1

logpdf_func = lambda xx: -np.log(std1*np.sqrt(2*np.pi))-0.5*((xx-mu1)/std1)**2

To set up the user-defined distribution we need to specify the logpdf as well as its dimension (number of variables) since that cannot be automatically inferred from the lambda function:

In [None]:
XU = cuqi.distribution.UserDefinedDistribution(dim=1, logpdf_func=logpdf_func)

We can now evalute the logpdf, as well as the pdf:

In [None]:
print(XU.logpdf(0))
print(XU.pdf(0))

We can compare this with the normal distribution from the beginning of the notebook and observe that their pdfs agree:

In [None]:
plt.plot(grid, [X.pdf(grid[k]) for k in range(len(grid))], label='CUQIpy Normal')
plt.plot(grid, [XU.pdf(grid[k]) for k in range(len(grid))], '--', label='User-defined Normal')
plt.legend()

We cannot sample the user-defined distribution because we have only provided the logpdf:

In [None]:
try:
    XU.sample()
except Exception as e:
    print(e)

We can equip the user-defined distribution with a sample_func which specified how to sample (it is up to the user to ensure consistency between logpdf and sample_func):

In [None]:
XU.sample_func = lambda : np.array(mu1 + std1*np.random.randn())

In [None]:
XU.sample()

We can compare the samples obtained from the original normal distribution and the user-defined:

In [None]:
Xs = X.sample(10000)

In [None]:
XUs = XU.sample(10000)

We plot their histograms and note that they appear similar:

In [None]:
Xs.hist_chain(0,bins=100)
XUs.hist_chain(0,bins=100)
plt.legend(['CUQIpy Normal', 'User-defined Normal'])