# Exercise 5 - Multivariate Gaussians

In this exercise, we will estimate a Gaussian from a dataset and answer inference queries using the mean- and canonical parameterizations. Runtime experiments will illustrate the importance of both parameterizations.

In the event of a persistent problem, do not hesitate to contact the course instructors under
- paul.kahlmeyer@uni-jena.de

### Submission

- Deadline of submission:
        04.12.2022
- Submission on [moodle page](https://moodle.uni-jena.de/course/view.php?id=34630)

### Help
In case you cannot solve a task, you can use the saved values within the `help` directory:
- Load arrays with [Numpy](https://numpy.org/doc/stable/reference/generated/numpy.load.html)
```
np.load('help/array_name.npy')
```
- Load functions with [Dill](https://dill.readthedocs.io/en/latest/dill.html)
```
import dill
with open('help/some_func.pkl', 'rb') as f:
    func = dill.load(f)
```

to continue working on the other tasks.

# Dataset

In this exercise, we will use a dataset used for [predicting wine quality](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009).

You find this dataset stored as `dataset.csv`. 

### Task 1
Read this dataset into a $1599\times 12$ matrix.

Each row represents one specific wine, each column corresponds to a measured attribute.

In [2]:
# TODO: Load dataset into matrix
import pandas as pd

df = pd.read_csv("dataset.csv")
columns = list(df)
dataset = df.to_numpy()

print(columns)


def indices(columns: list[str], *names: str) -> list[int]:
    return [columns.index(name) for name in names]


def columns_left(columns: list[str], indices_removed: list[int]) -> list[str]:
    return [column for i, column in enumerate(columns) if i not in indices_removed]


dataset.shape


['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']


(1599, 12)

# Model selection

Here we use the model assumption that the samples come from a multivariate normal distribution. 

### Task 2
Estimate the Maximum Likelihood parameters

\begin{align}
\mu_{\text{ML}} &= \frac{1}{N}\sum_{i=1}^Nx^{(i)}\\
\Sigma_{\text{ML}} &= \frac{1}{N}\sum_{i=1}^N\left(x^{(i)}-\mu_{\text{ML}}\right)\left(x^{(i)}-\mu_{\text{ML}}\right)^T
\end{align}

for a multivariate normal distribution based on this dataset. Here $N$ is the number of samples and $x^{(i)}$ is the i-th sample.

In [3]:
# TODO: calculate ML estimates
import numpy as np

N = dataset.shape[0]
ml_mean = np.mean(dataset, axis=0)
ml_cov = (N - 1) / N * np.cov(dataset.T)
print(ml_mean.shape)
print(ml_cov.shape)


(12,)
(12, 12)


# Inference

Now that we have estimated the parameters of our underlying model, we want to perform inference in order to answer the query:

**"What quality and alcohol level can we expect, if we observe a wine with**
- **citric acid level of 0.6,**
- **residual sugar of 2.5,**
- **chlorides level of 0.1,**
- **density of 0.994,**
- **sulphate level of 0.5?"**

## Mean Parameterization

The mean parameterization of a Gaussian consists of the mean vector $\mu$ and the covariance matrix $\Sigma$.

**Marginalizing** dimensions from a Gaussian, to keep a subset $J$ of the dimensions results in a Gaussian with 
- Mean vector $\mu_J$
- Covariance matrix $\Sigma_{JJ}$

**Conditioning** a subset $J$ of the dimensions on values $x_J$ also gives us a Gaussian with 
- Mean vector $\mu_I+\Sigma_{IJ}\Sigma_{JJ}^{-1}(x_J-\mu_J)$ 
- Covariance matrix $S_{II} = \Sigma_{II}-\Sigma_{IJ}\Sigma_{JJ}^{-1}\Sigma_{JI}$

Here, the subscripts indicate the selected dimensions of the variables. $I$ denotes the remaining dimensions, after we condition on the dimensions $J$. $S$ denotes the Schur complement.

### Task 3
Implement the following class of a Gaussian with mean parameterization. Then use your implementation to answer the query.

Note: `marginalize` and `condition` should not return any parameters, but update the internal parameters.

In [49]:
class MeanGaussian():
    def __init__(self, mu, sigma):
        '''
        Mean parameterization of a gaussian

        @Params: 
            mu... vector of size ndims
            sigma... matrix of size ndims x ndims
        '''

        self.mu = mu
        self.sigma = sigma

    def marginalize(self, idx_J):
        '''
        Marginalizes a set of indices from the Gaussian.

        @Params:
            idx_J... list of indices to keep after marginalization (these indices remain)

        @Returns:
            Nothing, parameters are changed internally
        '''

        self.mu = self.mu[idx_J]
        self.sigma = self.sigma[idx_J, :][:, idx_J]

    def condition(self, idx_J, x_J):
        '''
        Conditions a set of indices on values.

        @Params:
            idx_J... list of indices that are conditioned on
            x_J... values that are conditioned on

        @Returns:
            Nothing, parameters are changed internally
        '''

        idx_I = [i for i in range(self.mu.shape[0]) if i not in idx_J]
        sigma_ii = self.sigma[idx_I, :][:, idx_I]
        sigma_ij = self.sigma[idx_I, :][:, idx_J]
        sigma_jj_inv = np.linalg.inv(self.sigma[idx_J, :][:, idx_J])

        self.mu = self.mu[idx_I] + sigma_ij @ sigma_jj_inv @ (x_J - self.mu[idx_J])
        self.sigma = sigma_ii - sigma_ij @ sigma_jj_inv @ sigma_ij.T


# TODO: answer query

mean_gaussian = MeanGaussian(ml_mean, ml_cov)
idx_J = indices(columns, "citric acid", "residual sugar", "chlorides", "density", "sulphates")
x_J = [0.6, 2.5, 0.1, 0.994, 0.5]
columns_after_marg = columns_left(columns, idx_J)
mean_gaussian.condition(idx_J, x_J)
idx_I = indices(columns_after_marg, "quality", "alcohol")
mean_gaussian.marginalize(idx_I)
print(f"Expected quality: {mean_gaussian.mu[0]}")
print(f"Expected alcohol level: {mean_gaussian.mu[1]}")


Expected quality: 6.101141437329366
Expected alcohol level: 11.787708048356173


## Canonical Parameterization

The canonical parameterization $(\nu,\Lambda)$ results from the mean parameterization trough

\begin{align}
\nu &=\Sigma^{-1}\mu\\
\Lambda &= \Sigma^{-1}
\end{align}


In the canonical parameterization, **marginalizing** dimensions from a Gaussian, to keep a subset $J$ of the dimensions, results in a Gaussian with 
- Vector $\nu_J-\Lambda_{IJ}\Lambda_{JJ}^{-1}\nu_J$
- Precision matrix $S_{JJ}=\Lambda_{JJ}-\Lambda_{JI}\Lambda_{II}^{-1}\Lambda_{IJ}$

**Conditioning** a subset $J$ of the dimensions on values $x_J$ again gives us a Gaussian with 
- Vector $\nu_I-\Lambda_{IJ}x_J$ 
- Precision matrix $\Lambda_{II}$

The subscripts indicate the selected dimensions of the variables. $I$ denotes the remaining dimensions, after we remove the dimensions $J$. $S$ denotes the Schur complement.

We shall later see, that there are some cases, where you would prefer canonical parameterization over the mean parameterization.

### Task 4
Implement the following class of a Gaussian with canonical parameterization. Then use your implementation to answer the query.

Note: `marginalize` and `condition` should not return any parameters, but update the internal parameters.
The solution should be the same as in Task 3.

In [50]:
class CanonicalGaussian():
    def __init__(self, nu, lamb):
        '''
        Canconical representation of a gaussian

        @Params: 
            nu... vector of size ndims
            lamb... matrix of size ndims x ndims (precision matrix)
        '''

        self.nu = nu
        self.lamb = lamb

    def marginalize(self, idx_J):
        '''
        Marginalizes a set of indices from the Gaussian.

        @Params:
            idx_J... list of indices to keep after marginalization (these indices remain)

        @Returns:
            Nothing, parameters are changed internally
        '''

        idx_I = [i for i in range(self.nu.shape[0]) if i not in idx_J]
        lamb_ji = self.lamb[idx_J, :][:, idx_I]
        lamb_ii_inv = np.linalg.inv(self.lamb[idx_I, :][:, idx_I])
        lamb_jj = self.lamb[idx_J, :][:, idx_J]

        self.nu = self.nu[idx_J] - lamb_ji @ lamb_ii_inv @ self.nu[idx_I]
        self.lamb = lamb_jj - lamb_ji @ lamb_ii_inv @ lamb_ji.T

    def condition(self, idx_J, x_J):
        '''
        Conditions a set of indices on values.

        @Params:
            idx_J... list of indices that are conditioned on
            x_J... values that are conditioned on

        @Returns:
            Nothing, parameters are changed internally
        '''

        idx_I = [i for i in range(self.nu.shape[0]) if i not in idx_J]
        lamb_ij = self.lamb[idx_I, :][:, idx_J]
        self.nu = self.nu[idx_I] - lamb_ij @ x_J
        self.lamb = self.lamb[idx_I, :][:, idx_I]



# TODO: answer query
lamb = np.linalg.inv(ml_cov)
nu = lamb @ ml_mean
canonical_gaussian = CanonicalGaussian(nu, lamb)
canonical_gaussian.condition(idx_J, x_J)
canonical_gaussian.marginalize(idx_I)
marg_sigma = np.linalg.inv(canonical_gaussian.lamb)
marg_mean = marg_sigma @ canonical_gaussian.nu
print(f"Expected quality: {marg_mean[0]}")
print(f"Expected alcohol level: {marg_mean[1]}")


Expected quality: 6.101141437329399
Expected alcohol level: 11.787708048356278


# Computational costs

Why do we need two different parameterizations of the same probability distribution?
What is the difference?

We cannot observe the effect of a different parameterization on our dataset, as it is way to small (too few dimensions).

In the `synthetic/` directory, you find parameters for a Gaussian with 300 dimensions, as well as a value vector `x` for conditioning.
Load these arrays and calculate the parameters for the canoncial parameterization.

In [51]:
# TODO: load synthetic parameters
mu = np.load("synthetic/mu.npy")
sigma = np.load("synthetic/sigma.npy")
x_J = np.load("synthetic/x.npy")
lamb = np.linalg.inv(sigma)
nu = lamb @ mu

We now want to investigate the computation times for the following inference operations:

1. Marginalize out the dimensions 200-299, then condition on the dimensions 100-199 with $x$
2. Condition on the dimensions 100-199 with $x$, then marginalize out the dimensions 200-299


<div>
<img src="images/indices.png" width="700"/>
</div>

Both operations yield the same result, $p(x_0,\dots,x_{99}|x_{100},\dots,x_{199})$ they just change the order of marginalization and conditioning.

### Task 5
Track the computational costs for both inference operations using the mean parameters and the canoncial parameters.

What do you observe? Try to find an explanation for your observations.

In [60]:
# TODO: measure execution costs + explain observations
import timeit

# t_mean = timeit.timeit(lambda: )
# keep 0 - 199
idx_I_first_marg = list(range(0, 200))
# condition on 100 - 199
idx_J_first_marg = list(range(100, 200))
# condition on 100 - 199
idx_J_first_cond = list(range(100, 200))
# keep 0 - 99
idx_I_first_cond = list(range(0, 100))

def canonical_marg_first():
    gaussian = CanonicalGaussian(nu, lamb)
    gaussian.marginalize(idx_I_first_marg)
    gaussian.condition(idx_J_first_marg, x_J)

def canonical_cond_first():
    gaussian = CanonicalGaussian(nu, lamb)
    gaussian.condition(idx_J_first_cond, x_J)
    gaussian.marginalize(idx_I_first_cond)

def mean_marg_first():
    gaussian = MeanGaussian(mu, sigma)
    gaussian.marginalize(idx_I_first_marg)
    gaussian.condition(idx_J_first_marg, x_J)

def mean_cond_first():
    gaussian = MeanGaussian(mu, sigma)
    gaussian.condition(idx_J_first_cond, x_J)
    gaussian.marginalize(idx_I_first_cond)

N = 200
t_canonical_marg_first = timeit.timeit(canonical_marg_first, number=N) / N
t_canonical_cond_first = timeit.timeit(canonical_cond_first, number=N) / N
t_mean_marg_first = timeit.timeit(mean_marg_first, number=N) / N
t_mean_cond_first = timeit.timeit(mean_cond_first, number=N) / N

print("Canonical Parameterization times:")
print(f"marg first: {t_canonical_marg_first}")
print(f"cond first: {t_canonical_cond_first}")
print("Mean Parameterization times:")
print(f"marg first: {t_mean_marg_first}")
print(f"cond first: {t_mean_cond_first}")

Canonical Parameterization times:
marg first: 0.00699358800000482
cond first: 0.005307515499989677
Mean Parameterization times:
marg first: 0.0034045405000142637
cond first: 0.010268308999984583
