# Data-gen notebook
Denne notebook er skabt til at holde styr på hvilke metoder Anders har brug til at generer data, og datilhørende kilder

## Indhold:
1. Simpel "<i>normal af normal</i>"
2. AR(p) processer

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal as mn

### 1. Simpel "normal af normal" data-generering
Baseret på side 51 af <b>Statistical inference of informational networks</b>.

Vi generer data baseret på
$$ X\overset{i.i.d.}{\sim}\mathcal{N}\left( 0, \sigma^2_x I \right) \\
 Y\overset{i.i.d.}{\sim}\mathcal{N}\left( X, \sigma^2_y I \right) $$

 Så er MI givet ved
 $$ I(X;\ Y) = \log_2\left( 1 + \dfrac{\sigma^2_x}{\sigma^2_y} \right) $$

In [None]:
def gen_list(n, var_x, var_y, dim=2):
    """
    Generates a set of random points according to <statistical inference of information in networks>
    on page 51. The third variable, Z, is ignored as this is only used to create training data
    for MI, not CMI.

    The covariance matrices are both a scaled identity matrix.

    :param n: Number of points realized
    :param var_x: Variance of X
    :param var_y: Variance of Y
    :param dim: Dimension of the variables X and Y
    :return:
    :returns values: a realization of the process described in the book. Dimensions are [n, X/Y, dim].
                     e.g. [20, 0, 2] gives the second element the 20th realization of X.
    :returns mi: The mutual information: I(X; Y)
    """
    values = np.zeros((n, 2, dim))  # init
    values[:, 0, :] = mn.rvs(mean=np.zeros(dim), cov=var_x * np.identity(dim), size=(n, dim)).reshape((n, dim))

    for i, x in enumerate(values[:, 0, :]):
        values[i, 1, :] = mn.rvs(mean=x, cov=var_y * np.identity(dim), size=1)

    mi = dim * np.log2(1 + var_x / var_y) / 2

    return values, mi

### 2. AR(p) processer
En AR(p) er givet ved
$$ x_k = \omega_k + \sum^{p}_{i=1}A_ix_{k-1}, \quad \omega_k \overset{i.i.d.}{\sim} \mathcal{N}\left( 0, \Sigma_\omega \right) $$

Det er åbenlyst at $E[x_k]= 0 + E[x_0]$. Hvis vi definerer $Var[x_k]=\sigma_k^2$, og $A_i(x)=A_ixA_i^\top$ så kan vi se at
$
\begin{align}
Var[x_0] &= \sigma_0^2 \\
Var[x_1] &= A_1(\sigma_0) + \Sigma_\omega \\
Var[x_2] &= A_1(\sigma_1) + A_2(\sigma_0) + \Sigma_\omega \\
&\ \ \vdots \\
Var[x_k] &= \Sigma_\omega + \sum^{p}_{i=1} A_i(\sigma_{k-i})
\end{align}
$
Dette udtryk kan udregnes løbende når vi simulerer vores AR(p) process. <i>NOTE: Der findes nogle ligninger som viser hvad variansen går imod, tror jeg. Variansen konvergerer nemlig meget hurtigt i praksis</i>

Vi kan også nemt argumenterer for at $x_k$ er fordelt efter multivariat gaussisk fordeling (da den bare er en lin.komb. af $\omega_i$)

Marginalen for et subset af $x$, $x_s$, har fordelingen:
$$ x_s \sim \mathcal{N}\left( S\mu,S\Sigma S^\top \right) $$
hvis $\mu$ og $\Sigma$ er mean og varians for $x$. Her er $S$ defineret som $s_{ij}=1$ hvis det j'te element i $x_s$ er det i'te element i $x$

Hvis vi vil undersøge MI indgangsvist kan vi tage udgangspunkt i [denne](https://www.math.nyu.edu/~kleeman/infolect7.pdf) lecture note. Det skal dog siges at det er nok noget vi selv kan udlede, eller finde en mere fast kilde på senere. På side 2 i kilden får vi givet at
$$ I(X;Y)=\dfrac{1}{2}log_2\left( \dfrac{|\Sigma_X| |\Sigma_Y|}{|\Sigma|} \right)$$
Hvor $\Sigma$ er covariancen for den jointe fordeling imellem $X$ og $Y$.

I praksis vil det sige at hvis vi tagetr $x_k^{(i)}$ til at betyde den i'te indgang i $x$ til index $k$, så får vi 
$\begin{align}
I(X_k^{(i)}; X_k^{(j)}) &= \dfrac{1}{2}log_2\left(\dfrac{ det\left(\Sigma_{ii}\right) det\left( \Sigma_{jj}\right)}{ det\left( \begin{bmatrix} \Sigma_{ii} & \Sigma_{ij} \\ \Sigma_{ji} & \Sigma_{jj} \end{bmatrix}\right) }\right) \\
&= \dfrac{1}{2}log_2\left(\dfrac{ \Sigma_{ii}\Sigma_{jj}}{\Sigma_{ii}\Sigma_{jj} -2\Sigma_{ij}}\right)
\end{align}$

In [None]:
def MI(idx1, idx2, cov):
    """Calculates the mutual information between the marginals of a multivariate gaussian distribution: I(X1; X2)

    Args:
        idx1 (int): Index of X1 in the vector it came from
        idx2 (int): Index of X2 in the vector it came from
        cov (2D array float): Covariance matrix of the vector X, which X1 and X2 came from

    Returns:
        float: Returns the mutual information of X1 and X2
    """
    top = cov[idx1, idx1] * cov[idx2, idx2]
    bot = top - 2*cov[idx1, idx2]

    return .5 * np.log2(top / bot)


def simulate_AR(coefficients, noise_cov, num_steps, x0=None):
    """Simulate the AR(p) process, with iid zero-mean gaussian additive noise.

    Args:
        coefficients (list of arrays): List of the coefficient matrices to use in the realization. Entries in list are 2D float arrays
        noise_cov (2D array float): Covariance matrix of the gaussian additive noise
        num_steps (int): Number of iterations for which the simulation should run
        x0 (1D float array, optional): The starting value of the process. If specified the derivations of the MI don't fit as of the end of september 2022. Defaults to None.
    
    Returns:
        array float: Returns the resulting array of the realization of the process
    """
    # Setup
    p = len(coefficients)
    dim = noise_cov.shape[0]
    results = np.zeros((num_steps + 1, dim))

    # initialize the initial value
    if x0 is None:
        x0 = np.zeros(p)
    results[0] = x0

    # simulate the process
    for i in range(1, num_steps + 1):
        iter_sum = np.zeros(dim)

        # Try except makes sure we don't try to index results too far back. Could be solved with a bool check changed when i >= p
        for j in range(p):
            try:
                iter_sum += coefficients[j] @ results[i - (p + 1)]
            except IndexError:
                break
        
        # Save iteration to results
        iter_sum += np.random.multivariate_normal(mean=np.zeros(dim), cov=noise_cov)
        results[i] = iter_sum
    
    return results
    


# NOTER TIL AR
lav en funktion til at gen alle de coef matricer vi skal bruge, og en funktion som hiver de coef matricer ind i RAM. Så kan vi lave et data_gen script hvor vi bare skal specificere et int for at vælge hvilken data-type vi vil simulerer