# Internship Finn Sherry @ Sioux Mathware

---

# Bayesian grey-box system identification for thermal effects: VB using BayesPy
In this notebook, we apply the Variational Bayes from [BayesPy](https://github.com/bayespy), developed by Luttinen, to a simplified model of our to see whether VB is a viable approximation method for the rest of this project. This notebook continues on from the Variational Bayes part of my [main (Julia) notebook](sysid-thermal-AR.ipynb). 

Last update: 20-07-2022

$\renewcommand{\vec}[1]{\boldsymbol{\mathrm{#1}}}$
$\newcommand{\covec}[1]{\hat{\vec{#1}}}$
$\newcommand{\mat}[1]{\boldsymbol{\mathrm{#1}}}$
$\newcommand{\inv}[1]{#1^{-1}}$
$\newcommand{\Expectation}{\mathbb{E}}$
$\newcommand{\Variance}{\mathbb{V}}$

In [None]:
from bayespy.nodes import GaussianARD, GaussianMarkovChain, Gamma, Dot
from bayespy.inference import VB
from bayespy.utils import random
import numpy as np
from math import ceil
import matplotlib.pyplot as plt

## 2-D System: Adding in an extra block
### Physical Model
Next, we make the problem more complicated by considering a system of 2 materials with 2 temperature sensors. Ignoring input heats and assuming the influence of convection, we may write the equations govering the evolution of our system as
$$\mat{M} \dot{\vec{T}} = \mat{K} \vec{T},$$
where $\mat{K}$ contains a single unknown conduction coefficient $k$:
$$\mat{K} = 
\begin{pmatrix}
-k & k \\
k & -k
\end{pmatrix}.$$

[comment]: # (and finally $\vec{T}_a$ is the ambient temperature at both blocks, which we will assume for the moment is in both cases constantly equal to some known $T_a$.)
Discretising using a forward difference, we now find by rearranging the terms that
$$
\vec{T}_{n + 1} = \underbrace{(\mat{I} + \Delta t \inv{\mat{M}} \mat{K})}_{\mat{A}} \vec{T}_n + \vec{q}_n,
$$
where $\vec{q}_n$ is the process noise. Finally, our measurement model is given by
$$
\vec{y}_n = \mat{I} \vec{T}_n + \vec{r}_n,
$$
where $\vec{r}_n$ is the normally distributed measurement noise.



[comment]: # (These are subject to the following dynamics:)

[comment]: # ($$M\dot{T} = K(\theta)T + B(\theta)u \, ,$$)

[comment]: # (where the material properties are described by:)

[comment]: # ($$M = \begin{bmatrix} m_1 c_{p, 1} & 0 & 0 \\
0 & m_2 c_{p, 2} & 0 \\
0 & 0 & m_3 c_{p, 3} \end{bmatrix} \, ,$$)

[comment]: # (the conductance by:)

[comment]: # ($$K(\theta) = \begin{bmatrix} -h_{12} -h_a(T_1) & h_{12} & 0 \\
h_{12} & -h_{12}-h_{23}-h_a(T_2) & h_{23} \\
0 & h_{23} & -h_{23}-h_a(T_3) \end{bmatrix} \, ,$$)

[comment]: # (and the input by:)

[comment]: # ($$B(\theta) u = \begin{bmatrix} \, h_a(T_1) \\ h_a(T_2) \\ h_a(T_3) \, \end{bmatrix} T_a \, ,$$)

[comment]: # "where $u=T_a$ represents the ambient temperature. We assume $M$ is matrix of (known) constants. Later on, we will add a convection term. "
 
[comment]: # (The parameters consist of two contact conductances $h_{12}$, $h_{23}$, and the effect of the ambient temperature $h_a$:)

[comment]: # ($$\theta = \begin{bmatrix} \, h_{12} & h_{23} & h_a(T_1) & h_a(T_2) & h_a(T_3) \, \end{bmatrix}^{\top} \, .$$)

[comment]: # (<mark>It is unclear to me what the $h_a(T_i)$ represent. Considering that they're written as functions, I assume that there is a temperature dependence $\theta(T)$. But for now, I am going to assume that they are static parameters. </mark>)

[comment]: # (If we discretize, we get:)

[comment]: # ($$\begin{align}
M\dot{T} &= K(\theta)T + B(\theta)u \\
\dot{T} &= M^{-1} K(\theta)T + M^{-1}B(\theta)u \\
\frac{T_{k+1} - T_k}{\Delta t} &= M^{-1} K(\theta)T_k + M^{-1}B(\theta)u_k \\
T_{k+1} - T_k &= \Delta t M^{-1} K(\theta)T_k + \Delta t M^{-1}B(\theta)u_k \\
T_{k+1} &= \big(\Delta t M^{-1} K(\theta) + I \big)T_k + \Delta t M^{-1}B(\theta)u_k \, .\\
\end{align}$$)

[comment]: # (We can view this as a vector autoregressive model with exogenous input.)

[comment]: # (## Analytical Solution)

[comment]: # (If we indeed assume $\eta$ to be constant, we can in this case still get a closed form analytical description of the evolution of the posterior of $\theta$.)

[comment]: # "In Mathematica, we define the prior as 
$$p(\theta \mid T_k) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(\theta - \mu)^2}{2 \sigma^2}\right),$$
and the likelihood as
$$p(T_{k + 1} \mid \theta, T_k) = \sqrt{\frac{\gamma}{2 \pi}} \exp\left(-\gamma \frac{(T_{k + 1} - \theta T_k - \eta T_a)^2}{2}\right).$$
Then we simply find the resulting posterior by computing (using Bayes' rule)
$$p(\theta \mid T_{k + 1}, T_k) = \frac{p(T_{k + 1} \mid \theta, T_k) p(\theta \mid T_k)}{\int_{-\infty}^{\infty} p(T_{k + 1} \mid \theta, T_k) p(\theta \mid T_k) d \theta}.$$
Mathematica tells us (after applying FullSimplify) that
$$p(\theta \mid T_{k + 1}, T_k) = \sqrt{\frac{1 + T_k^2 \gamma \sigma^2}{2 \pi \sigma^2}} \exp\left(-\frac{1 + T_k^2 \gamma \sigma^2}{2 \sigma^2} \left(\theta - \frac{\mu + T_k T_{k + 1} \gamma \sigma^2 + \eta T_a T_k \gamma \sigma^2}{1 + T_k^2 \gamma \sigma^2}\right)^2\right),$$
from which we can see that $\theta \mid T_{k + 1}, T_k \sim \mathcal{N}(\mu_1, \sigma_1^2)$, with
$$\mu_1 = \frac{\mu + (T_{k + 1} + \eta T_a) T_k  \gamma \sigma^2}{1 + T_k^2 \gamma \sigma^2}, \text{ and } \sigma_1^2 = \frac{\sigma^2}{1 + T_k^2 \gamma \sigma^2}.$$"


[comment]: # "From these equations, we obtain a probabilistic model of the form:
$$p(T_{n + 1} \mid \theta, T_k) = \mathcal{N}(T_{n + 1} \mid \theta T_n + (1 - \theta) T_a, \gamma^{-1}) \, .$$
We should pose a prior on $\theta$ and we apply the Autoregressive model from ReactiveMP.jl: https://biaslab.github.io/ReactiveMP.jl/stable/examples/autoregressive/. For now, let's fix $\gamma = 1e8$ and estimate it simultaneously later on."

### Simulate Data
We start by visualising the evolution of the temperatures in our system. Since we have no input heats or convection, this should be pretty boring. We generate an unreasonably large amount of data to give the inference the greatest possible chance of succeeding.

In [None]:
# System Properties
M = 2 # Observation dimension
D = 2 # Latent dimension
N = 1001 # Observations
Dt = 1. # Time step for discretisation
times = [Dt * i for i in range(N)]

In [None]:
# State properties
k = 5 # Conduction coefficient
mcp_1 = 2e3
mcp_2 = 1.5e3
Mconst = np.diag([mcp_1, mcp_2])
# a = np.linalg.inv(np.identity(D) - Dt * np.matmul(np.linalg.inv(Mconst), np.array([[-k, k], [k, -k]]))) ### Backward Euler
a = np.identity(D) + Dt * np.matmul(np.linalg.inv(Mconst), np.array([[-k, k], [k, -k]])) # Transition matrix A
x_1_0 = 270
x_2_0 = 330
std_noise_x = 0 # No process noise because unidentifiability makes that too messy for the moment
std_noise_y = 1

In [None]:
# Generate data
x = np.empty((N, D))
y = np.empty((N, M))
x[0] = np.array([x_1_0, x_2_0])
y[0] = x[0] + std_noise_y * np.random.randn(M)
for n in range(N - 1):
    x[n + 1] = np.dot(a, x[n]) + std_noise_x * np.random.randn(D)
    y[n + 1] = x[n + 1] + std_noise_y * np.random.randn(M)

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (8, 8));
ax.grid(True);
ax.set_prop_cycle('color', ["b", "r"]);
ax.plot(times, x, label = [r"$T_1$", r"$T_2$"]);
ax.scatter(times, y[:, 0], label = r"$y_1$");
ax.scatter(times, y[:, 1], label = r"$y_2$");
ax.set_xlim(times[0], ceil(times[-1]));
ax.set_ylim(260, 340);
ax.set_xlabel(r"$t~(\mathrm{s})$");
ax.set_ylabel(r"$T~(\mathrm{K})$");
ax.legend();

### Probabilistic Model
We can now define the priors of our system. $\alpha$ and $\nu$ are so-called Automatic Relevance Determination (ARD) parameters, which should automatically make less relevant elements in our matrices play a smaller role in the inference. Note that Gamma is shape-rate: if we define some random variable $x \sim \Gamma(a, b)$, then 
$$\Expectation[x] = \frac{a}{b}, \text{ and } \Variance(x) = \frac{a}{b^2}.$$
Hence, for an uninformative prior, we should make $b$ small.

$\mat{A}$is our state transition matrix. We define a prior per column, with corresponding ARD parameter. This is not really a sensible thing to do, however: there is no reason to believe that elements within a column are necessarily roughly equally relevant, while those in different columns could still be equally relevant. It would be more reasonable to have a parameter for each entry. Moreover, while entries in a column can be dependent, those in different columns are treated as being independent, which clearly is not true. Regardless, for the time being we work with this setup, since it somewhat works. If the results are promising, we can come back and make improvements.

We start by defining the probabilistic model...

In [None]:
def model(M, N, D, nu_a, nu_b):
    # Transition
    alpha = Gamma(1e-5, 1e-5, plates = (D,), name = "alpha") # ARD
    A = GaussianARD(0, alpha, shape = (D,), plates = (D,), name = "A")
    A.initialize_from_value(np.identity(D))
    # Process
    nu = Gamma(nu_a, nu_b, plates = (D,), name = "nu") # Innovation: not really sure what this does...
    X = GaussianMarkovChain(300 * np.ones(D), np.identity(D) / 900, A, nu, n = N, name = "X")
    X.initialize_from_value(np.random.randn(N, D))
    # Observation
    tau = Gamma(1e0, 1e0, name = "tau")
    tau.initialize_from_value(1e0)
    Y = GaussianARD(X, tau, name = "Y")
    # Variational Inference
    Q = VB(X, A, alpha, nu, tau, Y)
    return Q

... and subsequently perform VB.

In [None]:
def infer(y, M, N, D, nu_a, nu_b):
    Q = model(M, N, D, nu_a, nu_b)
    Q['Y'].observe(y)
    Q.update()
    return Q

In [None]:
my_Q = infer(y, M, N, D, 50, 1)
inferred_X = my_Q['X'].get_moments()[0]
inferred_A = my_Q['A'].get_moments()[0]
inferred_tau = my_Q['tau'].get_moments()[0]
inferred_nu = my_Q['nu'].get_moments()[0]

It seems like only a single iteration was necessary to converge. In less than a second, we were able to perform batch estimation using a dataset of about 2000 observations. I could imagine that then a single recursive update step would take mere milliseconds; VB indeed seems quite suitable for online estimation. However, we should also see whether the state and parameter estimates are any good.

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (8, 8));
ax.grid(True);
ax.set_prop_cycle('color', ["b", "r"])
ax.plot(times, x, label = [r"$T_1$", r"$T_2$"]);
ax.scatter(times, y[:, 0], label = r"$y_1$");
ax.scatter(times, y[:, 1], label = r"$y_2$");
ax.set_prop_cycle('color', ["g", "orange"])
ax.plot(times, inferred_X, label = [r"$T_1$ Bayes", r"$T_2$ Bayes"])
ax.set_xlim(times[0], round(times[-1]));
ax.set_ylim(260, 340);
ax.set_xlabel(r"$t~(\mathrm{s})$");
ax.set_ylabel(r"$T~(\mathrm{K})$");
ax.legend();
# fig.savefig("Results/Explore/VB/2_blocks.pdf", bbox_inches = "tight")

One thing to note is that the Bayesian MAP estimate for the state is somewhat jagged. Since we have no process noise, it would be desirable for the estimate to be smooth. I am not sure what the innovation $\nu$ does: it seems like it is the precision of the process noise. If we make it too small, the MAP estimate goes through all the observations, seemingly ignoring the underlying dynamics; if we make it too large, the MAP estimate seems to completely ignore all observations. In general, the estimate seems to be very sensitive to the prior of $\nu$.

Next, we extract information about our unknown quantity $k$. Here we run into a disadvantage of using VB in this way: we never defined $k$ as a quantity in our probabilistic model, only $\mat{A}$. Hence, we somehow have to use our posterior for $\mat{A}$ and the dependence of $\mat{A}$ on $k$ to estimate $k$. For a point estimate, I have the following straightforward idea: simply compute $k$ from each component of the MAP estimate $\mat{A}$, and take the average.

Recall that
$$\mat{A} = \mat{I} + \Delta t \inv{\mat{M}} \mat{K}.$$
Consequently, we can conclude that
$$\mat{K} = \frac{1}{\Delta t} \mat{M} (\mat{A} - \mat{I}).$$
Then, our point estimate for $k$ will be 
$$\hat{k} = \frac{-K_{1, 1} + K_{1, 2} + K_{2, 1} - K_{2, 2}}{4}.$$

In [None]:
inferred_K = np.matmul(Mconst, inferred_A - np.identity(2)) / Dt
inferred_k = (-inferred_K[0, 0] + inferred_K[0, 1] + inferred_K[1, 0] - inferred_K[1, 1]) / 4
k, inferred_k, (inferred_k - k) / k

This point estimate is pretty good, but the influence of $\nu$ is very big. Vhanging its prior (say to `Gamma(1, 1)`) drastically changes the quality of the estimate (with this change the estimate is 60 % more than the true value).

## 3-D System: Another one
### Physical Model
We now augment the problem by adding in an extra block. The dynamics of the system are still governed by
$$ \mat{M} \dot{\vec{T}} = \mat{K} \vec{T},$$
but now $\mat{K}$ contains two unknown conduction coefficient $k_{12}$ and $k_{23}$:
$$\mat{K} = 
\begin{pmatrix}
-k_{12} & k_{12} & 0 \\
k_{12} & -(k_{12} + k_{23}) & k_{23} \\
0 & k_{23} & -k_{23}
\end{pmatrix}.$$

### Simulate Data
Let's again start by visualising the evolution of the temperatures in our system. 

In [None]:
# System Properties
M = 3 # Observation dimension
D = 3 # Latent dimension
N = 1001 # Observations
Dt = 1. # Time step for discretisation
times = [Dt * i for i in range(N)]

In [None]:
# State properties
k12 = 5 # Conduction coefficient
k23 = 4
mcp_1 = 2e3
mcp_2 = 1.5e3
mcp_3 = 2.5e3
Mconst = np.diag([mcp_1, mcp_2, mcp_3])
# a = np.linalg.inv(np.identity(D) - Dt * np.matmul(np.linalg.inv(Mconst), np.array([[-k12, k12, 0], [k12, -(k12 + k23), k23], [0, k23, -k23]]))) # Transition matrix A
a = np.identity(D) + Dt * np.matmul(np.linalg.inv(Mconst), np.array([[-k12, k12, 0], [k12, -(k12 + k23), k23], [0, k23, -k23]]))
x_1_0 = 270
x_2_0 = 330
x_3_0 = 320
std_noise_x = 0 # No process noise because unidentifiability makes that too messy for the moment
std_noise_y = 1

In [None]:
# Generate data
x = np.empty((N, D))
y = np.empty((N, M))
x[0] = np.array([x_1_0, x_2_0, x_3_0])
y[0] = x[0] + std_noise_y * np.random.randn(M)
for n in range(N - 1):
    x[n + 1] = np.dot(a, x[n]) + std_noise_x * np.random.randn(D)
    y[n + 1] = x[n + 1] + std_noise_y * np.random.randn(M)

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (8, 8));
ax.grid(True);
ax.set_prop_cycle('color', ["b", "r", "purple"]);
ax.plot(times, x, label = [r"$T_1$", r"$T_2$", r"$T_3$"]);
ax.scatter(times, y[:, 0], label = r"$y_1$");
ax.scatter(times, y[:, 1], label = r"$y_2$");
ax.scatter(times, y[:, 2], label = r"$y_3$");
ax.set_xlim(times[0], round(times[-1]));
ax.set_ylim(260, 340);
ax.set_xlabel(r"$t~(\mathrm{s})$");
ax.set_ylabel(r"$T~(\mathrm{K})$");
ax.legend();

### Probabilistic Model
Because of the way we set up the probabilistic model for the case with 2 blocks, we can easily reuse the existing functions. Hence, we can immediately perform parameter identification. First, we pose the prior $\Gamma(100, 1)$ on $\nu$.

In [None]:
my_Q_medium_precision = infer(y, M, N, D, 100, 1)
inferred_X_medium_precision = my_Q_medium_precision['X'].get_moments()[0]
inferred_A_medium_precision = my_Q_medium_precision['A'].get_moments()[0]
inferred_tau_medium_precision = my_Q_medium_precision['tau'].get_moments()[0]

Adding the extra dimensions does not seem to have slowed down inference. This is a promising sign: future model expansions might similarly influence computation time in a negligible manner.

In [None]:
fig, ax = plt.subplots(1, 1, figsize = (8, 8));
ax.grid(True);
ax.set_prop_cycle('color', ["b", "r", "purple"]);
ax.plot(times, x, label = [r"$T_1$", r"$T_2$", r"$T_3$"]);
ax.scatter(times, y[:, 0], label = r"$y_1$");
ax.scatter(times, y[:, 1], label = r"$y_2$");
ax.scatter(times, y[:, 2], label = r"$y_3$");
ax.set_prop_cycle('color', ["g", "orange", "y"]);
ax.plot(times, inferred_X_medium_precision, label = [r"$T_1$ Bayes", r"$T_2$ Bayes", r"$T_3$ Bayes"])
ax.set_xlim(times[0], round(times[-1]));
ax.set_ylim(260, 340);
ax.set_xlabel(r"$t~(\mathrm{s})$");
ax.set_ylabel(r"$T~(\mathrm{K})$");
ax.legend();

In [None]:
inferred_A, a, (inferred_A_medium_precision - a) / a

In [None]:
inferred_K = np.matmul(Mconst, inferred_A_medium_precision - np.identity(3)) / Dt
inferred_k12 = (-inferred_K[0, 0] + inferred_K[1, 0] + inferred_K[0, 1]) / 3
inferred_k23 = (-inferred_K[2, 2] + inferred_K[2, 1] + inferred_K[1, 2]) / 3
k12, inferred_k12, (inferred_k12 - k12) / k12, k23, inferred_k23, (inferred_k23 - k23) / k23

Now, the estimates are not very good. They are again very sensitive to the prior of $\nu$. Maybe part of the problem is the ARD. We also consider other priors for $\nu$

In [None]:
my_Q_low_precision = infer(y, M, N, D, 1, 1)
inferred_X_low_precision = my_Q_low_precision['X'].get_moments()[0]
inferred_A_low_precision = my_Q_low_precision['A'].get_moments()[0]
inferred_tau_low_precision = my_Q_low_precision['tau'].get_moments()[0] 

In [None]:
my_Q_high_precision = infer(y, M, N, D, 10000, 1)
inferred_X_high_precision = my_Q_high_precision['X'].get_moments()[0]
inferred_A_high_precision = my_Q_high_precision['A'].get_moments()[0]
inferred_tau_high_precision = my_Q_high_precision['tau'].get_moments()[0] 

In [None]:
fig, ax = plt.subplots(1, 3, figsize = (24, 8));
ax[0].grid(True);
ax[0].set_prop_cycle('color', ["b", "r", "purple"]);
ax[0].plot(times, x, label = [r"$T_1$", r"$T_2$", r"$T_3$"]);
ax[0].scatter(times, y[:, 0], label = r"$y_1$");
ax[0].scatter(times, y[:, 1], label = r"$y_2$");
ax[0].scatter(times, y[:, 2], label = r"$y_3$");
ax[0].set_prop_cycle('color', ["g", "orange", "y"]);
ax[0].plot(times, inferred_X_low_precision, label = [r"$T_1$ Bayes", r"$T_2$ Bayes", r"$T_3$ Bayes"])
ax[0].set_xlim(times[0], round(times[-1]));
ax[0].set_ylim(260, 340);
ax[0].set_xlabel(r"$t~(\mathrm{s})$");
ax[0].set_ylabel(r"$T~(\mathrm{K})$");
ax[0].set_title(r"$\nu \sim \mathrm{Gamma}(1, 1)$");
ax[0].legend();
ax[1].grid(True);
ax[1].set_prop_cycle('color', ["b", "r", "purple"]);
ax[1].plot(times, x, label = [r"$T_1$", r"$T_2$", r"$T_3$"]);
ax[1].scatter(times, y[:, 0], label = r"$y_1$");
ax[1].scatter(times, y[:, 1], label = r"$y_2$");
ax[1].scatter(times, y[:, 2], label = r"$y_3$");
ax[1].set_prop_cycle('color', ["g", "orange", "y"]);
ax[1].plot(times, inferred_X_medium_precision, label = [r"$T_1$ Bayes", r"$T_2$ Bayes", r"$T_3$ Bayes"])
ax[1].set_xlim(times[0], round(times[-1]));
ax[1].set_ylim(260, 340);
ax[1].set_title(r"$\nu \sim \mathrm{Gamma}(100, 1)$");
ax[1].set_xlabel(r"$t~(\mathrm{s})$");
ax[1].set_ylabel(r"$T~(\mathrm{K})$");
ax[1].legend();
ax[2].grid(True);
ax[2].set_prop_cycle('color', ["b", "r", "purple"]);
ax[2].plot(times, x, label = [r"$T_1$", r"$T_2$", r"$T_3$"]);
ax[2].scatter(times, y[:, 0], label = r"$y_1$");
ax[2].scatter(times, y[:, 1], label = r"$y_2$");
ax[2].scatter(times, y[:, 2], label = r"$y_3$");
ax[2].set_prop_cycle('color', ["g", "orange", "y"]);
ax[2].plot(times, inferred_X_high_precision, label = [r"$T_1$ Bayes", r"$T_2$ Bayes", r"$T_3$ Bayes"])
ax[2].set_xlim(times[0], round(times[-1]));
ax[2].set_ylim(260, 340);
ax[2].set_title(r"$\nu \sim \mathrm{Gamma}(10^4, 1)$");
ax[2].set_xlabel(r"$t~(\mathrm{s})$");
ax[2].set_ylabel(r"$T~(\mathrm{K})$");
ax[2].legend();
# fig.savefig("Results/Explore/VB/3_blocks_compare_priors.pdf", bbox_inches = "tight")

The other priors do not appear to perform much better. When we use $\Gamma(1, 1)$, the state estimates are very jagged, which suggests that the grey-box model is being ignored compared to data. Conversely, the state estimates resulting from the $\Gamma(10^4, 1)$ prior are nice and smooth, but seem to ignore the measurement data: especially for small $t$ the estimates are way off.

We will finally compare the parameter estimates:

In [None]:
inferred_K = np.matmul(Mconst, inferred_A_low_precision - np.identity(3)) / Dt
inferred_k12 = (-inferred_K[0, 0] + inferred_K[1, 0] + inferred_K[0, 1]) / 3
inferred_k23 = (-inferred_K[2, 2] + inferred_K[2, 1] + inferred_K[1, 2]) / 3
k12, inferred_k12, (inferred_k12 - k12) / k12, k23, inferred_k23, (inferred_k23 - k23) / k23

In [None]:
inferred_K = np.matmul(Mconst, inferred_A_medium_precision - np.identity(3)) / Dt
inferred_k12 = (-inferred_K[0, 0] + inferred_K[1, 0] + inferred_K[0, 1]) / 3
inferred_k23 = (-inferred_K[2, 2] + inferred_K[2, 1] + inferred_K[1, 2]) / 3
k12, inferred_k12, (inferred_k12 - k12) / k12, k23, inferred_k23, (inferred_k23 - k23) / k23

In [None]:
inferred_K = np.matmul(Mconst, inferred_A_high_precision - np.identity(3)) / Dt
inferred_k12 = (-inferred_K[0, 0] + inferred_K[1, 0] + inferred_K[0, 1]) / 3
inferred_k23 = (-inferred_K[2, 2] + inferred_K[2, 1] + inferred_K[1, 2]) / 3
k12, inferred_k12, (inferred_k12 - k12) / k12, k23, inferred_k23, (inferred_k23 - k23) / k23

Of the options shown here (and indeed the many other combinations of prior hyperparameters that we tried), $\Gamma(100, 1)$ performs the best. However, the parameter estimates are still pretty bad. Notably, we see that the quality of the inference for the two conduction parameters tends to differ: when a prior gives a good estimate for $k_{12}$, it often gives a bad estimate for $k_{23}$. There does not appear to be an optimal prior that gives acceptable parameter estimates for both conduction coefficients simultaneously. Moreover, we improved the inference by playing around with the prior. There is no reason to believe that the values we found will work well in general, and we have no way to derive good hyperparameters from theory. 

## Conclusion
There are numerous issues with VB that make it unsuitable for application in this project, which could be mostly solved by investing a sufficient amount of time:
- Even without real process noise, the inference is very sensitive to the prior of the innovation $\nu$;
- The ARD as is currently applied in BayesPy does not make a lot of sense for our problem. We would have to implement a componentwise (instead of columnwise) ARD. That might help with the sensitivity to the prior of the innovation too;
- Our problem is somewhat more complex, involving also convection and input heats. It is not clear how these can be added using existing software;
- It is not clear how to convert knowledge of the distribution of components in matrices into posteriors for the underlying parameters.

Consequently, we will look for another approximate Bayesian inference method. We continue in the main [Julia Jupyter notebook](sysid-thermal-AR.ipynb). _try to link to correct section_