# Defining the expected free energy

The expected free energy is variational approximation of the expected bound on future suprisial normaly defined as

$$S(\pi) = E_{\tilde{Q}}\left[\ln Q(\tilde{s}|\pi) -\ln Q(\tilde{s}|\tilde{o}, \pi) - \ln P(\tilde{o}) \right] 
    \leq E_{\tilde{Q}}\left[\ln Q(\tilde{s}|\pi) - \ln P(\tilde{o}, \tilde{s}) \right]  \equiv G(\pi) $$

where $\tilde{Q}(\tilde{o}, \tilde{s}|\pi) = P(\tilde{o}|\tilde{s})Q(\tilde{s}|\pi)$.

Starting from the expression above one can either minimise the EFE

* $$ G(\pi) = D_{KL}\left(Q(\tilde{s}|\pi)||P(\tilde{s})\right) + E_{Q(\tilde{s}|\pi)}\left[H[\tilde{o}|\tilde{s}] \right]$$

or directly the expected suprisal bounded by the EFE

* $$ S(\pi) = E_{\tilde{Q}}\left[ \ln Q(\tilde{s}|\pi) - \ln Q(\tilde{s}|\tilde{o}, \pi) - \ln P(\tilde{o}) \right]$$

* $$ S(\pi) =D_{KL}\left(Q(\tilde{o}|\pi)||P(\tilde{o}) \right) - E_{\tilde{Q}}\left[\ln P(\tilde{o}|\tilde{s}) \right]$$ 

In addition, to the expected free energy we can also consider to minimise the free energy bound on the future expected surprisal defined as  
$$-E_{\tilde{Q}}[\ln P(\tilde{o})] \leq E_{\tilde{Q}}\left[\ln Q(\tilde{s}|\tilde{o}, \pi) - \ln P(\tilde{o}, \tilde{s}) \right] \equiv I(\pi)$$

Lets compare this quantities in a simple example.

In [2]:
import numpy as np

Ps = np.array([.8, .2])
Po_s = np.array([[.6, .2], [.4, .8]])
Pos = Po_s * Ps

Po = Pos.sum(-1)
Qs_o = Pos.T/Po

Qs_p = np.array([[.2, .8], [.5, .5], [.7, .3], [.9, .1], [.999, .001]]).T
n = Qs_p.shape[-1]
Qos_p = np.expand_dims(Po_s, -1) * Qs_p
Qo_p = Qos_p.sum(-2)

qso_p = np.expand_dims(Qs_o, -1) * Qo_p

G1 = np.sum(((np.log(Qs_p) - np.expand_dims(np.log(Pos), -1)) * Qos_p).reshape(-1, n), -2)
print(-G1)

S= (Qo_p * ( np.log(Qo_p) - np.expand_dims(np.log(Po), -1) )).sum(-2) \
    - (Qos_p * np.log(np.expand_dims(Po_s, -1))).reshape(-1, n).sum(-2)
print(-S)

Qs_o_p = (Qos_p/np.expand_dims(Qo_p, -2)).swapaxes(0, 1)

I = np.sum((np.log(Qs_p) - np.log(Ps)[:, None]) * Qs_p, -2) - np.sum( Qo_p * np.log(Qo_p), -2)

print(-I)

[-1.36670089 -0.8098506  -0.64939645 -0.69244076 -0.88946165]
[-0.65352817 -0.61564747 -0.6244306  -0.6589662  -0.68564111]
[-1.42472993 -0.89615522 -0.72051452 -0.72261981 -0.88979611]


Note that the minima of $G(\pi)$ (first row) and $\tilde{G}(\pi)$ (second row) differ.
Curiously the minima of $I(\pi)$ and $G(\pi)$ match in this example, whereas $S(\pi)$ provides a different minima, hence optimal policy. 

We will compare the different objective functions $S, G, I$ in various variants of dynamic multi-armed bandits.