<h1>Mathematical Theory of Data Assimilation with Applications:<br>

<p class="fragment">Tutorial part 4 of 4 --- Bayesian DA through sampling<p></h1>


<h3>Recall from last time</h3>

<ul>
    <li class="fragment">While we found a recursive parametric form for the next prior distribution in terms of the last posterior distibution, this method had limitations.  Particularly:</li>
    <ol>
         <li class="fragment">The recursion only holds for models with linear dynamic and observational models and Gaussian error distributions.</li>
        <li class="fragment">In nonlinear systems, this suggested</li>
        <ol>
           <li class="fragment">  evolving the mean state with the fully nonlinear equations; and </li>
           <li class="fragment"> hoping that the evolution of the last posterior is well represented by the evolution under the Jacobian equations along this trajectory.</li>
        </ol>
    </ol>
        </ul>
    
 

   
 <h3>Recall from last time</h3>

<ul>   
        <li class="fragment">The linearization of the model in the extended Kalman filter is, however, expensive and is infeasible for operational models.</li>
            <li class="fragment">Moreover, the parametric form for this analysis is highly rigid, or <em>biased</em> in its learning;</li>
        <ul>
         <li class="fragment">it assumes linear-Gaussian models and when the assumptions aren't well satisfied, the parametric form for the recursion of the posterior can diverge catastrophically.</li>
            </ul>
</ul>

<h3>Sampling</h3>

<ul>
    <li class="fragment">One way to rectify the above issues is to estimate the posterior either non-parametrically (particle filters) or less-parametrically (ensemble Kalman filter).</li>
    <li class="fragment">When the goal is to obtain an accurate representation of the posterior distribution, we can consider the representation of the density as (singular) volumes with weights.</li>
    <li class="fragment">We will suppose for the moment that we have a target posterior distribution $p(\mathbf{x}\vert \mathbf{y})$ that we can somehow draw independent samples from, even if we do not know its exact functional form.</li>
    <li class="fragment">Let $\mathbf{x}^i\in \mathbb{R}^n$ denote the $i$-th sample drawn iid from this distribution.</li>
</ul>

<h3>Sampling continued</h3>

<ul>
    <li class="fragment"> If we draw $N$ independent samples from the distribution $p(\mathbf{x}\vert \mathbf{y})$, an empirical representation of the distribution is given by
    \begin{align}
        p_N(\mathbf{x}\vert \mathbf{y}) = \frac{1}{N} \sum_{i=1}^N \delta_{\mathbf{x}^i}\left(\mathrm{d}\mathbf{x}\right)
    \end{align}
    </li>
    <li class="fragment">In the above, the denominator $\frac{1}{N}$ represents that all point volumes have equal mass or weight, so that the total density integrates to one.</li>
    <li class="fragment">Then for any statistic $f$ of the posterior, we can recover an estimate of its expected value directly as
        \begin{align}
        \mathbb{E}_{p(\mathbf{x}\vert \mathbf{y})} \left[f \right] \triangleq\int f p(\mathbf{x}\vert \mathbf{y} ) \mathrm{d}\mathbf{x} \approx \int f p_N(\mathbf{x}\vert \mathbf{y}) \mathrm{d}\mathbf{x} 
        &= \frac{1}{N}\sum_{i=1}^N f(\mathbf{x})\delta_{\mathbf{x}^i}(\mathrm{d}\mathbf{x}).
        \end{align}
    </li>
</ul>

<h3>Empirical estimates</h3>

<ul>
    <li class="fragment">The empirical estimate discussed before is also an unbiased estimator of the statistic $f$;</li>
    <li class="fragment">If the posterior variance of $f(\mathbf{x})$ satisfies,
            \begin{align}
            \sigma^2_f = \mathbb{E}_{p(\mathbf{x}\vert \mathbf{y})}\left[f^2(\mathbf{x})\right] - \left(\mathbb{E}_{p(\mathbf{x}\vert \mathbf{y})}\left[f\right]\right)^2 < \infty;
             \end{align}
             </li>
    <li class="fragment"> then the variance of the empirical estimate, \begin{align}
        var\left(\mathbb{E}_{p_N(\mathbf{x}\vert \mathbf{y})}\left[f\right]\right) = \frac{\sigma_f^2}{N},\end{align}
        (where the variance is understood as taken over the possible sample outcomes).</li>
     <li class="fragment">If $\sigma_f^2$ is also finite, then by the central limit theorem we know,
         \begin{align}
         \lim_{N\rightarrow +\infty}\sqrt{N}\left\{ \mathbb{E}_{p(\mathbf{x}\vert \mathbf{y})}\left[f\right] - \mathbb{E}_{p_N(\mathbf{x}\vert \mathbf{y})}\left[f\right]\right\} =N(0, \sigma_f^2),
         \end{align}
         i.e., the empirical distribution converges to the true distribution in the weak sense as $N$ gets sufficiently large.</li>
    
 </ul>
 

<h3>Importance sampling</h3>

<ul>
    <li class="fragment"> In practice, we often cannot sample the posterior directly but we may need to sample some other distribution that shares its support.</li>
    <li class="fragment">This idea of sampling another distribution with shared support is known as <em>importance sampling</em>.
        <li class="fragment">We will suppose that we have access, perhaps not to $p(\mathbf{x}\vert \mathbf{y})$ but instead $\pi(\mathbf{x}\vert \mathbf{y})$ such that $p \ll \pi$,</li>
    <ul>
            <li class="fragment"> i.e., $p(A\vert \mathbf{y})>0 \Rightarrow \pi(A\vert \mathbf{y})>0$.
    </ul>
    <li class="fragment">This above assumption allows us to take the Radon-Nikodym derivative of the true posterior $p(\mathbf{x}\vert \mathbf{y})$ with respect to the <em>proposal distribution</em> $\pi(\mathbf{x}\vert \mathbf{y})$.</li>
    <li class="fragment">The key innovation to the last formulation is that this allows us to evaluate a statistic of the posterior by point volumes but with non-equal <em>importance weights</em>.</li>
</ul>

<h3>Importance sampling continued</h3>

<ul>
    <li class="fragment"> Let us define the importance weight function $w(\mathbf{x}) \triangleq \frac{p(\mathbf{x}\vert \mathbf{y})}{\pi(\mathbf{x}\vert \mathbf{y})}$.</li>
    <li class="fragment">The we can re-write the expected value of some statistic $f$ of the posterior as,
        \begin{align}
        \mathbb{E}_{p(\mathbf{x}\vert \mathbf{y})}[f]&=\frac{\int f(\mathbf{x})w(\mathbf{x})\pi(\mathbf{x}\vert \mathbf{y})\mathrm{d}\mathbf{x}}{\int w(\mathbf{x})\pi(\mathbf{x}\vert \mathbf{y})\mathrm{d}\mathbf{x}} 
        \end{align}
        </li>
    <li class="fragment">The benefit for sampling techniques is therefore to take iid $\mathbf{x}^i \sim \pi(\mathbf{x}\vert \mathbf{y})$ and to write the empirically derived expected value of $f$ as, 
        \begin{align}
        \mathbb{E}_{p_N(\mathbf{x}\vert \mathbf{y})}[f] = \frac{ \frac{1}{N} \sum_{i=1}^N f(\mathbf{x}^i)w(\mathbf{x}^i)}{\frac{1}{N} \sum_{i=1}^N w(\mathbf{x}^i)} = \sum_{i=1}^N f(\mathbf{x}^i) \tilde{w}^i.
        \end{align}
        </li>
    <li class="fragment">Here, the $\tilde{w}^i\triangleq \frac{w(\mathbf{x}^i)}{\sum_{i=1}^N w(\mathbf{x}^i)}$ are defined as the <em>normalized importance weights</em>;</li>
</ul>

<h3>Importance sampling continued</h3>

<ul>
    <li class="fragment"><b>Q:</b> if we take $f(\mathbf{x}) = 1$, what is the expected value based on the empirical measure with sampling weights? Why is this important?</li>
    <li class="fragment"><b>A:</b> the expected value is one --- this means that the weighted point volumes gives a probability measure.</li>
    <li class="fragment">Particularly, in this way, we will write our empirical estimate of the posterior as,
        \begin{align}
        p_N(\mathbf{x}\vert \mathbf{y}) \triangleq \sum_{i=1}^N \tilde{w}^i\delta_{\mathbf{x}^i}(\mathrm{d}\mathbf{x}).
        \end{align}</li>
    </ul>

<h3>Importance sampling continued</h3>

<ul>
    </li>
    <li class="fragment">Using this formulation, we now have an extremely flexible view of the posterior as combination of positions (samples) and weights (probabilities).</li>
    <li class="fragment">In the DA problem, we once again have a natural choice of how to find the next prior from the last posterior; </li>
    <ul>
    <li class="fragment"> this is done by evolving the points and (possibly) finding new weights or resampling positions at the moment of conditioning on new observed information.</li>
    </ul>

<h3>Sequential importance sampling</h3>

<ul>
    <li class="fragment">Key to creating an operational algorithm is to extend importance sampling into a sequential algorithm.</li>
    <li class="fragment">Particularly, when we extend the posterior to a squence of model forecast states conditioned on a sequence of observed states, 
        \begin{align}
        \mathbf{x}_{0:k}&\triangleq \left\{\mathbf{x}_i : i = 0, \cdots ,k\right\},\\
        \mathbf{y}_{0:k}&\triangleq \left\{\mathbf{y}_i : i = 0, \cdots ,k\right\},
        \end{align}
    </li>   
    <li class="fragment">we will wish to find a recursive formulation of for the posterior, like we did in the Kalman filter.
    </li>
    <li class="fragment">In order to formulate this, we will marginalize over the previous states, i.e.,
        \begin{align}
        p\left(\mathbf{x}_{0:k}\vert \mathbf{y}_{0:k}\right) = p\left(\mathbf{x}_{0:k-1}\vert \mathbf{y}_{0:k}\right) p\left(\mathbf{x}_k \vert \mathbf{x}_{0:k-1}, \mathbf{y}_{0:k}\right).
        \end{align}</li>
    <li class="fragment">Iterating on this formula recursively, we find that,
        \begin{align}
        p(\mathbf{x}_{0:k}\vert \mathbf{y}_{0:k}) = p(\mathbf{x}_0) \prod_{m=1}^k p(\mathbf{x}_m \vert \mathbf{x}_{0:m-1}, \mathbf{y}_{0:k})
        \end{align}
</ul>

<h3>Sequential importance sampling continued</h3>

<ul>
    <li class="fragment">Using the recursion and using the Markovian assumption ($p(\mathbf{x}_{m}\vert \mathbf{x}_{0:m-1}) = p(\mathbf{x}_m \vert \mathbf{x}_{m-1})$), we find the recursive form for the weights via Bayes' Law as,
    \begin{align}
        \tilde{w}^i_k \propto \tilde{w}^i_{k-1} \frac{p\left(\mathbf{y}_k \vert \mathbf{x}^i_k\right) p\left(\mathbf{x}^i_k \vert \mathbf{x}^i_{k-1}\right)}{\pi\left(\mathbf{x}_k^i \vert \mathbf{x}^i_{0:k-1}, \mathbf{y}_{0:k} \right)}.
    \end{align}
    </li>
 <li class="fragment">Quintessentially, we will assume that $\pi$ is the prior $p(\mathbf{x}_{0:k})$; in this case, we have the weights given recursively by $\tilde{w}^i_k \propto \tilde{w}^i_{k-1} p(\mathbf{y}_k \vert \mathbf{x}_k^i)$.</li>
    <li class="fragment">The proportionality statement says that:
        <ul>
        <li class="fragment"> after conditioning each weight on the new observation, we need just re-normalize the weights to sum to one in order to recover the empirical posterior.</li>
        </ul>
 </ul>
    

<h3>Sequential importance sampling continued</h3>

<ul>    
        <li class="fragment">Sequential importance sampling for estimating the posterior is extremely flexible and makes few assumptions on the form of the problem whatsoever;</li>
    <li class="fragment">however, the primary issue that arises is that the importance weights become extremely skewed extremely quickly, leading to all the probability mass landing on a single point after only a few iterations.</li>
    <li class="fragment">Methods for handling the degeneracy of the weights is explicitly the motivation for the <em>bootstrap particle filter</em>, and implicitly one of the motivations for the <em>ensemble Kalman filter</em>.</li>
        <li class="fragment"> The method of the bootstrap filter essentially proposes to eliminate the degeneracy of the weights by eliminating samples with weights close to zero and resampling.</li>
</ul>

<h3>The bootstrap filter</h3>

<ul>
    <li class="fragment">At the point of the Bayesian update and re-weighting:</li>
    <ol>
        <li class="fragment">eliminate all samples with weights $\tilde{w}^i < W$ where $W\ll 1$ will be some threshold for the weights;</li> 
        <li class="fragment"> in turn, we make replicates of the higher weighted samples and reset the importance weights all equal to $\frac{1}{N}$;</li>
    <li class="fragment">then the new empirical measure is then given by,
        \begin{align}
        p_N(\mathrm{d}\mathbf{x}_{0:k}\vert \mathbf{y}_{0:k}) = \frac{1}{N} \sum_{i=1}^N N^i \delta_{\mathbf{x}^i}(\mathrm{d}\mathbf{x}),
        \end{align}
        where $N^i$ is the number of replicates ($N^i\in[0, N]$) of sample $\mathbf{x}^i$ such that $\sum_{i=1}^N N^i =N$</li>
    </ol>
    <li class="fragment">How the number of replicates $N^i$ is chosen is the basis of several different approaches to particle filters.</li>
</ul>

<h3>The bootstrap filter continued</h3>

<ul>
    <li class="fragment"> The classical bound for how well the empirical measure approximates the true posterior can be stated as follows:</li>
    <ul>
        <li class="fragment">Suppose $f$ is a bounded function (statistic) with $\parallel f\parallel\triangleq \sup_{\mathbf{x}_{0:k} } \rvert f(\mathbf{x}_{0:k})\rvert$,</li> 
        <li class="fragment"> then there exists some $C>0$ and independent of $k$ such that
        \begin{align}
        \mathbb{E} \left[ \left\{ \int f(\mathbf{x}_{0:k}) p(\mathrm{d}\mathbf{x}_{0:k}\vert \mathbf{y}_{0:k}) - \int f(\mathbf{x}_{0:k}) p_N(\mathrm{d}\mathbf{x}_{0:k}\vert \mathbf{y}_{0:k}) \right\}^2 \right] \leq \frac{C \parallel f \parallel^2}{N}.
        \end{align}
    </li>
        <li class="fragment"><b>Note:</b> the constant $C$ grows exponentially in the dimension $d$, $\mathbf{y}_k \in \mathbb{R}^d$.</li>
        <li class="fragment"><b>Exercise (3 minutes):</b> use Jensen's inequality to derive the rate at which the empirical measure converges to the true measure (in the number of samples N) in the expectation of the absolute difference.</li>
    </ul>
</ul>

<h3>The bootstrap filter continued</h3>

<ul>
    <li class="fragment"><b>Solution:</b> Jensen's inequality states that for a convex function $\phi$ on the real line, if $(\Omega, \mathcal{A}, \mu)$ is a probability space and $g$ is a real-valued integrable function,
        \begin{align}
        \phi\left(\int_\Omega g \mathrm{d}\mu\right) \leq \int_\Omega \phi\circ g \mathrm{d}\mu.
        \end{align}
    </li>
    <li class="fragment">We consider $\phi$ defined $\phi(x) = x^2$;</li> 
        <li class="fragment"> let us define $g$ as,
        \begin{align}
            g(\omega)\triangleq \left\rvert \int f(\mathbf{x}_{0:k}) p(\mathrm{d}\mathbf{x}_{0:k}\vert \mathbf{y}_{0:k}) - \int f(\mathbf{x}_{0:k}) p_N(\mathrm{d}\mathbf{x}_{0:k}\vert \mathbf{y}_{0:k}) \right\rvert,
        \end{align}
            where $\omega$ is some outcome.</li>    
    </ul>

<h3>The bootstrap filter continued</h3>

<ul>

<li class="fragment">Then,
        \begin{align}
        &\phi\left(\mathbb{E} \left[ g \right]\right) \leq \mathbb{E}[\phi \circ g ] \leq \frac{C \parallel f \parallel^2}{N}\\
        \Rightarrow & \mathbb{E}\left[g\right] \leq \frac{\sqrt{C} \parallel f \parallel}{\sqrt{N}}
        \end{align}
    </li>
    </ul>
</ul>

<h3>The bootstrap filter continued</h3>

<ul>
    <li class="fragment"> The constant $C$ is problematic for most operational DA in the tradeoff it represents.  Particularly as:</li>
    <ul>
        <li class="fragment">the observational dimension for operational DA is up the order $\mathcal{O}\left(10^8\right)$, we cannot feasibly control this constant with a denominator that grows at $\sqrt{N}$ in the number of samples;</li>
        <li class="fragment">we can reduce the number of observations to better approximate the true posterior, but the true posterior we will represent empirically will be one that is <em>extremely deprived of information</em>.</li>
    </ul>
    <li class="fragment">Indeed, the state dimension in operational models is up to the order of $\mathcal{O}\left(10^9\right)$, such that the inference problem is woefully underconstrained already;</li>
    <ul>
        <li class="fragment">if we reduce the ammount of observational information, we will get a good empirical estimate of an oblivious posterior.</li>
    </ul>
</ul>

<h3>Bias versus variance of estimators</h3>

<ul>
    <li class="fragment"> We will only mention that advanced learning techniques built upon the ideas from particle filters are an active area of research.</li>
    <li class="fragment">Typically, these techniques must offer some compensation for the extremely high variance of particle filters by introducing a form of bias into their estimates.</li>
    <li class="fragment">We will discuss the bias-variance tradeoff loosely in the following.</li>
    <li class="fragment">Generically, we suppose there is some unkown relationship $f$ that describes,
        \begin{align}
        \mathbf{y} = f(\mathbf{x}) + \epsilon
        \end{align}
        where $\mathbb{E}[\epsilon] = 0$ and $var(\epsilon) = \sigma^2 <\infty$.</li>
</ul>

<h3>Bias versus variance of estimators continued</h3>

<ul> 
    <li class="fragment">Treated as supervised learning of the relationship, if we choose some approximate form $\hat{f}$ for the true relationship $f$, </li>
    <li class="fragment">it can be demonstrated that the generalization error (outside of the training data) decomposes as
        \begin{align}
        \mathbb{E}\left[\left(y -\hat{f}(x)\right)^2\right] &= \left(\mathbb{E}[\hat{f}(x) - f(x)\right)^2 + \left(\mathbb{E}[\hat{f}(x)^2] - \mathbb{E}[\hat{f}(x)]^2\right) + \sigma^2\\
        &= \left\{\mathrm{Bias}\left[\hat{f}(x)\right]\right\}^2 + var\left[\hat{f}(x)\right] + \sigma^2
        \end{align}</li>
        <li class="fragment">where (heuristically):</li> 
        <ul>
            <li class="fragment"> the bias represents the rigidness of the assumptions (e.g., linearity/Gaussianity) of the learning scheme; and 
            <li class="fragment">the variance is the flexibility of the method and how far it will move from its mean when introduced to new data.</li>
    </ul>
</ul>

<h3>Bias versus variance of estimators continued</h3>

<ul>    
    <li class="fragment">The particle filter has extremely high variance in its learning due to the flexibility of its assumptions;</li>
    <li class="fragment">the extended Kalman filter on the other hand had extremely high bias by enforcing Gaussian-linear assumptions in the entire analysis.</li>
    <li class="fragment"> This motivates one of the more successful approaches to DA: the ensemble Kalman filter (EnKF).</li>
    <li class="fragment">The EnKF can be seen in one sense as introducing variance into the extended Kalman filter by allowing the past-posterior to evolve via sampling into the next posterior nonlinearly.</li>
    <li class="fragment"> However, at the point of the update, we reintroduce bias into the sampling analysis by enforcing a Gaussian assumption (or in some versions a Gaussian mixture assumption).</li>
    <li class="fragment"> This combination is often enough to reduce the overall error by introducing an appropriate ammount of bias and variance.</li>
</ul>

<h3>The Ensemble Kalman filter</h3>

<ul>    
    <li class="fragment">The set up of the ensemble Kalman filter will begin similarly to the particle filter.</li>
    <li class="fragment">We suppose that at time $k$ we have $N$ iid samples of a prior, $\left\{\mathbf{x}^{fi}_k\right\}_{i=1}^N$ which will form the columns of a matrix,
        \begin{align}
        \mathbf{X}^{f}_k &\triangleq \begin{pmatrix}\mathbf{x}_k^{f1} & \cdots & \mathbf{x}^{fN}_k\end{pmatrix}.
        \end{align}
    </li>
    <li class="fragment">We define the mean of the ensemble as,
        \begin{align}
        \overline{\mathbf{x}}^f_k \triangleq \frac{1}{N} \sum_{i=1}^N \mathbf{x}^{fi}_k,
        \end{align}
    </li>
    <li class="fragment">and from the above, we define the matrix of the anomalies as,
        \begin{align}
        \mathbf{A}^f_k \triangleq \begin{pmatrix} \mathbf{x}^{f1}_k - \overline{\mathbf{x}}^f_k & \cdots & \mathbf{x}^{fN}_k - \overline{\mathbf{x}}^f_k \end{pmatrix}.
        \end{align}
    </li>
</ul>

<h3>The Ensemble Kalman filter continued</h3>

<ul>    
    <li class="fragment">The anomalies are precisely given by the deviations of the samples from the mean, such that the (unbiased) sample-based prior covariance is given,
        \begin{align}
        \mathbf{P}^f_k = \frac{1}{N-1} \mathbf{A}^f \left(\mathbf{A}^f\right)^\mathrm{T} .
        \end{align}
        </li>
    <li class="fragment">Suppose that we are provided an observation $\mathbf{y}_k \sim N(\mathbf{H}\mathbf{x}^t_k ,\mathbf{R})$.</li>
    <li class="fragment">Rather than (re)-weighting the samples, as in the particle filter, we will resample the "posterior" assuming that the prior is Gaussian and a Gaussian likelihood for the observations.</li>
</ul>

<h3>The (stochastic) Ensemble Kalman filter continued</h3>

<ul>    
        <li class="fragment">We will apply the gain to each ensemble member, but for theoretical reasons, we will perturb the observation by a noise realization for each ensemble member.</li>
    <ul>
        <li class="fragment">When we use a direct implementation of the Kalman gain, this is necessary to produce the "correct" analysis covariance;</li> 
        <ul>
        <li class="fragment">however, advanced methods use a transform of the ensemble directly to avoid this step.</li>
        </ul>
    </ul>
    <li class="fragment">We suppose we draw $N$ observation perturbations $\boldsymbol{\psi}^i \sim N(0, \mathbf{R})$</li>
    <ul>
        <li class="fragment">We will enforce that $\frac{1}{N}\sum_{i=1}^N \boldsymbol{\psi}^i=0$.</li>
        <li class="fragment">Furthermore, we define $\hat{\mathbf{R}}\triangleq cov(\boldsymbol{\psi}^i)$ to be the ensemble based observation error covariance.</li>
    </ul>
    <li class="fragment">Then, we define the perturbed observations as $\mathbf{y}^i_k \triangleq \mathbf{y}_k + \boldsymbol{\psi}^i$, such that $\mathbf{y}^i_k \sim N\left(\mathbf{y}_k, \hat{\mathbf{R}}\right)$.
</ul>

<h3>The (stochastic) Ensemble Kalman filter continued</h3>

<ul> 
    <li class="fragment">We can functionally define the Kalman gain as before, but with respect to the ensemble-based prior covariance $\mathbf{P}^f_k$ and ensemble-based observation error covariance $\hat{\mathbf{R}}$.
        \begin{align}
        \mathbf{K}_k \triangleq \mathbf{P}^f_k \mathbf{H}\left(\mathbf{H}\mathbf{P}^f_k\mathbf{H}^\mathrm{T} + \hat{\mathbf{R}}\right)^{-1}
        \end{align}
    </li>
    <li class="fragment"><b>Note:</b> the Kalman gain isn't guaranteed to provide a recursion for the Bayesian posterior except in the linear-Gaussian case;</li> 
        <li class="fragment">therefore in this case, we use the ensemble-based gain as a sub-optimal, biased estimator.</li>
</ul>

<h3>The (stochastic) Ensemble Kalman filter continued</h3>

<ul> 
    <li class="fragment">Finally, we approximate the Bayesian update using the ensemble-based gain as,
    \begin{align}
        \mathbf{x}^{ai}_k = \mathbf{x}^{fi}_k + \mathbf{K}\left(\mathbf{y}^i_k - \mathbf{H}\mathbf{x}^{fi}_k\right)
     \end{align}
    </li>
    <li class="fragment">From the analysis samples, we can once again compute the first two moments to have an estimated "best state" and its uncertainty.</li>
    <li class="fragment">An arbitrary statistic $f$ of the posterior can be computed by,
        \begin{align}
        \mathbb{E}[f] = \frac{1}{N} \sum_{i=1}^N f(\mathbf{x}^{ai}_k),
        \end{align}
        such that we see that the samples are all given equal weight.</li>
    <li class="fragment">To produce the next prior, we forecast each sample in the fully nonlinear numerical model.</li>
</ul>

<h3>The Ensemble Kalman filter continued</h3>

<ul>    
    <li class="fragment">Even with the Gaussian assumption, the EnKF is an extremely successful learning algorithm for nonlinear systems.</li>
    <li class="fragment"> However, techniques such as inflation and localization can be seen as introducing more variance into the estimator, relaxing the bias in the EnKF.</li>
    <li class="fragment">In addition, we may introduce hyper-priors for values such as $\mathbf{Q}$ and $\mathbf{R}$ to similarly increase the variance of the estimator.</li>
    <li class="fragment"> We will explore a simple example with the EnKF in the following.</li>
</ul>

<h3>Stochastic EnKF in the Ikeda map</h3>

<ul>
    <li class="fragment"><b>Exercise (3 minutes):</b>We will now examine the performance of the EnKF, using once again in the Ikeda map in a twin experiment.</li>
    <li class="fragment">In the following code, we will plot the forecast and analysis <em>ensembles</em> to demonstrate how they track the observations.</li>
    <li class="fragment">The mean of each ensemble will be plotted as a diamond, while ensemble members will be plotted as opaque points.</li>
</ul>

<h3>Stochastic EnKF in the Ikeda map</h3>

<ul>
    <li class="fragment">We want to evaluate the following questions:</li>
    <ol>
        <li class="fragment">How does the performance of the EnKF change with the number of samples?</li>
        <li class="fragment">How does the performance of the EnKF change over the number of assimilation steps?</li>
        <li class="fragment">How does the performance of the EnKF change with respect to the initial prior uncertainty $B_{var}$?</li>
        <li class="fragment">How does the performance of the EnKF change with respect to the observational error variance $R_{var}$?</li>
    </ol>
</ul>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interactive
from IPython.display import display
from matplotlib.patches import Ellipse

def Ikeda(X_0, u):
    """The array X_0 will define the initial condition and the parameter u controls the chaos of the map
    
    This should return X_1 as the forward state."""
    
    t_n = 0.4 - 6 / (1 + X_0.dot(X_0) )
    
    x_1 = 1 + u * (X_0[0] * np.cos(t_n) + X_0[1] * np.cos(t_n))
    y_1 = u * (X_0[0] * np.sin(t_n) + X_0[1] * np.cos(t_n))
                 
    X_1 = np.array([x_1, y_1])
    
    return X_1

def Ikeda_V(X_0, u):
    """The array X_0 will define the ensemble matrix of dimension 2 times N_ens
    
    This should return X_1 as the forward state."""
    
    t_n = 0.4 - 6 / (1 + np.sum(X_0*X_0, axis=0) )
    
    x_1 = 1 + u * (X_0[0, :] * np.cos(t_n) + X_0[1, :] * np.cos(t_n))
    y_1 = u * (X_0[0, :] * np.sin(t_n) + X_0[1, :] * np.cos(t_n))
                 
    X_1 = np.array([x_1, y_1])
    
    return X_1

In [None]:
def animate_enkf(B_var = 0.1, R_var = 0.1, N=2, ens_n=3):

    # define the static background and observational error covariances
    P_0 = B_var * np.eye(2)
    R = R_var * np.eye(2)

    # set a random seed for the reproducibility
    np.random.seed(1)
    
    # we define the mean for the background
    x_b = np.array([0,0])
    
    # and the initial condition of the real state as a random draw from the prior
    x_t = np.random.multivariate_normal([0,0], P_0)

    y_obs = np.zeros([2,N-1])
    
    # define the Ikeda map parameter
    u = 0.9
    for i in range(N-1):
        # we forward propagate the true state
        x_t = Ikeda(x_t, u)
    
        # and generate a noisy observation
        y_obs[:, i] = x_t + np.random.multivariate_normal([0,0], R)
    
    
    # we define the ensemble as a random draw of the prior
    ens = np.random.multivariate_normal(x_b, P_0, size=ens_n).transpose()

    
    # define the Ikeda map parameter
    for i in range(N-1):
        
        # forward propagate the last analysis
        ens_f = Ikeda_V(ens, u)
        
        # we generate observation perturbations
        obs_perts =  np.random.multivariate_normal([0,0], R, size=ens_n)
        obs_perts = obs_perts - np.mean(obs_perts, axis=0)
        
        # we generate the ensemble based observation error covariance 
        obs_cov = obs_perts.transpose() @ obs_perts / (ens_n - 1)
        
        # we perturb the observations
        perts_obs = np.squeeze(y_obs[:,i]) + obs_perts
        
        # we compute the ensemble mean and anomalies
        X_mean_f = np.mean(ens_f, axis=1)
        A_t = (ens_f.transpose() - X_mean_f) / np.sqrt(ens_n - 1)
        
        # and the ensemble covariances
        P = A_t.transpose() @ A_t

        # we compute the ensemble based gain and the analysis ensemble
        K_gain = P @ np.linalg.inv( P + obs_cov)
        ens = ens_f + K_gain @ (perts_obs.transpose() - ens_f)
        X_mean_a = np.mean(ens, axis=1)

    
    fig = plt.figure(figsize=(16,8))
    ax = fig.add_axes([.1, .1, .8, .8])
    
    l1 = ax.scatter(ens_f[0, :], ens_f[1, :], c='k', s=20, alpha=.5, marker=',')
    ax.scatter(X_mean_f[0], X_mean_f[1], c='k', s=200, marker="D")
    
    l3 = ax.scatter(ens[0, :], ens[1, :], c='b', s=20, alpha=.5, marker=',')
    ax.scatter(X_mean_a[0], X_mean_a[1], c='b', s=200, marker="D")
    
    l2 = ax.scatter(y_obs[0, -1], y_obs[1, -1], c='r', s=20)
    ax.add_patch(Ellipse(y_obs[:,-1], R_var, R_var, ec='r', fc='none'))
    
    
    ax.set_xlim([-2, 4])
    ax.set_ylim([-4,2])
    
    labels = ['Forecast', 'Observation', 'Analysis']
    plt.legend([l1,l2,l3],labels, loc='upper right', fontsize=26)
    plt.show()
    
w = interactive(animate_enkf,B_var=(0.01,1.0,0.01), R_var=(0.01,1.0,0.01), N=(2, 50, 1), ens_n=(3,300, 3))
display(w)

<h3>Summary of the EnKF</h3>

<ul>
    <li class="fragment">The EnKF makes a vast improvement over the earlier explored methods and demonstrates its robustness as a learning scheme <em>when there are sufficiently many samples.</em></li>
    <ul>
        <li class="fragment">It should be noted that this requires vastly fewer samples than implementing an effective bootstrap particle filter, but at the cost of introducing bias in the analysis of the posterior.</li>
    </ul>
    <li class="fragment">However in operational settings, the reality is that ensemble-based forecasting is still highly expensive and most operational EnKF uses at most $\mathcal{O}\left(10^2\right)$ samples in the learning.</li>
    <li class="fragment">While the EnKF is highly parallelizable, numerical weather prediciton models require massive computation power and this fundamentally limits the number of available samples.</li>
</ul>

<h3>Making the EnKF work in practice</h3>

<ul>
    <li class="fragment">The reality of implementing the EnKF in a numerical weather prediction setting is that the covariance estimates will also be highly biased and extremely rank deficient.</li>
    <li class="fragment">While there are reasons to belive that the true Bayesian posterior covariance is also rank deficient, in the end we especially rely on:</li>
    <ol>
        <li class="fragment">inflation; and</li>
        <li class="fragment">localization;</li>
    </ol>
    <li class="fragment">in order to relax the error estimates (introduce variance) and rectify the extreme rank deficiency.</li>
    <li class="fragment">Both of these techniques are highly active research areas for improving ensemble-based filtering, which are discussed in, e.g., the recent review article of Carrassi et al.</li>
</ul>