In [1]:
import os
import warnings
with warnings.catch_warnings():
    warnings.simplefilter('ignore')

import scipy.stats as st
import scipy.optimize

import pandas as pd
import numpy as np

import iqplot
import bokeh.io
bokeh.io.output_notebook()

Introduction to the EM Algorithm 
===================================
The Expectation-Maximization (EM) algorithm is a powerful iterative method used for parameter estimation in statistical models, particularly when the data is incomplete or has latent variables. It is widely used in machine learning, statistics, and signal processing to maximize the likelihood of observed data, especially when the direct computation of this likelihood is difficult.

In this exercise, we will introduce the classic EM algorithm and discuss the basic idea of why it works. There are several versions of the EM algorithm that have been described in the literature. The book [Tools for Statistical Inference](https://doi.org/10.1007/978-1-4612-4024-2) by Michael Tanner has a straightforward interpretation of the classical and variational forms of the EM algorithm as well as several applications.  

Description of the EM algorithm
================================
As mentioned above, the EM algorithm is an iterative procedure that has important applications in settings where there are *latent*, or otherwise unobserved random variables. The procedure can be described by two alternating steps: finding the expectation (E) of a function of the latent variables, and maximizing (M) the resulting posterior density to estimate the parameters. In the section below, we will show that each iteration of the algorithm increases the value of the posterior until convergence. 

### Why does the EM algorithm work?
To see how the algorithm works, recall that the log posterior can be written as

$$
    \ln\, g(\theta \mid y) = \ln\, f(y \mid \theta) + \ln\, g(\theta) - \ln\, f(y)
$$ 
\
The conventional form of the EM algorithm requires treatment of the log likelihood, which we denote $\ln\, f(y \mid \theta) \coloneqq \ell(\theta ; \mathcal{Y})$. We introduce an auxiliary latent variable, $\mathcal{X}$ which is unobserved. Observe that for any probability distribution $\mathcal{Q}(x)$
\
$$ 
    \begin{align}
        \ell(\theta ; \mathcal{Y}) &= \ln\, \int_{x}\, f(\mathcal{Y}, x \mid \theta) \,dx \\
        & = \ln\, \int_{x}\, \mathcal{Q}(x) \frac{f(\mathcal{Y}, x \mid \theta)}{\mathcal{Q}(x)} \,dx \\
        & \geq \int_{x}\, \mathcal{Q}(x) \ln\, \frac{f(\mathcal{Y},x \mid \theta)}{\mathcal{Q}(x)} \,dx
    \end{align}
$$
\
where the last line  follows from [Jensen's inequality](https://en.wikipedia.org/wiki/Jensen%27s_inequality) for concave functions. Now, consider for a particular $\theta_{t}$, $\mathcal{Q}(x) = f(\mathcal{X} \mid \mathcal{Y}, \theta_{t})$. The last line of the inequality above can be rewritten as

$$
    \begin{align}
        \int_{x}\, \mathcal{Q}(x) \ln\, \frac{f(\mathcal{Y},x \mid \theta)}{\mathcal{Q}(x)} \,dx &= \int_{x} f(x \mid \mathcal{Y}, \theta_{t}) \ln\, \frac{f(\mathcal{Y},x \mid \theta)}{f(x \mid \mathcal{Y}, \theta_{t})} \,dx  \\
        &= \mathbb{E}[ \ln\, ( \frac{f(\mathcal{X},\mathcal{Y} \mid \theta)}{f(\mathcal{X}\mid \mathcal{Y}, \theta_{t})} ) \mid \mathcal{Y},\theta_{t}]
    \end{align}
$$
\
Letting $g(\theta \mid \theta_{t})$ equal the right hand side of the expression above, we have that $\ell(\theta ; \mathcal{Y}) \geq g(\theta \mid \theta_{t})$. The inequality holds with equality whenever $\theta = \theta_{t}$. Otherwise, $g(\theta \mid \theta_{t})$ is guaranteed to be a lower bound of $\ell(\theta ; \mathcal{Y})$. Separating $g(\theta \mid \theta_{t})$ into terms with $\theta$ dependence, we have
\
$$
    \begin{align}
        g(\theta \mid \theta_{t}) &= \mathbb{E}[\ln\, f(\mathcal{X},\mathcal{Y}\mid \theta) \mid \mathcal{Y}, \theta_{t}] - \mathbb{E}[\ln\, f(\mathcal{X}\mid\mathcal{Y}, \theta_{t}) \mid \mathcal{Y}, \theta_{t}] \\
        &= Q(\theta \mid \theta_{t}) - \mathbb{E}[\ln\, f(\mathcal{X}\mid\mathcal{Y}, \theta_{t}) \mid \mathcal{Y}, \theta_{t}] \\ 
    \end{align}
$$

### Putting it all together
The so-called Q-function, $Q(\theta \mid \theta_{t})$ is computed in the **E-step** of the EM algorithm, and is updated iteratively in the **M-step**. More specifically, the **E-step** and **M-step** work as follows:
\
\
<u>**E-step**</u>
\
Compute $Q(\theta \mid \theta_{t})$ with respect to the unobserved data, $\mathcal{X}$ with $\theta = \theta_t$

$$
    \begin{align}
       Q(\theta \mid \theta_{t}) &= \mathbb{E}[\ln\, f(\mathcal{X},\mathcal{Y}\mid \theta) \mid \mathcal{Y}, \theta_{t}] \\ 
       &= \int_{x}\, f(x | \mathcal{Y}, \theta_{t}) \ln\,f(x, \mathcal{Y}\mid \theta)\,dx
    \end{align}
$$


<ins>**M-step**</ins>
\
The **M-step** assigns $\theta_{t+1}$ to the value of $\theta$ that maximizes $Q(\theta \mid \theta_{t})$. In other words, 

$$
    \begin{align}
        \theta_{t+1} = \arg\max_{\theta}\, Q(\theta \mid \theta_{t}) 
    \end{align}
$$ 

The **M-step** enforces that $g(\theta_{t} \mid \theta_{t})\leq g(\theta_{t+1} \mid \theta_{t})$. Therefore, we have the inequality, $\ell(\theta_{t};\mathcal{Y}) = g(\theta_{t} \mid \theta_{t}) \leq g(\theta_{t+1} \mid \theta_{t}) \leq \ell(\theta_{t+1}; \mathcal{Y})$, which shows that the **M-step** results in monotonic improvement of the log likelihood! 

We iterate between the **E-Step** and **M-Step** until the difference between $\theta_{t}$ and $\theta_{t+1}$ is negligibly small, indicating convergence. 

Application of the EM Algorigthm to Bayesian Inference
=======================================================
The primary goal of Bayesian inference is to evaluate the posterior distribution. This can be challenging when the posterior is complex, multi-modal, or otherwise difficult to evaluate directly. In these cases, statitics such as the mode of the posterior distribution can be useful for summarizing the distribution. In section below, we will describe how the EM algorithm can be a useful tool for finding the mode of a posterior distribution.

Notice that the likelihood can be decomposed as follows: 

$$
    \begin{align}
    \ell(\mathcal{Y}; \theta) &= \int_{x}\, f(\mathcal{X}|\mathcal{Y},\theta_{t})\,\ln(\frac{f(\mathcal{X},\mathcal{Y}\mid\theta)}{f(\mathcal{X}\mid\mathcal{Y},\theta_{t})}\,\frac{f(\mathcal{X}\mid\mathcal{Y},\theta_{t})}{f(\mathcal{X}\mid\mathcal{Y},\theta)})\,dx \\
    &= \int_{x} f(\mathcal{X}\mid\mathcal{Y},\theta_{t})\,\ln(\frac{f(\mathcal{X},\mathcal{Y}\mid\theta)}{f(\mathcal{X}\mid\mathcal{Y},\theta_{t})})\,dx + \int_{x} f(\mathcal{X}\mid\mathcal{Y},\theta_{t})\, ln(\frac{f(\mathcal{X}\mid\mathcal{Y},\theta_{t})}{f(\mathcal{X}\mid\mathcal{Y},\theta)})\,dx \\ 
    &= Q(\theta \mid \theta_{t}) - \mathbb{E}[\ln\, f(\mathcal{X}\mid\mathcal{Y}, \theta_{t}) \mid \mathcal{Y}, \theta_{t}] + D_{\mathrm{KL}}(f(\mathcal{X}\mid\mathcal{Y},\theta_{t})\| f(\mathcal{X}\mid\mathcal{Y}, \theta)) \\ 
    &= g(\theta\mid\theta_{t}) + D_{\mathrm{KL}}(f(\mathcal{X}\mid\mathcal{Y},\theta_{t})\| f(\mathcal{X}\mid\mathcal{Y}, \theta)) 
    \end{align}
$$

where $D_{\mathrm{KL}}$ is the Kullback-Leibler (KL) divergence. For general probability distributions $p$ and $q$ can be shown that the KL-divergence $D_{\mathrm{KL}}(p\|q)$ satisfies $D_{\mathrm{KL}}(p\|q) \geq 0$ and equals zero if and only if $p=q$. It follows then that when $\theta = \theta_{t}$, $\ell(\mathcal{Y};\theta_{t}) = g(\theta_t\mid\theta_{t})$, which we showed above. Returning to the log posterior and substituting the likelihood above, 

$$
    \ln\,g(\theta\mid\mathcal{Y}) = g(\theta\mid\theta_{t}) +  D_{\mathrm{KL}}(f(\mathcal{X}\mid\mathcal{Y},\theta_{t})\| f(\mathcal{X}\mid\mathcal{Y}, \theta)) + \ln\,g(\theta) - \ln\,f(\mathcal{y})
$$

If we wish to use the EM algorithm to find the mode of the posterior the E-step is exactly the same as above since only the term $g(\theta\mid\theta_{t})$ has any dependence on the latent variable, $\mathcal{X}$. The M-step, however, must be updated to include the log prior, $\ln\,g(\theta)$. 

Example: A Normal-Mixture Model
===============================
A classic application of this algorithm is assigning data points to components in a normal mixture model. You can think of this as a generalized form of the k-means clustering algorithm, which you may be familiar with. The core idea is that we assume the data points are generated from a mixture of a known number of Gaussian distributions.

In this example, we'll simulate data from three Gaussian distributions, each with distinct location and scale parameters. After defining the full model for these data, we'll explore how we can use the EM algorithm to identify the mode of the posterior distribution.  

In [2]:
# Simulate multiple gaussian RVs
g1 = st.norm.rvs(size=500, loc=30, scale=1.75)
g2 = st.norm.rvs(size=500, loc=25, scale=1)
g3 = st.norm.rvs(size=250, loc=40, scale=0.25)

df = pd.DataFrame.from_dict({"g1" : g1, "g2": g2, "g3":g3}, orient='index').stack()
df = df.reset_index(level=0, name='value').rename(columns={'level_0':'group'})

df.head()

Unnamed: 0,group,value
0,g1,31.430268
1,g1,31.438461
2,g1,32.894517
3,g1,31.280249
4,g1,30.312977


Let's do a quick visualization of the simulated data. 

In [3]:
bokeh.io.show(
    iqplot.ecdf(
        data=df,
        q='value',
        cats='group',
        x_axis_label='Realized value'
    )
)

It is clear that the distributions for each simulated group are distinct. To see where the EM algorigthm could be useful, consider a scenario where we did not know the distribution that generated each measurement. Our ECDF would instead look something like this

In [4]:
bokeh.io.show(
    iqplot.ecdf(
        data=df,
        q='value',
        x_axis_label='Realized value'
    )
)

In this setting, computing statistics like the average for each group is no longer possible because we don't know which distribution corresponds to each measurement. The EM algorithm offers a solution. To apply the EM algorithm, we will assume that there is an unobserved random variable $\mathcal{Y} = \{y_1, ...., y_n\}$ representing the assignment of each of the $n$ measurements to one of $k = 3$ Gaussians, where $P(\mathcal{Y}=j) = \pi_{j}$. Assuming that the measurements are generated from a Gaussian mixture distribution, the complete likelihood is given by

$$
\begin{align}
  f(\mathcal{X}, \mathcal{Y}\mid\theta) &= \prod_{i=1}^{N}\,f(x_i\mid y_i, \theta)f(y_i\mid\theta)\\
  &= \prod_{i=1}^{N}\,\pi_{y_{i}}\cdot\mathcal{N}(\mu_{y_i}, \sigma_{y_i})
\end{align}
$$
where the parameters of the model are $\theta = (\pi_1, ..., \pi_k, \mu_1, ..., \mu_k, \sigma_1,...,\sigma_k)$. With this model in hand, we can construct the Q function, $Q(\theta\mid\theta_t)$.


$$
  \begin{align}
    Q(\theta\mid\theta_t) &= \mathbb{E}\left[\ln\,f(\mathcal{X},\mathcal{Y}|\theta)\mid\mathcal{X},\theta_t\right]\\
    &= \sum_{i=1}^{N}\sum_{j=1}^{M}\, f(\mathcal{Y}=j|\mathcal{X}=x_i,\theta_t)\ln\,f(\mathcal{X}=x_i,\mathcal{Y}=j|\theta)\, dy \\
    &= \sum_{i=1}^{N}\sum_{j=1}^{M}\, \frac{f(\mathcal{X}=x_i \mid\mathcal{Y}=j, \theta_t)f(\mathcal{Y}=j\mid\theta_t)}{\sum_{k=1}^{M}\,f(\mathcal{X}=x_i \mid\mathcal{Y}=k, \theta_t)f(\mathcal{Y}=k\mid\theta_t)}\ln f(\mathcal{X}=x_i \mid\mathcal{Y}=j, \theta)f(\mathcal{Y}=j\mid\theta)\,dy
  \end{align}
$$

Define $\gamma_{ij} \colon= \frac{f(\mathcal{X}=x_i \mid\mathcal{Y}=j, \theta_{t})f(\mathcal{Y}=j\mid\theta_{t})}{\sum_{k=1}^{M}\,f(\mathcal{X}=x_i \mid\mathcal{Y}=k, \theta_{t})f(\mathcal{Y}=k\mid\theta_{t})}$ so that,

$$
  \begin{align}
  Q(\theta\mid\theta_t) = \sum_{i=1}^{N}\sum_{j=1}^{M}\,\gamma_{ij}\,\ln\pi_jf(\mathcal{X}=x_i \mid\mathcal{Y}=j, \theta)
  \end{align}
$$


In the E-step, we'll compute this function assuming $\theta = \theta_t$. Notice, that  $f(\mathcal{Y}=j|\mathcal{X}=x_i,\theta_t)$ represents the probability of assigment $j$ given the data and parameters. In the literature, this is referred to as the *responsibility* of the $j$-th component for $x_i$.


We'll define a few helper functions for this part of the algorithm.

In [5]:
def responsibility(x, params):
    """
    Compute responsibility of component `y` for the data `x`.
    """
    mu, sigma, pi = params
    r0 = st.norm.pdf(x[:,None], mu, np.sqrt(sigma)) * pi

    return r0/np.sum(r0, axis=1)[:,None]

def q_function(x, params, params_old):
  """
  Compute the expected value of the complete log-likelihood
  """
  resp = responsibility(x, params_old)
  ll = log_complete_likelihood(x, params)

  return resp, np.sum(resp * ll)

def log_complete_likelihood(x, params):
    """
    Compute log of the *complete* data likelihood
    """
    mu, sigma, pi = params
    return st.norm.logpdf(x[:,None], mu, np.sqrt(sigma)) + np.log(pi)

To describe the full generative model, we need to define priors for the parameters.

The $\mathbf{\pi}$ have the constraint that $\sum_{j=1}^{k}\, \pi_{j} = 1$, so we will use a Dirichlet prior. According to Wikipedia, values of the concentration parameter, $\mathbf{\alpha}$ below 1 prefer sparse distributions  and the case where each $\alpha_k = 1$ is equivalent to the uniform distribution over the $(K-1)$-simplex. We will set $\alpha$ so that the the distribution is less concentrated and each of the $k$ components have equal weight. That is,

$$
  \pi_1,\cdots,\pi_k \sim \text{Dirichlet}(1.5,1.5,1.5)
$$

Also, we have normal priors on the means of the Normal distributions, $\mu_k$
$$
  \mu_k \sim \mathcal{N}(30, 5)
$$

and we will assign broad Gamma priors to the variances, $\sigma_k$
$$
  \sigma_k \sim \text{Gamma}(5, 1)
$$

Let's code up these priors so that we can compare the results from MAP estimation to those from the EM algorithm. As we will see below, we can derive the update rules for each parameter analytically, so that we do not have to compute the prior at each step.

In [6]:
def log_prior(params):
    """
    Compute the log prior
    """
    mu, sigma, pi = params
    
    if (params <= 0).any():
        return -np.inf
    
    if (np.abs(np.sum(pi) - 1.0) > 10e-10):
        return -np.inf
    
    lp = np.sum(st.norm.logpdf(mu, 30, np.sqrt(5)))
    lp += np.sum(st.gamma.logpdf(sigma, 5, loc=0, scale=1))
    lp += st.dirichlet.logpdf(pi, np.array([1.5,1.5,1.5]))
    
    return lp

def log_complete_likelihood(x, params):
    """
    Compute log of the *complete* data likelihood
    """
    mu, sigma, pi = params
    return st.norm.logpdf(x[:,None], mu, np.sqrt(sigma)) + np.log(pi)

def log_likelihood(x, params):
    mu, sigma, pi = params
    
    lik = st.norm.pdf(x[:,None], mu, np.sqrt(sigma)) * pi
    lik = np.sum(lik, axis=1)
    
    return np.sum(np.log(lik))

def log_posterior(params, x):
    lp = log_prior(params)
    if lp == -np.inf:
        return lp
    
    return lp + log_likelihood(x, params)

def neg_log_posterior(params, x):
    params = params.reshape(3,-1)
    return -log_posterior(params, x)


The M-step involves finding the value $\theta$ that maximizes $Q(\theta\mid\theta_{t}) + \log\,g(\theta)$. We can compute this value analytically by taking partial derivatives and setting the expressions equal to 0. With a bit of Calculus and algebra, we can show that each of the parameters have the following update rules:

For $\sigma_{k}^2$ with a Gamma prior:  

$$
\begin{aligned}
  {\sigma_k^2}^{(t+1)} &= \frac{\sum_{i=1}^{N} \gamma_{ik} (x_i - \mu_k)^2 + 2 \beta_0}{\sum_{i=1}^{N} \gamma_{ik} + 2 (\alpha_0 - 1)}
\end{aligned}
$$

where $\alpha_0$ and $\beta_0$ are the shape and scale parameters of the Gamma prior.

For $\mu_k$ with a Normal prior:
$$
\begin{aligned}
  {\mu_k}^{(t+1)} &= \frac{\sum_{i=1}^{N} \gamma_{ik} x_i + \frac{\sigma_k^2}{\tau_0^2} \mu_0}{\sum_{i=1}^{N} \gamma_{ik} + \frac{\sigma_k^2}{\tau_0^2}}
\end{aligned}
$$

where $\mu_0$ and $\tau_0^2$ are the mean and variance of the Normal prior

The mixing probabilities are updated with the following rule
$$
\begin{aligned}
{\pi_k}^{(t+1)} = \frac{\sum_{i=1}^{N} \gamma_{ik} + (\alpha_k-1)}{\sum_{k=1}^{K} \left( \sum_{i=1}^{N} \gamma_{ik} + (\alpha_k-1) \right)}
\end{aligned}
$$


This shows that the responsibilities, $\gamma_{ij}$ determine the update rules for all of the parameters.

In [7]:
def update_mu(x, r, sigma, *, hyper_mu=30, hyper_tau=5):
  """
  Update rule for mu
  """
  mu_new = np.sum(r * x[:,None], axis=0) + (sigma/hyper_tau)*hyper_mu
  mu_new /= np.sum(r, axis=0) + (sigma/hyper_tau)

  return mu_new

def update_pi(x, r, *, hyper_alpha=np.array([1.5, 1.5, 1.5])):
  """
  Update mixing probabilites
  """
  # double sum of responsibilities is N.
  N = x.shape[0]

  pi_new = np.sum(r, axis=0) + (hyper_alpha - 1)
  pi_new /= N + np.sum(hyper_alpha - 1)

  return pi_new

def update_sigma(x, r, mu, *, hyper_alpha=5, hyper_beta=1):
  """
  Update rule for sigma
  """
  sigma_new = np.sum((x[:,None] - mu)**2 * r, axis=0) + 2*hyper_beta
  sigma_new /= np.sum(r, axis=0) + 2 * (hyper_alpha-1)

  return sigma_new

Now, we'll define the E-step and M-step and compose these functions into a single function that runs the EM algorithm until convergence. 

In [8]:
def e_step(x, params):
    """
    Expectation step of the EM algorithm
    """
    return q_function(x, params, params)

def m_step(x, r, params):
    """
    Update step of the EM algorithm
    """
    mu, sigma, pi = params
    
    new_mu = update_mu(x, r, sigma)
    new_sigma = update_sigma(x, r, new_mu)
    new_pi = update_pi(x, r)
    
    return np.array([new_mu, new_sigma, new_pi])

In [9]:
def em_algorithm(x, params_init, *, max_iter=100, tolerance=1e-6):
    """
    Run the EM algorithm to fit a Gaussian mixture model.
    """
    q_old = -np.inf
    params = params_init
    
    for iteration in range(max_iter):
        rp, q = e_step(data, params)
    
        if np.abs(q - q_old) < tolerance:
                print(f"Converged at iteration {iteration}")
                break

        params = m_step(data, rp, params)
        q_old = q

    if iteration == (max_iter-1):
        print(f'Failed to converge after {max_iter} iterations')
        
    return params

Now, we'll initialize the parameters of the model. For $\mu$, we'll sample values from the data vector. For $\pi$, we'll start by assuming each of the groups is equally represented and for $\sigma$ we will use a scaled plug-in estimate of the variance of data.  

In [10]:
data = df['value'].values
n = data.shape[0]
n_components = 3

# Initialize parameters
sigma_init = np.array([np.var(data)/3] * n_components)
pi_init = np.array([1/n_components] * n_components)
mu_init = data[np.random.choice(n, n_components, replace = False)]

params_init = np.array([mu_init, sigma_init, pi_init])

Now we are ready to run the EM algorithm. 

In [11]:
mode_posterior = em_algorithm(data, params_init, max_iter=2000)

Converged at iteration 83


Let's take a look at the results. Recall that the order is arbitrary. 

In [12]:
for i, parameters in enumerate(zip(*mode_posterior)):
    print(
        """
        Estimated \u03BC for group {0}: {1:.2f}
        Estimated σ² for group {0}: {2:.2f}
        Estimated \u03C0 for group {0}: {3:.2f}
        """.format(i, *parameters)
    )


        Estimated μ for group 0: 40.00
        Estimated σ² for group 0: 0.07
        Estimated π for group 0: 0.20
        

        Estimated μ for group 1: 30.13
        Estimated σ² for group 1: 2.84
        Estimated π for group 1: 0.39
        

        Estimated μ for group 2: 25.13
        Estimated σ² for group 2: 0.94
        Estimated π for group 2: 0.41
        


Pretty close to the true values for these parameters! We can also predict which distribution generated each datum. This is easy to do once we compute the responsibilities using our estimated parameters.  

In [13]:
rsp = responsibility(data, mode_posterior)

# Predict based on greatest responsibility
df['em_prediction'] = np.argmax(rsp, axis=1).astype(str)

Since our predictions are categorical, we can construct a quick visualization to see how well we did.  

In [14]:
group_estimates = df.groupby(['group','em_prediction']).size()

df_plot = pd.DataFrame({
    "count" : group_estimates.values,
    "factors" : group_estimates.index.values
})

In [15]:
source = bokeh.plotting.ColumnDataSource(df_plot)

p = bokeh.plotting.figure(
    x_range=bokeh.models.FactorRange(*df_plot.factors.values), 
    width=450, 
    height=300,
    toolbar_location=None, 
     tooltips=[
        ("count", "@{count}"),
    ])

p.vbar(x='factors', top='count', width=0.75, source=source)


bokeh.io.show(p)

For the most part, there's good alignment between the true category and the predicted category. For fun, let's compare the to the results from direct optimization of the posterior to find the MAP. 

In [16]:
args = (data,)

# Compute the MAP
res = scipy.optimize.minimize(
    neg_log_posterior, params_init.flatten(), args=args, method="powell",
    
)

map_posterior = res.x.reshape(3,-1)

  tmp2 = (x - v) * (fx - fw)
  p = (x - v) * tmp2 - (x - w) * tmp1


We'll print these results. Again, remember that the order may not match exactly

In [17]:
for i, parameters in enumerate(zip(*map_posterior)):
    print(
        """
        Estimated \u03BC for group {0}: {1:.2f}
        Estimated σ² for group {0}: {2:.2f}
        Estimated \u03C0 for group {0}: {3:.2f}
        """.format(i, *parameters)
    )


        Estimated μ for group 0: 40.00
        Estimated σ² for group 0: 0.06
        Estimated π for group 0: 0.33
        

        Estimated μ for group 1: 30.10
        Estimated σ² for group 1: 2.98
        Estimated π for group 1: 0.33
        

        Estimated μ for group 2: 25.11
        Estimated σ² for group 2: 0.96
        Estimated π for group 2: 0.33
        


In [None]:
%load_ext watermark
%watermark -v -p numpy,scipy,bokeh,jupyterlab