# Hierarchical Hidden Markov Model

## Background

The seminal paper on hierarchical HMMs appears to be 
[(Fine, Singer and Tishby)](#References "Reference [1]: The Hierarchical Hidden Markov Model: Analysis and Applications"). However, this paper assumes that the hierarchy must be a tree structure.

As an alternative, the paper by 
[(Bui, Phung and Venkatesh)](#References "Reference [2]: Hierarchical Hidden Markov Models with General State Hierarchy") allows a more general hierarchical structure, represented by
the dynamical Bayesian network (DBN) shown below.
<img src="DBN_original.png" title="Dynamic Bayesian Network for Hierarchical HMM">

This notebook represents my thoughts and re-derivation of the model of
[(Bui, Phung and Venkatesh)](#References "Reference [2]: Hierarchical Hidden Markov Models with General State Hierarchy").
We shall mostly retain their notation, but slightly alter some initial assumptions
and some symbology.

## Structural Levels

### Implicit level 0

My first problem with the model of [(Bui, Phung and Venkatesh)](#References "Reference [2]: Hierarchical Hidden Markov Models with General State Hierarchy") is that they assume there is only a singular state at level 1, such that $q^{1}_{t}=1$ for all $t=1,2,\ldots,T$. Next, they assume from observation of $y_1,y_2,\ldots,y_T$ that the level 1 sequence cannot finish before $t=T$, and thus $e^{1}_t=0$ for $t=1,2,\ldots,T-1$, with $e^1_T$ being arbitrary and irrelevant. These assumptions essentially mean that we only have $D-1$ useful levels.

In contrast, we shall here reserve an implicit level 0 with these same properties, such that choosing $D=1$ should recover an ordinary HMM. Thus, at level 0 we permit only state(s) 
$\mathcal{Q}^0=\{\sigma^0_1\}$. The necessary parameters at level 0 are 
$\boldsymbol{\Pi}^0=(\boldsymbol{\pi}^{0,1})$ and 
$\boldsymbol{\mathcal{A}}^0=(\mathbf{A}^{0,1})$, where
the $0$ denotes level 0 and the $1$ indexes the (single, allowable) state at level 0. 
These parameters control the subprocess at level 1, as explained in the next 
[section](#Level-1 "Section: Level 1").

As discussed further in a later [section](#Temporal-Stages "Section: Temporal Stages"),
for mathematical convenience we take $e^0_t=0$ for $t=1,2,\ldots,T-1$. However, we allow
the variable $e^0_T$ to be controlled, since we find that $e^0_T=1$ if it is known in advance that the observed sequence terminates after stage $T$. This special handling is discussed at [stage T](#Stage-T "Section: Stage T").

### Level 1

The allowable states at level 1 are arbitrarily specified by the finite set
$\mathcal{Q}^1=\{\sigma^1_1,\sigma^1_2,\ldots\}$.
The initial state $q^1_1\in\mathcal{Q}^1$ of level 1 at stage $t=1$ is stochastically selected
via $q^1_1\mid q^0_1\sim\mathcal{D}(\boldsymbol{\Pi}^{0})$. That is, the process chooses
$q^1_1=\sigma^1_i$ with probability 
\begin{eqnarray}
P(q^1_1=\sigma^1_i\mid q^0_1=\sigma^0_1) & = & \pi^{0,1}_i\,,
\end{eqnarray}
where $\pi^{0,1}_\cdot\doteq\sum_{i=1}^{\left|\mathcal{Q}^1\right|}\pi^{0,1}_i = 1$.
This assumption is discussed futher in a later 
[section](#Stage-1 "Section: Stage 1").

Following that initial choice of state, subsequent states $q^1_t$ are mostly chosen via stochastic state transitions. However, from the 
[DBN](#Background "Section: Background")
we see that the transition from stage $t$ to stage $t+1$ is controlled by the *state completion indicator* $e^1_t\in\{0,1\}$.

To explain this indicator, note that due to the heierarchical nature of the model, the state $q^1_t$ at level 1 recursively controls the
subprocesses at subsequent levels $d=2,3,\ldots,D$ (see the next 
[section](#Level-d "Section: Level d")). At some point, we have a notion that these
subprocesses have completed, and control returns back to state $q^1_t$. Now, at this
point, at the end of stage $t$, the state $q^1_t$ has either completed its work, 
denoted by $e^1_t=1$, or else it is still in use, denoted by $e^1_t=0$.

In the former case, if the state $q^1_t$ has completed its job after stage $t$, then the process at level 1 transitions to another state $q^1_{t+1}$ via 
$q^1_{t+1}\mid q^1_t,e^1_t=1\sim\mathcal{D}(\boldsymbol{\mathcal{A}}^{0})$.
That is, state $q^1_t=\sigma^1_i$ transitions to state $q^1_{t+1}=\sigma^1_j$
with time-invariant probability 
\begin{eqnarray}
P(q^1_{t+1}=\sigma^1_j\mid q^1_t=\sigma^1_i, q^0_t=\sigma^0_1, e^1_t=1) & = & A^{0,1}_{i,j}\,,
\end{eqnarray}
where $A^{0,1}_{i,\cdot}\doteq\sum_{j=1}^{\left|\mathcal{Q}^1\right|}A^{0,1}_{i,j}=1$.

However, in the latter case, if state $q^1_t$ is not finished, then it is responsible for further
subprocesses at stage $t+1$. We denote this continuing use of state $q^1_t=\sigma^1_i$ by ensuring that 
$q^1_{t+1}=\sigma^1_i$ when $e^1_t=0$.

Consequently, at level 1, the general update from state $q^1_t$ at stage $t$ to state $q^1_{t+1}$ at stage $t+1$ is given by
\begin{eqnarray}
P(q^1_{t+1}=\sigma^1_j\mid q^1_t=\sigma^1_i,q^0_t=\sigma^0_1,e^1_t) & = &
\left\{\begin{array}{ll}
\delta_{ij} & \mbox{if}~e^1_t=0\,,\\
A^{0,1}_{i,j} & \mbox{otherwise}
\end{array}
\right.
\,,
\end{eqnarray}
for $t=1,2,\ldots,T-1$.
Note that it is possible for state $q^1_t$ to transition to the *same* state $q^1_{t+1}$
when $e^1_t=1$. Thus, the simple fact that maybe $q^1_{t+1}=q^1_t$ is **not** sufficient to determine whether or not the state has completed at stage $t$.

How can we know when state $q^1_t$ has finished after stage $t$? We observed above that if the process
at level 1 has completed (for the current stage), then all subprocesses must also have completed. At this point, there must be a stochastic decision as to whether or not the
state has finished, which clearly depends upon the state $q^1_t$.

Conversely, if any subprocess has not completed, then the superprocess(es)
cannot have completed either.
Consequently, the decision for the indicator 
$e^1_t\in\{0,1\}$ at level 1
must also depend upon the indicator $e^2_t$ at level 2.

What other information is needed? Since control passes down the hierarchy from superprocess to subprocess, and then back up the hierarchy from subprocess to superprocess, it makes sense if at some point the subprocess can signal the superprocess. In particular, the superprocess may possibly complete upon the subprocess
reaching a certain state.

These various dependencies are all included in the 
[DBN](#Background "Section: Background").
Consequently, the completion indicator $e^1_t$ is sampled from
$e^1_t\mid q^1_t,q^2_t,e^2_t\sim\mathcal{D}(\boldsymbol{\mathcal{T}}^{1})$,
for level 1 completion parameters 
$\boldsymbol{\mathcal{T}}^1=
(\boldsymbol{\tau}^{1,p})_{p=1}^{\left|\mathcal{Q}^1\right|}$.
The exact calculation is deferred to the next [section](#Level-d "Section: Level d").
The other parameters required at level 1 are
$\boldsymbol{\Pi}^{1}=(\boldsymbol{\pi}^{1,p})_{p=1}^{\left|\mathcal{Q}^{1}\right|}$ and
$\boldsymbol{\mathcal{A}}^{1}=(\mathbf{A}^{1,p})_{p=1}^{\left|\mathcal{Q}^{1}\right|}$,
which control the subprocess at level 2.

### Level d

In the previous [section](#Level-1 "Section: Level 1"), the process at level 1
dependend strongly upon the implicit superprocess at level 0, and weakly on the
subprocess at level 2.
In this section, we shall generalise the process to arbitrary (explicit) level $d$ for
$d=1,2,\ldots,D-1$, and sometimes also for $d=D$.

Firstly, we let the allowable states of level $d$ to be arbitrarily specified
by the finite set $\mathcal{Q}^d=\{\sigma^d_1,\sigma^d_2,\ldots\}$.
The size of these various state sets will need to be specified in advance.

Next, we note that at level $d$ the state $q^d_t$ of the process
depends upon the state $q^{d-1}_t$ of the superprocess. Thus, we permit
each particular parent state,
say $q^{d-1}_t=\sigma^{d-1}_p$, to subselect different child states from 
$\mathcal{Q}^d$.

The first stage $t=1$ of level $d$ is initialised with state
$q^d_1\mid q^{d-1}_1\sim\mathcal{D}(\boldsymbol{\Pi}^{d-1})$. That is,
state $q^d_1=\sigma^d_i$ is chosen with probability
\begin{eqnarray}
P(q^d_1=\sigma^d_i\mid q^{d-1}_1=p) & = & \pi^{d-1,p}_{i}\,,
\end{eqnarray}
where $\pi^{d-1,p}_\cdot\doteq\sum_{i=1}^{\left|\mathcal{Q}^d\right|}\pi^{d-1,p}_i = 1$,
for $d=1,2,\ldots,D$.

In terms of whether or not state $q^d_t$ is completed after stage $t$
(denoted by $e^d_t=1$ or $e^d_t=0$, respectively),
the logic of the previous [section](#Level-1 "Section: Level 1") 
holds true in general. That is, state $q^d_t$ cannot finish
until state $q^{d+1}_t$ has finished, and even then state $q^d_t$ will only complete
at stage $t$ with some probability, but with converse probability will continue
to be active at stage $t+1$.

Consequently, state $q^d_t$ is completed (or not) according to
$e^d_t\mid q^d_t,q^{d+1}_t,e^{d+1}_t\sim\mathcal{D}(\boldsymbol{\mathcal{T}}^{d})$.
In particular,
\begin{eqnarray}
P(e^d_t=1\mid q^d_t=\sigma^d_p, q^{d+1}_t=\sigma^{d+1}_i,e^{d+1}_t) & = &
\left\{\begin{array}{ll}
0 & \mbox{if}~e^{d+1}_t=0\,,\\
\tau^{d,p}_{i} & \mbox{otherwise}\,,
\end{array}
\right.
\end{eqnarray}
for $d=1,2,\ldots,D-1$ and $t=1,2,\ldots,T$.
Note that $\tau^{d-1,p}_{i}$ is the time-invariant probability that the parent state $q^d_t=\sigma^d_p$ is complete once the subprocess completes after reaching child state $q^{d+1}_t=\sigma^{d+1}_i$. This does **not**
sum to unity over child states $\sigma^{d+1}_i\in\mathcal{Q}^{d+1}$.

In terms of transitioning between states at level $d$, the dynamics again mostly follow from those of the previous [section](#Level-1 "Section: Level 1"), 
except that now for $d>1$ there is an additional behaviour arising from the fact that state $q^{d-1}_t$ is permitted to complete 
(unlike for level 0, where $e^0_t=0$ for $t=1,2,\ldots,T-1$).
Thus, the 'transition' from state $q^d_t$ to $q^d_{t+1}$ now depends upon whether or
not parent state $q^{d-1}_t$ is complete.
Furthermore, the 'transition' actually depends on the next parent state $q^{d-1}_{t+1}$,
not on the current state $q^{d-1}_t$. This did not matter at level 1,
since there could only ever be one parent state $\sigma^0_1$.
From the [DBN](#Background "Section: Background"), we therefore sample according to
$q^d_{t+1}\mid q^d_t,q^{d-1}_{t+1},e^d_t,e^{d-1}_t\sim
\mathcal{D}(\boldsymbol{\mathcal{A}}^{d-1},\boldsymbol{\Pi}^{d-1})$.

We can view the control flow as follows. If state $q^d_t=\sigma^d_p$ is not complete at stage $t$, then that state is still active (i.e. reused) at stage $t+1$
with $q^d_{t+1}=q^d_t$, and control remains at level $d$. At the same time, none of the superprocesses may complete, and all super-states also remain the same at stage $t+1$. 

However, if state $q^d_t$ is complete, then control passes up to the superprocess
at level $d-1$, across to the stage $t+1$ with parent state $q^{d-1}_{t+1}$,
and then back down to the subprocess at level $d$ with the selection of state
$q^d_{t+1}$. This next state depends upon whether or not the superprocess also
completed with its state $q^{d-1}_t$. If $q^{d-1}_t$ is complete, then a new
state $q^{d-1}_{t+1}$ is sampled at level $d-1$, triggering the selection of a new 'initial' state at level $d$. In other words, completion at level $d-1$ 
(and level $d$) results in the ending of the current subsequence at level $d$ and the commencement of another subsequence.

Conversely, if $q^{d-1}_t$ is not complete, then the parent state persists as 
$q^{d-1}_{t+1}=q^{d-1}_t$, and the next child state
$q^d_{t+1}$ is selected via allowable transitions from $q^d_t$. In other words,
whilst the superprocess is not complete, the subprocess will continue to generate a subsequence. In effect, each superprocess produces a contiguous partitioning of its
subprocess. Thus, although the [DBN](#Background "Section: Background") is in general not a tree,
the end result of the HHMM is a tree of hierarchical states.

Putting all the dependencies together, in general the state $q^d_t$ 'transitions' to
state $q^d_{t+1}$ with probability
\begin{eqnarray}
P(q^d_{t+1}=\sigma^1_j\mid q^d_t=\sigma^1_i,q^{d-1}_{t+1}=\sigma^{d-1}_p,
e^d_t,e^{d-1}_t) 
& = &
\left\{\begin{array}{ll}
\delta_{ij} & \mbox{if}~e^d_t=0\,,\\
A^{d-1,p}_{i,j} & \mbox{else if}~e^{d-1}_t=0\,,\\
\pi^{d-1,p}_{j} & \mbox{otherwise}
\end{array}
\right.
\,,
\end{eqnarray}
for $t=1,2,\ldots,T-1$ and $d=1,2,\ldots,D$.


### Level D

Like [level 0](#Level-0 "Section: Level 0") and 
[level 1](#Level-1 "Section: Level 1"), 
level $D$ has some special properties.
In particular, observations 
$\mathbf{y}_{1:T}\doteq(y_1,y_2,\ldots,y_T)\in\mathcal{Y}^T$ are only generated
at level $D$.

Next, just as we chose $e^0_t=0$ for level 0, we take $e^D_t=1$ for level $D$.
In effect, this means that every state $q^D_t$ is complete immediately after generating
observation $y_t$. Although this might seem a bit strange, if we set $e^D_t=0$, then
no state $q^{d}_t$ for $d<D$ would ever be permitted to complete.

So far, all of the state-dependent parameters in the HHMM, 
e.g. $\tau^{d,p}_i$, $A^{d-1,p}_{i,j}$ and $\pi^{d-1,p}_j$,
depend upon both the parent state $q^{d-1}_t=\sigma^{d-1}_p$ and the child state
$q^d_t=\sigma^d_i$ at stage $t$.
For the output of observation $y_t$, therefore, we could also assume this dual
dependency. Note that, as mentioned [previously](#Level-0 "Section: Level 0"),
we assume that the HHMM devolves exactly into a HMM when $D=1$. However, since there is always an implicit, constant-state level 0, an observation dependency on dual parent-child states would also devolve into the usual single child-state dependency of a typical HMM.

Despite this argument, the [DBN](#Background "Section: Background")
has been specified such that observation $y_t$ depends only upon state $q^D_t$,
for $t=1,2,\ldots,T$.

## Temporal Stages

### Stage 1

It must be noted that the [DBN](#Background "Section: Background") model
makes a special assumption for $t=1$, namely that the observed sequence
$\mathbf{y}_{1:T}$ commences at stage 1. This decision is not unconsequential, since it 
[ensures](#Level-d "Section: Level d") 
that the sequence starts with special state, via
\begin{eqnarray}
P(q^d_1=\sigma^d_i\mid q^{d-1}_1=p) & = & \pi^{d-1,p}_{i}\,.
\end{eqnarray}
However, this assumption is not justified for an arbitrary
sequence that has an unknown point of initialisation.

If we knew in advance that the observed sequence did not start at stage 1, then there is essentially a missing stage $t=0$ with $e^0_0=0$. We might then (in theory) infer
an initial state from
\begin{eqnarray}
P(q^d_1=\sigma^d_i\mid q^d_0=*,q^{d-1}_1=p,e^d_0=*,e^{d-1}_0=*)~\doteq~
\sum_{j=1}^{\left|\mathcal{Q}^d\right|}\sum_{k,\ell\in\{0,1\}^2}&&
P(q^d_{1}=\sigma^1_i\mid q^d_0=\sigma^1_j,q^{d-1}_{1}=\sigma^{d-1}_p,
e^d_0,e^{d-1}_0)
\\&&{}\times
P(q^d_0=\sigma^1_j,e^d_0=k,e^{d-1}_0=\ell)
\,.
\end{eqnarray}
In practice, this is intractible, since we do not even know if $e^0_{-1}=1$, i.e. the true start of the sequence could be any $t\le 0$.
Hence, we shall retain the implicit assumption that $e^d_0=1$
for $d=0,1,\ldots,D$.

### Stage t

Following on from [stage 1](#Stage-1 "Section: Stage 1"), we might further suppose
that we know that the observed sequence $\mathbf{y}_{1:T}$ is actually composed of a number of complete sub-sequences. In other words, we assume prior assignments for completion variables $\mathbf{e}^0_{1:T-1}$.
However, in such a case we might more reasonably split the larger sequence into
their separate, complete sub-sequences, and model each sub-sequence separately.
Hence, we shall continue to assume that $\mathbf{e}^0_{1:T-1}=\mathbf{0}$.

### Stage T

The [DBN](#Background "Section: Background") model as presented does not properly account for stage $t=T$.
In particular, it does not account for prior knowledge about whether or not the observed sequence $\mathbf{y}_{1:T}$ is complete. This is particularly important in some areas, e.g. when modelling entire sentences from a grammar, or when predicting the final observation of a sequence given the previous observations.

As noted previously, if we know in advance that the observations
$\mathbf{y}_{1:T}$ form a complete sequence, then we know $e^0_T=1$.
It then follows [logically](#Level-d "Section: Level d") that 
$\mathbf{e}^{1:D}_T=\mathbf{1}$ also.
The probability of agreement with this condition is then
\begin{eqnarray}
P(e^0_T=1\mid\mathbf{q}^{0:D}_T,e^{D}_T=1) & = &
\prod_{d=0}^{D-1}
P(e^d_T=1\mid q^d_T, q^{d+1}_T,e^{d+1}_T=1)~=~
\prod_{d=0}^{D-1}\tau^{d,\iota(q^d_T)}_{\iota(q^{d+1}_T)}\,,
\end{eqnarray}
with *state index* function $\iota(\sigma^d_i)=i$. This requires that we also model
the level 0 parameters $\boldsymbol{\mathcal{T}}^0=(\boldsymbol{\tau}^{0,1})$, which
were previously ignored.

Conversely, if we know the sequence has not ended at stage $t=T$ then $e^0_T=0$, and
the probability of this condition is just
\begin{eqnarray}
P(e^0_T=0\mid\mathbf{q}^{0:D}_T,e^{D}_T=1) & = &
1 - P(e^0_T=1\mid\mathbf{q}^{0:D}_T,e^{D}_T=1)~=~
1 - \prod_{d=0}^{D-1}\tau^{d,\iota(q^d_T)}_{\iota(q^{d+1}_T)}\,.
\end{eqnarray}
Consequently, in the event that $e^0_T$ is known, the joint probability of each
*complete* case should be multiplied by $P(e^0_T\mid q^{0:D}_T,e^{D}_T=1)$, before summing over the hidden variables.

In the special case where we do not know whether or not the sequence has terminated at stage $T$, for convenience we take $e^0_T=*$, such that
\begin{eqnarray}
P(e^0_T=*\mid\mathbf{q}^{0:D}_T,e^{D}_T=1) & \doteq & 
\sum_{e^0_T\in\{0,1\}}P(e^0_T\mid\mathbf{q}^{0:D}_T,e^{D}_T=1)
~=~1\,.
\end{eqnarray}

## The Model

The original [DBN](#Background "Section: Background") is modified using the
various suggestions for
[level 0](#Level-0 "Section: Level 0"),
[level D](#Level-D "Section: Level D"),
and [stage $T$](#Stage-T "Section: Stage T").
The new DBN has the structure:
<img src="DBN_modified.png" title="Dynamic Bayesian Network for Hierarchical HMM"
 width="50%">
where solid (black) vertical arrows represent control-flow dependencies, 
solid (grey) non-vertical arrows represent non-control-flow dependencies,
and dotted arrows represent non-dependency control-flows.

[Recall](#Level-d "Section: Level d")
that the flow of control proceeds firstly down the DBN,
sampling states $\mathbf{q}^{0:D}_t$ from parent state to child state, and then subsequently up the DBN, sampling the completion indicators $\mathbf{e}^{0:D-1}_t$
from child indicator to parent indicator.
This pattern is then repeated as the flow proceeds from left to right from stage $t$ to stage $t+1$ for $t=1,2,\ldots,T-1$.

This DBN
entirely specifies the joint probability of a *complete* case of data 
$\mathbf{v}$ specifying values of the variables 
$\mathcal{V}=(\mathbf{Q}^{0:D}_{1:T}, \mathbf{E}^{0:D}_{1:T},\mathbf{y}_{1:T})$,
where 
$\mathbf{Q}^{0:D}_{1:T}\doteq (\mathbf{q}^{0:D}_1,\mathbf{q}^{0:D}_2,\ldots,\mathbf{q}^{0:D}_T)$
and
$\mathbf{E}^{0:D}_{1:T}\doteq (\mathbf{e}^{0:D}_1,\mathbf{e}^{0:D}_2,\ldots,\mathbf{e}^{0:D}_T)$.
In practice, we may partition these variables into hidden variables $\mathcal{H}$, observed variables $\mathcal{O}$, and fixed variables (i.e. constants) $\mathcal{C}$, via 
$\mathcal{V}=\mathcal{H}\oplus\mathcal{O}\oplus\mathcal{C}$. 

The known constants are therefore 
$\mathcal{C}=(\mathbf{q}^0_{1:T}, \mathbf{e}^D_{1:T})$, where
$q^0_t=\sigma^0_1$ and $e^D_t=1$ for $t=1,2,\ldots,T$.
We take $P(\mathcal{C})=1$.
Similarly, the observational variables are 
$\mathcal{O}=(\mathbf{y}_{1:T},\mathbf{e}^0_{1:T})$, since
$\mathbf{e}^0_{1:T}$ specifies how the sequence $\mathbf{y}_{1:T}$ is to be interpreted
(i.e. complete, incomplete or multi-sequence - see the
[previous](#Temporal-Stages "Section: Temporal Stages") section).
The hidden variables are therefore
$\mathcal{H}=(\mathbf{Q}^{1:D}_{1:T},\mathbf{E}^{1:D-1}_{1:T})$.


Due to the Markov property of the network, the joint probability of an arbitrary case 
is thus given (in control flow order) by 
\begin{eqnarray}
P(\mathcal{V}) & = & 
\left[
P(q^0_1)
\prod_{d=1}^D P(q^d_1\mid q^{d-1}_1)
\right]\,P(y_1\mid q^D_1)\,
\left[
p(e^D_1)\prod_{d=D-1}^0 P(e^d_{1}\mid q^{d}_{1}, q^{d+1}_{1}, e^{d+1}_{1})
\right]
\\&&{}\!\!\!\!\!\!\!\times
\prod_{t=2}^T\left\{
 \left[
 p(q^0_t)
 \prod_{d=1}^D P(q^d_t\mid q^d_{t-1}, q^{d-1}_{t}, e^d_{t-1}, e^{d-1}_{t-1})
 \right]\,
 P(y_t\mid q^D_t)\,
 \left[
 P(e^D_t)\prod_{d=D-1}^0 P(e^d_{t}\mid q^{d}_{t}, q^{d+1}_{t}, e^{d+1}_{t})\right]
\right\}
\,.
\end{eqnarray}

Some notational simplifications are clearly in order. For example, note that
for level 0 and level $D$ we have
\begin{eqnarray}
P(\mathcal{C}) & = & \prod_{t=1}^T P(q^0_t)\,P(e^D_t)~=~1\,.
\end{eqnarray}
Also, for stage 1, we have
\begin{eqnarray}
P(\mathbf{q}_1^{1:D}\mid q^0_1) & = & \prod_{d=1}^D P(q^d_1\mid q^{d-1}_1)\,,
\end{eqnarray}
and
\begin{eqnarray}
P(\mathbf{e}_1^{0:D-1}\mid\mathbf{q}_1^{0:D},e^D_1) & = & 
\prod_{d=D-1}^0 P(e^d_{1}\mid q^{d}_{1}, q^{d+1}_{1}, e^{d+1}_{1})\,.
\end{eqnarray}
Similarly, for stage $t>1$ we have
\begin{eqnarray}
P(\mathbf{q}_t^{1:D}\mid q^0_t,\mathbf{q}_{t-1}^{1:D},\mathbf{e}_{t-1}^{0:D}) & = & 
\prod_{d=1}^D P(q^d_t\mid q^d_{t-1}, q^{d-1}_{t}, e^d_{t-1}, e^{d-1}_{t-1})\,,
\end{eqnarray}
and
\begin{eqnarray}
P(\mathbf{e}_t^{0:D-1}\mid\mathbf{q}_t^{0:D},e^D_t) & = &
\prod_{d=D-1}^0 P(e^d_{t}\mid q^{d}_{t}, q^{d+1}_{t}, e^{d+1}_{t})\,.
\end{eqnarray}
Finally, at level $D$ we have
\begin{eqnarray}
P(\mathbf{y}\mid\mathbf{q}^D_{1:T}) & = &
\prod_{t=1}^{T} P(y_t\mid q^D_t)\,.
\end{eqnarray}
Consequently, the joint probability of the hidden and observed variables is just
\begin{eqnarray}
P(\mathcal{H},\mathcal{O}\mid\mathcal{C}) & = &
P(\mathbf{q}_1^{1:D}\mid q^0_1)\,
P(\mathbf{e}_1^{0:D-1}\mid\mathbf{q}_1^{0:D},e^D_1)
\prod_{t=2}^T\left\{
P(\mathbf{q}_t^{1:D}\mid q^0_t,\mathbf{q}_{t-1}^{1:D},\mathbf{e}_{t-1}^{0:D})\,
P(\mathbf{e}_t^{0:D-1}\mid\mathbf{q}_t^{1:D},e^D_t)
\right\}
P(\mathbf{y}\mid\mathbf{q}^D_{1:T})\,.
\end{eqnarray}

## References

[1] Fine, Singer, and Tishby (1998) "*The Hierarchical Hidden Markov Model: Analysis and Applications*", Machine Learning 32. 
[(PDF)](https://link.springer.com/content/pdf/10.1023/A:1007469218079.pdf "springer.com")

[2] Bui, Phung and Venkatesh (2004) "*Hierarchical Hidden Markov Models with General State Hierarchy*", AAAI-04 (National Conference on Artificial Intelligence).
[(PDF)](https://www.aaai.org/Papers/AAAI/2004/AAAI04-052.pdf "aaai.org")