# HMM Basics


Hidden Markov Models (HMMs) are one of the fundamental tools in probabilistic modelling, and are widely used in computer vision, and speech and natural language processing (NLP) and many others.

In NLP, many current state of the art models can be viewed as advanced versions of HMMs. 

In this notebook we use a simple language model to explain the basics of HMMs.


## Table of Content

* Simple HMM Language Model
* Parametrisation of a HMM
* Generating Sentences by Sampling from a HMM
* Rethinking the Independence Assumption

# Preparations

In [1]:
# We will only need numpy for this tutorial
import numpy as np

# A simple language example

Consider the following simple sentences that discuss cats and dogs:
* I like cats 
* I really like cats
* I like really cute cats
* Cute Sam likes dogs 
* Jack loves dogs 
* Jack really loves small dogs 
* Charming Sam likes cute dogs 

These sentences are simple in a way that they are structured like the following
* (Adjective) -> Subject -> (Adverb) -> Verb -> (Adjective) -> Object. We call this the _template_ of a sentence.
* Note the probabilistic nature: a verb can directly generate an object or a modifier.

Consider the simplest sentence template:
* Subject -> Verb -> Object

To describe a language that uses this template, we can use a Markov Chain to model the template as a sequence of latent states:
* $h_1$ = Subject -> $h_2$ = Verb -> $h_3$ = Object

This template can generate the following sentences:
* I like cats 
* Jack loves dogs 
* etc.

Specifically, each latent state generates a word:
* $h_1$ = Subject -> $v_1$ = {I, Jack}
* $h_2$ = Verb -> $v_2$ = {like, loves}
* $h_3$ = Object -> $v_3$ = {cats, dogs}

We can further have an adjective to modify the subject or the object:
* Adjective -> Subject, e.g., Cute Sam
* Adjective -> Object, e.g., Cute cats

Or an adverb to modify the verb
* Adverb -> Verb, e.g., really like

By adding these modifiers to the simple template language above and using probabilistic transition and emission distributions we construct a richer language.

We now consider modelling this language with HMMs. Formally, we have a sequence of latent states that describe the template of a sentence:

$$
p(h_{1:T}) = p(h_1)\prod_{t=2}^T p(h_t | h_{t-1})
$$

where $p(h_1)$ is the initial state distribution of the Markov chain, $p(h_t | h_{t-1})$ are the Markov chain transitions, each state $h_t$ only depends on the previous state $h_{t-1}$ (not two or more steps, just the previous step), and $T$ is the length of the sentence.

Then each latent state generates a word:

$$
p(v_{1:T} | h_{1:T}) = \prod_{t = 1}^T p(v_t | h_t)
$$

where $p(v_t | h_t)$ is the emission distribution and each word $v_t$ only depends on its corresponding latent state $h_t$ (not previous word $v_{h-1}$, nor the previous states).

We use the word **dependency structure** or **independence assumptions** to refer to the above two facts, namely $h_t$ only depends on $h_{t-1}$ and $v_t$ only depends on $h_t$

# Why is it called Hidden Markov Model?

* **Hidden**: the template [Subject -> Verb -> Object] is hidden since typically, we only observed the generated sentence, not its template.
* **Markov**: each latent state depends only on its previous state. This is also called the **Markovian** Property.
* **Model**: this is a model that we use to describe human language, which may not necessarily be true about our language. A model will usually have gaps to the actual things that are being modeled.

# Parametrisation of HMM

To model the above simple language with an HMM, we need the following parameters:

* initial distribution $p(h_1)$
* transition distribution $p(h_t|h_{t-1})$
* emission distribution $p(v_t|h_t)$

The word **parametrisation** means the specification of the parameters of a model. In the current HMM case, this corresponds to the specification of the above distributions. Because we are modelling a simplistic language model we choose to parametrise the above distributions using (conditional) probability tables.

For the initial distribution $p(h_1)$, since all sentences starts with a subject or an adjective that modifies the subject, we assume the following probability

| Subject | Adjective | Adverb | Verb | Object | \<EOS\> | 
| ------- | --------- | ------ | ---- | ------ | ------- |
| 0.7     | 0.3       | 0      | 0    | 0      | 0       |

where the tag `<EOS>` denotes the end-of-sentence. We represent this probability table below.

In [2]:
initial = [0.7, 0.3, 0., 0., 0., 0.]
id2state = {0: 'Subject', 1: 'Adjective', 2: 'Adverb', 3: 'Verb', 4: 'Object', 5: '<EOS>'}
state2id = {id2state[s]: s for s in id2state}

The variable `inital` is a vector of length $N = 6$, $N$ being the number of latent states. 

We can get the initial state probability by doing:

In [3]:
initial[state2id['Subject']]

0.7

For the transition distribution , we assume the following conditional probability table:

| -         | Subject | Adjective | Adverb | Verb | Object    | \<EOS\> |
| -------   | ------- | --------- | ----   | ---- | --------- | ------- |
| Subject   | 0       | 0         | 0.3    | 0.7  | 0         | 0       |
| Adjective | 0.4     | 0.1       | 0      | 0    | 0.5       | 0       |
| Adverb    | 0.0     | 0.3       | 0      | 0.7  | 0         | 0       |
| Verb      | 0       | 0.3       | 0.2    | 0    | 0.5       | 0       |
| Object    | 0       | 0         | 0      | 0    | 0         | 1       |

Note that this is a conditional probability table, since the rows of the table sum to 1. The table is organised so that $p(h_t=k | h_{t-1}=k')$ is in row $k'$ and column $k$.

Notice the following:
* A subject can transition to a verb (I like), or an adverb that modifies a verb (I really like).
* An adjective can modify a subject (Handsome Jack), another adjective (cool handsome Jack), or an object (cute dog)
* An adverb modifies a verb (really like) or an adjective (really cute)
* A verb may transition to an object (like dogs), to an adjective (like cute dogs), or an adverb (like really cute dogs)
* The object is the end state and hence always transitions to the \<EOS\>. The table does not contain a row for \<EOS\> since we use it to terminate the HMM and it does not transition to any other state.

We represent the described transition distribution below.

In [4]:
transition = np.array([[0., 0., 0.3, 0.7, 0., 0.], 
                       [0.4, 0.1, 0., 0., 0.5, 0.], 
                       [0., 0.3, 0., 0.7, 0., 0.],
                       [0., 0.3, 0.2, 0., 0.5, 0.],
                       [0., 0., 0., 0., 0., 1.],
                      ])

So the variable `transition` is a $N-1 \times N$, i.e., $5 \times 6$ matrix. 

`transition[i][j]` means the probability from state i to state j.

For example, `transition[0][3]` means transition from subject to verb (we start the index from 0).

In [5]:
assert(transition[0][3] == transition[state2id['Subject']][state2id['Verb']])

With the initial and the transition distributions, we are able to generate sentence templates by sampling from the Markov chain:

In [6]:
# Sample the initial state of the Markov chain
num_states = len(initial)
h = np.random.choice(num_states, p=initial)
# Sample subsequent states using the conditional probability table
h_total = []
h_total.append(h)
while(h != state2id['<EOS>']):
    h = np.random.choice(num_states, p=transition[h])
    h_total.append(h)

In [7]:
print(h_total)
print(' '.join([id2state[h] for h in h_total]))

[1, 0, 3, 1, 0, 3, 4, 5]
Adjective Subject Verb Adjective Subject Verb Object <EOS>


For the emission distribution, we assume:

| -         | I    | He   | Jack | Mary | likes | loves | hates | really | extremely | pretty | cute | adorable | cats | dogs | . |
| -------   | ---  | ---  | ---- | ---- | ---- | ---- | ---- | ------ | --------- | ------ | ---- | -------- | ---- | ---- | ---- |
| Subject   | 0.2  | 0.2  | 0.2  | 0.2  | 0    | 0    | 0    | 0      | 0         | 0      | 0    | 0        | 0.1  | 0.1  | 0   |
| Adjective | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0      | 0         | 0.5    | 0.25 | 0.25     | 0    | 0    |   0 |
| Adverb    | 0    | 0    | 0    | 0    | 0    | 0    | 0    | 0.25   | 0.25      | 0.5    | 0    | 0        | 0    | 0    | 0 |
| Verb      | 0    | 0    | 0    | 0    | 0.3  | 0.4  | 0.3  | 0      | 0         | 0      | 0    | 0        | 0    | 0    | 0 |
| Object    | 0    | 0    | 0.2  | 0.2  | 0    | 0    | 0    | 0      | 0         | 0      | 0    | 0        | 0.3  | 0.3  | 0 |
| \<EOS\>  | 0    | 0    | 0  | 0  | 0    | 0    | 0    | 0      | 0         | 0      | 0    | 0        | 0  | 0  | 1 |

Note, again that the table above represents a conditional probability table, hence each row sums to 1.

Notice: shared vocabulary. Different states may generate the same words:
* In the above emission, the word "Jack" and "Mary" can be either a subject or an object
* Similarly, the word "pretty" can be an adjective or an adverb
* This shared vocabulary may causes difficulties in inference and learning.

In [8]:
emission = np.array([
    [0.2, 0.2, 0.2, 0.2, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.1, 0.],
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0.25, 0.25, 0, 0, 0.],
    [0, 0, 0, 0, 0, 0, 0, 0.25, 0.25, 0.5, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0.3, 0.4, 0.3, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0.2, 0.2, 0, 0, 0, 0, 0, 0, 0, 0, 0.3, 0.3, 0],
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.]
    ])
word2id = {'I': 0, 'He': 1, 'Jack': 2, 'Mary': 3, 'likes': 4, 'loves': 5, 'hates': 6, 'really': 7, 'extremely': 8, 'pretty': 9, 'cute': 10, 'adorable': 11, 'cats': 12, 'dogs': 13, '.': 14}
id2word = {word2id[w]: w for w in word2id}

It can be helpful to visualise the model with a state-transition diagram:

![State-transition diagram of the simple NLP model](./figures/pmr-hmm-nlp-state-diagram.png)

Where we have used solid nodes to represent the hidden states, solid arrows to represent the hidden state transitions, and tables with incoming dashed arrows to represent the emissions.

Note that a state-transition diagram is just a visualisation of the dynamics of the model, it is _not_ a probabilistic graphical model.

# Generating Sentences by Sampling from the HMM

To generate sentences, we firstly generate the template of a sentence (latent states), then generate words conditioned on the states

In [1]:
def generate(initial, transition, emission):
    # Sample the initial state
    num_states = len(initial)
    num_words = emission.shape[1]

    # Sample the subsequent states and words
    h = np.random.choice(num_states, p=initial)
    h_total = [h]
    v_total = []
    while(h != state2id['<EOS>']):
        v = np.random.choice(num_words, p=emission[h])
        v_total.append(v)
        h = np.random.choice(num_states, p=transition[h])
        h_total.append(h)
    v = np.random.choice(num_words, p=emission[h])
    v_total.append(v)
    return h_total, v_total

h_total, v_total = generate(initial, transition, emission)
print(h_total)
print(' '.join([id2state[h] for h in h_total]))
print(v_total)
print(' '.join([id2word[v] for v in v_total]))

NameError: name 'initial' is not defined

# Rethinking the Independence Assumption

Have you noticed something wrong with the above model? 

If you run the above sampling code multiple times, you may get a result like this:

```
Adjective Object <EOS>
adorable Jack .
```

This is not a legitimate sentence since it does not contain a verb (every legitimate sentence should contain at least one verb).
The reason this sentence is generated is that the object can be directly generated by the adjective, although in this case, the adjective should really generate a subject (rather than the object).

You may also get the following sample:

```
Subject Verb Object <EOS>
I hates dogs .
```

The generated sentence is gramatically incorrect since the verb is in the third-person form even though really be in the first-person form. There are two reasons for this: firstly, that our emission distribution does not contain the first-person from of the verb; and secondly, that the use of the first- or second-person forms of a verb may depend on the previous words in the sentence.

Additionally, you may also get a sample like this:

```
Subject Verb Adverb Adjective Subject Verb Adverb Verb Adjective Object <EOS>
dogs love extremely adorable cats love really like adorable dogs .

```

This is not a legitimate sentence since it contains two subjects (the second subject should really be an object).

Why may these problems occur? 

Because each latent state depends only on its previous state, while being ignorant about any other possible dependencies. For example, the transition to an object should also depend on whether there already is a verb in the sentence. 

This problem is a limitation of first-order hidden Markov models, since first-order HMMs cannot model **long-term dependencies**. 

To fix this problem, one solution is to **increase the dependency order**, e.g., to make the state at step t to depend on all states from 1 to t - 1, rather than just t - 1. We will stop here. If you are interested read [autoregressive language modeling with recurrent neural networks](https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture05-rnnlm.pdf) for more details.