<img src='data/images/section-notebook-header.png' />

# POS Tagging with HMMs

We have already briefly looked into Part-of-Speech (POS) tagging in "Natural Language Processing: Foundations" during the topic of text preprocessing. We saw that the task of lemmatization relies on the information about words POS tag (i.e., word type: noun, verb, adjective, adverb, and so on). However, in this context, we used off-the-shelf POS tagger but ignored how POS tagging actually can be implement. In NLP, POS tagging is one of very important sequence labeling tasks -- that is, where each word/token in a sequences (e.g., a sentence) is assigned with a tag. Other common sequence labeling tasks include:

* **Named Entity Recognition (NER):** NER aims to identify and classify named entities in text, such as names of persons, organizations, locations, dates, and more. For instance, given the sentence "Apple Inc. is planning to open a new store in New York City next month," a NER system would label "Apple Inc." as an organization and "New York City" as a location.

* **Chunking:** Chunking involves identifying and labeling syntactic constituents, also known as chunks, in a sentence. These chunks often correspond to phrases such as noun phrases (NP), verb phrases (VP), or prepositional phrases (PP). For example, in the sentence "She saw a beautiful sunset," a chunker would label "a beautiful sunset" as an NP.

* **Semantic Role Labeling (SRL):** SRL aims to identify the roles played by different entities in a sentence with respect to a specific predicate. It assigns labels such as "agent," "patient," "theme," and others to indicate the semantic roles. For instance, in the sentence "John ate an apple," the SRL system would label "John" as the agent and "an apple" as the patient.

* **Sentiment Analysis:** Sentiment analysis involves determining the sentiment or opinion expressed in a piece of text. In some cases, sentiment analysis can be treated as a sequence labeling task, where sentiment labels (e.g., positive, negative, neutral) are assigned to each word or sentence. This can be useful in analyzing customer reviews, social media posts, or product descriptions.

This family of sequence labeling tasks can be solved using various techniques, ranging from rule-based approaches to statistical and machine learning methods. Rule-based approaches rely on handcrafted linguistic rules and dictionaries, whereas statistical methods use probabilistic models trained on large annotated corpora. More recently, deep learning models, particularly recurrent neural networks (RNNs) and transformer models, have achieved state-of-the-art performance in sequence labeling by effectively capturing the contextual information and long-range dependencies within a sentence.

In this notebook, we look at a core technique for sequence labeling based on statistican machine learning: **Hidden Markov Models (HMM)**. We implement our own POS tagger from scratch by training a Hidden Markov Model (HMM) and implementing the Viterbi algorithm for decoding.


## Quick Recap: POS Tagging

Part-of-speech (POS) tagging is a fundamental task in natural language processing (NLP) that involves assigning a grammatical category or part of speech to each word in a given text. It is a crucial step in many NLP applications, such as machine translation, information retrieval, and sentiment analysis. POS tagging helps in understanding the syntactic structure of a sentence and provides valuable information for subsequent analysis and interpretation.

The goal of POS tagging is to label each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, adverb, preposition, conjunction, and so on. These labels capture the lexical and grammatical properties of words and enable a deeper understanding of the text's meaning and structure. POS tagging algorithms utilize linguistic features, contextual clues, and statistical patterns to determine the most likely part of speech for each word, taking into account the surrounding words and the overall context.

Accurate POS tagging is essential for many downstream NLP tasks, as it serves as a crucial preprocessing step. By providing a fine-grained analysis of the grammatical structure of text, POS tagging facilitates more sophisticated language understanding and enables the development of advanced NLP applications.

## Hidden Markov Models

A Hidden Markov Model (HMM) is a statistical model used to describe and analyze sequential data, particularly data with temporal dependencies. It is a type of generative model that consists of a set of states, observed symbols, and transition probabilities between states. The key idea behind an HMM is that the underlying state of the system is hidden or unobserved, while only the emitted symbols or observations are visible. In the context of POS tagging, the states of the systems are the POS tags, and the emitted symbols are the words in our sentence/sequence.

Recall the example from slides -- see figure below -- where assumed the existence of 3 states (i.e., POS tags: singular or mass noun (NN), non-3rd person singular present verb (VBP), personal pronoun (PRP) -- using the Penn Treebank tag set). The emitted or observed symbols are the words of the sentence *"I like NLP"*. The goal of training and using an HMM is to find the most likely sequences of tags given these sentences, which should be: PRP VBP NN.

<img src='data/images/hmm-pos-example.png' width='40%' />

In an HMM, the states form a Markov chain, meaning that the probability of transitioning from one state to the next depends only on the current state. The transitions between states are governed by transition probabilities, which represent the likelihood of moving from one state to another. Additionally, each state can emit a symbol from a set of observable symbols, and the emission probabilities determine the likelihood of observing a particular symbol from a given state.

The main goal of an HMM is to model the joint probability of the observed sequence of symbols and the underlying sequence of states. This can be useful for various tasks, such as sequence labeling, where the goal is to determine the most likely sequence of hidden states given the observed sequence of symbols. For POS tagging: How likely is a sequence of POS tags for a given sentence.

HMMs are particularly useful when dealing with sequential data, where the current state depends on previous states and the observed symbols provide indirect information about the hidden states. Despite their simplicity, HMMs have been widely used and extended to more complex models, such as hidden semi-Markov models (HSMMs) and conditional random fields (CRFs), to capture more intricate dependencies and improve performance in various NLP and pattern recognition tasks.


## Setting up the Notebook

### Import Required Packages

In [None]:
import numpy as np
from tqdm import tqdm
from collections import defaultdict

import nltk
from nltk.corpus import brown
from nltk.corpus import treebank
from nltk.corpus import conll2000

---

## Toy Example from the Lecture

We first look at the toy example that we used to walk through the Viterbi algorithm. In this example, we also considered only 3 states (i.e., POS tags). The image below shows a screenshot of the lecture slides describing the toy example

<img src='data/images/hmm-toy-model.png' width='90%' />

### "Training" the HMM

Recall from the lecture, that an HMM is completely defined by the following 3 components:

* **Transition matrix $A$:** The transition matrix represents the probabilities of transitioning from one hidden state to another. It is also known as the state transition matrix or the transition probability matrix. The transition matrix for an HMM is typically denoted by $A$ and has dimensions ($N\times N$), where $N$ is the number of hidden states in the model.

* **Emission matrix $B$:** The emission matrix, also known as the emission probability matrix or observation probability matrix, represents the probabilities of emitting observable symbols or observations from each hidden state. The emission matrix captures the relationship between the hidden states and the observed symbols. The emission matrix is typically denoted by $B$ and has dimensions ($N\times M$), where $N$ is the number of hidden states and $M$ is the number of possible observable symbols.

* **Start probabilities $\pi$:** The start probabilities, also known as initial state probabilities or initial state distribution, represent the probabilities of starting the sequence in each hidden state. The start probabilities describe the likelihood of the HMM's initial state being in a particular hidden state before any observations are made. The start probabilities are typically represented by a vector, denoted as $\pi$ of length $N$, where $N$ is the number of hidden states in the HMM.

In our toy example, all 3 components were given to us, so we didn't need to train anything. We will see how to compute all these values given an annotated dataset when going through our real-world examples below. Here the focus is in simplicity and the algorithm. Right now, let's define $A$, $B$, and $\pi$:

In [None]:
A = np.array([
    [0.0, 0.8, 0.2],
    [0.0, 0.5, 0.5],
    [0.5, 0.5, 0.0]
])

# the, fans, love, show
B = np.array([
    [0.2, 0.00, 0.00, 0.00], # DT
    [0.0, 0.05, 0.30, 0.10], # NN
    [0.0, 0.25, 0.15, 0.30]  # VB
])

PI = np.array([0.8, 0.2, 0.0])

As we can index the values in $A$, $B$ and $\pi$ only using integer indices, we need a few dictionary to map between 
* tags (which represent the hidden states) and their respective indices
* words (which represent the observed variables) and their respective indices

In [None]:
tag2index, index2tag = {'DT': 0, 'NN': 1, 'VB': 2}, {0: 'DT', 1: 'NN', 2: 'VB'}
word2index, index2word = {'the': 0, 'fans': 1, 'love': 2, 'show': 3}, {0: 'the', 1: 'fans', 2: 'love', 3: 'show'}

### Viterbi Algorithm

The Viterbi algorithm is a dynamic programming algorithm used to find the most likely sequence of hidden states in a Hidden Markov Model (HMM) given a sequence of observations or symbols. It is an efficient algorithm that provides an optimal solution to the decoding problem in HMMs. The goal of the Viterbi algorithm is to find the path of hidden states that maximizes the joint probability of the observed sequence and the corresponding hidden state sequence. This is often referred to as the Viterbi path or the most likely state sequence. For our toy example, this trellis looks as follows (taken from the lecture slides):

<img src='data/images/hmm-toy-model-trellis-init.png'  width='90%' />

The Viterbi algorithm works by iteratively calculating the most likely path up to each hidden state at each time step. It maintains a dynamic programming table, often called the Viterbi trellis or Viterbi matrix, which stores the partial probabilities and backpointers for each state at each time step.

The steps of the Viterbi algorithm are as follows:

* **Initialization:** Initialize the first column of the Viterbi trellis with the product of the start probabilities and the emission probabilities for the first observation.

* **Recursion:** For each subsequent time step, calculate the maximum partial probability for each state by considering the maximum probability from the previous time step multiplied by the transition probabilities and the emission probabilities for the current observation. Store the maximum probability and the corresponding backpointer in the Viterbi trellis.

* **Termination:** Once all time steps have been processed, find the final state with the highest probability in the last column of the Viterbi trellis. This represents the most likely ending state.

* **Backtracking:** Starting from the most likely ending state, follow the backpointers in the Viterbi trellis to trace back the most likely state sequence.

The resulting state sequence obtained through the Viterbi algorithm represents the most probable sequence of hidden states that explains the observed sequence of symbols. The method `viterbi()` below implements the Viterbi algorithm as covered in the lecture. The implementation is annotated and should be straightforward enough to understand the code and map it to iterative steps of the Viterbi algorithm visualized on the lecture slides.

In [None]:
def viterbi(tokens, A, B, PI):
    N, T = A.shape[0], len(tokens)         # N = number of states; T = lenght of sequence
    M = np.zeros((N, T))                   # Reflecting probabilties of trellis
    BT = np.zeros((N, T), dtype=np.int16)  # For the Backtracking pointers
    
    #####################################################################################################
    ### Handle initial state = start probabilities multiplies by repespective emission probabilities
    for s in range(N):
        M[s,0] = PI[s] * B[s, word2index[tokens[0]]]
        
    #####################################################################################################
    ### Handle all transitions
    
    # Loop over all time steps
    for t in range(1, T):
        # Loop over all states
        for s in range(N):
            # Compute the transition probabilities from ALL states from previous time step
            trans_probs = M[:,t-1] * A[:,s] * B[s,word2index[tokens[t]]]
            # Find the index that reflects the path the highest transition probability
            max_idx = np.argmax(trans_probs)
            # Update the trellis matrix with the hights probability the current state and time step
            M[s,t] = trans_probs[max_idx]
            # Remember the index reflecting the highest probability in the backtracking matrix
            BT[s,t] = max_idx

    #####################################################################################################
    ### Use back pointers to follow the path that lead to the max prob
    state = np.argmax(M[:,-1])
    state_sequence = []
    for i in reversed(range(T)):
        state_sequence.append(state)
        state = BT[:,i][state]
        
    # We also return matrix M, but only to print it
    return [ index2tag[idx] for idx in reversed(state_sequence) ], M

For some additional explanation of the `viterbi()` method above, have a look at the figure below; again, directly taken from the lecture slides. This figure visualizes the computation of `trans_probs` where 

* `M[:,t-1]` represents the left column (i.e., the probabilities of all paths up to time stamp $i-1$) -- in the figure: $v_{t-1}(1)$, $v_{t-1}(2)$, ..., $v_{t-1}(N)$

* `A[:,s]` represents the transition probabilities from *all* states at time stamp $i-1$ to state `s` at time stamp $t$ -- in the figure: $a_{1,s}$, $a_{2,s}$, ... $a_{1,N}$

* `B[s,word2index[tokens[t]]]` represents the emission probability of word `word2index[tokens[t]]` given state $s$ -- in the figure: $b_s(o_t)$

<img src='data/images/hmm-viterbi-visualized.png' />


The blue arrow indicates the path that yields the highest probability for reaching $v_{t}(s)$. This corresponds to the line `max_idx = np.argmax(trans_probs)` in the code of the `viterbi()` method.

With our method, we can now decode any sequence of tokens, including the example sequence from the lecture slides.

In [None]:
decoded_sequence, M = viterbi(['the', 'fans', 'love', 'the', 'show'], A, B, PI)

print(decoded_sequence)

The result looks as expected as those are arguably the correct POS tags for each of the words in the sentence/sequence. For a more detailed inspection, we can also have a look at the trellis matrix `M`.

In [None]:
print(M)

The values in matrix `M` natrually reflect the completed trellis from the lecture slides:

<img src='data/images/hmm-toy-model-trellis-final.png' width='90%' />

The implementation of the Viterbi algorithm in method `viterbi()` together with the toy example from the lecture slides should help with a better understand of the algorithm. If some individual steps are still unclear, feel free to edit the `viterbi()` method, e.g., by inserting additional `print` statements and see how intermediate results are reflected in the figures visualizing the Viterbi algorithm (see above).

---

## Real-World Example

In the toy example above, we focused on the Viterbi algorithm for decoding a input sequence (e.g., an input sentence) given a trained HMM. In the toy example, the HMM was given to us in terms of the 3 components: transition matrix $A$, emission matrix $B$, and start probabilities $\pi$. In this section, we will actually train an HMM based on an annotated dataset. For this, we combine 3 annotated dataset provided by NLTK:

* **Treebank Dataset:** The treebank dataset refers to a collection of parsed sentences represented as syntactic tree structures. It is a corpus specifically designed for training and evaluating parsers and other NLP tools. The English treebank dataset in NLTK includes the Penn Treebank, which is one of the most widely used treebank resources. It contains parsed sentences from various sources, such as newspaper articles, and covers a range of genres and topics.

* **Brown Dataset:** The Brown corpus is one of the first and most influential general-purpose corpora of English text. The Brown corpus consists of samples of written English from various sources, covering a wide range of genres and topics. It was compiled in the 1960s at Brown University and contains over one million words of text. The corpus is divided into categories such as news, fiction, government, religion, sports, and more, providing a diverse representation of English language usage. Each text sample in the Brown corpus is tokenized into words and annotated with POS tags.

* **CoNLL2000 Dataset:** The CoNLL 2000 dataset is a collection of annotated English language data commonly used for training and evaluating information extraction systems, particularly those related to named entity recognition and chunking. The CoNLL 2000 dataset is based on the data used in the CoNLL-2000 shared task, which was organized as part of the Conference on Computational Natural Language Learning (CoNLL) in the year 2000. The dataset consists of news articles from the Wall Street Journal (WSJ) section of the Penn Treebank corpus. Each sentence in the CoNLL 2000 dataset is annotated with POS tags, chunk tags, and named entity tags.

While each tokens/words in all three datasets are annotated with POS tags, these 3 dataset (by default) use different tag sets, technically prohibiting us to simply combine the datasets into one. However, all 3 dataset are additionally annotated using the **Universal Part-of-Speech (POS) tag set**. The Universal POS tag set is a standardized set of part-of-speech tags that aims to provide a cross-linguistic and language-independent representation of word categories or syntactic roles in natural language processing (NLP) tasks.

The Universal POS tag set was introduced as part of the [Universal Dependencies (UD) project](https://universaldependencies.org/), which seeks to create consistent and multilingual syntactic treebanks across different languages. The goal of the Universal POS tag set is to facilitate cross-linguistic comparisons and enable the development of language-independent NLP models and tools. The Universal POS tag set consists of a small number of coarse-grained and high-level categories that are applicable to a wide range of languages. The tag set includes the following labels:

* **NOUN:** Nouns (common and proper)
* **VERB:** Verbs (main and auxiliary)
* **ADJ:** Adjectives
* **ADV:** Adverbs
* **PRON:** Pronouns
* **DET:** Determiners
* **ADP:** Adpositions (prepositions and postpositions)
* **NUM:** Numerals
* **CONJ:** Conjunctions
* **PRT:** Particles or other function words
* **. :** Punctuation marks
* **X:** Other or undetermined

This tag set provides a simplified and consistent representation of word categories across different languages, making it easier to develop cross-linguistic NLP models and perform comparative analyses. The Universal POS tag set is widely used in various NLP tasks, including part-of-speech tagging, syntactic parsing, and machine translation.

### Prepare Training Dataset

We first download the 3 datasets; or at least check if the datasets are already available.

In [None]:
nltk.download('treebank')
nltk.download('brown')
nltk.download('conll2000')

We can now extract the sentences together with their universal POS tags; we need to explicitly specify this! For easier use in the following, we also concatenate all 3 datasets into a single list and treat this list as a single dataset for training our HMM.

In [None]:
treebank_corpus = treebank.tagged_sents(tagset='universal')
brown_corpus = brown.tagged_sents(tagset='universal')
conll_corpus = conll2000.tagged_sents(tagset='universal')

# Combine all 3 corpora
tagged_sentences = list(treebank_corpus + brown_corpus + conll_corpus)

It's always helpful to first have a look at the raw data. The code cell below shows the information for the first sentence. As you can see, each sentence is represented by a list of 2-tuples, where each tuples contains the token/word at position 0 and the universal POS tag at position 1.

In [None]:
# Keep track of the total number of sentences; needed later to calculate the start probabilities
num_sent = len(tagged_sentences)

print('Total number of sentences: {}\n'.format(num_sent))

print('Example -- output for the first sentence')
tagged_sentences[0]

### Compute all Required Counts

We saw that we can compute $A$, $B$, and $\pi$ using Maximum Likelihood Estimation (MLE). The figure below is taken from the lecture slides and shows how the values for $A$, $B$, and $\pi$ are calculated. 

<img src='data/images/hmm-training-mle.png' width='90%' />

Let's first define a series of dictionaries to keep track of all required counts to compute the probabilities later captured by $A$, $B$, and $\pi$.

In [None]:
# Define sets to keep track of vocabulary V and tag set S
S, V = set(), set()

initial_state_counts     = defaultdict(int)  # Count(<S>s_i): Number of times a state s_i was the first state in a training sequence
state_counts             = defaultdict(int)  # Count(s_i): Number of times a state s_i occured
state_transition_counts  = defaultdict(int)  # Count(s_i, s_j): Number of times the transition from a state s_i to state s_j occurred
observation_counts       = defaultdict(int)  # Count(v_k, s_i): Number of times seeing a token v_k in a state s_i

With these dictionaries initialized we now only need to go through all annotated sentences to calculate these counts. This is done in the code cell below. Again, the code is annotated and should be self-explanatory. Note that the whole code essentially just goes through the dataset and increases the respective counts in the dictionaries.

In [None]:
for sent in tqdm(tagged_sentences):
    
    # Get all tokens and tags
    tokens = [ t[0].lower() for t in sent ]
    tags = [ t[1] for t in sent ]
    
    # Update the set of tokens (i.e., observed variables) and set of tags (i.e., states)
    V.update(set(tokens))
    S.update(set(tags))
    
    # Increase the count for inital state (tag)
    initial_state_counts[tags[0]] += 1    
    
    # Iterate over all tokens
    for pos in range(len(tags)-1):
        pred, succ = tags[pos], tags[pos+1]
        # Increase the counter for state "pred"
        state_counts[pred] += 1
        # Increase the counter for transition from state "pred" to state "succ"
        state_transition_counts[(pred, succ)] += 1

    # Iterate over all tags
    for pos in range(len(tags)):
        state, obs = tags[pos], tokens[pos]
        # Increase counter for observation "obs" given state "state"
        observation_counts[(state, obs)] += 1    

Like we saw for the toy example, we again need a few dictionaries to map between words, tags and their respective indices. Here, we compute these dictionaries automatically, of course:

In [None]:
tag2index, index2tag = {}, {}
word2index, index2word = {}, {}

for idx, tag in enumerate(S):
    tag2index[tag] = idx
    index2tag[idx] = tag
    
for idx, word in enumerate(V):
    word2index[word] = idx
    index2word[idx] = word 

### Training the HMM

With all required counts calculated, we can now train our HMM by computing the transition matrix $A$, emission matrix $B$, and the start probabilities $\pi$.

#### Auxiliary Methods

Let's first define a series of auxiliary methods to compute:

* the transition probability for two given states (i.e., tags)
* the emission probability for a given state (i.e., tag) and observation (i.e., words)
* the initial state probability for a given state (i.e., tag)

According to the MLE, we simply need our counts to compute all probabilities. Note that we return 0 if anything fails (e.g., in case of unknown words). In practice, techniques such as smoothing can be applied.

These auxiliary methods are not really needed, but they make the code below easier to read, which is the focus of this notebook; we do not care about the efficiency of the implementation.

In [None]:
def get_transition_probability(s1, s2):
    try:
        return state_transition_counts[(s1, s2)] / state_counts[s1]
    except:
        return 0.0

def get_emission_probability(s, o):
    try:
        return observation_counts[(s, o)] / state_counts[s]
    except:
        return 0.0
    
def get_initial_probability(s):
    try:
        return initial_state_counts[s] / num_sent
    except:
        return 0.0
    
# Some example outputs
print(get_transition_probability('DET', 'ADJ'))
print(get_emission_probability('DET', 'the'))
print(get_initial_probability('DET'))

We have to read the output of the code cell above as follows:

* ~23.3% of the time, a determiner is followed by an adjective

* ~52.2% of the time, the determiner is the word "the"

* ~21.5% of the time, a sentence starts with a determiner

#### Computing $A$, $B$,  $\pi$

With all the required counts and the auxiliary methods in place, the code cell below finally computes the transition matrix $A$, emission matrix $B$, and the start probabilities $\pi$. To this end, the code simply loops over each entry of the  two matrices $A$ and $B$, as well as of vector $\pi$, and calculates the corresponding values using the auxiliary methods (which in turn utilize the various count values).

In [None]:
%%time

# Transition matrix
A = np.zeros((len(S), len(S)))

for si in range(len(S)):
    for sj in range(len(S)):
        A[si,sj] = get_transition_probability(index2tag[si], index2tag[sj])
        
        
# Emission matrix
B = np.zeros((len(S), len(V)))

for s in range(len(S)):
    for v in range(len(V)):
        B[s,v] = get_emission_probability(index2tag[s], index2word[v])
        
        
# Initial state matrix
PI = np.zeros((len(S),))

for s in range(len(S)):
    PI[s] = get_initial_probability(index2tag[s])        

We have now trained our HMM. While the matrices $A$ and $B$ are too large to print, we can have a look at the start probabilities -- particularly since we used the reduced set of universal POS tags.

In [None]:
for s in range(len(S)):
    print('Start probability of state {}: {}'.format(index2tag[s], PI[s]))

These start probabilities tell us, the sentences are most likely to start with a determiner (DET: ~21.5), followed by a noun (NOUN: ~17.1%) and pronoun (PRON: ~14.1%). This seems intuitive since English is generally a subject-verb-object (SVO) language, with sentences typically starting with the subject, and the subject of a sentence is typically a noun or noun phrase.

### Decoding with Viterbi Algorithm

Now that we have trained our HMM, we can again use our `viterbi()` method to decode a couple of example sentences. We already implemented the Viterbi algorithm above, so let's use it to decode some examples. We use the ones included in the lecture slides, but feel free to come up and try your own.

In [None]:
decoded_sequence, _ = viterbi(['the', 'fans', 'love', 'the', 'show'], A, B, PI)
print(decoded_sequence)

In [None]:
decoded_sequence, _ = viterbi(['the', 'fans', 'like', 'the', 'show'], A, B, PI)
print(decoded_sequence)

In [None]:
decoded_sequence, _ = viterbi(['funny', 'movies', 'are', 'the', 'best'], A, B, PI)
print(decoded_sequence)

In [None]:
decoded_sequence, _ = viterbi(['i', 'like', 'watching', 'comedies'], A, B, PI)
print(decoded_sequence)

As you can see, our POS tagger based on the Viterbi algorithm is not perfect. In the second, example, "like" is labeled as ADP (adposition: preposition or postposition) where it should be labeled as verb (VERB).

---

## Summary

Hidden Markov Models (HMMs) have been widely used for part-of-speech (POS) tagging, which is the process of assigning the appropriate part-of-speech tags to each word in a sentence. HMMs provide a probabilistic framework for modeling the sequence of POS tags given the corresponding sequence of words in a sentence. In HMM-based POS tagging, the POS tags are considered as hidden states, and the observed words are treated as emissions from those hidden states. The underlying assumption is that the POS tags influence the choice of words in a sentence. By utilizing the probabilistic nature of HMMs, POS tagging can be formulated as a sequence labeling problem, where the goal is to find the most likely sequence of POS tags given the observed words.

To train an HMM-based POS tagger, a labeled training corpus is used to estimate the probabilities of transitioning from one POS tag to another and the probabilities of emitting each word from a given POS tag. These probabilities are typically estimated using maximum likelihood estimation or other statistical techniques. During the tagging process, the Viterbi algorithm is commonly employed to find the most probable sequence of POS tags for a given sentence. The algorithm efficiently computes the most likely sequence by considering the transition probabilities between POS tags and the emission probabilities of words from the corresponding POS tags.

HMM-based POS taggers have been successful in various NLP applications and research areas. They are particularly useful when the context and surrounding words play a significant role in determining the correct POS tags. However, HMM-based approaches may struggle with handling ambiguous words and capturing long-range dependencies. Over time, more advanced and neural network-based approaches, such as recurrent neural networks (RNNs) and transformer models, have gained popularity for POS tagging due to their ability to capture complex patterns and dependencies in language. Nonetheless, HMMs have provided a solid foundation for POS tagging and continue to serve as a reference point in the field.