Skip to content

Assignment: POS tagging with HMMs

Alexander Koller edited this page Nov 4, 2025 · 9 revisions

Implement a part-of-speech (POS) tagger based on Hidden Markov Models from scratch. Using NLTK is disallowed, except for the modules explicitly listed below. Please keep in mind that you are not allowed to copy code from the Internet or generate it with LLMs.

You will need to develop and/or utilize the following modules:

  • Corpus reader and writer (10 points)
  • Training procedure (30 points)
  • Viterbi tagging, including unknown word handling (50 points)
  • Evaluation (10 points)

The task is mostly very straightforward, but each step requires careful design. Thus, we suggest you proceed in the following way.

Viterbi algorithm

First, implement the Viterbi algorithm for finding the optimal state (tag) sequence given the sequence of observations (words). We suggest you test your implementation on a small example for which you know the correct tag sequence, such as the Eisner's Ice Cream HMM from the lecture.

Make sure your Viterbi algorithm runs properly on the example before you proceed to the next step. Submit the best state sequence $x$ that your Viterbi implementation finds for $y = 3, 1, 3$ and its joint probability $P(x, y)$.

Training

Second, estimate the parameters of your HMM from data, i.e. the initial, transition, and emission probabilities. Implement a maximum likelihood training procedure for supervised learning of HMMs.

You can get the German GSD Universal Dependencies corpus from this repository. It contains a training set, a test set, and an evaluation set. The training set and the test set are written in the commonly used CoNLL-U format. They are text files with multiple colums; the second column contains the words, the POS tags are in the fourth column, and empty lines delimit sentences. The corpus uses the UD POS tagset.

Feel free to use the conllu library to read the corpora. This will put the UD POS tags into the upos field.

Evaluation

Once you have trained a model, predict POS tags on unseen data (the test set) using the Viterbi algorithm and evaluate it against the gold annotations.

Report the accuracy by dividing the total number of correctly classified tokens (across the whole corpus) by the total number of tokens in the corpus. You can write the output of your tagger back into a CoNLL-U file if you wish, but this is not required.

Unknown word handling. Note that your tagger will initially fail to produce output for sentences that contain words you haven't seen in training. If you have such a word $w$ appear at sentence position $t$, you will have $b_j(w) = 0$ for all states/tags $j$, and therefore $V_t(j) = 0$ for all $j.$

Extend your tagger with a mechanism for unknown word handling. One simple idea is as follows: Whenever you get $V_t(j) = 0$ for all $j$ because of an unknown word $w$ at position $t$, pretend that $b_j(w) = 1$ for all $j$. This will basically set $V_t(j) = \max_i V_{t-1}(i) \cdot a_{ij}$, and allow you to interpolate the missing POS tag based on the transition probabilities alone. You can get creative here if you wish, but we will only give full credit to solutions that work at least as well as this simple approach. Thus, we recommend implementing the simple approach first and then evaluating your own ideas in comparison to it.

Extra credit

Feel free to go further for extra credit, e.g. by doing one of the following: implement better unknown word handling, use a trigram tagger, plot a learning curve for your tagger (accuracy as a function of training data size), plot a speed vs. sentence length curve.

Submission

Please submit your code, instructions for running your tagger and tagging output(s). Document any additional data you submit.

With this, you will have implemented your first POS tagger! Well done!

Clone this wiki locally