Skip to content

Assignment: POS tagging with HMMs

Alexander Koller edited this page Oct 29, 2023 · 9 revisions

Part-of-speech tagging with HMMs

Implement a bigram part-of-speech (POS) tagger based on Hidden Markov Models from scratch. Using NLTK is disallowed, except for the modules explicitly listed below. For this, you will need to develop and/or utilize the following modules:

  • Corpus reader and writer (10 points)
  • Training procedure (30 points)
  • Viterbi tagging, including unknown word handling (50 points)
  • Evaluation (10 points)

The task is mostly very straightforward, but each step requires careful design. Thus, we suggest you proceed in the following way.

Viterbi algorithm

First, implement the Viterbi algorithm for finding the optimal state (tag) sequence given the sequence of observations (words). We suggest you test your implementation on a small example for which you know the correct tag sequence, such as the Eisner's Ice Cream HMM from the lecture.

Make sure your Viterbi algorithm runs properly on the example before you proceed to the next step. Submit the best state sequence $x$ that your Viterbi implementation finds for $y = 3, 1, 3$ and its joint probability $P(x, y)$.

There are plenty of other detailed illustrations for the Viterbi algorithm on the Web from which you can take example HMMs. Please resist the temptation to copy Python code from those websites; that would be plagiarism.

Training

Second, estimate the parameters of your HMM from data, i.e. the initial, transition, and emission probabilities. Implement a maximum likelihood training procedure for supervised learning of HMMs.

You can get the German GSD Universal Dependencies corpus from this repository. It contains a training set, a test set, and an evaluation set. The training set and the test set are written in the commonly used CoNLL-U format. They are text files with multiple colums; the second column contains the words, the POS tags are in the fourth column, and empty lines delimit sentences. The corpus uses the UD POS tagset.

Feel free to use the conllu library to read the corpora.

Evaluation

Once you have trained a model, predict POS tags on unseen data (the test set) using the Viterbi algorithm and evaluate it against the gold annotations.

TODO: evaluate it on the unseen data from the test set. Run the Viterbi algorithm with each of your models, and output a tagged corpus in the two-column CoNLL format. We have provided an evaluation script on Moodle. Run it on the output of your tagger and the evaluation set and report your results.

Note that your tagger will initially fail to produce output for sentences that contain words you haven't seen in training. If you have such a word $w$ appear at sentence position $t$, you will have $b_j(w) = 0$ for all states/tags $j$, and therefore $V_t(j) = 0$ for all $j$. Adapt your tagger by implementing the following crude approach to unknown words. Whenever you get $V_t(j) = 0$ for all $j$ because of an unknown word $w$ at position $t$, pretend that $b_j(w) = 1$ for all $j$. This will basically set $V_t(j) = \max_i V_{t-1}(i) \cdot a_{ij}$, and allow you to interpolate the missing POS tag based on the transition probabilities alone.

Extra credit

The task is challenging as it stands. However, feel free to go further for extra credit, e.g.\ by doing one of the following: implement better unknown word handling, use a trigram tagger, plot a learning curve for your tagger (accuracy as a function of training data size), plot a speed vs. sentence length curve.

Submission

Please submit your code, instructions for running your tagger and tagging output(s). Document any additional data you submit.

With this, you will have implemented your first POS tagger! Well done!

Clone this wiki locally