-
Notifications
You must be signed in to change notification settings - Fork 2
Assignment: POS tagging with HMMs
Implement a bigram part-of-speech (POS) tagger based on Hidden Markov Models from scratch. Using NLTK is disallowed, except for the modules explicitly listed below. For this, you will need to develop and/or utilize the following modules:
- Corpus reader and writer (10 points)
- Training procedure (30 points)
- Viterbi tagging, including unknown word handling (50 points)
- Evaluation (10 points)
The task is mostly very straightforward, but each step requires careful design. Thus, we suggest you proceed in the following way.
First, implement the Viterbi algorithm for finding the optimal state (tag) sequence given the sequence of observations (words). We suggest you test your implementation on a small example for which you know the correct tag sequence, such as the Eisner's Ice Cream HMM from the lecture.
Make sure your Viterbi algorithm runs properly on the example before you proceed to the next step. Submit the best state sequence
There are plenty of other detailed illustrations for the Viterbi algorithm on the Web from which you can take example HMMs. Please resist the temptation to copy Python code from those websites; that would be plagiarism.
Second, estimate the parameters of your HMM from data, i.e. the initial, transition, and emission probabilities. Implement a maximum likelihood training procedure for supervised learning of HMMs.
You can get the German GSD Universal Dependencies corpus from this repository. It contains a training set, a test set, and an evaluation set. The training set and the test set are written in the commonly used CoNLL-U format. They are text files with multiple colums; the second column contains the words, the POS tags are in the fourth column, and empty lines delimit sentences. The corpus uses the UD POS tagset.
Feel free to use the conllu library to read the corpora.
Once you have trained a model, predict POS tags on unseen data (the test set) using the Viterbi algorithm and evaluate it against the gold annotations.
TODO: evaluate it on the unseen data from the test set. Run the Viterbi algorithm with each of your models, and output a tagged corpus in the two-column CoNLL format. We have provided an evaluation script on Moodle. Run it on the output of your tagger and the evaluation set and report your results.
Note that your tagger will initially fail to produce output for sentences that contain words you haven't seen in training. If you have such a word
The task is challenging as it stands. However, feel free to go further for extra credit, e.g.\ by doing one of the following: implement better unknown word handling, use a trigram tagger, plot a learning curve for your tagger (accuracy as a function of training data size), plot a speed vs. sentence length curve.
Please submit your code, instructions for running your tagger and tagging output(s). Document any additional data you submit.
With this, you will have implemented your first POS tagger! Well done!