-
Notifications
You must be signed in to change notification settings - Fork 2
Assignment: n‐gram models
This assignment assumes that you are using Python, and that you have installed and familiarized yourself with the NLTK 3 library. For full credit, do Problem 1 and one of Problem 2 and Problem 3. Feel free to do all three, or to extend one of the problems, for extra credit.
Please do not use any NLTK classes except those that are explicitly allowed. You can use numpy in all assignments if you find it useful.
Investigate Zipf's Law empirically. Use the following freely available corpora:
For each corpus, compute a list of unique words sorted by descending frequency. Treat SETIMES as two separate corpora, one in Turkish and one in Bulgarian. You will need to read up on how to iterate over a Huggingface Dataset; this is an investment that will pay off very quickly.
Feel free to tokenize however you like, e.g. by splitting at whitespace. Use the Python library matplotlib to plot the frequency curves for the corpora, i.e. x-axis is position in the frequency list, y-axis is frequency. Make sure to provide both a plot with linear axes and one with log-log axes (see methods matplotlib.pyplot.plot and matplotlib.pyplot.loglog) for each corpus.
Provide a brief discussion of the findings, as well as the source code.
In this assignment, you will reimplement the "Dissociated Press" system that was developed by MIT students in the 1970s. The purpose of this system is to generate random text from an n-gram model over a corpus.
We have provided a Python class for training n-gram models that you're welcome to use. Train an instance of this class on a
corpus of your choice (from the problem above or elsewhere), and name it ngram. You can then use ngram[context] to determine the
probability distribution for the next word given the previous n-1
words. Given this distribution, you can use the generate method
from the NLTK class ProbDistI to generate the next random
word.
Use your system to produce a number of text samples of 100 words each. Vary n from 2 to 4. Submit a few interesting texts that your system generates for each n, and discuss how the quality (and creativity) of the generated outputs changes with n. Also submit your source code, and document any dependencies, such as links to the selected corpora.
In statistical NLP we frequently make independence assumptions about relevant events which are not actually correct in reality. We are asking you to test the independence assumptions of unigram language models.
Pointwise mutual information,
is a measure of statistical dependence of the events
Calculate the pmi for all successive pairs
Document and submit your observations and code. Discuss the validity of the independence assumption for unigram models.