Lab 3: Language Modeling
=============
In this problem set, your objective is to train a language model, evaluate it and explore how it can be used for language generation. Towards that end you will:

- Train an n-gram language model.
- Use that language model to generate representative sentences.
- Study the effect of training data size, and language model complexity (n-gram size), on the modeling capacity of a language model.

- **For this assignment, submit ```lab3.py``` on Gradescope.**
- In order to test the lab you can run ```python run_tests.py``` or ```python run_tests.py -j``` (more detailed information)
- In order to install the correct dependencies you can run ```pip install -r requirements.txt```

Total points: 90 points

# 0. Setup

In order to develop this assignment, you will need [python 3.6](https://www.python.org/downloads/) and the following libraries. Most if not all of these are part of [anaconda](https://www.continuum.io/downloads), so a good starting point would be to install that.

- [jupyter](http://jupyter.readthedocs.org/en/latest/install.html)
- [nosetests](https://nose.readthedocs.org/en/latest/)
- [nltk](https://www.nltk.org)

Here is some help on installing packages in python: https://packaging.python.org/installing/. You can use ```pip --user``` to install locally without sudo. We have also provided a requirements.txt file with the correct packages and their respective versions, so you can also run ```pip install -r requirements.txt``` to install the correct dependencies

In [1]:
import sys
from importlib import reload
from collections import defaultdict
import lab3

In [2]:
print('My Python version')

print('python: {}'.format(sys.version))

My Python version
python: 3.9.1 (default, Dec 28 2020, 11:25:16) 
[Clang 12.0.0 (clang-1200.0.32.28)]


In [3]:
import nose
import nltk

In [4]:
print('My library versions')

print('nose: {}'.format(nose.__version__))
print('nltk: {}'.format(nltk.__version__))

My library versions
nose: 1.3.7
nltk: 3.6.1


To test whether your libraries are the right version, run:

`nosetests tests/test_environment.py`

In [5]:
! nosetests tests/test_environment.py

.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK


# 1. Training a language model

Let us first train a 3-gram language model. We need a monolingual corpus, which we will get using nltk.

Total: 40 points

Let us first extract from nltk's reuters corpus, 2 corpora in 2 different domains (here, subject areas), the food industry and the natural resources industry.

In [6]:
import nltk

food = ['barley', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copra-cake', 'grain', 'groundnut', 'groundnut-oil', 'potato', 'soy-meal', 'soy-oil', 'soybean', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'veg-oil', 'wheat']
natural_resources = ['alum', 'fuel', 'gas', 'gold', 'iron-steel', 'lead', 'nat-gas', 'palladium', 'propane', 'tin', 'zinc']
corpus = nltk.corpus.reuters
food_corpus = corpus.raw(categories=food)
natr_corpus = corpus.raw(categories=natural_resources)

In [8]:
print(food_corpus)

 --    2.54
   Evnsvlle    1.54  2.04  1.50  2.17  4.90  2.61
   Cinci       1.52  2.04  1.50  2.17  4.85  2.58
   Mpls        1.34  1.75  1.50  1.85  4.68  --
   Balt/Nor/
   Phil        1.70  1.80  --    --    4.98  3.12
   KC          1.49  1.56  1.64  --    4.76  2.58
   St Lo       1.54  --    1.66  --    4.90  2.91
   Amarlo/
   Lubbck      1.84  1.40  --    --    4.75  2.92
   Lou Gulf    1.73  --    --    --    5.05  3.12
   Port/
   Seattle     1.87  2.10  1.68  --    --    --
   Stockton    2.18  2.23  2.10  --    --    4.00
   LA          2.54  2.50  --    --    --    4.38
   Duluth      1.34  1.75  1.50  1.85  4.68  --
   Tex Gulf    1.73  1.48  1.73  --    5.05  3.12
  

VIETNAM'S ARMY ORDERED TO GROW MORE FOOD
  Vietnam has ordered its army to grow
  more food to ease shortages and meet economic recovery goals
  set for 1990.
      The army newspaper Quan Doi Nhan Dan, monitored here, said
  soldiers must work harder to care for rice, vegetables and
  other crops endanger

## Tokenization

Your first task is to tokenize the raw text into a list of sentences, which are in turn a list of words. No need for any other kind of preprocessing such as lowercasing.

- **Deliverable 1.1**: Complete the function `lab3.tokenize`. (5 points)
- **Test**: `tests/test_visible.py:test_d1_1_tk`

In [9]:
reload(lab3);

In [11]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/xietian/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [12]:
food_corpus_tk = lab3.tokenize_corpus(food_corpus)
natr_corpus_tk = lab3.tokenize_corpus(natr_corpus)

## Padding

Your second task is to pad your sentences with the start-of-sentence symbol `'<s>'` and end-of-sentence symbol `'</s>'`. These symbols are necessary to model the probability of words that usually start a sentence and those that usually end a sentence.

- **Deliverable 1.2**: Complete the function `lab3.pad`. (5 points)
- **Test**: `tests/test_visible.py:test_d1_2_pad`

In [14]:
reload(lab3);

In [15]:
food_corpus_tk_pd = lab3.pad_corpus(food_corpus_tk)
natr_corpus_tk_pd = lab3.pad_corpus(natr_corpus_tk)

## Train-Test Split

Your third task is to split the corpora into train, for training the language model, and test, for testing the language model. We will go with the traditional 80% (train), 20% (test) split. The first `floor(0.8*num_of_tokens)` should constitute the training corpus, and the rest should constitute the test corpus.

- **Deliverable 1.3**: Complete the function `lab3.split_corpus`. (5 points)
- **Test**: `tests/test_visible.py:test_d1_3_spc`

In [18]:
reload(lab3);

In [19]:
food_corpus_tr, food_corpus_te = lab3.split_corpus(food_corpus_tk_pd)
natr_corpus_tr, natr_corpus_te = lab3.split_corpus(natr_corpus_tk_pd)

In [29]:
flat_list = [item for sublist in food_corpus_tr[0:2] for item in sublist]
# for i in range(len(flat_list)):
#     print(food_corpus_tr[i])
#     print(flat_list[i], end=" ")
#     print("\nass\n")
print(food_corpus_tr[0:2])
print("\nass\n")
print(flat_list)

[['<s>', 'CHINA', 'DAILY', 'SAYS', 'VERMIN', 'EAT', '7-12', 'PCT', 'GRAIN', 'STOCKS', 'A', 'survey', 'of', '19', 'provinces', 'and', 'seven', 'cities', 'showed', 'vermin', 'consume', 'between', 'seven', 'and', '12', 'pct', 'of', 'China', "'s", 'grain', 'stocks', ',', 'the', 'China', 'Daily', 'said', '.', '</s>'], ['<s>', 'It', 'also', 'said', 'that', 'each', 'year', '1.575', 'mln', 'tonnes', ',', 'or', '25', 'pct', ',', 'of', 'China', "'s", 'fruit', 'output', 'are', 'left', 'to', 'rot', ',', 'and', '2.1', 'mln', 'tonnes', ',', 'or', 'up', 'to', '30', 'pct', ',', 'of', 'its', 'vegetables', '.', '</s>']]

ass

['<s>', 'CHINA', 'DAILY', 'SAYS', 'VERMIN', 'EAT', '7-12', 'PCT', 'GRAIN', 'STOCKS', 'A', 'survey', 'of', '19', 'provinces', 'and', 'seven', 'cities', 'showed', 'vermin', 'consume', 'between', 'seven', 'and', '12', 'pct', 'of', 'China', "'s", 'grain', 'stocks', ',', 'the', 'China', 'Daily', 'said', '.', '</s>', '<s>', 'It', 'also', 'said', 'that', 'each', 'year', '1.575', 'mln', 't

In [32]:
word_list = []
for i in food_corpus_tr[0:2]:
    word_list.extend(i)
print(word_list)

['<s>', 'CHINA', 'DAILY', 'SAYS', 'VERMIN', 'EAT', '7-12', 'PCT', 'GRAIN', 'STOCKS', 'A', 'survey', 'of', '19', 'provinces', 'and', 'seven', 'cities', 'showed', 'vermin', 'consume', 'between', 'seven', 'and', '12', 'pct', 'of', 'China', "'s", 'grain', 'stocks', ',', 'the', 'China', 'Daily', 'said', '.', '</s>', '<s>', 'It', 'also', 'said', 'that', 'each', 'year', '1.575', 'mln', 'tonnes', ',', 'or', '25', 'pct', ',', 'of', 'China', "'s", 'fruit', 'output', 'are', 'left', 'to', 'rot', ',', 'and', '2.1', 'mln', 'tonnes', ',', 'or', 'up', 'to', '30', 'pct', ',', 'of', 'its', 'vegetables', '.', '</s>']


## Splitting into n-grams

Your fourth task is to count n-grams in the text up to a specific order.

- **Deliverable 1.4**: Complete the function `lab3.count_ngrams`. (20 points)
- **Test**: `tests/test_visible.py:test_d1_4_cn`

In [35]:
reload(lab3);

In [34]:
food_ngrams, food_vocab = lab3.count_ngrams(food_corpus_tr, 3)
natr_ngrams, natr_vocab = lab3.count_ngrams(natr_corpus_tr, 3)

## Estimating n-gram probability

Your last task in this part of the problem set is to estimate the n-gram probabilities p(w_i|w_{i-n+1}, w_{i-n+2}, .., w_{i-1}), with no smoothing. For the purposes of this exercise we will use the maximum likelihood estimate and perform no smoothing. 

- **Deliverable 1.5**: Complete the function `lab3.estimate`. (5 points)
- **Test**: `tests/test_visible.py:test_d1_5_es`

In [68]:
reload(lab3);

In [69]:
print(lab3.estimate(food_ngrams, ['palm'], ['producer', 'of']))
print(lab3.estimate(natr_ngrams, ['basis'], ['tested', 'the']))

0.25
0.5


Application: the speech recognition task takes human voice as its input and outputs text. If the pronunciation of two words are similar, Language Model can help decide which word to choose! 

In [38]:
print(food_ngrams[('there', 'is', 'no')])
print(food_ngrams[('their', 'is', 'no')])

11
0


Given the count of 'there is no' and 'their is no', which word ('there' or 'their') is more likely to be taken as the output? 

Language Model is not only helpful in speech recogition, but text generation (*e.g.*, machine translation, summarization, image captioning), spelling correction and so on. 

## Training a language model

Now we will combine everything together and train our language model! One way to see what the language model has learned is to see the sentences it can generate.

For the sake of simplicity, and for the purposes of later parts in this problem set, we use nltk's lm module to train a language model.

In [39]:
from nltk.lm import Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
size_ngram = 3

food_train, food_vocab = padded_everygram_pipeline(size_ngram, food_corpus_tk[:int(0.8*len(food_corpus_tk))])
natr_train, natr_vocab = padded_everygram_pipeline(size_ngram, natr_corpus_tk[:int(0.8*len(natr_corpus_tk))])

food_test = sum([['<s>'] + x + ['</s>'] for x in food_corpus_tk[int(0.8*len(food_corpus_tk)):]],[])
natr_test = sum([['<s>'] + x + ['</s>'] for x in natr_corpus_tk[int(0.8*len(natr_corpus_tk)):]],[])

food_lm = Laplace(size_ngram)
natr_lm = Laplace(size_ngram)

food_lm.fit(food_train, food_vocab)
natr_lm.fit(natr_train, natr_vocab)

In [41]:
reload(lab3);

Now let's ask our language model to generate a sentence. 

In [42]:
# This might take some time
n_words = 10
print(food_lm.generate(n_words, random_seed=3))  # random_seed makes the random sampling part of generation reproducible. 
print(natr_lm.generate(n_words, random_seed=3))

['<s>', '<s>', '<s>', 'GHANA', 'COCOA', 'PURCHASES', '1,323', 'TONNES', 'IN', '1987/88']
['<s>', '<s>', '<s>', 'HAITI', 'ANNOUNCES', 'FIND', 'OF', 'ORE-RICH', 'GOLD', 'FIELD']


# 2. Evaluating a language model

Next, we evaluate our language models using the perplexity measure, and draw conclusions on how a change of domains (here, subject areas) can affect the performance of a language model. Perplexity measures the language model capacity at predicting sentences in a test corpus.

Total: 10 points

- **Deliverable 2.1**: Complete the function `lab3.get_perplexity`. (10 points)
- **Test**: `tests/test_visible.py:test_d2_1_gp`

In [43]:
reload(lab3);

In [44]:
# This might take some time
print(lab3.get_perplexity(food_lm, food_test[:5000]))
print(lab3.get_perplexity(food_lm, natr_test[:5000]))
print(lab3.get_perplexity(natr_lm, natr_test[:5000]))
print(lab3.get_perplexity(natr_lm, food_test[:5000]))

8682.570675094219
8858.706404544759
5614.398479984016
5663.540343702553


- What observations can you make on the results? Is the domain shift affecting the performance of the language model? What are possible explanations?


**Your Observation**: 

# 3. Data size and model complexity

Let us now see how the size of the training data and the complexity of the model we choose affects the quality of our language model.

Total: 40 points

For this part we'd like to see the difference between a 2-gram model and a 3-gram model. Typically, with a larger n, the n-gram model gives us more information about the word sequence and has lower perplexity. 

For testing, we'll only be considering 5% instead of 20% of the test data for running time purposes. 

- **Deliverable 3.1**: Complete the function `lab3.vary_ngram`. (40 points)
- **Test**: `tests/test_visible.py:test_d3_1_vary`

In [75]:
test_corpus = natr_corpus_tk[int(0.8*len(natr_corpus_tk)): int(0.85*len(natr_corpus_tk))][:2]
new_test = sum([['<s>'] + x + ['</s>'] for x in test_corpus])
print(test_corpus)
print("\n#####\n")
print(new_test)

TypeError: unsupported operand type(s) for +: 'int' and 'list'

In [76]:
reload(lab3);

In [77]:
n_gram_orders = [2, 3]

train_corpus = natr_corpus_tk[:int(0.8*len(natr_corpus_tk))]
test_corpus = natr_corpus_tk[int(0.8*len(natr_corpus_tk)): int(0.85*len(natr_corpus_tk))]

results = lab3.vary_ngram(train_corpus, test_corpus, n_gram_orders)

print(results)

defaultdict(None, {2: 5425.7474270051935, 3: 5455.361088899949})


However, we notice that the 3-gram language model actually performs worse than the 2-gram language model. This is due to the small size of the training corpus. A 3-gram language model is actually too complex of a model for a small training size. If our training data was larger, we would be seeing the opposite. If we trained 1-gram, 2-gram, and 3-gram models on 38 million words from the Wall Street Journal, we will get perplexity of 962, 170, 109 respectively on a test set of 1.5 million words. 

In [None]:
defaultdict(None, {2: 5596.7318534048245, 3: 5625.390747181811})

Now let's see a few examples of top frequent n-gram examples. Let's start with unigram. 

In [54]:
natr_ngrams, natr_vocab = lab3.count_ngrams(natr_corpus_tr, 3)

top_ngram = []
count = 0
for i in sorted(natr_ngrams.items(), key=lambda x: x[1], reverse=True):
    if len(i[0]) == 1:
        top_ngram.append(i[0])
        count += 1
    if count >=20:
        break
print(top_ngram)

[(',',), ('the',), ('<s>',), ('</s>',), ('.',), ('of',), ('to',), ('and',), ('said',), ('in',), ('a',), ('for',), ('The',), ('from',), ('pct',), ('mln',), ('at',), ('on',), ("'s",), ('is',)]


Do you think unigram captures any grammatical information? How well do you think unigram captures the language information? 

Now let's see bigram and trigram. 

In [55]:
top_ngram = []
count = 0
for i in sorted(natr_ngrams.items(), key=lambda x: x[1], reverse=True):
    if len(i[0]) == 2:
        top_ngram.append(i[0])
        count += 1
    if count >=20:
        break
print(top_ngram)

top_ngram = []
count = 0
for i in sorted(natr_ngrams.items(), key=lambda x: x[1], reverse=True):
    if len(i[0]) == 3:
        top_ngram.append(i[0])
        count += 1
    if count >=20:
        break
print(top_ngram)

[('.', '</s>'), ('said', '.'), ('<s>', 'The'), ('in', 'the'), ('of', 'the'), ('&', 'lt'), ('lt', ';'), (',', 'the'), ('said', 'it'), ('said', 'the'), ('<s>', '``'), (',', "''"), (',', 'which'), ('to', 'the'), ('for', 'the'), (',', 'a'), ('on', 'the'), (',', 'and'), ('mln', 'dlrs'), ('<s>', 'It')]
[('said', '.', '</s>'), ('&', 'lt', ';'), ('.', "''", '</s>'), ('<s>', 'The', 'company'), ('<s>', 'It', 'said'), ('he', 'said', '.'), ('ounces', 'of', 'gold'), ('year', '.', '</s>'), ('The', 'company', 'said'), ('added', '.', '</s>'), ('oil', 'and', 'gas'), (',', 'it', 'said'), ('pct', '.', '</s>'), (',', "''", 'he'), (',', 'he', 'said'), ('it', 'said', '.'), ('sources', 'said', '.'), ('is', 'expected', 'to'), ('<s>', 'He', 'said'), ('the', 'company', 'said')]


Compared with unigram, bigram and trigram can capture more information. 
Bigram language model can already capture some of the grammatical information, such as 'in the', 'of the'. However, the power of bigram is still limited. 
The trigram can output more adequate short phrases such as 'ounces of gold', 'The company said', 'oil and gas'. 

Therefore, typically the n-gram model with a larger n contains more information about the word sequence and thus, has lower perplexity. However, the tradeoff is the computational efficiency and memory. 