# Solutions for Statistical Language Modeling with NLTK

- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

### Exercise: Counting Ngrams in Shakespeare's Hamlet

- Load Shakespeare's Hamlet from Gutenberg corpus
    - lowercase it

- Extract padded unigrams and bigrams

- Using NgramCounter
    - get total number of ngrams
    - get count of unigram `the`
    - get count of bigram `of the`

In [4]:
from nltk.corpus import gutenberg

hamlet = gutenberg.sents('shakespeare-hamlet.txt')

print(len(hamlet))
print(hamlet[0])

3106
['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']']


In [5]:
# lowercasing
hamlet = [[w.lower() for w in sent] for sent in hamlet]
print(hamlet[0])

['[', 'the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare', '1599', ']']


In [6]:
from nltk.lm import NgramCounter
from nltk.lm.preprocessing import padded_everygram_pipeline

In [7]:
padded_ngrams, flat_text = padded_everygram_pipeline(2, hamlet)
counter = NgramCounter(padded_ngrams)

In [8]:
# total number of ngrams
counter.N()

84038

In [9]:
# count of unigram "the"
counter["the"]

993

In [10]:
# count of bigram "of the"
counter[["of"]]["the"]

59

## Exercise on Vocabulary
- lookup in vocabulary
    - "trento"
    - "trento is the capital city of trentino"
        - split into a list
- update vocabulary with "trento is the capital city of trentino" (splitting)
    - do the lookup again to see the effect
- create another vocabulary changing the cut-off to `1`
    - do the lookup again to see the effect

In [20]:
from nltk.lm import Vocabulary
hamlet_words = gutenberg.words('shakespeare-hamlet.txt')

# lowercase
hamlet_words = [w.lower() for w in hamlet_words]

# initialize vocabulary with cut-off
vocab = Vocabulary(hamlet_words, unk_cutoff=2)

In [21]:
vocab.lookup("trento")

'<UNK>'

In [22]:
vocab.lookup("trento is the capital city of trentino".split())

('<UNK>', 'is', 'the', '<UNK>', 'city', 'of', '<UNK>')

In [23]:
# updating vocabulary & doing lookup again
# there is no effect because of cut-off
vocab.update("trento is the capital city of trentino".split())
vocab.lookup("trento is the capital city of trentino".split())

('<UNK>', 'is', 'the', '<UNK>', 'city', 'of', '<UNK>')

In [29]:
# another vocabulary with lower cut-off
vocab1 = Vocabulary(vocab.counts, unk_cutoff=1)
vocab1.lookup("trento is the capital city of trentino".split())

('trento', 'is', 'the', 'capital', 'city', 'of', 'trentino')

## Exercise: Chain Rule
Implement a function to compute score of a sequence (i.e. Chain Rule)

- arguments:
    - Language Model
    - List of Tokens

- functionality
    - extracts ngrams w.r.t. LM order (`lm.order`)
    - scores each ngram w.r.t. LM (`lm.score` or `lm.logscore`)
        - mind that `score` takes care of OOV by converting to `<UNK>` already
    - computes the overal score using chain rule
        - mind the difference between `score` and `logscore`

- compute the scores of the sentences below
    - compute padded and unpadded sequence scores

In [36]:
test_sents = ["the king is dead", "the tzar is dead"]

In [60]:
from nltk.lm.preprocessing import pad_both_ends
from nltk.util import ngrams

# use pad to make it compute both padded and unpadded scores
def score_seq(lm, sequence, pad=True):
    if pad:
        sequence = pad_both_ends(sequence, n=lm.order)
    seq_ngrams = list(ngrams(sequence, lm.order))
    # simple for bigrams
    # w.r.t. Jurafsky we do not score first unigram, since we have '<s>'
    # and the first bigram will give us probability of a word to start a sentence
    print(seq_ngrams)
    seq_scores = [lm.logscore(ng[-1], ng[:-1]) for ng in seq_ngrams]
    return sum(seq_scores)

In [61]:
# let's prepare MLE LM
from nltk.lm.preprocessing import padded_everygram_pipeline
data = [[w.lower() for w in sent] for sent in gutenberg.sents('shakespeare-hamlet.txt')]
padded_ngrams, flat_text = padded_everygram_pipeline(2, data)

In [62]:
from nltk.lm import MLE

mle_lm = MLE(2)
mle_lm.fit(padded_ngrams, flat_text)

In [66]:
# let's score padded sentences
for sent in test_sents:
    print(score_seq(mle_lm, sent.split()))

[('<s>', 'the'), ('the', 'king'), ('king', 'is'), ('is', 'dead'), ('dead', '</s>')]
-25.988633342954444
[('<s>', 'the'), ('the', 'tzar'), ('tzar', 'is'), ('is', 'dead'), ('dead', '</s>')]
-inf
