# Language Modeling with NLTK

## Selecting a corpus

First we need to select a corpus that will be used to train the language model. We will select *Alice's Adventures in Wonderland*, a novel by Lewis Carroll, that can be found in the Gutenberg Corpus​​ NLTK.

In [1]:
from nltk.corpus import gutenberg

# Access the text of "Alice's Adventures in Wonderland"
alice_text = gutenberg.raw("carroll-alice.txt")
# Print the first 500 characters as a sample
print(alice_text[:500])  

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy an


## Pre-processing the corpus

First, we convert the text to lowercase to normalize words

In [2]:
# Convert to lowercase (word normalization)
alice_text = alice_text.lower()
# Print the first 500 characters
print(alice_text[:500])

[alice's adventures in wonderland by lewis carroll 1865]

chapter i. down the rabbit-hole

alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought alice 'without pictures or
conversation?'

so she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy an


Segment the text into sentences

In [3]:
from nltk import sent_tokenize

sentences = sent_tokenize(alice_text)

# Print the first 10 sentences
for sent in sentences[:10]:
    print(sent)
    print("_________________")

[alice's adventures in wonderland by lewis carroll 1865]

chapter i. down the rabbit-hole

alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought alice 'without pictures or
conversation?'
_________________
so she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a white rabbit with pink eyes ran
close by her.
_________________
there was nothing so very remarkable in that; nor did alice think it so
very much out of the way to hear the rabbit say to itself, 'oh dear!
_________________
oh dear!
_________________
i shall be late!'
_________________
(when she thought it over afterwards, it
occurred to her th

Tokenize sentences into tokens. In this case, we will utilize the TweetTokenizer.

In [4]:
from nltk.tokenize import TweetTokenizer

tweet_tokenizer = TweetTokenizer()

sentences_tokenized = []

for sent in sentences:
    sent_tok = tweet_tokenizer.tokenize(sent)
    sentences_tokenized.append(sent_tok)

# Print the first 10 tokenized sentences
for sent_tok in sentences_tokenized[:10]:
    print(sent_tok)
    print("_________________")

['[', "alice's", 'adventures', 'in', 'wonderland', 'by', 'lewis', 'carroll', '1865', ']', 'chapter', 'i', '.', 'down', 'the', 'rabbit-hole', 'alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do:', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ',', "'", 'thought', 'alice', "'", 'without', 'pictures', 'or', 'conversation', '?', "'"]
_________________
['so', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',', 'for', 'the', 'hot', 'day', 'made', 'her', 'feel', 'very', 'sleepy', 'and', 'stupid', ')', ',', 'whether', 'the', 'pleasure', 'of', 'making', 'a', 'daisy-chain', 'would', 'be', 'worth', 'the', 'trouble', 'of', 'getting', '

## Training the Language Model

See complete documentation in https://www.nltk.org/api/nltk.lm.html

#### Compute n-grams in the training text

First we need to get all the n-grams from the training text and the vocabulary of valid words. We can do that using the function `padded_everygram_pipeline`

Taking as parameters the order of n-grams and the list of tokenized sentences, returns the list of n-grams and the list of words in the vocabulary. It adds special tokens for start and end of sentence (`<s>` and `<\s>`) 

In [34]:
from nltk.lm.preprocessing import padded_everygram_pipeline

# Order of n-grams = 2. It will use unigrams and bigrams.
n = 2
train, vocab = padded_everygram_pipeline(n, sentences_tokenized)

vocab_list = list(vocab)

# Prints the n-grams for training
print ("Training n-grams")
for sentence in train:
    for n_gram in sentence:
        print(n_gram)

# Prnts the vocabulary list and its length
print("Vocabulary words: ", vocab_list)
print ("Vocabulary length: ", len(vocab_list))

Training n-grams
('<s>',)
('<s>', '[')
('[',)
('[', "alice's")
("alice's",)
("alice's", 'adventures')
('adventures',)
('adventures', 'in')
('in',)
('in', 'wonderland')
('wonderland',)
('wonderland', 'by')
('by',)
('by', 'lewis')
('lewis',)
('lewis', 'carroll')
('carroll',)
('carroll', '1865')
('1865',)
('1865', ']')
(']',)
(']', 'chapter')
('chapter',)
('chapter', 'i')
('i',)
('i', '.')
('.',)
('.', 'down')
('down',)
('down', 'the')
('the',)
('the', 'rabbit-hole')
('rabbit-hole',)
('rabbit-hole', 'alice')
('alice',)
('alice', 'was')
('was',)
('was', 'beginning')
('beginning',)
('beginning', 'to')
('to',)
('to', 'get')
('get',)
('get', 'very')
('very',)
('very', 'tired')
('tired',)
('tired', 'of')
('of',)
('of', 'sitting')
('sitting',)
('sitting', 'by')
('by',)
('by', 'her')
('her',)
('her', 'sister')
('sister',)
('sister', 'on')
('on',)
('on', 'the')
('the',)
('the', 'bank')
('bank',)
('bank', ',')
(',',)
(',', 'and')
('and',)
('and', 'of')
('of',)
('of', 'having')
('having',)
('having

#### Create the vocabulary

https://www.nltk.org/api/nltk.lm.vocabulary.html

We need to create an object of the class `Vocabulary` that will contain the vocabulary of valid words. 
We can filter the less frequent words using the parameter `unk_cutoff`. All words with a frequency lower than this parameter will be removed from the vocabulary and will considered as unknown (`UNK`).

In [16]:
from nltk.lm.vocabulary import Vocabulary

vocabulary_train = Vocabulary(vocab_list, unk_cutoff=1)

# Prints the vocabulary
for word in vocabulary_train:
    print(word)


<s>
[
alice's
adventures
in
wonderland
by
lewis
carroll
1865
]
chapter
i
.
down
the
rabbit-hole
alice
was
beginning
to
get
very
tired
of
sitting
her
sister
on
bank
,
and
having
nothing
do:
once
or
twice
she
had
peeped
into
book
reading
but
it
no
pictures
conversations
'
what
is
use
a
thought
without
conversation
?
</s>
so
considering
own
mind
(
as
well
could
for
hot
day
made
feel
sleepy
stupid
)
whether
pleasure
making
daisy-chain
would
be
worth
trouble
getting
up
picking
daisies
when
suddenly
white
rabbit
with
pink
eyes
ran
close
there
remarkable
that
;
nor
did
think
much
out
way
hear
say
itself
oh
dear
!
shall
late
over
afterwards
occurred
ought
have
wondered
at
this
time
all
seemed
quite
natural
);
actually
took
watch
its
waistcoat-pocket
looked
then
hurried
started
feet
flashed
across
never
before
seen
either
take
burning
curiosity
field
after
fortunately
just
see
pop
large
under
hedge
another
moment
went
how
world
again
straight
like
tunnel
some
dipped
not
about
stopping
herself
f

We can check if a word is included in the vocabulary or the counts of a word in the vocabulary

In [None]:
# Get the count of a word in the vocabulary
print(vocabulary_train['alice'])
print(vocabulary_train['car'])

# Check if a word is in the vocabulary
print ('alice' in vocabulary_train)
print ('car' in vocabulary_train)


# Another way to check if a word is in the vocabulary
print (vocabulary_train.lookup('alice'))
print (vocabulary_train.lookup(['alice', 'car', 'the', 'computer']))

386
0
True
False
alice
('alice', '<UNK>', 'the', '<UNK>')


#### Train the model

We can use different types of models corresponding to different ways of computing n-gram probabilities, by applying different smoothing techniques: basic MLE, Laplace, Backoff, ...

See the complete documentation at https://www.nltk.org/api/nltk.lm.html

In [42]:
from nltk.lm.models import MLE, Laplace, StupidBackoff

# order of n-grams = 2
n = 2
lm_basic = MLE(n, vocabulary=vocabulary_train)
lm_laplace = Laplace(n, vocabulary=vocabulary_train)
lm_backoff = StupidBackoff(alpha = 0.4, order = n, vocabulary=vocabulary_train)

# Fit the models with the training n-grams
train, vocab = padded_everygram_pipeline(n, sentences_tokenized)
lm_basic.fit(train)
train, vocab = padded_everygram_pipeline(n, sentences_tokenized)
lm_laplace.fit(train)
train, vocab = padded_everygram_pipeline(n, sentences_tokenized)
lm_backoff.fit(train)



#### Evaluate the model

We can get the probability given by the model to any n-gram (also the log of the probability to avoid numerical issues)

In [43]:
# Get the probability of the unigram 'alice'
print("MLE probability of 'alice': ", lm_basic.score('alice'), lm_basic.logscore('alice')) 
print("Laplace probability of 'alice': ", lm_laplace.score('alice'), lm_laplace.logscore('alice'))
print("StupidBackoff probability of 'alice': ", lm_backoff.score('alice'), lm_backoff.logscore('alice'))

# Get the probability of the bigram ('alice', 'was')
print("MLE probability of ('alice', 'was'): ", lm_basic.score('was', ['alice']), lm_basic.logscore('was', ['alice']))
print("Laplace probability of ('alice', 'was'): ", lm_laplace.score('was', ['alice']), lm_laplace.logscore('was', ['alice']))
print("StupidBackoff probability of ('alice', 'was'): ", lm_backoff.score('was', ['alice']), lm_backoff.logscore('was', ['alice']))

MLE probability of 'alice':  0.010460988102658608 -6.578837060610674
Laplace probability of 'alice':  0.009754990925589837 -6.67964375386906
StupidBackoff probability of 'alice':  0.010460988102658608 -6.578837060610674
MLE probability of ('alice', 'was'):  0.04404145077720207 -4.5049941960177415
Laplace probability of ('alice', 'was'):  0.005698005698005698 -7.45532722030456
StupidBackoff probability of ('alice', 'was'):  0.04404145077720207 -4.5049941960177415


We can also evalute the quality of the model using perplexity

In [None]:
train, vocab = padded_everygram_pipeline(n, sentences_tokenized)
test = [n_gram for sentence in train for n_gram in sentence]
print("MLE perplexity of the train data: ", lm_basic.perplexity(test))
print("Laplace perplexity of the train data: ", lm_laplace.perplexity(test))
print("StupidBackoff perplexity of the train data: ", lm_backoff.perplexity(test))

MLE perplexity of the train data:  67.88670218144063
Laplace perplexity of the train data:  281.5965005703454
StupidBackoff perplexity of the train data:  67.88670218144063


<img src="./note.png" width = "40" height = "40" alt="note" align=top />

1. Try to change some parameters of the language model (order of n-grams, cutoff parameter, type of model, ...) and evaluate the impact of these variations using perplexity
2. Create a function to generate new text using the language model based on a previous context (sample always the most probable word based on the context)
3. What is the quality of the generated text? Is there any signifficant difference by changing the parameters of the language model?

In [None]:
# ADD YOUR CODE HERE

We can quantitatively evaluate the quality of the generated text by comparing to a reference text using BLEU, ROUGE and METEOR, using the library `evaluate`: https://pypi.org/project/evaluate/

In [None]:
%pip install evaluate
%pip install rouge_score

Defaulting to user installation because normal site-packages is not writeable
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.6.0-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.19-py312-none-any.whl.metadata (7.5 kB)
Collecting huggingface-hub>=0.7.0 (from evaluate)
  Downloading huggingface_hub-1.4.1-py3-none-any.whl.metadata (13 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.18-py312-none-any.whl.metadata (7.5 kB)
Collecting hf-xet<2.0.0,>=1.2.0 (from huggingface-hub>=0.7.0->evaluate)
  Downloading hf_xet-1.2.0-cp37-abi3-win_amd64.whl.metadata (5.0 kB)
Collecting dill (from evaluate)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)




In [17]:
import evaluate
generated_sentence = "This is a generated sentence using the MLE model."
reference_sentence = "This is a new text using a language model"

bleu_metric = evaluate.load('bleu')
bleu1 = bleu_metric.compute(predictions=[generated_sentence], references=[reference_sentence], max_order=1)
bleu2 = bleu_metric.compute(predictions=[generated_sentence], references=[reference_sentence], max_order=2)
print("BLEU-1 score: ", bleu1['bleu'])
print("BLEU-2 score: ", bleu2['bleu'])

rouge_metric = evaluate.load('rouge')
rouge = rouge_metric.compute(predictions=[generated_sentence], references=[reference_sentence])
print(f"ROUGE-1 F1 Score: {rouge['rouge1']:.2f}")
print(f"ROUGE-L F1 Score: {rouge['rougeL']:.2f}")

meteor_metric = evaluate.load('meteor')
meteor = meteor_metric.compute(predictions=[generated_sentence], references=[reference_sentence])
print(f"METEOR Score: {meteor['meteor']:.2f}")

BLEU-1 score:  0.5
BLEU-2 score:  0.3333333333333333
ROUGE-1 F1 Score: 0.56
ROUGE-L F1 Score: 0.56
METEOR Score: 0.41


[nltk_data] Downloading package wordnet to C:\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
