<a href="https://colab.research.google.com/github/acastellanos-ie/NLP-MBD-EN-BL-ENE-2021-J-1/blob/main/language_modelling/language%20modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Colab Configuration

**Execute this steps to configure the Google Colab environment in order to execute this notebook. It is not required if you are executing it locally and you have properly configured your local environment according to what explained in the Github Repository.**

The first step is to clone the repository to have access to all the data and files

In [None]:
repository_name = "NLP-MBD-EN-BL-ENE-2021-J-1"
repository_url = 'https://github.com/acastellanos-ie/' + repository_name

In [None]:
! git clone $repository_url

Cloning into 'MBD-EN-BL-ENE-2020-J-1'...
remote: Enumerating objects: 4481, done.[K
remote: Counting objects: 100% (4481/4481), done.[K
remote: Compressing objects: 100% (4368/4368), done.[K
remote: Total 4481 (delta 158), reused 4387 (delta 94), pack-reused 0[K
Receiving objects: 100% (4481/4481), 13.41 MiB | 19.53 MiB/s, done.
Resolving deltas: 100% (158/158), done.


Install the requirements

In [None]:
! pip install -Uqqr $repository_name/requirements.txt

[K     |████████████████████████████████| 1.5MB 9.5MB/s 
[K     |████████████████████████████████| 10.4MB 1.4MB/s 
[K     |████████████████████████████████| 12.0MB 170kB/s 
[K     |████████████████████████████████| 9.9MB 42.9MB/s 
[K     |████████████████████████████████| 348kB 44.6MB/s 
[K     |████████████████████████████████| 204kB 49.0MB/s 
[K     |████████████████████████████████| 727kB 24.7MB/s 
[K     |████████████████████████████████| 454.3MB 35kB/s 
[K     |████████████████████████████████| 25.3MB 1.4MB/s 
[K     |████████████████████████████████| 81kB 10.3MB/s 
[K     |████████████████████████████████| 2.3MB 29.5MB/s 
[K     |████████████████████████████████| 1.8MB 36.5MB/s 
[K     |████████████████████████████████| 1.1MB 58.0MB/s 
[K     |████████████████████████████████| 61kB 8.6MB/s 
[K     |████████████████████████████████| 51kB 7.1MB/s 
[K     |████████████████████████████████| 1.2MB 52.9MB/s 
[K     |████████████████████████████████| 4.0MB 33.6MB/s 
[K

Now you have everything you need to execute the code in Colab

# Language Modelling

In this notebook we are going to start playing with languages models. In particular, we are going to start with the simplest approach based on n-grams. Then, in the following threads, we will move to more advanced approaches based on LSTM and Transformer architectures.

The Natural Language Toolkit (NLTK) has data types and functions that make life easier for us when we want to count bigrams and compute their probabilities.

Let's start!

In [None]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

**Import the Brown corpus**

For the experimentation, we are going to use the well-known Brown Corpus.

The Brown University Standard Corpus of Present-Day American Englis, or just Brown Corpus (https://en.wikipedia.org/wiki/Brown_Corpus),  is a general corpus containing 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961.



In [None]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

From the words of the Brown corpus

In [None]:
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

Let's inspect what are the most likely (most frequent) words in the dataset. The probability of a word is very important for our language model. When we ask the LM to generate new text, it should rely on these word probabilities, so it can generate words that are likely in our dataset.

We compute the word frequency by using the `FreqDist` function of NLTK (an nltk.FreqDist() is like a dictionary, but it is ordered by frequency).

The following uses this function to compute the freqs and plot the 20 most frequent words

In [None]:
freq_brown = nltk.FreqDist(brown.words())

list(freq_brown.keys())[:20]
freq_brown.most_common(20)

[('the', 62713),
 (',', 58334),
 ('.', 49346),
 ('of', 36080),
 ('and', 27915),
 ('to', 25732),
 ('a', 21881),
 ('in', 19536),
 ('that', 10237),
 ('is', 10011),
 ('was', 9777),
 ('for', 8841),
 ('``', 8837),
 ("''", 8789),
 ('The', 7258),
 ('with', 7012),
 ('it', 6723),
 ('as', 6706),
 ('he', 6566),
 ('his', 6466)]

We can see that they are mostly stopwords, punctuation signs.

**Should we remove them? Why?** 

No, just think in what we are trying to do here. We are trying to use the dataset to create a model of the language to, given a set of words, predict the most probable next word. For this process, stopwords, as well as punctuation or other signs are need.

For the same reason, we shall not stemmize/lemmatize, neither normalize the words. We need all these variations to learn a proper language model (i.e, `the` != `The`)

## Bigram Model

We'll start small and we will create a language model based on bi-grams. This LM is rather simplistic: it will only codify relationships of length 2.

To that end, we will use the `ConditionalFreqDist` function of NLTK. `nltk.ConditionalFreqDist()` counts frequencies of pairs. When given a list of bigrams, it maps each first word of a bigram to a FreqDist over the second words of the bigram.

If you remember the theoretical session, we are applying the Markov assumption: the next element (word in our case) of a sequence can be predicted by just focusing on the previous one.

The following code creates these bi-gram counts.
If we print the `conditions` we can see the antecedent of the bi-grams. (`conditions()` in a `ConditionalFreqDist` are like `keys()` in a dictionary).

In [None]:
cfreq_brown_2gram = nltk.ConditionalFreqDist(nltk.bigrams(brown.words()))
cfreq_brown_2gram.conditions()[:10]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

Let' see the most frequent terms after the word `my`.

In [None]:
# the cfreq_brown_2gram entry for "my" is a FreqDist (i.e, a dictionary of word and freqCount).
my_terms = cfreq_brown_2gram["my"]

# Sort the terms by frequency and print the 25th most common
sorted(my_terms.items(), key=lambda x: -x[1])[:25]

[('own', 52),
 ('hand', 19),
 ('life', 19),
 ('mind', 19),
 ('first', 15),
 ('wife', 14),
 ('hands', 14),
 ('eyes', 13),
 ('father', 13),
 ('mother', 12),
 ('husband', 12),
 ('way', 12),
 ('head', 11),
 ('left', 8),
 ('heart', 7),
 ('point', 7),
 ('body', 7),
 ('Uncle', 7),
 ('best', 6),
 ('family', 6),
 ('right', 6),
 ('brother', 6),
 ('friends', 6),
 ('name', 6),
 ('business', 6)]

We can do the same with the `most_common` function

In [None]:
cfreq_brown_2gram["my"].most_common(25)

[('own', 52),
 ('hand', 19),
 ('life', 19),
 ('mind', 19),
 ('first', 15),
 ('wife', 14),
 ('hands', 14),
 ('eyes', 13),
 ('father', 13),
 ('mother', 12),
 ('husband', 12),
 ('way', 12),
 ('head', 11),
 ('left', 8),
 ('heart', 7),
 ('point', 7),
 ('body', 7),
 ('Uncle', 7),
 ('best', 6),
 ('family', 6),
 ('right', 6),
 ('brother', 6),
 ('friends', 6),
 ('name', 6),
 ('business', 6)]

With the `nltk.ConditionalProbDist()`, map pairs are mapped to probabilities.

In [None]:
cprob_brown_2gram = nltk.ConditionalProbDist(cfreq_brown_2gram, nltk.MLEProbDist) # Uses a Maximum Likelihood Estimation (MLE) estimator

This again has `conditions()` wihch are like dictionary keys

In [None]:
cprob_brown_2gram.conditions()[:10]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

We can also find the words that can come after `my` by using the function `samples()`

In [None]:
cprob_brown_2gram["my"].samples()

dict_keys(['political', 'client', 'fellow', 'man', 'candidacy', 'best', 'place-kicking', 'last', 'reflexes', 'jobs', 'family', 'thanks', 'firm', 'payroll', 'judgment', 'sales', 'first', 'mother', 'boys', 'share', 'daily', 'wife', 'legs', 'big', 'hands', 'biologist', 'locker', 'hand', 'right', 'neck', 'heart', 'grudge', 'neighbor', 'brother', 'house', 'good', 'life', 'native', 'charge-a-plate', "son's", 'psychiatrist', 'son', 'children', 'arms', 'daughter', 'opinion', 'husband', 'friends', 'country', 'wonderful', 'school', 'home', 'desire', 'point', 'little', 'part', 'two', 'itinerary', 'classroom', 'initial', 'induction', 'own', 'students', 'classes', 'personal', 'only', 'estimation', 'taste', 'objectivity', 'bed', 'eyes', 'principal', 'primary', 'Roman', 'experience', 'stay', 'lot', 'leave', 'learned', 'Bible', 'nearest', 'Father', 'Saviour', 'patient', 'peace', 'work', 'patients', 'professional', 'talents', 'soul', 'light', 'salvation', 'foes', 'flesh', 'fingers', 'body', 'finger', '

In addition, you can see the prob of a particular pair

In [None]:
cprob_brown_2gram["my"].prob("own")

0.04478897502153316

In [None]:
cprob_brown_2gram["my"].prob("leg")

0.0034453057708871662

## Compute the probability of a sentence

Create a function to compute the probability of a word from its frequency

In [None]:
def unigram_prob(word):
    len_brown = len(brown.words())
    return float(freq_brown[word]) / float(len_brown)

unigram_prob("night")

0.0003427512418273636

We now can ask for the probability of a word sequence.

For instance: `P(how do you do) = P(how) * P(do|how) * P(you|do) * P(do | you)`

In [None]:
unigram_prob("how") * cprob_brown_2gram["how"].prob("do") * cprob_brown_2gram["do"].prob("you") * cprob_brown_2gram["you"].prob("do")

1.5639033871961e-09

Compare it with the prob of another not so common sentence: `how do you dance`

In [None]:
unigram_prob("how") * cprob_brown_2gram["how"].prob("do") * cprob_brown_2gram["do"].prob("you") * cprob_brown_2gram["you"].prob("dance")

1.0089699272232904e-10

As expected, one order of magnitude less probable

## Generate Language

With our bi-gram language model already generated, we can now use it to generate text and see what has our model learned.

In [None]:
cprob_brown_2gram["my"].generate()

'bed'

Let's see if the model create valid text or just jiberish

In [None]:
word = "my"
text = ""
for index in range(20):
    text += word + " "
    word = cprob_brown_2gram[ word].generate()
print(text)

my place us forever . Next to be so many listeners , and now one of the meaning greetings from 


It is not a valid sentence, but it has some kind of sense. 

Remember that we are just learning from bigrams!

**We can try another datasets to train a language models using different dataset.**

In particular we are going to import the book dataset of NLTK, which includes the text of different books.


The following function takes a text (i.e., the text o a given book) to learn a language model, and a initial word to start the generation and the number of words that have to be generated.

In [None]:
# Here is how to do this with NLTK books:
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')

from nltk.book import *


def generate_text(text, initialword, numwords):
    bigrams = list(nltk.ngrams(text, 2))
    cpd = nltk.ConditionalProbDist(nltk.ConditionalFreqDist(bigrams), nltk.MLEProbDist)

    word = initialword
    text = ""
    for i in range(numwords):
        text += word + " "
        word = cpd[ word].generate() 

    print(text)

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of

We use different books to generate text

In [None]:
# Holy Grail
generate_text(text6, "I", 25)


I am Zoot ! MINSTREL : Patsy . Come on ! It is your mother ! Go away , as soon to make them an 


In [None]:
# sense and sensibility
generate_text(text2, "I", 25)

# TriGrams

Let's try a more advance model using tri-grams to see if it is able to generate better language.

We cannot use the `ConditionalFreqDist` as before. `nltk.ConditionalFreqDist` expects its data as a sequence of `(condition, item)` tuples. `nltk.trigrams` returns tuples of length 3. Therefore, we have to adapt the trigrams output.

In [None]:
def generate_text(text, initialword, numwords):
    trigrams = list(nltk.ngrams(text, 3,  pad_right=True, pad_left=True))
    trigram_pairs = (((w0, w1), w2) for w0, w1, w2 in trigrams) # Adapt the format to use ConditionalFreqDist
    cpd = nltk.ConditionalProbDist(nltk.ConditionalFreqDist(trigram_pairs), nltk.MLEProbDist)

    word = initialword
    text = ""
    for i in range(numwords):
        w = cpd[(word[i],word[i+1])].generate() 
        word += [w]
    
    print(" ".join(word))

In [None]:
generate_text(text2, ["I", "am"], 25)

I am so sorry we cannot stay here long , and put on a twilight walk to the bustle , and of whose success he was affected


As expected, it creates a better lm.

Can we go on with more n-grams? Let's see

# N-grams

We are going to update again the `generate_text` function to create a language model based on 4-grams.


In [None]:
def generate_text(text, initialword, numwords):
    ngrams = list(nltk.ngrams(text, 4,  pad_right=True, pad_left=True))
    ngram_pairs = (((w0, w1, w2), w3) for w0, w1, w2, w3 in ngrams)
    cpd = nltk.ConditionalProbDist(nltk.ConditionalFreqDist(ngram_pairs), nltk.MLEProbDist)

    word = initialword
    text = ""
    for i in range(numwords):
        w = cpd[(word[i],word[i+1], word[i+2])].generate() 
        word += [w]
    
    print(" ".join(word))

In [None]:
generate_text(text2, ["I", "am", "very"], 25)

I am very sure that Colonel Brandon has not the smallest chance . As soon , however , towards that unfortunate girl -- I must say it ,


As we make the n-grams larger we got more accurate language models. However, as explained in class, if we create large n-grams we are not going to have enough data to train our models: we will never see enough data (enough sequences of n-grams) to train the model.

As an exercise, I leave up to you to keep extending this LM model to 5-gram, 6-gram....
