Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given *n-1* previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.

In [1]:
from nltk.corpus import reuters

## Unigram model

For starters, let's build a unigram language model.

In [3]:
from collections import defaultdict

# Create a placeholder for the model
model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        model[w] += 1

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\Acer/nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\share\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\Acer\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


Now that we have the counts, we need to transform them into probabilities:

In [None]:
total_count = float(sum(model.values()))
for w in model:
    model[w] /= total_count

#### Likely words

How likely is the word 'the'?

In [None]:
# your code here


What is the most likely word in the corpus?

In [None]:
# your code here


#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...

In [None]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold
    accumulator = .0
    for word in model.keys():
        accumulator += model[word]
        if accumulator >= r:
            text.append(word)
            break

print (' '.join([t for t in text]))

## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can padd the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:

In [None]:
from nltk import bigrams

# Create a placeholder for the model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        model[w1][w2] += 1

As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.

In [None]:
# your code here


#### Likely pairs

What are the probabilities of each word following 'today'?

In [None]:
# your code here


What are the probabilities for sentence-starting words? What do most of them have in common?

In [None]:
# your code here


#### Generating text

Now that we have a bigram model, we can generate text based on it.

In [None]:
import random

# sequence start symbol
text = ["<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    # your code here
    

print (' '.join([t for t in text if t]))

## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).

In [None]:
# your code here


#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?

In [None]:
# your code here


#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?

In [None]:
# your code here


## N-gram models

For larger *n*, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary *n*.

Create your own 4-gram model.

In [None]:
# your code here


#### Likely tuples

Check the most likely words following "today the public".

In [None]:
# your code here


#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?

In [None]:
# your code here
