# N-gram Language Models

In this notebook an n-gram language model is created and its application for sentence auto-complete is shown.

N-gram language models [Link](https://en.wikipedia.org/wiki/N-gram) have applications in auto-complete, speech recognition, spelling correction, and augmative communications.

N-gram is a probabilistic language model. Meaning that the probability of the existance of a sequence of n words is calculated from the training corpus. 

The probability of existance of a word after a n-word sequence P(word | n-word sequence) is equal to number of time 'n-sequence' + word is seen in the corpus, divided by number of time n-word sequence is encountered in the sequence. To do the calculations quickly hashtables with key word of all n-gram and (n+1)-gram have to be created and stored in the memory. This notedbook shows how this can be done efficiently. To account for the cases when the sequences are not seen in the training data a tick, called smoothing, is used to have low probability instead of divisions by zero.



## Loading necessary libraries
nltk is used for tokenizing the texts.

In [1]:
import math
import random
import numpy as np
import pandas as pd
import nltk
import re
nltk.download('punkt')

nltk.data.path.append('.')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Loading and visualizing the training dataset

In [2]:
filename = "./data/en_US.twitter.txt"

file = open(filename, encoding="utf8")

data = file.read()

print("Data type:", type(data))
print("Number of letters:", len(data))
print("First 300 letters of the data")
print("-------")
display(data[0:300])
print("-------")


Data type: <class 'str'>
Number of letters: 3335477
First 300 letters of the data
-------


"How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.\nthey've decided its more fun if I don't.\nSo Tired D; Played Lazer Tag & Ran A "

-------


## Data preprocessing

Data preprecessing includes:
- Split the corpus into train and test datasets (e.g. 80% train and 20% test)
- split the data into sentences.
- tokenizing each sentence
- Removing words that are not so frequent (rare words. e.g. we may want to only consider words appearing at least twice)
- Replacing unknown words with an unknown token


For each of these a function is created and the corpus is passed to be be splitted into sentences and tokenized.

In [3]:
def split_to_sentences(data):
    sentences = data.split('\n')
    sentences = [s.strip() for s in sentences]
    sentences = [s for s in sentences if len(s) > 0]
    return sentences  

In [4]:
def tokenize_sentences(sentences):
    tokenized_sentences = []
    for sentence in sentences: # complete this line
        sentence = sentence.lower()
        tokenized = nltk.word_tokenize(sentence)
        tokenized_sentences.append(tokenized)
    return tokenized_sentences

In [5]:
def get_tokenized_data(data):
    sentences = split_to_sentences(data)
    tokenized_sentences = tokenize_sentences(sentences)
    return tokenized_sentences

In [6]:
tokenized_data = get_tokenized_data(data)
random.shuffle(tokenized_data)
train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0:train_size]
test_data = tokenized_data[train_size:]

In [7]:
def count_words(tokenized_sentences):
    word_counts = {}
    for sentence in tokenized_sentences:
        for token in sentence: 
            if token not in word_counts: 
                word_counts[token] = 1
            else:
                word_counts[token] += 1
    return word_counts

In [8]:
def get_words_with_nplus_frequency(tokenized_sentences, count_threshold):
    closed_vocab = []
    word_counts = count_words(tokenized_sentences)
    closed_vocab = [word for word, cnt in word_counts.items() if cnt >=  count_threshold]
    return closed_vocab

In [9]:
def replace_oov_words_by_unk(tokenized_sentences, vocabulary, unknown_token="<unk>"):
    vocabulary = set(vocabulary)
    replaced_tokenized_sentences = []
    for sentence in tokenized_sentences:
        replaced_sentence = []
        for token in sentence: 
            if token in vocabulary: 
                replaced_sentence.append(token)
            else:
                replaced_sentence.append(unknown_token)
        replaced_tokenized_sentences.append(replaced_sentence)
    return replaced_tokenized_sentences

In [10]:
def preprocess_data(train_data, test_data, count_threshold, unknown_token="<unk>", get_words_with_nplus_frequency=get_words_with_nplus_frequency, replace_oov_words_by_unk=replace_oov_words_by_unk):
    vocabulary = get_words_with_nplus_frequency(train_data, count_threshold)
    train_data_replaced = replace_oov_words_by_unk(train_data, vocabulary, unknown_token)
    test_data_replaced = replace_oov_words_by_unk(test_data, vocabulary, unknown_token)
    return train_data_replaced, test_data_replaced, vocabulary

In [11]:
minimum_freq = 2
train_data_processed, test_data_processed, vocabulary = preprocess_data(train_data, 
                                                                        test_data, 
                                                                        minimum_freq)

## Creating n-gram language model

The base of the language model is counting the number of time an n-gram and an (n+1)-gram is seen. Then the probability of the occurance is calculated by the ratio as mentioned above.

A function to calculate a count dictionary for all seen n-grams in the data is created.

Two other functions are created to calculate the probabilities and store them in the memory. A smoothing trick is used so that division by zero in case of unseen n-grams or (n+1)-grams is replaced by a small number.

In [12]:
def count_n_grams(data, n, start_token='<s>', end_token = '<e>'):
    n_grams = {}
    
    for sentence in data: 
        sentence = [start_token] * n + sentence + [end_token]
        sentence = tuple(sentence)
        
        for i in range(len(sentence) - n + 1): 
            n_gram = sentence[i:i + n]
            if n_gram in n_grams: 
                n_grams[n_gram] += 1
            else:
                n_grams[n_gram] = 1
                
    return n_grams

In [13]:
def estimate_probability(word, previous_n_gram, 
                         n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    
    previous_n_gram = tuple(previous_n_gram)
    previous_n_gram_count = n_gram_counts.get(previous_n_gram, 0)
    denominator = previous_n_gram_count + k * vocabulary_size
    n_plus1_gram = previous_n_gram + (word, )
    n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram, 0)
    numerator = n_plus1_gram_count + k
    probability = numerator / denominator
    
    return probability

In [14]:
def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, end_token='<e>', unknown_token="<unk>",  k=1.0):
    previous_n_gram = tuple(previous_n_gram)    
    vocabulary = vocabulary + [end_token, unknown_token]    
    vocabulary_size = len(vocabulary)    
    probabilities = {}
    for word in vocabulary:
        probability = estimate_probability(word, previous_n_gram, 
                                           n_gram_counts, n_plus1_gram_counts, 
                                           vocabulary_size, k=k)
        probabilities[word] = probability
    return probabilities

## Train the model

For training the model for an n-gram model, count tables for n-grams and (n+1)-grams need to be calculated.
To test the model for different values of n, we calculate the n-grams for n = 1 to 5. Then we have the n-gram language models for n = 1 to n = 4

In [15]:
n_gram_counts_list = []
n_max = 4
for n in range(1, n_max + 2):
    print("Computing n-gram counts with n =", n, "...")
    n_model_counts = count_n_grams(train_data_processed, n)
    n_gram_counts_list.append(n_model_counts)

Computing n-gram counts with n = 1 ...
Computing n-gram counts with n = 2 ...
Computing n-gram counts with n = 3 ...
Computing n-gram counts with n = 4 ...
Computing n-gram counts with n = 5 ...


## Auto complete

Using the calculated n-grams, we can make suggestions for the next word.

The functions below do this by iterating over all the words in the vocabulary and calculate the probabilities for each, given the previous sequences and return the highest probability word(s).

The first function receives two n-gram count tables and returns the most likely word based on that model and the second function is defined to pass two consequative n-gram tables to the first one and receive a suggested model based on that n-gram model.

In [16]:
def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, end_token='<e>', unknown_token="<unk>", k=1.0, start_with=None):
    n = len(list(n_gram_counts.keys())[0]) 
    previous_n_gram = previous_tokens[-n:]
    probabilities = estimate_probabilities(previous_n_gram,
                                           n_gram_counts, n_plus1_gram_counts,
                                           vocabulary, k=k)
    suggestion = None
    max_prob = 0
    for word, prob in probabilities.items(): 
        if start_with is not None: 
            if not word.startswith(start_with): 
                continue
        if prob > max_prob: 
            suggestion = word
            max_prob = prob
    return suggestion, max_prob

In [17]:
def get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with=None):
    model_counts = len(n_gram_counts_list)
    suggestions = []
    for i in range(model_counts-1):
        n_gram_counts = n_gram_counts_list[i]
        n_plus1_gram_counts = n_gram_counts_list[i+1]
        
        suggestion = suggest_a_word(previous_tokens, n_gram_counts,
                                    n_plus1_gram_counts, vocabulary,
                                    k=k, start_with=start_with)
        suggestions.append(suggestion)
    return suggestions

# Testing the model

Now, we can use these functions to get suggestions for a next word from the previous sequence (an uncompleted sentence).

The suggestions are shown for different n-grams, meaning looking at n previous words in the sequence.

In [18]:
my_sentence = "hey how are you"
my_sentence_tokenized = get_tokenized_data(my_sentence)[0]

suggestions = get_suggestions(my_sentence_tokenized, n_gram_counts_list, vocabulary, k=1.0)

print("The previous words are {}".format(my_sentence_tokenized))
print("The suggestions are:")
for i in range(1, n_max + 1, 1):
    print("{}-gram model: {}".format(i, suggestions[i - 1]))
    

The previous words are ['hey', 'how', 'are', 'you']
The suggestions are:
1-gram model: ("'re", 0.02392422634114477)
2-gram model: ('?', 0.0025874079479864657)
3-gram model: ('?', 0.0014236322961155175)
4-gram model: ('<e>', 0.0001360451669954425)


Here, it is interesting to see that 1-gram suggests 're. It is only looking at the previous work and the most frequent word after you appears to be 're to result in you are. But higher numbers of n predict differently as they look at more words back in the sequence.

It is also possible to suggest/return word that start with a certain characters. This is very useful for example to suggest words as the user maybe still typing and has entered part of the next word...


In [19]:
my_sentence = "hey how are you"
my_sentence_tokenized = get_tokenized_data(my_sentence)[0]

#the user has entered 'd'
user_input = "d"
suggestions = get_suggestions(my_sentence_tokenized, n_gram_counts_list, vocabulary, k=1.0, start_with=user_input)

print("The previous words are {}".format(my_sentence_tokenized))
print("The suggestions are:")
for i in range(1, n_max + 1, 1):
    print("{}-gram model: {}".format(i, suggestions[i - 1]))

The previous words are ['hey', 'how', 'are', 'you']
The suggestions are:
1-gram model: ('do', 0.008859312484690128)
2-gram model: ('doing', 0.0013932196643004046)
3-gram model: ('doing', 0.0004067520846044336)
4-gram model: ('day', 6.802258349772124e-05)


As it is was shown, the heart of n-gram language model is calculating the tables of all encountered tuples of n and n+1. Although this is a simple model and can be implemented and used quickly, the memory consumption can become very large in case of a big training set. 

Sequence based language models can solve this problem.