## TI3160TU: Natural Language Processing - Ngram models Lab

In this hands-on lab, we will explore Ngram models. Ngram models are a simple form of language models (LMs). They assign probabilities to sequence of words based on a large corpus of data. In this lab, we will focus on three tasks related to Ngram models:

1. **Extracting Ngrams from a corpus of data**
2. **Calculating Ngram probabilities**
3. **Generating text using Ngram models**

### 0. Loading Dataset

In [16]:
# we need the library json as the reddit data is stored in line-delimited json objects
# (one json object in each line, with each line representing a Reddit comment)
import json

# function to load all comment data into a list of strings
# Input: the path of the file including our data
# Output: a list of strings including the body of the Reddit comments
def load_reddit_comment_data(data_directory):

    comments_data = [] # list object that will store the loaded Reddit comments

    # we first open the file that includes our dataset
    with open(data_directory, 'r', encoding='utf-8') as f:
        # iterate the file, reading it line by line
        for line in f:
            # load the data petraining to a line into a json object in memory
            data = json.loads(line)

            # append the comment if not removed
            if data['body']!="[removed]":
                comments_data.append(data['body'])

    # the method returns all the loaded Reddit comments
    return comments_data

# our data is stored in this file
data_dir = './comments_TUDelft.ndjson'
# lets load our dataset into memory
reddit_data = load_reddit_comment_data(data_dir)
print("Successfully loaded Reddit comments! Our dataset includes %d Reddit comments!" %len(reddit_data))

Successfully loaded Reddit comments! Our dataset includes 2215 Reddit comments!


### 1. Extracting Ngrams from a corpus of data

We start by using the NLTK Python library to extract Ngrams from our Reddit dataset. First, we need to preprocess the dataset:
1. Convert everything to lowercase
2. Remove links
3. Remove punctuation
4. Tokenize posts into words
5. Remove stopwords

After preprocessing the dataset we will calculate Ngrams from our dataset and calculate the most popular Ngrams.

In [17]:
import nltk
import re
import pandas as pd
from nltk.util import ngrams
from collections import Counter, defaultdict
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
stop_words = set(stopwords.words('english'))

# function to preprocess the Reddit comments
# Input: a string that includes a text corresponding to a Reddit comment
# Output: a string with the preprocessed Reddit comment
def preprocess(text):
    text = text.lower()  # convert text to lower-case
    text = re.sub('&gt;', '', text) # remove some special characters from the data &gt; corresponds to >
    text = re.sub('&amp;', '', text) # remove some special characters from the data &amp; corresponds to &
    text = re.sub(r'\s+', ' ', text)  # eliminate duplicate whitespaces using regex
    text = re.sub(r'\[[^]]*\]', '', text)  # remove text in square brackets
    text = re.sub(r'http\S+', '', text)  # remove URLs
    text = re.sub(r'[^a-z0-9\s]', '', text)  # remove punctuation (keep only characters and numbers)
    return text

# run our function to preprocess all comments
preprocessed_comments = [preprocess(comment) for comment in reddit_data]

# Tokenize the comments and remove the stopwords
all_words = [] # list of lists holding our dataset (each list corresponds to a comment and it includes the tokenized words)
for comment in preprocessed_comments:
    # tokenize the comments and remove stopwords
    all_words.append([ w for w in word_tokenize(comment) if w not in stop_words])

# calculate all ngrams for n=2, 3, and 4
all_bigrams = []
all_trigrams = []
all_fourgrams = []

# for each document
for doc in all_words:
    # calculate all ngrams with size 2 (i.e., bigrams) and then store them in our list holding all bigrams
    all_bigrams.extend(list(ngrams(doc, 2)))
    all_trigrams.extend(list(ngrams(doc, 3)))
    all_fourgrams.extend(list(ngrams(doc, 4)))

# we use the Counter class from Collections to find the top N most occurring Ngrams in our dataset
top_bigrams = Counter(all_bigrams).most_common(10)
top_trigrams = Counter(all_trigrams).most_common(10)
top_fourgrams = Counter(all_fourgrams).most_common(10)

# lets present the most occurring Ngrams in a nice table using Pandas
top_bigrams_df = pd.DataFrame(top_bigrams, columns =['Bigram', '#']) # create DataFrame for bigrams
top_trigrams_df = pd.DataFrame(top_trigrams, columns=['Trigram', '#']) # create DataFrame for trigrams
top_fourgrams_df = pd.DataFrame(top_fourgrams, columns=['Fourgram', '#']) # create DataFrame for fourgrams
ngrams_df = pd.concat([top_bigrams_df, top_trigrams_df, top_fourgrams_df], axis=1) # concatenate all to a single datafrme

print("The top 10 most popular ngrams in our dataset are...")
ngrams_df




The top 10 most popular ngrams in our dataset are...


Unnamed: 0,Bigram,#,Trigram,#.1,Fourgram,#.2
0,"(tu, delft)",124,"(first, year, material)",10,"(education, student, affairs, esa)",3
1,"(first, year)",56,"(tu, delft, website)",8,"(first, come, first, serve)",3
2,"(dont, know)",48,"(youre, gon, na)",7,"(systematic, reasoning, logical, thinking)",3
3,"(good, luck)",42,"(high, school, diploma)",6,"(see, delft, next, year)",3
4,"(entrance, exam)",33,"(grades, dont, matter)",5,"(get, diploma, required, subjects)",2
5,"(next, year)",33,"(im, pretty, sure)",5,"(mathematics, a2, mechanics, physics)",2
6,"(thank, much)",33,"(algorithms, data, structures)",5,"(a2, mechanics, physics, a2)",2
7,"(selection, procedure)",26,"(hl, math, physics)",5,"(contact, education, student, affairs)",2
8,"(high, school)",26,"(mathematics, a2, mechanics)",4,"(710, decent, score, motivational)",2
9,"(civil, engineering)",25,"(education, student, affairs)",4,"(decent, score, motivational, test)",2


### 2. Calculating Ngram probabilities

In [18]:
bigram_model = defaultdict(lambda: defaultdict(lambda: 0))
trigram_model = defaultdict(lambda: defaultdict(lambda: 0))
fourgram_model = defaultdict(lambda: defaultdict(lambda: 0))

# Bigrams
for bigram in all_bigrams:
    w1, w2 = bigram
    bigram_model[w1][w2] += 1

# Normalize bigram counts to get probabilities
for w1 in bigram_model:
    total_count = float(sum(bigram_model[w1].values()))
    for w2 in bigram_model[w1]:
        bigram_model[w1][w2] /= total_count

# Trigrams
for trigram in all_trigrams:
    w1, w2, w3 = trigram
    trigram_model[(w1, w2)][w3] += 1

# Normalize trigram counts to get probabilities
for w1_w2 in trigram_model:
    total_count = float(sum(trigram_model[w1_w2].values()))
    for w3 in trigram_model[w1_w2]:
        trigram_model[w1_w2][w3] /= total_count

# Fourgrams
for fourgram in all_fourgrams:
    w1, w2, w3, w4 = fourgram
    fourgram_model[(w1, w2, w3)][w4] += 1

# Normalize fourgram counts to get probabilities
for w1_w2_w3 in fourgram_model:
    total_count = float(sum(fourgram_model[w1_w2_w3].values()))
    for w4 in fourgram_model[w1_w2_w3]:
        fourgram_model[w1_w2_w3][w4] /= total_count

### 3. Generating text using Ngram models

In [19]:
import random
import numpy as np

# function that uses a bigram model to generate some text
# Input: A bigram model, a starting word, and the length of the sentence that we want to generate
# Output: A string that corresponds to a sentence that is generated based on the bigram probabilities
def generate_text_bigram(model, start_word, num_words):
    # Initialize the current word as the start word
    current_word = start_word

    # Initialize the sentence as a list containing the start word
    sentence = [current_word]

    # Loop over the desired number of words
    for _ in range(num_words):
        # Check if the current word exists in our model
        if model[current_word]:
            # Get the list of potential next words
            next_words = list(model[current_word].keys())

            # Get the list of probabilities corresponding to the next words
            next_word_probs = list(model[current_word].values())

            # Randomly choose a next word based on the probabilities
            # np.random.choice selects an item from "next_words" list taking into account their corresponding probability distribution from "next_word_probs"
            next_word = np.random.choice(next_words, p=next_word_probs)

            # Append this word to our sentence
            sentence.append(next_word)

            # Update the current word to be the word we just added to the sentence
            current_word = next_word
        else:
            # If the current word isn't in our model (this would happen if the word didn't have any following word in the training data),
            # break the loop and end the sentence
            break

    # Join the words in the sentence list with spaces in between to form a string sentence
    return " ".join(sentence)

# Generate text
print("Generated with bigram model: ", generate_text_bigram(bigram_model, "tu", 10))

Generated with bigram model:  tu delft university laptop doesnt come study programme obligatory aissce indian


In [20]:
# function to generate text using a Trigram/Fourgram model
# Input: A Trigram/Fourgram model, a starting word, and the length of the sentence that we want to generate
# Output: A string that corresponds to a sentence that is generated based on the Ngram probabilities
def generate_text_trigram_fourgram(model, start_words, num_words):
    # Start_words should be a tuple of words of length matching the model order (i.e., for trigram model we should provide two words, for fourgram three words, etc.)
    current_words = start_words

    # Initialize the sentence as a list containing all the start words
    sentence = list(current_words)

    # We'll use the length of the start_sequence to determine the order of the model (bigram, trigram, etc.)
    ngram_order = len(start_words)

    # Loop over the desired number of words
    for _ in range(num_words):

        # Check if the current words exists in our model
        if model[current_words]:

            # Get the list of potential next words
            next_words = list(model[current_words].keys())

            # Get the list of probabilities corresponding to the next words
            next_word_probs = list(model[current_words].values())

            # Randomly choose a next word based on the probabilities
            # np.random.choice selects an item from "next_words" list taking into account their corresponding probability distribution from "next_word_probs"
            next_word = np.random.choice(next_words, p=next_word_probs)

            # Append this word to our sentence
            sentence.append(next_word)

            # Update the current words so that they are the last Ngram order words in the generated sentence
            current_words = tuple(sentence[-ngram_order:])
        else:
            # If the current word isn't in our model (this would happen if the word didn't have any following word in the training data),
            # break the loop and end the sentence
            break

    # Join the words in the sentence list with spaces in between to form a string sentence
    return " ".join(sentence)

# Generate text
print("Generated with trigram model: ", generate_text_trigram_fourgram(trigram_model, ("tu", "delft"), 10))
print("Generated with fourgram model: ", generate_text_trigram_fourgram(fourgram_model, ("tu", "delft", "university"), 10))

Generated with trigram model:  tu delft pages lot apply case x etc questions would nice type
Generated with fourgram model:  tu delft university provide incredibly creative surrounding


##### What do you observe from the generated outputs from the bigram, trigram, and fourgram models? What model you think generates the more coherent output? Why is that?

## Calculating Sentence Probabilities

Having some bigrams, trigrams, and fourgrams models trained on our Reddit corpus, we can calculate the probabilities of a full phrase or sentence appearing. This is equivalent to the conditional probabilities of the n-1 words before the current word. E.g., for the bigram model, is the conditional probability of a word appearing, when considering only the previous word. Lets see how we can calculate sentence probabilities, first with the bigram model.

In [21]:
# function to calculate the sentence probability from a bigram model
# Input: a sentence and the bigram model
# Output: the sentence probability within the bigram model
def calculate_sentence_probability_bigram(sentence, model):
    # Convert input sentence to a list of words
    sentence = sentence.split(" ")

    # Initialize probability
    probability = 1

    # Loop over the sentence. For each word, get its probability and multiply it to the current probability
    for i in range(1, len(sentence)):
        # Get the preceding word and current word
        preceding_word = sentence[i-1]
        current_word = sentence[i]

        # If the preceding word is not in the model or the current word is not in the preceding word's distribution,
        # the probability is 0
        if preceding_word not in model or current_word not in model[preceding_word]:
            return 0

        # Multiply the current probability with the current word's probability
        probability *= model[preceding_word][current_word]

    # return the calculated probability
    return probability

# lets test it with some toy sentence
sentence = "tu delft"
print("The probability for %s using bigram model = %f" %(sentence, calculate_sentence_probability_bigram(sentence, bigram_model) ))


The probability for tu delft using bigram model = 0.789809


This means that when given the word "tu", the probability of the next word being "delft" is almost 79% in our bigram model.

In [22]:
sentence = "tu delft university"
print("The probability for %s using bigram model = %f" %(sentence, calculate_sentence_probability_bigram(sentence, bigram_model) ))

The probability for tu delft university using bigram model = 0.009148


This means that when given the word "tu", the probability of the next words being "delft university" is less than 1% in our bigram model.

In [23]:
sentence = "tu eindhoven"
print("The probability for %s using bigram model = %f" %(sentence, calculate_sentence_probability_bigram(sentence, bigram_model) ))

The probability for tu eindhoven using bigram model = 0.038217


This means that when given the word "tu", the probability of the next word being "eindhoven" is less than 4% in our bigram model.

In [24]:
sentence = "tu eindhoven university"
print("The probability for %s using bigram model = %f" %(sentence, calculate_sentence_probability_bigram(sentence, bigram_model) ))

The probability for tu eindhoven university using bigram model = 0.000000


### What do you observe from these probabilities?

##### Think about the occurrences of the words in the corpus (tu delft vs tu eindhoven). Also, why is the "tu eindhoven university" sentence probability zero? How can we solve this issue? Also, how do the probabilities change when we add more words? Think about the smoothing techniques and log probability calculation as we see in the lecture.

## Exercise: Write code to implement trigram and fourgram models. Then check how the probabilities of n-grams like "tu delft university" change. What do you observe when comparing the probability of "tu delft university" for 2-gram and 3-gram model?


In [None]:
# Insert your code here:

## TI3160TU: Natural Language Processing - Ngram models lab -- END