<a href="https://colab.research.google.com/github/Natural-Language-Processing-YU/Exercises/blob/main/Exercise_Building_your_own_n_gram_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Exercise: Building your own n-gram language model

Objective:
The objective of this exercise is to give students hands-on experience in creating a language model using n-grams and generating new sentences based on the model. Students will work through the steps of preprocessing the text, generating n-grams, creating a vocabulary, constructing the language model, and generating sentences.


In [1]:
import random
import nltk
from nltk.util import ngrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm import MLE
from nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipeline


## Step 1. Preprocessing: Provide a small corpus of sentences and tokenize the text using NLTK's word_tokenize function.

In [2]:
# Step 1: Preprocessing
corpus = [
    "I love to eat pizza.",
    "I love to play soccer.",
    "I love to read books.",
     "I love to create algorithms.",
]

# Tokenize the text
tokenized_corpus = [nltk.word_tokenize(sentence.lower()) for sentence in corpus]
tokenized_corpus


[['i', 'love', 'to', 'eat', 'pizza', '.'],
 ['i', 'love', 'to', 'play', 'soccer', '.'],
 ['i', 'love', 'to', 'read', 'books', '.'],
 ['i', 'love', 'to', 'create', 'algorithms', '.']]

## Step 2: N-gram Generation


In [3]:
# Step 2: N-gram Generation
n = 3  # Trigrams

# Pad the sequences
padded_corpus = [list(pad_both_ends(sentence, n=n)) for sentence in tokenized_corpus]

# Flatten the corpus into n-grams
ngrams_list = [ngrams(sentence, n) for sentence in padded_corpus]
flattened_ngrams = [ngram for sublist in ngrams_list for ngram in sublist]


## Step 3: Vocabulary Creation


In [4]:
# Step 3: Vocabulary Creation
vocab = set(flattened_ngrams)
train_data, padded_sents = padded_everygram_pipeline(n,tokenized_corpus)


## Step 4: Language Model Construction


In [5]:
# Step 4: Language Model Construction
model = MLE(n)
model.fit(train_data,padded_sents)

## Step 5: Generate Text

In [6]:

# Step 5: Sentence Generation
max_length = 10  # Maximum number of words in the generated sentence

# Set an initial context
context = random.choice(list(vocab))
prefix = context[:n-1]

# Generate new sentences
generated_sentence = "".join(prefix)

while len(generated_sentence) < max_length:
    token = model.generate(1, context)[-1]

    if token == "</s>":
        break
    generated_sentence += " " + token
    prefix = tuple(generated_sentence.split()[-(n - 1):])
    context = prefix + (token,)

print("Generated Sentence:", generated_sentence)


Generated Sentence: loveto s r
