<a href="https://colab.research.google.com/github/Natural-Language-Processing-YU/Exercises/blob/main/Exercise_Building_your_own_n_gram_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Exercise: Building your own n-gram language model

Objective:
The objective of this exercise is to give students hands-on experience in creating a language model using n-grams and generating new sentences based on the model. Students will work through the steps of preprocessing the text, generating n-grams, creating a vocabulary, constructing the language model, and generating sentences.


In [1]:
import random
import nltk
from nltk.util import ngrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipeline
from nltk.lm import MLE

In [2]:
import nltk

# Download the 'punkt' tokenizer models
nltk.download('punkt')




[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Step 1. Preprocessing: Provide a small corpus of sentences and tokenize the text using NLTK's word_tokenize function.

In [3]:
# Step 1: Preprocessing
corpus = [
    "I love to eat pizza.",
    "I love to play soccer.",
    "I love to read books.",
     "I love to create algorithms.",
]

# Tokenize the text
tokenized_corpus = [nltk.word_tokenize(sentence.lower()) for sentence in corpus]

## Step 2: N-gram Generation


In [4]:
# Step 2: N-gram Generation
n = 3  # Trigrams

# Pad the sequences
padded_corpus = [list(pad_both_ends(sentence, n=n)) for sentence in tokenized_corpus]

# Flatten the corpus into n-grams
ngrams_list = [ngrams(sentence, n) for sentence in padded_corpus]
flattened_ngrams = [ngram for sublist in ngrams_list for ngram in sublist]

# Pad the sequences and create n-grams
train_data, padded_vocab = padded_everygram_pipeline(n, tokenized_corpus)



## Step 3: Vocabulary Creation


In [5]:
# Step 3: Vocabulary Creation
vocab = set(flattened_ngrams)

## Step 4: Language Model Construction


In [6]:
# Step 4: Language Model Construction
model = MLE(n)

# Fit the model
model.fit(train_data, padded_vocab)


## Step 5: Generate Text

In [7]:
# Reset the script for sentence generation
max_length = 10  # Maximum number of words in the generated sentence
context = ('<s>',)  # Starting with the sentence beginning token

generated_sentence = []
while len(generated_sentence) < max_length:
    # Use 'text_seed' instead of 'context' for the 'generate' method
    token = model.generate(num_words=1, text_seed=context)[-1]
    if token == "</s>":
        break
    generated_sentence.append(token)
    # Update the context with the last 'n-1' tokens
    context = tuple(generated_sentence[-(n-1):])

# Joining the generated tokens to form a sentence
generated_sentence = " ".join(generated_sentence)

"Generated Sentence: " + generated_sentence



'Generated Sentence: > > d > > > > > . >'

In [8]:
# Step 5: Generate Text

def generate_text(model, num_words, random_seed=42):
    """
    Generate text using the n-gram model.

    :param model: Trained n-gram language model.
    :param num_words: Number of words to generate.
    :param random_seed: Seed for the random number generator.
    :return: Generated text as a string.
    """
    random.seed(random_seed)
    text = [random.choice(list(model.vocab)) for _ in range(n-1)]  # start with random words
    for _ in range(num_words):
        context = tuple(text[-(n-1):])  # get the last (n-1) elements
        next_word = model.generate(text_seed=context)
        text.append(next_word)
        if next_word == '</s>':  # check for end of sentence
            break
    return ' '.join(text)

# Generate a sentence
generated_sentence = generate_text(model, num_words=10)
print("Generated Sentence: "+ generated_sentence)


Generated Sentence: read i love to read books . </s>
