# n-gram bot

## A sample project on making n-gram bot that responds the continuing sequence of the given word

Reference: Coursera course "Gen AI Foundational Models for NLP & Language Understanding" by IBM

User choices are the amount(number) of context and the starting word.

This project was made on Oct 1st, 2024.

### Install necessary libraries

In [1]:
%%capture

!mamba install -y nltk
!pip install torchtext -qqq

### import libraries

In [2]:
%%capture
import warnings
from tqdm import tqdm

warnings.simplefilter('ignore')
import time
from collections import OrderedDict

import re

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


import nltk
nltk.download('punkt')

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import string
import time

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
%capture

### Define a function to preprocess the input text

In [3]:
def preprocess_string(s):
    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)
    # Replace all runs of whitespaces with no space
    s = re.sub(r"\s+", '', s)
    # replace digits with no space
    s = re.sub(r"\d", '', s)

    return s

### Input sequence

In [4]:
paragraph = "The article is “Historical review and future challenges in Supercomputing and Networks of Scientific Communication”, written by Á. Fernández-González, R. Rosillo, J. Á. Miguel-Dávila, and V. Matellán. It was published in The Journal of Supercomputing in November 2015. The article reviews the history of supercomputers covering their birth, current use, and future challenges. The characteristics, improvements, and challenges of major supercomputers are covered. The authors also mention the importance of networks that support the functions of supercomputers, along with the history of how the networks have been developing. The authors have written an important and timely article on supercomputing and their networks and they take on the difficult task of revealing the historical plots and improvements of supercomputers. Unfortunately, the authors have not covered how they became more energy efficient in detail."

### Preprocess the text

In [5]:
from nltk.tokenize import word_tokenize
def preprocess(words):
    tokens=word_tokenize(words)
    tokens=[preprocess_string(w)   for w in tokens]
    return [w.lower()  for w in tokens if len(w)!=0 or not(w in string.punctuation) ]

tokens=preprocess(paragraph)

In [6]:
print(len(tokens))
print(tokens)

131
['the', 'article', 'is', 'historical', 'review', 'and', 'future', 'challenges', 'in', 'supercomputing', 'and', 'networks', 'of', 'scientific', 'communication', 'written', 'by', 'á', 'fernándezgonzález', 'r', 'rosillo', 'j', 'á', 'migueldávila', 'and', 'v', 'matellán', 'it', 'was', 'published', 'in', 'the', 'journal', 'of', 'supercomputing', 'in', 'november', 'the', 'article', 'reviews', 'the', 'history', 'of', 'supercomputers', 'covering', 'their', 'birth', 'current', 'use', 'and', 'future', 'challenges', 'the', 'characteristics', 'improvements', 'and', 'challenges', 'of', 'major', 'supercomputers', 'are', 'covered', 'the', 'authors', 'also', 'mention', 'the', 'importance', 'of', 'networks', 'that', 'support', 'the', 'functions', 'of', 'supercomputers', 'along', 'with', 'the', 'history', 'of', 'how', 'the', 'networks', 'have', 'been', 'developing', 'the', 'authors', 'have', 'written', 'an', 'important', 'and', 'timely', 'article', 'on', 'supercomputing', 'and', 'their', 'networks',

In [7]:
fdist = nltk.FreqDist(tokens)
print(fdist)

<FreqDist with 72 samples and 131 outcomes>


In [8]:
#total count of each word 
C=sum(fdist.values())
print(C)

131


In [9]:
#Unique words 
vocabulary=set(tokens)
print(len(vocabulary))
print(vocabulary)

72
{'functions', 'important', 'was', 'matellán', 'r', 'that', 'mention', 'how', 'developing', 'with', 'they', 'unfortunately', 'not', 'efficient', 'the', 'of', 'review', 'it', 'history', 'future', 'task', 'november', 'reviews', 'communication', 'detail', 'more', 'current', 'energy', 'scientific', 'support', 'on', 'an', 'revealing', 'á', 'been', 'have', 'and', 'birth', 'migueldávila', 'by', 'covered', 'v', 'published', 'plots', 'fernándezgonzález', 'characteristics', 'networks', 'use', 'in', 'authors', 'challenges', 'major', 'also', 'timely', 'supercomputing', 'journal', 'difficult', 'became', 'article', 'supercomputers', 'their', 'j', 'importance', 'is', 'covering', 'improvements', 'rosillo', 'are', 'written', 'take', 'along', 'historical'}


### n-grams model

In [10]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
tokens=tokenizer(paragraph)

# Create a vocabulary from text tokens

# tokenize the 'paragraph' text using the provided tokenizer.
# The map function applies the tokenizer to each word in the 'paragraph' after splitting it.
# The result is a list of tokens representing the words in the 'paragraph'.
tokenized_paragraph = map(tokenizer, paragraph.split())

# Step 2: Vocabulary Building
# The build_vocab_from_iterator function constructs a vocabulary from the tokenized text.
# In this case, add a special token "<unk>" (unknown token) to handle out-of-vocabulary words.
vocab = build_vocab_from_iterator(tokenized_paragraph, specials=["<unk>"])

# Step 3: Set Default Index
# Set the default index for the vocabulary to the index corresponding to the "<unk>" token.
# This ensures that any unknown tokens in the future will be mapped to this index.
vocab.set_default_index(vocab["<unk>"])

In [11]:
text_pipeline = lambda x: vocab(tokenizer(x))

In [12]:
index_to_token = vocab.get_itos()
print(index_to_token) #index of each words. search for a word from index.

['<unk>', 'the', '.', ',', 'and', 'of', 'in', 'networks', 'supercomputers', 'article', 'authors', 'challenges', 'have', 'supercomputing', 'covered', 'future', 'history', 'how', 'improvements', 'on', 'their', 'they', 'written', 'á', '2015', 'along', 'also', 'an', 'are', 'became', 'been', 'birth', 'by', 'characteristics', 'communication”', 'covering', 'current', 'detail', 'developing', 'difficult', 'efficient', 'energy', 'fernández-gonzález', 'functions', 'historical', 'importance', 'important', 'is', 'it', 'j', 'journal', 'major', 'matellán', 'mention', 'miguel-dávila', 'more', 'not', 'november', 'plots', 'published', 'r', 'revealing', 'review', 'reviews', 'rosillo', 'scientific', 'support', 'take', 'task', 'that', 'timely', 'unfortunately', 'use', 'v', 'was', 'with', '“historical']


In [13]:
#Example use

#get n-grams
CONTEXT_SIZE=2
ngrams = [
    (
        [tokens[i - j - 1] for j in range(CONTEXT_SIZE)],
        tokens[i]
    )
    for i in range(CONTEXT_SIZE, len(tokens))
]

print(ngrams)

#Embedding
embedding_dim=20
vocab_size=len(vocab)
embeddings = nn.Embedding(vocab_size, embedding_dim)

print("-----", "Index the text", "-----")
context, target=ngrams[0]
print("context",context,"target",target)
print("context index",vocab(context),"target index",vocab([target]))

print("-----", "Embed the text", "-----")
print(vocab(context))
my_embeddings=embeddings(torch.tensor(vocab(context)))
print(my_embeddings)
print(my_embeddings.shape)

print("-----", "concatenate the embedded text", "-----")
my_embeddings=my_embeddings.reshape(1,-1)
my_embeddings.shape
print(my_embeddings)


linear = nn.Linear(embedding_dim*CONTEXT_SIZE,128)
print("-----", "input to linear function and get output", "-----")
print(linear(my_embeddings))

[(['article', 'the'], 'is'), (['is', 'article'], '“historical'), (['“historical', 'is'], 'review'), (['review', '“historical'], 'and'), (['and', 'review'], 'future'), (['future', 'and'], 'challenges'), (['challenges', 'future'], 'in'), (['in', 'challenges'], 'supercomputing'), (['supercomputing', 'in'], 'and'), (['and', 'supercomputing'], 'networks'), (['networks', 'and'], 'of'), (['of', 'networks'], 'scientific'), (['scientific', 'of'], 'communication”'), (['communication”', 'scientific'], ','), ([',', 'communication”'], 'written'), (['written', ','], 'by'), (['by', 'written'], 'á'), (['á', 'by'], '.'), (['.', 'á'], 'fernández-gonzález'), (['fernández-gonzález', '.'], ','), ([',', 'fernández-gonzález'], 'r'), (['r', ','], '.'), (['.', 'r'], 'rosillo'), (['rosillo', '.'], ','), ([',', 'rosillo'], 'j'), (['j', ','], '.'), (['.', 'j'], 'á'), (['á', '.'], '.'), (['.', 'á'], 'miguel-dávila'), (['miguel-dávila', '.'], ','), ([',', 'miguel-dávila'], 'and'), (['and', ','], 'v'), (['v', 'and']

In [14]:
#batching

from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
CONTEXT_SIZE=3
BATCH_SIZE=10
EMBEDDING_DIM = 10

def collate_batch(batch):
    batch_size=len(batch)
    context, target=[],[]
    for i in range(CONTEXT_SIZE, batch_size):
        target.append(vocab([batch[i]]))
        context.append(vocab([batch[i-j-1] for j in range(CONTEXT_SIZE)]))

    return   torch.tensor(context).to(device),  torch.tensor(target).to(device).reshape(-1)

Padding=BATCH_SIZE-len(tokens)%BATCH_SIZE
tokens_pad=tokens+tokens[0:Padding]  #add first few tokens to the end for padding.

dataloader = DataLoader(tokens_pad, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)

In [15]:
#This is the our n-gram

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.context_size=context_size
        self.embedding_dim=embedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        embeds=torch.reshape( embeds, (-1,self.context_size * self.embedding_dim))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)

        return out

In [20]:
#This function writes a paragraph to use during the training process.
def write_paragraph(model,number_of_words=100):
    my_paragraph=""
    for i in range(number_of_words):
        with torch.no_grad():
            context=torch.tensor(vocab([tokens[i-j-1] for j in range(CONTEXT_SIZE)])).to(device)
            word_inx=torch.argmax(model(context))
            my_paragraph+=" "+index_to_token[word_inx.detach().item()]

    return my_paragraph

In [21]:
#loss function that initiates back propagation
criterion = torch.nn.CrossEntropyLoss()

In [22]:
#this function trains the n-gram model expressed as a neural network

def train(dataloader, model, number_of_epochs=100, show=10):
    """
    Args:
        dataloader (DataLoader): DataLoader containing training data.
        model (nn.Module): Neural network model to be trained.
        number_of_epochs (int, optional): Number of epochs for training. Default is 100.
        show (int, optional): Interval for displaying progress. Default is 10.

    Returns:
        list: List containing loss values for each epoch.
    """

    MY_LOSS = []  # List to store loss values for each epoch

    # Iterate over the specified number of epochs
    for epoch in tqdm(range(number_of_epochs)):
        total_loss = 0  # Initialize total loss for the current epoch
        my_paragraph = ""    # Initialize a string to store the generated song

        # Iterate over batches in the dataloader
        for context, target in dataloader:
            model.zero_grad()          # Zero the gradients to avoid accumulation
            predicted = model(context)  # Forward pass through the model to get predictions
            loss = criterion(predicted, target.reshape(-1))  # Calculate the loss
            total_loss += loss.item()   # Accumulate the loss

            loss.backward()    # Backpropagation to compute gradients
            optimizer.step()   # Update model parameters using the optimizer

        # Display progress and generate song at specified intervals
        if epoch % show == 0:
            my_paragraph += write_paragraph(model)  # Generate song using the model

            print("Generated Paragraph:")
            print("\n")
            print(my_paragraph)

        MY_LOSS.append(total_loss/len(dataloader))  # Append the total loss for the epoch to MY_LOSS list

    return MY_LOSS  # Return the list of  mean loss values for each epoch

In [18]:
# Define the context size for the n-gram model
CONTEXT_SIZE = int(input("Input the preferred context size:"))

# Create an instance of the NGramLanguageModeler class with specified parameters
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE).to(device)

# Define the optimizer for training the model, using stochastic gradient descent (SGD)
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Set up a learning rate scheduler using StepLR to adjust the learning rate during training
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1.0, gamma=0.1)

Input the preferred context size: 3


In [23]:
#Train!
y_loss=train(dataloader,model)

  2%|▏         | 2/100 [00:00<00:07, 13.66it/s]

Generated Paragraph:


 been difficult and of <unk> and that covering have of been of of more <unk> major energy and in been review difficult of review difficult difficult have the have they and supercomputing and functions covering review difficult and and of and have future of and reviews the of task the in and difficult have . of <unk> have take and . major have covering major major have are difficult of it covering and history covering of <unk> have in . the important of difficult in communication” in of <unk> the have á of in <unk> have history have major review


 14%|█▍        | 14/100 [00:01<00:06, 13.35it/s]

Generated Paragraph:


 . . and . . and . of . of . . of . . the . and . . . . . . . . . . . . . . . . . . . and . . . . . . . . the . . the . and . . . of . the . . . . . . . of . the . . . . . . . of . the . . the . . . . . . . . the . . of . . the . . of .


 24%|██▍       | 24/100 [00:01<00:05, 13.82it/s]

Generated Paragraph:


 the . and . . and . of . of . . of supercomputers . the . . . . . . of . the . . . . . . . of . of . . and . . . . . of and . the . . the . and . . . of . the . , . . . . . of . the . of . . and . of of . the . . the . of . . . of . . the . . of supercomputers . the . . of of


 32%|███▏      | 32/100 [00:02<00:05, 12.17it/s]

Generated Paragraph:


 the . and . . and , challenges . the of . of supercomputers . , and and . . the . of . the . and . and . the . and . of supercomputers . and . the and , of of and the the , . the . and . , of supercomputers . the . , . supercomputers , and future challenges . the . , . . and . of supercomputers . the the . the . of . . . of supercomputers supercomputers the , of of supercomputers . the . . of of


 44%|████▍     | 44/100 [00:03<00:04, 13.36it/s]

Generated Paragraph:


 the . and . review and , challenges . the of . challenges supercomputers and , and , , . the . the . the . and . and . the . and future of supercomputers . and . the and , of of and and the , . the . and . , of supercomputers . the of , . supercomputers , and future challenges . the . , . . and . of supercomputers . the the . the , of . the of of supercomputers supercomputers the , of of supercomputers . and . . of of


 52%|█████▏    | 52/100 [00:03<00:03, 14.02it/s]

Generated Paragraph:


 the . and “historical review and future challenges . the of future challenges supercomputers communication” , and by á . the . the . the , and . and . the . and future of supercomputers . the . , and , of of and in november , . the . and . , of supercomputers . the of , . supercomputers , and future challenges . the . , . . and . of major . the the . the , have . the of of supercomputers supercomputers the , of of supercomputers . and supercomputers . of of


 62%|██████▏   | 62/100 [00:04<00:03, 12.07it/s]

Generated Paragraph:


 the . and “historical review and future challenges . the of future challenges supercomputers communication” , and by á . the . the . the , and . and . the . and v of supercomputers . the . published and , of of supercomputing in november 2015 . the . and . history of supercomputers . their of , . supercomputers , and future challenges . the . , . . and challenges of major . the the . the , have . the of of supercomputers that the , of of supercomputers . and challenges . of of


 72%|███████▏  | 72/100 [00:05<00:02, 11.08it/s]

Generated Paragraph:


 the article have “historical review and future challenges . supercomputing of future challenges supercomputers communication” , written by á . the . have . the , and . á . the . and v of matellán . the was published and , of of supercomputing in november 2015 . the . have . history of supercomputers . their birth , and supercomputers , and future challenges . the . , . . á challenges of major supercomputers the the . the , have . the importance of supercomputers that the , of of supercomputers . and challenges the of of


 84%|████████▍ | 84/100 [00:06<00:01, 12.03it/s]

Generated Paragraph:


 the article is “historical review and future challenges in supercomputing of future challenges supercomputers communication” , written by á . the . have . rosillo , j . á . the . and v of matellán . the was published and november of of supercomputing in november 2015 . the . is . history of supercomputers . their birth , and supercomputers , and future challenges in the . , . . á challenges of major supercomputers their covered . the , have . the importance of supercomputers that the , of of supercomputers , and challenges the of of


 92%|█████████▏| 92/100 [00:07<00:00, 12.09it/s]

Generated Paragraph:


 the article is “historical review and future challenges in supercomputing of future challenges scientific communication” , written by á . the . j . rosillo , j . á . the . and v of matellán . it was published and november of of supercomputing in november 2015 . the history is . history of supercomputers . their birth , and supercomputers , and future challenges in the . , . of á challenges of major supercomputers their covered . the , have mention the importance of supercomputers that the history of of supercomputers , along with the history of


100%|██████████| 100/100 [00:08<00:00, 12.49it/s]


In [24]:
#Save the trained result
save_path = str(CONTEXT_SIZE) + '_gram.pth'
torch.save(model.state_dict(), save_path)
#my_loss_list.append(my_loss)

In [25]:
#This function writes a paragraph starting with a word of your choice.
def write_paragraph_with_input(model,input_word, number_of_words):
    my_paragraph=["<unk>","<unk>","<unk>",input_word]
    for i in range(number_of_words):
        with torch.no_grad():
            context=torch.tensor(vocab([tokens[i-j-1] for j in range(CONTEXT_SIZE)])).to(device)
            word_inx=torch.argmax(model(context))
            my_paragraph.append(index_to_token[word_inx.detach().item()])

    return my_paragraph

In [26]:
first_word = input("What word do you want to being the paragraph with?")
paragraph_output = write_paragraph_with_input(model, first_word, 100)

paragraph_output_str = ""
for i in paragraph_output[3:]:
    paragraph_output_str += i + " "
print(paragraph_output_str)

What word do you want to being the paragraph with? author


author the article is “historical review and future challenges in supercomputing of future challenges scientific communication” , written by á . the . j . rosillo , j . á . it , and v of matellán . it was published and november of of supercomputing in november 2015 . the history is . history of supercomputers . their birth , and supercomputers , and future challenges in the . , . of á challenges of major supercomputers their covered . the , have mention the importance of supercomputers that support history of of supercomputers , along with the history of 
