This script installs the Natural Language Toolkit (NLTK), a Python library used for natural language processing tasks such as tokenization, tagging, and text prediction.

In [111]:
!pip install nltk



In [45]:
import nltk
from nltk.tokenize import word_tokenize
import string

# Download tokenizer models if not already present
nltk.download('punkt')

# Step 1: Read the input file
with open("cricket.txt", "r", encoding="utf-8") as file:
    text = file.read()

# Step 2: Tokenize and filter only alphabetic words
tokens = word_tokenize(text)
words = [word.lower() for word in tokens if word.isalpha()]  # Keep only words (no punctuation/numbers)

# Step 3: Write each word to a new line
with open("cricket_words.txt", "w", encoding="utf-8") as output_file:
    for word in words:
        output_file.write(word + "\n")

print(f"{len(words)} clean words written to 'output_words.txt'")


3424 clean words written to 'output_words.txt'


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\varchasva\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


This part of script imports necessary libraries for deep learning with PyTorch.

Sets up text preprocessing with NLTK, including tokenization and stopword removal.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
from torch.utils.data import DataLoader, TensorDataset
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\varchasva\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\varchasva\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Here we used the following dataset for training.

In [2]:
import re
import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

with open("cricket.txt", "r", encoding="utf-8") as f:
    document = f.read()

# Step 2: Split into sentences
sentences = sent_tokenize(document)

# Step 3: Clean each sentence
cleaned_sentences = []
for sentence in sentences:
    # Remove non-alphabetic characters (preserve spaces)
    sentence = re.sub(r'[^a-zA-Z\s]', '', sentence)
    # Replace multiple spaces with a single space
    sentence = re.sub(r'\s+', ' ', sentence)
    # Lowercase and strip
    sentence = sentence.strip().lower()
    if sentence:  # Skip empty sentences
        cleaned_sentences.append(sentence)

# Step 4: Join cleaned sentences with newline
cleaned_document = '\n'.join(cleaned_sentences)

document= cleaned_document



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\varchasva\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
document

'the india mens national cricket team also known as men in blue represents india in international cricket\nit is governed by the board of control for cricket in india and is a full member nation of the international cricket council with test odi and ti status\nindia are the current holders of the t world cup the champions trophy and the asia cup\nthe team has played test matches winning losing with draws and tie\nas of may india is ranked fourth in the icc mens test team rankings with rating points\nindia have played in two of the three world test championship finals finishing runnersup in and while finishing third in\ntest rivalries include the bordergavaskar trophy with australia freedom trophy with south africa anthony de mello trophy and pataudi trophy both with england\nthe team has played odi matches winning losing tying and with ending in a noresult\nas of may india is ranked first in the icc mens odi team rankings with rating points\nindia have appeared in the world cup final f

This script prepares text data for a next-word prediction model using PyTorch, with tokenization and stopword removal handled via NLTK.

In [4]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\varchasva\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\varchasva\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\varchasva\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

This script preprocesses a text dataset for next-word prediction by tokenizing the input document and converting it to lowercase using NLTK.

In [5]:
tokens = word_tokenize(document.lower())

In [6]:
tokens

['the',
 'india',
 'mens',
 'national',
 'cricket',
 'team',
 'also',
 'known',
 'as',
 'men',
 'in',
 'blue',
 'represents',
 'india',
 'in',
 'international',
 'cricket',
 'it',
 'is',
 'governed',
 'by',
 'the',
 'board',
 'of',
 'control',
 'for',
 'cricket',
 'in',
 'india',
 'and',
 'is',
 'a',
 'full',
 'member',
 'nation',
 'of',
 'the',
 'international',
 'cricket',
 'council',
 'with',
 'test',
 'odi',
 'and',
 'ti',
 'status',
 'india',
 'are',
 'the',
 'current',
 'holders',
 'of',
 'the',
 't',
 'world',
 'cup',
 'the',
 'champions',
 'trophy',
 'and',
 'the',
 'asia',
 'cup',
 'the',
 'team',
 'has',
 'played',
 'test',
 'matches',
 'winning',
 'losing',
 'with',
 'draws',
 'and',
 'tie',
 'as',
 'of',
 'may',
 'india',
 'is',
 'ranked',
 'fourth',
 'in',
 'the',
 'icc',
 'mens',
 'test',
 'team',
 'rankings',
 'with',
 'rating',
 'points',
 'india',
 'have',
 'played',
 'in',
 'two',
 'of',
 'the',
 'three',
 'world',
 'test',
 'championship',
 'finals',
 'finishing',
 '

In [7]:
len(tokens)

3517

This script builds a vocabulary for the next-word prediction model by assigning unique integer indices to tokens in the dataset, with an unknown token (<UNK>) initialized in the vocabulary.

In [8]:
vocab={'<UNK>':0}
Counter(tokens).keys
for token in Counter(tokens).keys():
  if token not in vocab:
    vocab[token]=len(vocab)

vocab

{'<UNK>': 0,
 'the': 1,
 'india': 2,
 'mens': 3,
 'national': 4,
 'cricket': 5,
 'team': 6,
 'also': 7,
 'known': 8,
 'as': 9,
 'men': 10,
 'in': 11,
 'blue': 12,
 'represents': 13,
 'international': 14,
 'it': 15,
 'is': 16,
 'governed': 17,
 'by': 18,
 'board': 19,
 'of': 20,
 'control': 21,
 'for': 22,
 'and': 23,
 'a': 24,
 'full': 25,
 'member': 26,
 'nation': 27,
 'council': 28,
 'with': 29,
 'test': 30,
 'odi': 31,
 'ti': 32,
 'status': 33,
 'are': 34,
 'current': 35,
 'holders': 36,
 't': 37,
 'world': 38,
 'cup': 39,
 'champions': 40,
 'trophy': 41,
 'asia': 42,
 'has': 43,
 'played': 44,
 'matches': 45,
 'winning': 46,
 'losing': 47,
 'draws': 48,
 'tie': 49,
 'may': 50,
 'ranked': 51,
 'fourth': 52,
 'icc': 53,
 'rankings': 54,
 'rating': 55,
 'points': 56,
 'have': 57,
 'two': 58,
 'three': 59,
 'championship': 60,
 'finals': 61,
 'finishing': 62,
 'runnersup': 63,
 'while': 64,
 'third': 65,
 'rivalries': 66,
 'include': 67,
 'bordergavaskar': 68,
 'australia': 69,
 'freed

In [9]:
len(vocab)

752

This script splits the input document into individual sentences, preparing the text data for further processing in the next-word prediction model.

In [10]:
input_sentences=document.split('\n')

In [11]:
input_sentences

['the india mens national cricket team also known as men in blue represents india in international cricket',
 'it is governed by the board of control for cricket in india and is a full member nation of the international cricket council with test odi and ti status',
 'india are the current holders of the t world cup the champions trophy and the asia cup',
 'the team has played test matches winning losing with draws and tie',
 'as of may india is ranked fourth in the icc mens test team rankings with rating points',
 'india have played in two of the three world test championship finals finishing runnersup in and while finishing third in',
 'test rivalries include the bordergavaskar trophy with australia freedom trophy with south africa anthony de mello trophy and pataudi trophy both with england',
 'the team has played odi matches winning losing tying and with ending in a noresult',
 'as of may india is ranked first in the icc mens odi team rankings with rating points',
 'india have appea

This function converts a sentence into a sequence of numerical indices based on the vocabulary, using the <UNK> token for unknown words not found in the vocabulary.

In [12]:
def text_indices(sentence, vocab):
  numerical_sentence=[]
  for token in sentence:
    if token not in vocab:
      numerical_sentence.append(vocab['<UNK>'])
    else:
      numerical_sentence.append(vocab[token])
  return numerical_sentence

This script converts the list of input sentences into sequences of numerical indices by tokenizing each sentence, converting to lowercase, and mapping tokens to their corresponding indices in the vocabulary.

In [13]:
input_numerical_sentences = []

for sentence in input_sentences:
  input_numerical_sentences.append(text_indices(word_tokenize(sentence.lower()), vocab))


In [14]:
len(input_numerical_sentences)

169

This script generates training sequences for the next-word prediction model by creating subsequences from each sentence, where each sequence includes progressively more tokens to predict the next word.

In [15]:
training_sequences = []
for sentence in input_numerical_sentences:
  for i in range(1,len(sentence)):
    training_sequences.append(sentence[:i+1])


In [16]:
len(training_sequences)

3348

In [17]:
training_sequences[:9]

[[1, 2],
 [1, 2, 3],
 [1, 2, 3, 4],
 [1, 2, 3, 4, 5],
 [1, 2, 3, 4, 5, 6],
 [1, 2, 3, 4, 5, 6, 7],
 [1, 2, 3, 4, 5, 6, 7, 8],
 [1, 2, 3, 4, 5, 6, 7, 8, 9],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]

This script calculates the lengths of all training sequences and finds the maximum sequence length, which can be useful for padding or defining input size for the model.

In [18]:
len_list=[]
for sequence in training_sequences:
  len_list.append(len(sequence))

max(len_list)

63

This script pads the training sequences with zeros to ensure they all have the same length, based on the maximum sequence length, preparing the data for input into the model.

In [19]:
padded_training_sequence=[]
for sequence in training_sequences:
 padded_training_sequence.append([0]*(max(len_list)-len(sequence))+sequence)

In [20]:
len(padded_training_sequence[100])

63

This script converts the padded training sequences into a PyTorch tensor of type long, making it ready for training in a deep learning model.

In [21]:
padded_training_sequence=torch.tensor(padded_training_sequence, dtype=torch.long)

In [22]:
padded_training_sequence.shape

torch.Size([3348, 63])

This script splits the padded training sequences into input (x) and target (y) tensors, where x contains all tokens except the last one (input sequence), and y contains the last token (target word to predict).

In [23]:
x=padded_training_sequence[:, :-1]
y=padded_training_sequence[:, -1]

In [24]:
x.shape

torch.Size([3348, 62])

In [25]:
from torch.utils.data import Dataset, dataloader

This script defines a custom PyTorch dataset class that stores the input (x) and target (y) sequences, enabling easy batching and access to training data during model training.

In [26]:
class CustomDataset(Dataset):
  def __init__(self, x, y):
    self.x=x
    self.y=y
  def __len__(self):
    return self.x.shape[0]
  def __getitem__(self, idx):
    return self.x[idx], self.y[idx]

This script creates a custom dataset instance, dataset, using the input (x) and target (y) tensors, ready for use in data loading and model training.

In [27]:
dataset=CustomDataset(x,y)

In [28]:
len(dataset)

3348

This script creates a DataLoader for the custom dataset, enabling efficient batch processing and shuffling of the training data with a batch size of 32.

In [29]:
dataloader=DataLoader(dataset=dataset, batch_size=32, shuffle=True)

In [30]:
dataset[1]

(tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2]),
 tensor(3))

This script defines an LSTM-based model for next-word prediction, using an embedding layer for token representation, an LSTM layer for sequence learning, and a fully connected layer for generating predictions based on the final hidden state.

In [31]:
class LSTMmodel(nn.Module):
  def __init__(self, vocab_size):
    super().__init__()
    self.emmbedding=nn.Embedding(vocab_size, 100)
    self.lstm=nn.LSTM(100, 150, batch_first=True)
    self.fc=nn.Linear(150, vocab_size)


  def forward(self,x):
    embedded=self.emmbedding(x)
    intermediate_hidden_state, (final_hidden_state, final_cell_state)=self.lstm(embedded)
    output= self.fc(final_hidden_state.squeeze(0))
    return output

This script creates an instance of the LSTM model, initializing it with the vocabulary size to ensure the model can handle the input data and output the correct word predictions.

In [32]:
model=LSTMmodel(vocab_size=len(vocab))

In [33]:
import torch
print("CUDA Available:", torch.cuda.is_available())
print("GPU Name:", torch.cuda.get_device_name(0))
print("Device:", torch.device("cuda" if torch.cuda.is_available() else "cpu"))


CUDA Available: True
GPU Name: NVIDIA GeForce RTX 4060 Laptop GPU
Device: cuda


This script checks if a GPU is available and sets the device to CUDA for faster training, otherwise defaults to using the CPU, and prints the selected device.

In [34]:
device= torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


This script moves the model to the selected device (GPU or CPU) for training, ensuring the model operates on the appropriate hardware

In [35]:
model.to(device)

LSTMmodel(
  (emmbedding): Embedding(752, 100)
  (lstm): LSTM(100, 150, batch_first=True)
  (fc): Linear(in_features=150, out_features=752, bias=True)
)

This script sets the learning rate to 0.001, which controls the step size during model training to optimize the loss function.

In [36]:
epochs=50
learning_rate=0.001

This script defines the loss function as CrossEntropyLoss for multi-class classification and initializes the Adam optimizer with the model parameters and a learning rate of 0.00

In [37]:
criterion=nn.CrossEntropyLoss()
optimizer=optim.Adam(model.parameters(), lr=learning_rate)

This script trains the LSTM model for the specified number of epochs, using the data from the DataLoader, calculating the loss, performing backpropagation, and updating the model weights using the Adam optimizer.

In [38]:
for epoch in range(epochs):
  total_loss=0
  for batch_x, batch_y in dataloader:
    batch_x, batch_y=batch_x.to(device), batch_y.to(device)
    optimizer.zero_grad()
    output=model(batch_x)
    loss=criterion(output, batch_y)
    loss.backward()
    optimizer.step()
    total_loss+=loss.item()
  print(f"Epoch: {epoch+1}, loss {total_loss:.4f}")

Epoch: 1, loss 610.3102
Epoch: 2, loss 528.2638
Epoch: 3, loss 481.8401
Epoch: 4, loss 434.8520
Epoch: 5, loss 389.8432
Epoch: 6, loss 348.5896
Epoch: 7, loss 309.8247
Epoch: 8, loss 272.7964
Epoch: 9, loss 239.1522
Epoch: 10, loss 208.9039
Epoch: 11, loss 182.0656
Epoch: 12, loss 157.2233
Epoch: 13, loss 136.0817
Epoch: 14, loss 117.6206
Epoch: 15, loss 102.0880
Epoch: 16, loss 88.9078
Epoch: 17, loss 77.6815
Epoch: 18, loss 67.8333
Epoch: 19, loss 59.7445
Epoch: 20, loss 52.6372
Epoch: 21, loss 47.1098
Epoch: 22, loss 42.0567
Epoch: 23, loss 38.1781
Epoch: 24, loss 34.7053
Epoch: 25, loss 31.8248
Epoch: 26, loss 29.3551
Epoch: 27, loss 27.2712
Epoch: 28, loss 25.9398
Epoch: 29, loss 24.1649
Epoch: 30, loss 22.9044
Epoch: 31, loss 22.1136
Epoch: 32, loss 20.9556
Epoch: 33, loss 20.0444
Epoch: 34, loss 19.3484
Epoch: 35, loss 18.7438
Epoch: 36, loss 18.1094
Epoch: 37, loss 17.7281
Epoch: 38, loss 17.2625
Epoch: 39, loss 16.8393
Epoch: 40, loss 16.5399
Epoch: 41, loss 16.4924
Epoch: 42,

This script defines a function for predicting the next word given a text input, by tokenizing, converting to numerical indices, padding the sequence, and using the trained model to generate the next word prediction based on the highest output probability.

In [39]:
def prediction(model, vocab, text):
  tokenized_text=word_tokenize(text.lower())
  numerical_text=text_indices(tokenized_text, vocab)
  padded_text=torch.tensor([0]*(23-len(numerical_text))+numerical_text, dtype=torch.long).unsqueeze(0).to(device)
  output=model(padded_text)
  value, index=torch.max(output, dim=1)
  # print(list(vocab.keys())[index])
  return text +" "+ list(vocab.keys())[index]





In [42]:
torch.save(model.state_dict(), "nextword_model.pt")


This script predicts the next word after "Transfer learning is a powerful AI" by processing the input text through the trained model and returning the predicted next word from the vocabulary.

In [40]:
prediction(model, vocab, "Hello my name is")

'Hello my name is governed'

This script generates a sequence of 25 predicted words starting from the input text "Zero-shot learning", using the model to predict the next word iteratively and appending it to the input text, with a 0.3-second delay between each prediction.

In [41]:
import time
num_token=30
input_text="Virat Kohli"
for token in range(num_token):
  output=prediction(model, vocab, input_text)
  print(output)
  input_text=output
  time.sleep(0.3)


Virat Kohli became
Virat Kohli became the
Virat Kohli became the first
Virat Kohli became the first player
Virat Kohli became the first player to
Virat Kohli became the first player to be
Virat Kohli became the first player to be player
Virat Kohli became the first player to be player of
Virat Kohli became the first player to be player of the
Virat Kohli became the first player to be player of the tournament
Virat Kohli became the first player to be player of the tournament in
Virat Kohli became the first player to be player of the tournament in back
Virat Kohli became the first player to be player of the tournament in back to
Virat Kohli became the first player to be player of the tournament in back to back
Virat Kohli became the first player to be player of the tournament in back to back editions
Virat Kohli became the first player to be player of the tournament in back to back editions of
Virat Kohli became the first player to be player of the tournament in back to back editions of 

In [43]:
import pickle
with open("vocab.pkl", "wb") as f:
    pickle.dump(vocab, f)
