## Preparing the Dataset

In [3]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')

# load data
with open('sherlock-holm.es_stories_plain-text_advs.txt', 'r', encoding='utf-8') as f:
    text = f.read().lower()

tokens = word_tokenize(text)
print("Total Tokens:", len(tokens))

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/lokeshdash/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/lokeshdash/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Total Tokens: 125731


#### Here, we converted the text to lowercase (to maintain consistency) and used word_tokenize to break the entire corpus into word-level tokens. This prepares our data for model training by converting raw text into a structured format that the model can understand.

## Creating a Vocabulary

In [53]:
from collections import Counter

word_counts = Counter(tokens)
vocab = sorted(word_counts, key=word_counts.get, reverse=True)

word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}
vocab_size = len(vocab)

vocab = sorted(set(tokens + ["<UNK>"]))  # ensure <UNK> is added
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}

#### Here, we counted how often each word appears using Counter, then sorted the vocabulary from most to least frequent. This sorted list helps us assign lower indices to more common words (useful for embeddings). Then, we created word2idx and idx2word dictionaries to convert words to unique IDs and back. Finally, we stored the total vocabulary size, which will define the input and output dimensions for our model.

## Building Input-Output Sequences

In [26]:
import torch  # ← Add this line

sequence_length = 4
data = []
for i in range(len(tokens) - sequence_length):
    input_seq = tokens[i:i + sequence_length - 1]
    target = tokens[i + sequence_length - 1]
    data.append((input_seq, target))

def encode(seq):
    return [word2idx[word] for word in seq]

encoded_data = [(torch.tensor(encode(inp)), torch.tensor(word2idx[target])) for inp, target in data]


#### Here, we used a sliding window approach to generate training samples: for every group of 3 consecutive words (input), we predict the next word (target). It prepares the data for sequence modelling.
#### Then, we defined an encode function to convert each word in the sequence into its corresponding index using our vocabulary. Finally, we build encoded_data, a list of (input_tensor, target_tensor) pairs, where each input is a tensor of word indices and the target is the index of the next word to be predicted.

## Designing the Model Architecture

In [3]:
import torch.nn as nn

class PredictiveKeyboard(nn.Module):
    def __init__(self, vocab_size, embed_dim=64, hidden_dim=128):
        super(PredictiveKeyboard, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        output, _ = self.lstm(x)
        output = self.fc(output[:, -1, :])  # last LSTM output
        return output

#### This class defines our neural network model. First, the Embedding layer converts word indices into dense vectors. These embeddings are then passed through an LSTM layer, which captures the sequential context of the input.

#### Finally, we take the output of the last time step and feed it through a Linear layer to get a vector of size vocab_size, representing the predicted probabilities for each word in the vocabulary. This architecture allows the model to learn patterns and dependencies in word sequences for next-word prediction.

## Training the Model

In [47]:
import torch
import torch.optim as optim
import random

model = PredictiveKeyboard(vocab_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.005)

epochs = 20
for epoch in range(epochs):
    total_loss = 0
    random.shuffle(encoded_data)
    for input_seq, target in encoded_data[:10000]:  # Limit data for speed
        input_seq = input_seq.unsqueeze(0)
        output = model(input_seq)
        loss = criterion(output, target.unsqueeze(0))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

Epoch 1, Loss: 3.6774
Epoch 2, Loss: 2.1599
Epoch 3, Loss: 1.0939
Epoch 4, Loss: 0.5619
Epoch 5, Loss: 0.1868
Epoch 6, Loss: 0.0670
Epoch 7, Loss: 0.0304
Epoch 8, Loss: 0.0176
Epoch 9, Loss: 0.0093
Epoch 10, Loss: 0.0053
Epoch 11, Loss: 0.0033
Epoch 12, Loss: 0.0021
Epoch 13, Loss: 0.0016
Epoch 14, Loss: 0.0013
Epoch 15, Loss: 0.0010
Epoch 16, Loss: 0.0009
Epoch 17, Loss: 0.0008
Epoch 18, Loss: 0.0007
Epoch 19, Loss: 0.0006
Epoch 20, Loss: 0.0005


#### Here, we instantiated the model, defined a loss function (CrossEntropyLoss), and used the Adam optimizer for efficient gradient updates. During each training epoch, we shuffled the dataset for better generalization. For each training sample, we added a batch dimension to the input, computed the output, and calculated the loss between predicted and actual next-word indices.

#### Then we performed backpropagation, updated the weights, and accumulated the total loss for tracking. This loop trains the model to predict the next word based on the previous sequence.

## Predicting the Next Words

In [65]:
import torch.nn.functional as F

def encode(seq):
    return [word2idx.get(word, word2idx["<UNK>"]) for word in seq]

def suggest_next_words(model, text_prompt, top_k=3):
    model.eval()
    tokens = word_tokenize(text_prompt.lower())
    if len(tokens) < sequence_length - 1:
        raise ValueError(f"Input should be at least {sequence_length - 1} words long.")

    input_seq = tokens[-(sequence_length - 1):]
    input_tensor = torch.tensor(encode(input_seq)).unsqueeze(0)

    with torch.no_grad():
        output = model(input_tensor)
        probs = F.softmax(output, dim=1).squeeze()
        top_indices = torch.topk(probs, top_k).indices.tolist()

    return [idx2word[idx] for idx in top_indices]

print("Suggestions:", suggest_next_words(model, "So, are we really at"))

Suggestions: ['this', 'predict', 'going']


#### This function takes a user input like “So, are we really at”, tokenizes and encodes the last few words, and passes them through the trained model to get output scores.

#### These scores are then converted into probabilities using softmax, and the top k predictions (like the three most probable next words) are selected using torch.topk. The function then maps these indices back to actual words using idx2word, mimicking the behaviour of a real predictive keyboard.