## Next Word Prediction Model

The Next Word Prediction Models are used in applications like messaging apps, search engines, virtual assistants, and autocorrect features on smartphones.


Next word prediction is a language modelling task in Machine Learning that aims to predict the most probable word or sequence of words that follows a given input context. This task utilizes statistical patterns and linguistic structures to generate accurate predictions based on the context provided.

Steps:

    1. start by collecting a diverse dataset of text documents
    2. preprocess the data by cleaning and tokenizing it
    3. prepare the data by creating input-output pairs
    4. engineer features such as word embeddings
    5. select an appropriate model like an LSTM or GPT 
    6. train the model on the dataset while adjusting hyperparameters
    7. improve the model by experimenting with different techniques and architectures


In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [2]:
#Read the text file
with open('book/sherlock-holm.es_stories_plain-text_advs.txt', 'r', encoding='utf-8') as file:
    text = file.read()

In [3]:
#tokenize the txt to create a sequence
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1
print(total_words)

8200


In [4]:
#create input-output pairs 
input_sequences = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In [5]:
# input_sequences

In [6]:
#pad the input sequence to have equal length
max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

In [7]:
input_sequences

array([[   0,    0,    0, ...,    0,    1, 1561],
       [   0,    0,    0, ...,    1, 1561,    5],
       [   0,    0,    0, ..., 1561,    5,  129],
       ...,
       [   0,    0,    0, ...,    1, 8198, 8199],
       [   0,    0,    0, ..., 8198, 8199, 3187],
       [   0,    0,    0, ..., 8199, 3187, 3186]], dtype=int32)

In [8]:
#splitting the sequence into input and output
X = input_sequences[:, :-1]
y = input_sequences[:, -1]

In [9]:
#convert output to one-hot encoding vectors
y = np.array(tf.keras.utils.to_categorical(y, num_classes=total_words))

In [13]:
#neural network architecture for model training
model = Sequential()
length = max_sequence_len-1
model.add(Embedding(total_words, 100, input_length=length))
model.add(LSTM(150))
model.add(Dense(total_words, activation='softmax'))
print(model.summary())

ValueError: Unrecognized keyword arguments passed to Embedding: {'input_length': 17}

In [None]:
pip uninstall tensorflow


Found existing installation: tensorflow 2.16.1
Uninstalling tensorflow-2.16.1:
  Would remove:
    /opt/homebrew/Cellar/jupyterlab/4.1.5/libexec/bin/import_pb_to_tensorboard
    /opt/homebrew/Cellar/jupyterlab/4.1.5/libexec/bin/saved_model_cli
    /opt/homebrew/Cellar/jupyterlab/4.1.5/libexec/bin/tensorboard
    /opt/homebrew/Cellar/jupyterlab/4.1.5/libexec/bin/tf_upgrade_v2
    /opt/homebrew/Cellar/jupyterlab/4.1.5/libexec/bin/tflite_convert
    /opt/homebrew/Cellar/jupyterlab/4.1.5/libexec/bin/toco
    /opt/homebrew/Cellar/jupyterlab/4.1.5/libexec/bin/toco_from_protos
    /opt/homebrew/Cellar/jupyterlab/4.1.5/libexec/lib/python3.12/site-packages/tensorflow-2.16.1.dist-info/*
    /opt/homebrew/Cellar/jupyterlab/4.1.5/libexec/lib/python3.12/site-packages/tensorflow/*
Proceed (Y/n)? 