In [1]:
doc = """
Each word in the vocabulary is represented as a one-hot vector. This creates a sparse matrix but ensures that words are uniquely identifiable.
Embedding layers convert sparse one-hot vectors into dense, low-dimensional vectors capturing semantic meanings of words. This reduces dimensionality and improves training speed.
Preprocessing the text—removing punctuation, converting to lowercase, and handling stop words—ensures consistency in data.
The dataset is split into training, validation, and test sets to evaluate the model's performance and prevent overfitting.
Since LSTMs require sequences of equal length, shorter sequences are padded with zeros or truncated to fit.
Some words may occur more frequently than others. Using techniques like class weighting or sub-sampling balances the dataset.
Analyzing text statistics like word frequency, sequence lengths, and vocabulary distribution helps understand the dataset better.
Normalizing input sequences (e.g., dividing token values by max token value) ensures consistent scaling.
Use undersampling or oversampling if certain words are underrepresented or overrepresented.
The model typically uses categorical_crossentropy as the loss function for multi-class classification (predicting one word out of many).
Accuracy or perplexity can be used to measure the model's performance.
Unknown words are often replaced with a special <OOV> token to handle unseen data during training.
To predict the next word, a fixed-size context window (e.g., the last 5 words) is used as input to the LSTM.
Longer sequences capture more context but increase computational cost, requiring careful selection.
Synonym replacement, shuffling, or paraphrasing increases training data diversity.
Words with low frequency can be oversampled to improve prediction accuracy for rare words.
Dropout helps prevent overfitting in LSTMs during training by randomly deactivating some neurons.
Data is split into mini-batches for faster and more efficient training.
Shuffling sequences during training prevents the model from memorizing patterns in the order of the data.
Sequences are fed into the LSTM step-by-step, preserving temporal relationships between words.
Sequences longer than a predefined limit are truncated to fit memory constraints.
Techniques like random insertion, swapping, or deletion of words introduce variability.
Adjusting the window size allows fine-tuning the level of context captured by the model.
Using embeddings like Word2Vec or GloVe helps the model start with semantic knowledge.
Optimizers like Adam or RMSprop adjust weights efficiently during backpropagation.
Statistical analysis of token frequencies can inform vocabulary curation.
Rare word counts are scaled logarithmically to reduce their impact.
Analyzing word distributions helps identify and mitigate potential biases.
Overlapping sequences ensure every word is part of multiple training sequences.
The choice of embedding dimension (e.g., 100, 300) balances richness and computational cost.
Hidden layers capture dependencies in sequences, improving predictions for longer contexts.
Prevents exploding gradients during backpropagation.
Training is halted once validation loss stops improving, preventing overfitting.
Resampling ensures the target word's frequency matches its importance in the text.
Adjusting learning rates during training improves convergence.
Different languages require custom preprocessing (e.g., stemming or lemmatization).
Incorporating positional encodings can enhance the model's ability to capture order.
Multi-layer LSTMs capture both local and global context.
Adding noise to input data prevents overfitting and improves generalization.
Overlap percentages in sliding windows can be tuned for sequence diversity.
Compute statistics like mean, median, and variance of sequence lengths.
Adding metadata like POS tags or sentence boundaries can enhance model performance.
Pretrain the model on unsupervised tasks like masked word prediction to improve accuracy.
Metrics like perplexity, BLEU, or ROUGE assess next-word prediction performance.
Iteratively train on subsets of data to improve learning on difficult patterns.
"""

In [2]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

In [3]:
tokenizer = Tokenizer()

In [4]:
tokenizer.fit_on_texts([doc])

In [5]:
tokenizer.word_index

{'the': 1,
 'to': 2,
 'or': 3,
 'of': 4,
 'and': 5,
 'words': 6,
 'training': 7,
 'sequences': 8,
 'word': 9,
 'like': 10,
 'in': 11,
 'are': 12,
 'data': 13,
 'is': 14,
 'model': 15,
 'can': 16,
 'during': 17,
 'a': 18,
 'for': 19,
 'into': 20,
 'performance': 21,
 'overfitting': 22,
 'with': 23,
 'helps': 24,
 'e': 25,
 'g': 26,
 'token': 27,
 'by': 28,
 'context': 29,
 'capture': 30,
 'vocabulary': 31,
 'as': 32,
 'one': 33,
 'ensures': 34,
 'improves': 35,
 'dataset': 36,
 "model's": 37,
 'lstms': 38,
 'more': 39,
 'frequency': 40,
 'sequence': 41,
 'input': 42,
 'accuracy': 43,
 'be': 44,
 'longer': 45,
 'improve': 46,
 'prediction': 47,
 'prevents': 48,
 'on': 49,
 'hot': 50,
 'this': 51,
 'sparse': 52,
 'but': 53,
 'embedding': 54,
 'layers': 55,
 'vectors': 56,
 'low': 57,
 'semantic': 58,
 'preprocessing': 59,
 'split': 60,
 'validation': 61,
 'prevent': 62,
 'require': 63,
 'truncated': 64,
 'fit': 65,
 'some': 66,
 'than': 67,
 'using': 68,
 'techniques': 69,
 'class': 70,
 

In [12]:
input_sequences = []
for sentence in doc.split('\n'):
    tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
    #print(tokenized_sentence)
    for i in range(1,len(tokenized_sentence)):
        input_sequences.append(tokenized_sentence[:i+1])

In [14]:
input_sequences

[[98, 9],
 [98, 9, 11],
 [98, 9, 11, 1],
 [98, 9, 11, 1, 31],
 [98, 9, 11, 1, 31, 14],
 [98, 9, 11, 1, 31, 14, 99],
 [98, 9, 11, 1, 31, 14, 99, 32],
 [98, 9, 11, 1, 31, 14, 99, 32, 18],
 [98, 9, 11, 1, 31, 14, 99, 32, 18, 33],
 [98, 9, 11, 1, 31, 14, 99, 32, 18, 33, 50],
 [98, 9, 11, 1, 31, 14, 99, 32, 18, 33, 50, 100],
 [98, 9, 11, 1, 31, 14, 99, 32, 18, 33, 50, 100, 51],
 [98, 9, 11, 1, 31, 14, 99, 32, 18, 33, 50, 100, 51, 101],
 [98, 9, 11, 1, 31, 14, 99, 32, 18, 33, 50, 100, 51, 101, 18],
 [98, 9, 11, 1, 31, 14, 99, 32, 18, 33, 50, 100, 51, 101, 18, 52],
 [98, 9, 11, 1, 31, 14, 99, 32, 18, 33, 50, 100, 51, 101, 18, 52, 102],
 [98, 9, 11, 1, 31, 14, 99, 32, 18, 33, 50, 100, 51, 101, 18, 52, 102, 53],
 [98, 9, 11, 1, 31, 14, 99, 32, 18, 33, 50, 100, 51, 101, 18, 52, 102, 53, 34],
 [98,
  9,
  11,
  1,
  31,
  14,
  99,
  32,
  18,
  33,
  50,
  100,
  51,
  101,
  18,
  52,
  102,
  53,
  34,
  103],
 [98,
  9,
  11,
  1,
  31,
  14,
  99,
  32,
  18,
  33,
  50,
  100,
  51,
  101,


In [15]:
max_len = max([len(x) for x in input_sequences])

In [16]:
max_len

24

In [17]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_input_sequences = pad_sequences(input_sequences, maxlen = max_len, padding='pre')

In [18]:
padded_input_sequences

array([[  0,   0,   0, ...,   0,  98,   9],
       [  0,   0,   0, ...,  98,   9,  11],
       [  0,   0,   0, ...,   9,  11,   1],
       ...,
       [  0,   0,   0, ...,  46,  95,  49],
       [  0,   0,   0, ...,  95,  49, 310],
       [  0,   0,   0, ...,  49, 310,  89]])

In [19]:
X = padded_input_sequences[:,:-1]

In [20]:
y = padded_input_sequences[:,-1]

In [21]:
X

array([[  0,   0,   0, ...,   0,   0,  98],
       [  0,   0,   0, ...,   0,  98,   9],
       [  0,   0,   0, ...,  98,   9,  11],
       ...,
       [  0,   0,   0, ...,   2,  46,  95],
       [  0,   0,   0, ...,  46,  95,  49],
       [  0,   0,   0, ...,  95,  49, 310]])

In [23]:
X.shape

(541, 23)

In [26]:
y.shape

(541,)

In [27]:
X.max()

310

In [28]:
from tensorflow.keras.utils import to_categorical
y = to_categorical(y,num_classes=311)

In [31]:
y.shape

(541, 311)

In [32]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [45]:
model = Sequential()

model.add(Embedding(311, 20, input_length=23))

model.add(LSTM(200))

model.add(Dense(311, activation='sigmoid'))
model.build(input_shape=(None, 23))

In [46]:
model.summary()

In [47]:
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

In [48]:
model.fit(X,y,epochs=100)

Epoch 1/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 28ms/step - accuracy: 0.0194 - loss: 5.7340
Epoch 2/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 28ms/step - accuracy: 0.0541 - loss: 5.4867
Epoch 3/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step - accuracy: 0.0278 - loss: 5.4106
Epoch 4/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 28ms/step - accuracy: 0.0268 - loss: 5.3939
Epoch 5/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 27ms/step - accuracy: 0.0416 - loss: 5.2899
Epoch 6/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 27ms/step - accuracy: 0.0486 - loss: 5.2641
Epoch 7/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 28ms/step - accuracy: 0.0551 - loss: 5.2254
Epoch 8/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 27ms/step - accuracy: 0.0420 - loss: 5.2575
Epoch 9/100
[1m17/17[0m [32m━━━━━━━━━

[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 32ms/step - accuracy: 0.9637 - loss: 0.5016
Epoch 69/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 32ms/step - accuracy: 0.9682 - loss: 0.4484
Epoch 70/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 31ms/step - accuracy: 0.9739 - loss: 0.4263
Epoch 71/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 33ms/step - accuracy: 0.9855 - loss: 0.3777
Epoch 72/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 30ms/step - accuracy: 0.9851 - loss: 0.3839
Epoch 73/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step - accuracy: 0.9815 - loss: 0.3439
Epoch 74/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 32ms/step - accuracy: 0.9833 - loss: 0.3564
Epoch 75/100
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 31ms/step - accuracy: 0.9858 - loss: 0.3355
Epoch 76/100
[1m17/17[0m [32m━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x210d702ac90>

In [55]:
text = 'e g'
token_text = tokenizer.texts_to_sequences([text])[0]
padded_token_text = pad_sequences([token_text], maxlen=24, padding='pre')

In [56]:
import numpy as np
np.argmax(model.predict(padded_token_text))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step


268

In [None]:
token_text = tokenizer.texts_to_sequences([text])[0]
  # padding
padded_token_text = pad_sequences([token_text], maxlen=24, padding='pre')
padded_token_text
np.argmax(model.predict(padded_token_text))

In [58]:
import numpy as np
text = "Techniques"

for i in range(10):
  # tokenize
    token_text = tokenizer.texts_to_sequences([text])[0]
  # padding
    padded_token_text = pad_sequences([token_text], maxlen=24, padding='pre')
  # predict
    pos = np.argmax(model.predict(padded_token_text))

    for word,index in tokenizer.word_index.items():
        if index == pos:
            text = text + " " + word
            print(text)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step
Techniques like
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step
Techniques like random
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 72ms/step
Techniques like random insertion
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step
Techniques like random insertion swapping
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step
Techniques like random insertion swapping or
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 84ms/step
Techniques like random insertion swapping or deletion
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 85ms/step
Techniques like random insertion swapping or deletion of
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
Techniques like random insertion swapping or deletion of words
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
Techniques like ran