<a href="https://colab.research.google.com/github/azizhina51-svg/NLP/blob/main/RNN_Word_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# TensorFlow is used to build and train the neural network
import tensorflow as tf

# NumPy is used for numerical operations and array handling
import numpy as np


In [None]:
# A small text dataset for next-word prediction
# Each sentence will be broken into smaller sequences later
sentences = [
    """ Today, the issues of global warming and significant climate
change are extremely relevant. They are discussed not only by scientists and
politicians, but also by ordinary citizens. It must be understood that this
problem really deserves extensive attention. Numerous studies have long
confirmed that warming does have an impact on the environment, even at
the regional level. If we leave some thoughts about extrapolation in the
future, then the usually cited facts on local effects are easily verified by local
residents, to a greater extent when it comes to melting permafrost or warm
winters. In addition, biospheric effects relating to individual organisms are
observed by all of us at the household level and therefore do not raise
questions. In the modern world, the climate is changing under the influence
of natural and anthropogenic factors. It seems to us that every person should
want to preserve the natural conditions in which we live. Within the
framework of this article, it is proposed to consider in more detail how this
can be done. """
]


In [None]:
# Tokenizer converts words into integer IDs
tokenizer = tf.keras.preprocessing.text.Tokenizer()

# Learn the vocabulary from the sentences
tokenizer.fit_on_texts(sentences)

# Dictionary mapping words to their integer index
word_index = tokenizer.word_index

# Vocabulary size (+1 because index 0 is reserved for padding)
vocab_size = len(word_index) + 1

# Print results to understand what happened
print("Word Index:", word_index)
print("Vocabulary Size:", vocab_size)


Word Index: {'the': 1, 'to': 2, 'in': 3, 'of': 4, 'and': 5, 'are': 6, 'by': 7, 'it': 8, 'that': 9, 'this': 10, 'warming': 11, 'climate': 12, 'not': 13, 'be': 14, 'have': 15, 'on': 16, 'at': 17, 'level': 18, 'we': 19, 'local': 20, 'effects': 21, 'us': 22, 'is': 23, 'natural': 24, 'today': 25, 'issues': 26, 'global': 27, 'significant': 28, 'change': 29, 'extremely': 30, 'relevant': 31, 'they': 32, 'discussed': 33, 'only': 34, 'scientists': 35, 'politicians': 36, 'but': 37, 'also': 38, 'ordinary': 39, 'citizens': 40, 'must': 41, 'understood': 42, 'problem': 43, 'really': 44, 'deserves': 45, 'extensive': 46, 'attention': 47, 'numerous': 48, 'studies': 49, 'long': 50, 'confirmed': 51, 'does': 52, 'an': 53, 'impact': 54, 'environment': 55, 'even': 56, 'regional': 57, 'if': 58, 'leave': 59, 'some': 60, 'thoughts': 61, 'about': 62, 'extrapolation': 63, 'future': 64, 'then': 65, 'usually': 66, 'cited': 67, 'facts': 68, 'easily': 69, 'verified': 70, 'residents': 71, 'a': 72, 'greater': 73, 'exte

In [None]:
# This list will store all input sequences
sequences = []

# Loop through each sentence
for sentence in sentences:

    # Convert sentence into a list of word indices
    token_list = tokenizer.texts_to_sequences([sentence])[0]

    # Create sub-sequences for next-word prediction
    # Example: [i, love, machine] → input=[i, love], output=machine
    for i in range(1, len(token_list)):
        sequences.append(token_list[:i + 1])

# Find the maximum length among all sequences
max_sequence_len = max([len(seq) for seq in sequences])

# Pad the sequences so they all have the same length
# 'pre' padding adds zeros at the beginning of each sequence
sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_sequence_len, padding='pre')

# Display generated sequences
print("Sequences:\n", sequences)


Sequences:
 [[  0   0   0 ...   0  25   1]
 [  0   0   0 ...  25   1  26]
 [  0   0   0 ...   1  26   4]
 ...
 [  0   0  25 ... 117  10 118]
 [  0  25   1 ...  10 118  14]
 [ 25   1  26 ... 118  14 119]]


In [None]:
# Find the length of the longest sequence
max_sequence_len = max(len(seq) for seq in sequences)

# Pad sequences with zeros at the beginning
sequences = tf.keras.preprocessing.sequence.pad_sequences(
    sequences,
    maxlen=max_sequence_len,
    padding='pre'
)

print("Padded Sequences:\n", sequences)


Padded Sequences:
 [[  0   0   0 ...   0  25   1]
 [  0   0   0 ...  25   1  26]
 [  0   0   0 ...   1  26   4]
 ...
 [  0   0  25 ... 117  10 118]
 [  0  25   1 ...  10 118  14]
 [ 25   1  26 ... 118  14 119]]


In [None]:
# Input features: all words except the last one
X = sequences[:, :-1]

# Target labels: the last word of each sequence
y = sequences[:, -1]

# Convert labels into one-hot encoded format
# This allows softmax to work correctly
y = tf.keras.utils.to_categorical(y, num_classes=vocab_size)

print("Input shape:", X.shape)
print("Output shape:", y.shape)


Input shape: (170, 170)
Output shape: (170, 120)


At this point, data preparation is complete

Now we have clean, structured data ready for an RNN.

In [None]:
# Create a Sequential model (layers stacked one after another)
model = tf.keras.Sequential()

# -------------------------------
# Embedding Layer
# -------------------------------
# This layer converts word indices into dense vectors
# Example: word "learning" → [0.12, -0.45, 0.89, ...]
# It helps the model understand semantic relationships between words
model.add(
    tf.keras.layers.Embedding(
        input_dim=vocab_size,                 # Size of vocabulary
        output_dim=64,                         # Dimension of word vectors
        input_length=max_sequence_len - 1     # Length of input sequences
    )
)

# -------------------------------
# Simple RNN Layer
# -------------------------------
# This layer processes sequences step-by-step
# It remembers previous words while reading the sentence
model.add(
    tf.keras.layers.SimpleRNN(
        64                                   # Number of RNN units (memory size)
    )
)

# -------------------------------
# Output Layer
# -------------------------------
# Dense layer with softmax activation
# Outputs probability for each word in the vocabulary
model.add(
    tf.keras.layers.Dense(
        vocab_size,                          # One neuron per word
        activation='softmax'                 # Converts scores to probabilities
    )
)




In [None]:
# Compile the model
model.compile(
    loss='categorical_crossentropy',   # Loss for multi-class classification
    optimizer='adam',                  # Adaptive learning optimizer
    metrics=['accuracy']               # Track accuracy during training
)

# Display model architecture
model.summary()


In [None]:
# Train the model on the prepared data
model.fit(
    X,                 # Input sequences
    y,                 # Correct next words
    epochs=200,        # Number of training cycles
    verbose=1          # Show training progress
)


Epoch 1/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 39ms/step - accuracy: 0.0052 - loss: 4.8035
Epoch 2/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step - accuracy: 0.0411 - loss: 4.7083
Epoch 3/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step - accuracy: 0.0312 - loss: 4.6493
Epoch 4/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step - accuracy: 0.0955 - loss: 4.6125
Epoch 5/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step - accuracy: 0.1730 - loss: 4.5276
Epoch 6/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step - accuracy: 0.1551 - loss: 4.4864
Epoch 7/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step - accuracy: 0.2461 - loss: 4.4161
Epoch 8/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step - accuracy: 0.3103 - loss: 4.3431
Epoch 9/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[3

<keras.src.callbacks.history.History at 0x785d3b486f90>

In [None]:
# Function to predict the next word given a text input
def predict_next_word(model, tokenizer, text, max_sequence_len):

    # Convert input text to sequence of integers
    token_list = tokenizer.texts_to_sequences([text])[0]

    # Pad sequence to match training input length
    token_list = tf.keras.preprocessing.sequence.pad_sequences(
        [token_list],
        maxlen=max_sequence_len - 1,
        padding='pre'
    )

    # Predict probabilities for each word
    predicted_probs = model.predict(token_list, verbose=0)

    # Get index of word with highest probability
    predicted_index = np.argmax(predicted_probs)

    # Convert index back to word
    for word, index in tokenizer.word_index.items():
        if index == predicted_index:
            return word

    return None


In [None]:
print(predict_next_word(model, tokenizer, "Today, the issues", max_sequence_len))
print(predict_next_word(model, tokenizer, "of global warming", max_sequence_len))


of
and


In [None]:
# Function to generate a sequence of words (paragraph)
def generate_text(model, tokenizer, seed_text, max_sequence_len, num_words):

    # Start with the initial seed text
    output_text = seed_text

    # Loop to generate the desired number of words
    for _ in range(num_words):

        # Convert current text to a sequence of integers
        token_list = tokenizer.texts_to_sequences([output_text])[0]

        # Pad sequence to match model input length
        token_list = tf.keras.preprocessing.sequence.pad_sequences(
            [token_list],
            maxlen=max_sequence_len - 1,
            padding='pre'
        )

        # Predict probability distribution for next word
        predicted_probs = model.predict(token_list, verbose=0)

        # Select the word with the highest probability
        predicted_index = np.argmax(predicted_probs)

        # Convert predicted index back to a word
        predicted_word = None
        for word, index in tokenizer.word_index.items():
            if index == predicted_index:
                predicted_word = word
                break

        # Stop if no valid word is found
        if predicted_word is None:
            break

        # Append the predicted word to the output text
        output_text += " " + predicted_word

    return output_text


In [None]:
# Generate a paragraph starting with a seed sentence
generated_paragraph = generate_text(
    model,
    tokenizer,
    seed_text="Today, the issues",
    max_sequence_len=max_sequence_len,
    num_words=20
)

print(generated_paragraph)


Today, the issues of global warming and significant climate change are extremely relevant they are discussed not only by scientists and politicians but
Today, the issues of global warming and significant climate change are extremely relevant they are discussed not only by scientists and politicians but


In [None]:
# Generate a paragraph starting with a seed sentence
generated_paragraph = generate_text(
    model,
    tokenizer,
    seed_text="It must be",
    max_sequence_len=max_sequence_len,
    num_words=50
)

print(generated_paragraph)

It must be of of this of significant climate change are extremely relevant they are discussed not only by scientists and politicians but also by ordinary citizens it must be understood that this problem really deserves extensive attention numerous studies have long confirmed that warming does have an impact on the environment even
