# Paraphrasing using LSTM, Bidirectional LSTM, Stacked LSTM

We will be implementing paraphrasing task in this notebook, you will learn how to create a language model for paraphrasing of natural language text by implement and training LSTM.

Generating News headlines
In this kernel, I will be using the dataset of Google’s PAWS: Paraphrase Adversaries from Word Scrambling: to train a paraphrasing language model.
It focuses on generating challenging sentence pairs by using paraphrasing techniques.


1. Import the libraries
As the first step, we need to import the required libraries:

In [1]:
import pandas as pd
import numpy as np
import string, os

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Bidirectional, Embedding, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## 2. Load the dataset

Load the dataset where use second coulmn as original sentence and third coulmn as paraphrase sentence

In [3]:
# Read TSV file
#curr_dir = '/content/'
filename = 'train.tsv'  # Replace with your TSV file name
data_df = pd.read_csv(filename, delimiter='\t')

# Filter out rows with missing values
data_df = data_df.dropna()

# Create sentence pairs from 'sentence1' and 'sentence2' columns
sentences = data_df['sentence1'].tolist()
paraphrases = data_df['sentence2'].tolist()

### 3 Generating Sequence of N-gram Tokens

IT requires a sequence input data, as given a sequence (of words/tokens) the aim to rephrase word/token.  

The next step is Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. Python’s library Keras has inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus. After this step, every text document in the dataset is converted into sequence of tokens.


In [4]:
# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences + paraphrases)

# Convert sentences to sequences
sentences_seq = tokenizer.texts_to_sequences(sentences)
paraphrases_seq = tokenizer.texts_to_sequences(paraphrases)

# Padding sequences
max_length = max(len(seq) for seq in sentences_seq + paraphrases_seq)
sentences_seq = pad_sequences(sentences_seq, maxlen=max_length, padding='post')
paraphrases_seq = pad_sequences(paraphrases_seq, maxlen=max_length, padding='post')


In [5]:
sentences_seq[:2]  ## to visualise the token

array([[    1,    63,     7, 10464,     2,  4980,    11,     1,   249,
          116,     2,  1446,     9,  4229,  5981, 14252, 14253,     3,
          199,   468, 14254,  2801,   558,  7641,     0,     0,     0,
            0,     0,     0,     0,     0,     0],
       [    1, 10873,  4573,  6706,    28,  1309,     1,   191,   238,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0]], dtype=int32)



## 4. LSTMs for paraphrasing

Unlike Feed-forward neural networks in which activation outputs are propagated only in one direction, the activation outputs from neurons propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates loops in the neural network architecture which acts as a ‘memory state’ of the neurons. This state allows the neurons an ability to remember what have been learned so far.

The memory state in RNNs gives an advantage over traditional neural networks but a problem called Vanishing Gradient is associated with them. In this problem, while learning with a large number of layers, it becomes really hard for the network to learn and tune the parameters of the earlier layers. To address this problem, A new type of RNNs called LSTMs (Long Short Term Memory) Models have been developed.

LSTMs have an additional state called ‘cell state’ through which the network makes adjustments in the information flow. The advantage of this state is that the model can remember or forget the leanings more selectively. To learn more about LSTMs, here is a great post. Lets architecture a LSTM model in our code. I have added total three layers in the model.

1. Embedding Layer : Takes the sequence of words as input
2. LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.
4. Dense/Output Layer : Computes the probability of the best possible next word as output

We will run this model for total 10 epochs but it can be experimented further.

LSTM model

In [6]:
def create_lstm_model():
    model = Sequential()
    model.add(Embedding(len(tokenizer.word_index) + 1, 50, input_length=max_length))
    model.add(LSTM(100))
    model.add(Dense(max_length, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Bidirectional LSTM model

In [7]:
def create_bidirectional_lstm_model():
    model = Sequential()
    model.add(Embedding(len(tokenizer.word_index) + 1, 50, input_length=max_length))
    model.add(Bidirectional(LSTM(100)))
    model.add(Dense(max_length, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model


Stacked LSTM model

In [8]:
def create_stacked_lstm_model():
    model = Sequential()
    model.add(Embedding(len(tokenizer.word_index) + 1, 50, input_length=max_length))
    model.add(LSTM(50, return_sequences=True))
    model.add(LSTM(50))
    model.add(Dense(max_length, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

you can run the summary function to differentiate the lstm model variants layers

In [None]:
lstm_model.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_9 (Embedding)     (None, 33, 50)            723450    
                                                                 
 lstm_11 (LSTM)              (None, 100)               60400     
                                                                 
 dense_9 (Dense)             (None, 33)                3333      
                                                                 
Total params: 787183 (3.00 MB)
Trainable params: 787183 (3.00 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
bidirectional_lstm_model.summary()

Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_11 (Embedding)    (None, 33, 50)            723450    
                                                                 
 bidirectional_3 (Bidirecti  (None, 200)               120800    
 onal)                                                           
                                                                 
 dense_11 (Dense)            (None, 33)                6633      
                                                                 
Total params: 850883 (3.25 MB)
Trainable params: 850883 (3.25 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
stacked_lstm_model.summary()

Model: "sequential_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_12 (Embedding)    (None, 33, 50)            723450    
                                                                 
 lstm_14 (LSTM)              (None, 33, 50)            20200     
                                                                 
 lstm_15 (LSTM)              (None, 50)                20200     
                                                                 
 dense_12 (Dense)            (None, 33)                1683      
                                                                 
Total params: 765533 (2.92 MB)
Trainable params: 765533 (2.92 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Lets train our model now

In [9]:
lstm_model = create_lstm_model()
lstm_model.fit(sentences_seq, paraphrases_seq, epochs=10, verbose=5)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x795e004589a0>

In [10]:
bidirectional_lstm_model = create_bidirectional_lstm_model()
bidirectional_lstm_model.fit(sentences_seq, paraphrases_seq, epochs=10, verbose=5)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x795d9c187610>

In [11]:
stacked_lstm_model = create_stacked_lstm_model()
stacked_lstm_model.fit(sentences_seq, paraphrases_seq, epochs=10, verbose=5)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x795d2c59f370>

Great, our model architecture is now ready and we can train it using our data. Next lets write the function to predict the next word based on the input words (or seed text). We will first tokenize the seed text, pad the sequences and pass into the trained model to get predicted word. The multiple predicted words can be appended together to get predicted sequence.

### Correct the padding in the code

In [13]:
def test_model(model, sentence):
    # Convert the input sentence to sequence
    sequence = tokenizer.texts_to_sequences([sentence])

    # Pad the sequence
    sequence = pad_sequences(sequence, maxlen=33, padding='post')

    # Predict paraphrased sequence
    predicted_seq = model.predict(sequence)

    # Convert predicted sequence back to text
    predicted_sentence = []
    for idx in np.argmax(predicted_seq, axis=1):
        word = tokenizer.index_word.get(idx, '')
        if word:
            predicted_sentence.append(word)
        if word == '<end>':  # Assuming '<end>' is the end token
            break

    predicted_sentence = ' '.join(predicted_sentence)

    return predicted_sentence

# Test sentences
test_sentences = [
    "I enjoy coding a lot",
    "Programming gives me joy",
]

# Test each model
print("Testing LSTM model:")
for sentence in test_sentences:
    paraphrase = test_model(lstm_model, sentence)
    print(f"Original: {sentence} -> Paraphrase: {paraphrase}")

print("\nTesting Bidirectional LSTM model:")
for sentence in test_sentences:
    paraphrase = test_model(bidirectional_lstm_model, sentence)
    print(f"Original: {sentence} -> Paraphrase: {paraphrase}")

print("\nTesting Stacked LSTM model:")
for sentence in test_sentences:
    paraphrase = test_model(stacked_lstm_model, sentence)
    print(f"Original: {sentence} -> Paraphrase: {paraphrase}")


Testing LSTM model:
Original: I enjoy coding a lot -> Paraphrase: the
Original: Programming gives me joy -> Paraphrase: the

Testing Bidirectional LSTM model:
Original: I enjoy coding a lot -> Paraphrase: the
Original: Programming gives me joy -> Paraphrase: the

Testing Stacked LSTM model:
Original: I enjoy coding a lot -> Paraphrase: the
Original: Programming gives me joy -> Paraphrase: the


This code will test each model with the provided test sentences and print out the original sentence along with its paraphrased version according to each model.

Run this testing code after training the models to see the paraphrased results for the test sentences. Remember, the quality of paraphrasing might low or vary based on the complexity of the sentence and the model's training data. Adjustments like adding more data or tuning hyperparameters could improve the results. This is a example to learn the implementation of LSTM varients

### **EXERCISE:-** You many change the hypermarameters of the model and see the difference in results