


# Predicting Next K Words using LSTM and BERT



The central idea of this notebook is to explore various language models specifically LSTM based and transformer. We will explore how the size of the model effects the sequence generated. We will see both character based and word based models. The dataset used to train the model can be found here: [link](https://drive.google.com/file/d/1OxNHKbdQm03KiNFNmERPI5wt_kTmjCW0/view?usp=sharing)



# Word Based LSTM model

In [1]:
# Importing modules
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
import numpy
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Do basic pre processing which includes lowering etc
Check the dataset and apply suitable preprocessing.

In [None]:
nltk.download('punkt') # For tokenizers
nltk.download('inaugural') # For dataset
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Load the data and preprocess data and store corpus in raw_text
# read dataset
f = open("/content/drive/My Drive/Colab Notebooks/corpus.txt", "rt", encoding="utf-8")
corpus=f.read()
#print(corpus)

#tokenize corpus into sentences
sentences=sent_tokenize(corpus)
preprocessed_corpus="";
for sentence in sentences:
    tokenizer = nltk.RegexpTokenizer(r"[a-z’]+")
    tokenized_sentence = tokenizer.tokenize(sentence.lower())
    if len(tokenized_sentence)!=0:
        preprocessed_corpus+=" ".join(tokenized_sentence)+" "
        #print(tokenized_sentence)  
        
        
        
    

In [None]:
# Hyperparameters of the model
vocab_size = 2461 # choose based on statistics
oov_tok = '<OOV>'
embedding_dim = 100
padding_type='post'
trunc_type='post'

In [None]:
# tokenize sentences
raw_text=preprocessed_corpus
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts([raw_text])
word_index = tokenizer.word_index

In [None]:
seq_length = 50
tokens = tokenizer.texts_to_sequences([raw_text])[0]

In [None]:
dataX = []
dataY = []

for i in range(0, len(tokens) - seq_length-1 , 1):
  seq_in = tokens[i:i + seq_length]
  seq_out = tokens[i + seq_length]

  if seq_out==1: #Skip samples where target word is OOV
    continue
    
  dataX.append(seq_in)
  dataY.append(seq_out)
 
N = len(dataX)
print ("Total training data size: ", N)

Total training data size:  26494


In [None]:
X = numpy.array(dataX)

# one hot encode the output variable
y = numpy.array(dataY)
y = np_utils.to_categorical(dataY)

In [None]:
# with embedding
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=seq_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(vocab_size, activation='softmax')
])
# compile model
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 100)           246100    
_________________________________________________________________
bidirectional_2 (Bidirection (None, 128)               84480     
_________________________________________________________________
dense_2 (Dense)              (None, 2461)              317469    
Total params: 648,049
Trainable params: 648,049
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Use validation split of 0.2 while training
model.fit(X, y, epochs= 100, batch_size=128, validation_split=0.2 ) 

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7fc100c51e10>

In [None]:
## Create word to idx map using tokenizer.word_index

reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))



In [None]:
# Complete the code to return next n words greedily
def next_tokens(input_str, n): 
	print ("Seed: \n",  input_str)
	final_string = ""
	
	for i in range(n):	
		x=tokenizer.texts_to_sequences([input_str])
		#print(x)
		prediction = model.predict(x , verbose=0)
		# get next word index. Use reverse_word_map to get the word
		index = numpy.argmax(prediction)
		#print(index)
		next_word = reverse_word_map[index]+" "
		final_string += next_word
		input_str+= next_word
		#print(reverse_word_map[index])
	return final_string 

In [None]:
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)

pattern = dataX[start]
input_str = ' '.join([reverse_word_map[value] for value in pattern])

next_tokens( input_str , 10)

Seed: 
 is of course not alice replied very readily but that’s because it stays the same year for such a long time together which is just the case with mine said the hatter alice felt dreadfully puzzled the hatter’s remark seemed to have no sort of meaning in it and yet


'you just as if you would you say what you '

In [None]:
input_str = "The boy laughed at the fright he had caused. This time, the villagers left angrily. The third day, as the boy went up\
 the small hill, he suddenly saw a wolf attacking his sheep. He cried as hard as he could, “Wolf! Wolf! Wolf!”, but not \
 a single villager came to help him. The villagers thought that he was trying to fool them again and did not come to rescue \
 him or his sheep."

# Use first 50 tokens from given input_str as input.(Use tokenizer to split to take first 50)
print(next_tokens( input_str , 50))

Seed: 
 The boy laughed at the fright he had caused. This time, the villagers left angrily. The third day, as the boy went up the small hill, he suddenly saw a wolf attacking his sheep. He cried as hard as he could, “Wolf! Wolf! Wolf!”, but not  a single villager came to help him. The villagers thought that he was trying to fool them again and did not come to rescue  him or his sheep.
said the king and alice thought to herself if you had been no use now but if you had been no time to be impertinent said alice and if you had been no one but she had wept if i must be impertinent said alice and if you had been 


# Character based LSTM Model 1

In [None]:
# User the preprocess data and create raw_text
f = open("/content/drive/My Drive/Colab Notebooks/corpus.txt", "rt", encoding="utf-8")
corpus=f.read()
raw_text = corpus.replace("_", "")
# create mapping of unique characters to integers
chars = sorted(list(set(raw_text)))

char_to_int = {chars[i]:i for i in range(len(chars))}

In [None]:
# Print the total characters and character vacob size
n_chars = len(raw_text)
n_vocab = len(chars)
print("No. of characters: ",n_chars, " Size of vocabulary: ", n_vocab)

No. of characters:  142037  Size of vocabulary:  71


In [None]:

#Prepare dataset where the input is sequence of 100 characters and target is next character.

seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):  
    # Write code here
    dataX.append([char_to_int[i] for i in raw_text[i:(i+seq_length)]])
    dataY.append(char_to_int[raw_text[i+seq_length]])



n_patterns = len(dataX)
print ("Total Patterns: ", n_patterns)

Total Patterns:  141937


In [None]:
# reshape X to be [samples, time steps, features]
X = numpy.array(dataX)

# one hot encode the output variable
dataY = numpy.array(dataY)
y = np_utils.to_categorical(dataY)

In [None]:
embedding_dim =100
max_length =100

In [None]:
from keras.layers import Embedding
model = Sequential()
model.add(Embedding(n_vocab, embedding_dim, input_length=max_length))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 100, 100)          7100      
_________________________________________________________________
lstm_3 (LSTM)                (None, 256)               365568    
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 71)                18247     
Total params: 390,915
Trainable params: 390,915
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(X, y, epochs=20, batch_size=128)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fc159887f60>

In [None]:
#implement mapping of integer to character
int_to_char = {i:chars[i] for i in range(len(chars))}
#print(int_to_char)

In [None]:


# Complete the code to return next n words greedily
def predict_next_100_chars(pattern, x): 	
	final_string = ""
	encoded_str=np.reshape([char_to_int[ch] for ch in pattern], (1, len(pattern)))
	for i in range(x):		
		prediction = model.predict(encoded_str , verbose=0)				
		index = numpy.argmax(prediction)			
		next_char = int_to_char[index]
		final_string += next_char		
		encoded_str = np.append(encoded_str, np.array([[index]]), axis=1)	
	return final_string 





In [None]:
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
input_str = ''.join([int_to_char[value] for value in pattern])

print(predict_next_100_chars(input_str,200))

order,” said the King, “and the moral of that is—‘The more than the executioner everything I’ve to talk about it,” said the King, “and the moral of that is—‘The more than the executioner everything I’


In [None]:
input_str = "The boy laughed at the fright he had caused. This time, the villagers left angrily. The third day, as the boy went up\
 the small hill, he suddenly saw a wolf attacking his sheep. He cried as hard as he could, “Wolf! Wolf! Wolf!”, but not \
 a single villager came to help him. The villagers thought that he was trying to fool them again and did not come to rescue \
 him or his sheep."

 # Use first 100 characeters from given input_str as input and generate next 200 characters.

 
print(predict_next_100_chars(input_str,200))


“There’s no more than the executioner everything I’ve to say that is—‘The more than the moral of that is—‘The more than the executioner everything I’ve to talk about it,” said the King, “and the mora


## Character based LSTM Model 2


In [None]:
model1 = Sequential()
model1.add(Embedding(n_vocab, embedding_dim, input_length=max_length))
model1.add(LSTM(256, input_shape=(X.shape[1], embedding_dim),return_sequences=True))
model1.add(Dropout(0.2))
model1.add(LSTM(256))
model1.add(Dropout(0.2))
model1.add(Dense(y.shape[1], activation='softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
model1.fit(X, y, epochs=20, batch_size=64)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fc0fa077e10>

In [None]:
# Generate the sequence similar to above methods



def predict_next_100_chars(pattern, x): 	
	final_string = ""
	encoded_str=np.reshape([char_to_int[ch] for ch in pattern], (1, len(pattern)))
	for i in range(x):		
		prediction = model1.predict(encoded_str , verbose=0)				
		index = numpy.argmax(prediction)			
		next_char = int_to_char[index]
		final_string += next_char		
		encoded_str = np.append(encoded_str, np.array([[index]]), axis=1)	
	return final_string 

	

In [None]:
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
input_str = ''.join([int_to_char[value] for value in pattern])

print(predict_next_100_chars(input_str,200))

 the sort of deal of the song, and then I must be a little sisters as the Queen of the soldiers,” said the Mock Turtle.
“I don’t know what they’re saying to herself, and then I must be a little sister


In [None]:
input_str = "The boy laughed at the fright he had caused. This time, the villagers left angrily. The third day, as the boy went up\
 the small hill, he suddenly saw a wolf attacking his sheep. He cried as hard as he could, “Wolf! Wolf! Wolf!”, but not \
 a single villager came to help him. The villagers thought that he was trying to fool them again and did not come to rescue \
 him or his sheep."

 # Use first 100 characeters from given input_str as input and generate next 200 characters.

 
print(predict_next_100_chars(input_str,200))

” “I don’t know what they’re saying to herself, and then I must be a little sisters as the Queen of the soldiers,” said the Mock Turtle.
“I don’t know what they’re saying to herself, and then I must b


# Performance of the Models

**Question:** What are the observations based on the model(all) outputs on train data(in domain) vs unseen data(out of domain) ?


Model seems to predict better for text based on the corpus. 


**Question:** What was observed in the outputs of char LSTM model1 vs char LSTM model2 ?

**Answer:**
Character based model in LSTM model2 seems to overfit more than model1 on observing the repeatition in the sequence of words generated.

# Transformer based language model (Bert)


In [3]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 4.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 30.4MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 31.2MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [4]:
import os
import torch
import string
from transformers import BertTokenizer, BertForMaskedLM

In [5]:
def load_model(model_name):
  try:
    if model_name.lower() == "bert":
      bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
      bert_model = BertForMaskedLM.from_pretrained('bert-base-uncased').eval()
      return bert_tokenizer,bert_model
  except Exception as e:
    pass

In [7]:
def decode(tokenizer, pred_idx, top_clean):
  ignore_tokens = string.punctuation + '[PAD]'
  tokens = []
  for w in pred_idx:
    token = ''.join(tokenizer.decode(w).split())
    if token not in ignore_tokens:
      tokens.append(token.replace('##', ''))
  return '\n'.join(tokens[:top_clean])

In [8]:
def encode(tokenizer, text_sentence, add_special_tokens=True):
  text_sentence = text_sentence.replace('<mask>', tokenizer.mask_token)
  # if <mask> is the last token, append a "." so that models dont predict punctuation.
  if tokenizer.mask_token == text_sentence.split()[-1]:
    text_sentence += ' .'
  input_ids = torch.tensor([tokenizer.encode(text_sentence, add_special_tokens=add_special_tokens)])
  mask_idx = torch.where(input_ids == tokenizer.mask_token_id)[1].tolist()[0]
  return input_ids, mask_idx

In [9]:
def get_all_predictions(text_sentence, top_clean=5):
  input_ids, mask_idx = encode(bert_tokenizer, text_sentence)
  with torch.no_grad():
    predict = bert_model(input_ids)[0]
    print(predict.shape)
    
    bert = decode(bert_tokenizer, predict[0, mask_idx, :].topk(top_k).indices.tolist(), top_clean)
  return {'bert': bert}

In [10]:
def get_prediction_eos(input_text):
  try:
    input_text += ' <mask>'
    res = get_all_predictions(input_text, top_clean=int(top_k))
    return res
  except Exception as error:
    pass

In [None]:
# Below code predicts the next top_k words.  
top_k= 3
print('Predict next top', top_k, ' words')
model_name = 'BERT'
bert_tokenizer, bert_model  = load_model(model_name) 
input_text = "Will you be my " ### GIVE YOUR INPUT STRING HERE
res = get_prediction_eos(input_text)
answer = []
print(res['bert'].split("\n"))
for i in res['bert'].split("\n"):
  answer.append(i)
  answer_as_string = " ".join(answer)

print(answer_as_string)

Predict next  3  words


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([1, 8, 30522])
['wife', 'friend', 'husband']
wife friend husband


In [20]:
#You can modify the above code to get next n words using top_k=1 and greedily decoding it.
top_k= 1

model_name = 'BERT'
bert_tokenizer, bert_model  = load_model(model_name) 
input_text = "I said you " ### GIVE YOUR INPUT STRING HERE
print("_____________________________________")
print("Input:")
print(input_text)
print("_____________________________________")
k=5  #predict next k words
print("k:",k)
print("_____________________________________")
for i in range(k):
  res = get_prediction_eos(input_text)
  input_text+=res['bert']
  input_text+=" "
print("_____________________________________")
print("Output:")
print(input_text)



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


_____________________________________
Input:
I said you 
_____________________________________
k: 5
_____________________________________
torch.Size([1, 7, 30522])
torch.Size([1, 8, 30522])
torch.Size([1, 9, 30522])
torch.Size([1, 10, 30522])
torch.Size([1, 11, 30522])
_____________________________________
Output:
I said you would come back tomorrow night 
