<a href="https://colab.research.google.com/github/Z4HRA-S/NLP_Course_Spring2023/blob/main/Next_word_prediction_using_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Next Word Prediction Using LSTM
In this session, we want to predict the next token based on a sequence. For more detail, see this [link](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/).
 This task is charachter-level. It means our model will predict the next character :)
bold text

In [1]:
# Small LSTM Network to Generate Text for Alice in Wonderland
import numpy as np
import nltk
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
from keras.layers import Embedding
from keras.preprocessing.text import Tokenizer

Here we define a function for cleaning our text.

In [2]:
import string
from nltk.tokenize import word_tokenize
nltk.download("punkt")

"""def clean_text(text:str)->list:
    data = word_tokenize(text)
    data = [word.lower() for word in data]
    data = [d.translate(str.maketrans('', '', string.punctuation)) for d in data]
    data = [d for d in data if d.isalpha()]
    return data"""


def clean_text_char_level(data:str)->list:
    data = data.lower()
    data = data.translate(str.maketrans('', '', string.punctuation))
    return data

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


We download the brown corpus from nltk. 

In [3]:
from nltk.corpus import brown
import nltk

nltk.download('brown')

data=brown.sents(categories=['news','reviews'])
data = " ".join(sum(data,[]))
data = clean_text_char_level(data)

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Our vocab includes english chars.

In [4]:
vocab = set(data)
print(len(vocab))

37


In [None]:
"""from collections import Counter
vocab_count = dict(Counter(data))
low_freq_words = [k for k,v in vocab_count.items() if v<2]
print("words with one accourance in data: ", len(low_freq_words))"""

'from collections import Counter\nvocab_count = dict(Counter(data))\nlow_freq_words = [k for k,v in vocab_count.items() if v<2]\nprint("words with one accourance in data: ", len(low_freq_words))'

In [None]:
"""from nltk.corpus import words
nltk.download("words")

valid_low_freq_words = list(filter(lambda x: x in words.words(), low_freq_words))"""

'from nltk.corpus import words\nnltk.download("words")\n\nvalid_low_freq_words = list(filter(lambda x: x in words.words(), low_freq_words))'

In [None]:
"""len(valid_low_freq_words )"""

'len(valid_low_freq_words )'

In [None]:
"""vocab=[k for k,v in vocab_count.items() if v>=3]
vocab.append("unk")
vocab=set(vocab)
#vocab.extend(valid_low_freq_words)
len(vocab)"""

'vocab=[k for k,v in vocab_count.items() if v>=3]\nvocab.append("unk")\nvocab=set(vocab)\n#vocab.extend(valid_low_freq_words)\nlen(vocab)'

We define word_to_id as a mapping between chars and numbers.

In [5]:
"""tokenizer = Tokenizer()
tokenizer.fit_on_texts(set(vocab))
sequences = tokenizer.texts_to_sequences(data)"""
word_to_id = {k:v for v,k in enumerate(vocab)}

In the next cell, we take 70-length sequence of chars of our corpus, and we take the 71th char as the sequence's label.

In [13]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 120
dataX = []
dataY = []
for i in range(0, len(data) - seq_length, 1):
	seq_in = data[i:i + seq_length]
	seq_out = data[i + seq_length]
	dataX.append([word_to_id[s] for s in seq_in])
	dataY.append(word_to_id[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  736254


In [14]:
print(dataX[10])
print(dataY[10])

[17, 21, 22, 4, 34, 31, 20, 17, 8, 26, 14, 34, 33, 17, 23, 4, 26, 20, 17, 3, 14, 18, 33, 17, 5, 26, 18, 33, 14, 20, 17, 14, 34, 17, 18, 34, 19, 13, 3, 31, 18, 8, 14, 31, 18, 22, 34, 17, 22, 5, 17, 14, 31, 29, 14, 34, 31, 14, 3, 17, 26, 13, 21, 13, 34, 31, 17, 28, 26, 18, 35, 14, 26, 20, 17, 13, 29, 13, 21, 31, 18, 22, 34, 17, 28, 26, 22, 33, 4, 21, 13, 33, 17, 17, 34, 22, 17, 13, 19, 18, 33, 13, 34, 21, 13, 17, 17, 31, 11, 14, 31, 17, 14, 34, 20, 17, 18, 26, 26, 13]
8


In [15]:
"""vocab_size = len(tokenizer.word_index) + 1"""
vocab_size = len(vocab)
print(vocab_size)

37


In the next cell, we normalize the input to the model to be in the range of (0,1) and turn the label into one-hot vectors.

In [16]:
dataX=np.array(dataX)/float(len(vocab))
dataY=np.array(dataY)
dataX=np.reshape(dataX, (dataX.shape[0], seq_length, 1))
dataY = to_categorical(dataY, num_classes=vocab_size)

print(dataX.shape, dataY.shape)

(736254, 120, 1) (736254, 37)


In the next cell, we define a deep neural network model. Don't worry if you did not take the deep learning course, just run the cell. 

In [17]:
from tensorflow.keras.layers import Flatten
import tensorflow as tf

model = Sequential()
model.add(LSTM(seq_length,return_sequences=True,input_shape=(dataX.shape[1],1)))
#model.add(Dropout(0.1))
model.add(LSTM(seq_length,return_sequences=True))
#model.add(Dropout(0.1))
model.add(LSTM(seq_length))
model.add(Dense(dataY.shape[1], activation='softmax'))
optimizer = tf.optimizers.Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [18]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_3 (LSTM)               (None, 120, 120)          58560     
                                                                 
 lstm_4 (LSTM)               (None, 120, 120)          115680    
                                                                 
 lstm_5 (LSTM)               (None, 120)               115680    
                                                                 
 dense_1 (Dense)             (None, 37)                4477      
                                                                 
Total params: 294,397
Trainable params: 294,397
Non-trainable params: 0
_________________________________________________________________


The training will occur in the next cell, it may take a while. Go make a cup of tea for yourself 🫖 🍵

In [None]:
model.fit(dataX, dataY, epochs=100, batch_size=80)

After training the model, we will evaluate our model. If you want to test it, define your desired text.

In [31]:
id_to_char={v:k for k,v in word_to_id.items()}

text= "words are used to encode and convey" # next word: thoughts
text = clean_text_char_level(data)

model_input = np.array([word_to_id[char] for char in text])

#predict next 10 charachters.
output=[]
for i in range(10):
    output_vector = model.predict(model_input[i:i+120].reshape(1,120,1))
    idx = np.argmax(output_vector)
    output.append(id_to_char[idx])

    model_input = np.append(model_input,idx)

print(output)

['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']


##Home Work
These homework are subjective, and your way of thinking and researching is more important than your actual answers (but pay attention to your actual answers too :) ). Try to think about the problem, and if you discuss it with your friends, please include their names in your email.

1. Run the notebook and ask any question, or discuss your ideas and thoughts. 
2. Test some other text with the trained model and see the answers. 
3. Think about the pre-processing and see if you can do it better. If you came up with any ideas, try it and submit it in  your email.
4. Explore the data. Search and see if you can run the model with another dataset. compare your data set with the current dataset and include the result of the training, including accuracy and loss. 
