# Autosuggestion - Predicting the next word using RNN

Adapted from: https://medium.com/towards-artificial-intelligence/sentence-prediction-using-word-level-lstm-text-generator-language-modeling-using-rnn-a80c4cda5b40

### Approach

- Test with bigger dataset (17000 sentences) vs smaller dataset (1700 sentences)
    - Goal: to identify if a larger training set improves prediction accuracy
    - Bigger dataset: sample of Enron email vs smaller dataset: personal Acronis email made up of mostly newsletters
- Test with long query (>=3 words) short query (1 word)
    - Goal: to identify if there is a gap in the RNN model in predicting the next word

### Steps

1. Get sentence tokens and clean sentences
2. Remove low frequency words and generate training sequences (4 words)
3. Train model and predict

### Setup

- Python version: Python 3.7.4
- Set up virtual environment
    - pip install --user virtualenv
    - virtualenv tensorflow-gpu
    - tensorflow-gpu\Scripts\activate
- Add Jupyter Notebook to virtual environment
    - pip install ipykernel
    - python -m ipykernel install --name=tensorflow-gpu
- Set up Tensorflow 2.1 GPU version
    - https://www.tensorflow.org/install/gpu
    - pip install tensorflow==2.1 
    - CUDA 10.1, cuDNN 7.6.5
- pip install pandas

## Pre-processing

Create txt file to save sentences

In [5]:
import pandas as pd
df = pd.read_csv('enron_emails_clean_2500.csv')
df.fillna('').dropna()
print(df.shape)
df.head()

(17256, 2)


Unnamed: 0,msg_sentence,msg_clean
0,Here is our forecast,here is our forecast
1,Traveling to have a business meeting takes the...,traveling to have a business meeting takes the...
2,Especially if you have to prepare a presentation.,especially if you have to prepare a presentation
3,I would suggest holding the business plan meet...,i would suggest holding the business plan meet...
4,I would even try and get some honest opinions ...,i would even try and get some honest opinions ...


In [6]:
import numpy as np
np.savetxt(r'enron_train', df.msg_clean.values, fmt='%s')

In [None]:
f = open("enron_train", "r")
if f.mode == 'r':
    content = f.read()
    print(content)

Data cleaning

In [8]:
import re
# Cleaning the data
clean_data = []
for text in content.splitlines():
    text = text.replace('_', ' ')
    a = re.sub(r'[^a-zA-z ]+', '', text).strip()
    if len(a)>0:
        clean_data.append(a)
    else:
        None

In [9]:
# Removing the lines which are to short or to long
short_data = []
for line in clean_data:
    if 2 <= len(line.split()) <= 200:
        short_data.append(line)
    else:
        None

In [10]:
# Counting the appearnce of each word in the corpus also calculates the number of unique words also
word2count = {}
total_words = 0
for text in short_data:
    for word in text.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1
        total_words += 1
print(total_words)

352973


In [11]:
# creating a list that will only contain the words that appear more than 15 times
word15 = []
threshold = 15
for word, count in word2count.items():
    if count >= threshold:
        if len(word) > 1:
            word15.append(word)
print(len(word15))

2614


In [12]:
# Removing the words from each string which appear less than 15 times
data_15 = []
for line in short_data:
    str1=''
    for word in line.split():
        if word in word15:
            str1 = " ".join((str1, word))
    data_15.append(str1.lstrip())

In [13]:
# Removing the lines which are to short or to long after removing the unnecssary words.     
short_data_consize = []
for line in data_15:
    if 3 <= len(line.split()) <= 200:
        short_data_consize.append(line)
    else:
        None

In [14]:
#defining a function to save data
def write_txt(name, data):
    file1 = open("{0}.txt".format(name),"w") 
    for line in data:
        file1.writelines(line) 
        file1.writelines('\n') 
    file1.close() #to change file access modes

write_txt(name = 'enron_train_wordRNN', data = short_data_consize)

## LSTM Model

In [16]:
def read_file(filepath):
    with open(filepath) as f:
        str_text = f.read()
    return str_text

text = read_file('enron_train_wordRNN.txt')
line = text.split("\n")
token_lst = [x.split() for x in line]

In [17]:
def generate_seq(tokens):
    train_len = 3+1
    text_sequences = []
    for i in range(train_len,len(tokens)):
        seq = tokens[i-train_len:i]
        text_sequences.append(seq)
    return text_sequences

seq_lst = [generate_seq(x) for x in token_lst]

In [18]:
flatten = [item for sublist in token_lst for item in sublist]

sequences = {}
count = 1
for i in range(len(flatten)):
    if flatten[i] not in sequences:
        sequences[flatten[i]] = count
        count += 1

In [19]:
flatten_seq = [item for sublist in seq_lst for item in sublist]

In [20]:
flatten_seq[40:50]

[['meetings', 'think', 'it', 'would'],
 ['think', 'it', 'would', 'be'],
 ['it', 'would', 'be', 'more'],
 ['would', 'be', 'more', 'to'],
 ['be', 'more', 'to', 'try'],
 ['more', 'to', 'try', 'and'],
 ['to', 'try', 'and', 'discussions'],
 ['try', 'and', 'discussions', 'across'],
 ['and', 'discussions', 'across', 'the'],
 ['discussions', 'across', 'the', 'different']]

In [21]:
text_sequences = flatten_seq
train_len = 3+1

In [22]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences) 

#Collecting some information   
vocabulary_size = len(tokenizer.word_counts)

import numpy as np
n_sequences = np.empty([len(sequences),train_len], dtype='int32')
for i in range(len(sequences)):
    n_sequences[i] = sequences[i]

In [30]:
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import Sequential
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import *

train_inputs = n_sequences[:,:-1]
train_targets = n_sequences[:,-1]

train_targets = to_categorical(train_targets, num_classes=vocabulary_size+1)
seq_len = train_inputs.shape[1]
train_inputs.shape

def create_model(vocabulary_size, seq_len):
    model = Sequential()
    model.add(Embedding(vocabulary_size, seq_len,input_length=seq_len))
    model.add(LSTM(50,return_sequences=True))
    model.add(LSTM(50))
    model.add(Dense(50,activation='relu'))
    model.add(Dense(vocabulary_size,activation='softmax'))
    opt_adam = Adam(lr=0.001)
    #You can simply pass 'adam' to optimizer in compile method. Default learning rate 0.001
    #But here we are using adam optimzer from optimizer class to change the LR.
    model.compile(loss='categorical_crossentropy',optimizer=opt_adam,metrics=['accuracy'])
    model.summary()
    return model

In [24]:
model = create_model(vocabulary_size+1,seq_len)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 3, 3)              7833      
_________________________________________________________________
lstm (LSTM)                  (None, 3, 50)             10800     
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dense (Dense)                (None, 50)                2550      
_________________________________________________________________
dense_1 (Dense)              (None, 2611)              133161    
Total params: 174,544
Trainable params: 174,544
Non-trainable params: 0
_________________________________________________________________


In [32]:
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.models import load_model

path = 'models/enron_wordRNN_0714.h5'
checkpoint = ModelCheckpoint(path, monitor='loss', verbose=1, save_best_only=True, mode='min')
stopping = EarlyStopping(monitor='loss', mode='min', verbose=1, patience=3)
model.fit(train_inputs,train_targets,batch_size=128,epochs=500,verbose=1,callbacks=[checkpoint, stopping])

Train on 225490 samples
Epoch 1/500
Epoch 00001: loss improved from inf to 3.70814, saving model to models/enron_wordRNN_0714.h5
Epoch 2/500
Epoch 00002: loss improved from 3.70814 to 3.68798, saving model to models/enron_wordRNN_0714.h5
Epoch 3/500
Epoch 00003: loss improved from 3.68798 to 3.67821, saving model to models/enron_wordRNN_0714.h5
Epoch 4/500
Epoch 00004: loss improved from 3.67821 to 3.67147, saving model to models/enron_wordRNN_0714.h5
Epoch 5/500
Epoch 00005: loss improved from 3.67147 to 3.66437, saving model to models/enron_wordRNN_0714.h5
Epoch 6/500
Epoch 00006: loss improved from 3.66437 to 3.65889, saving model to models/enron_wordRNN_0714.h5
Epoch 7/500
Epoch 00007: loss improved from 3.65889 to 3.65064, saving model to models/enron_wordRNN_0714.h5
Epoch 8/500
Epoch 00008: loss improved from 3.65064 to 3.64485, saving model to models/enron_wordRNN_0714.h5
Epoch 9/500
Epoch 00009: loss improved from 3.64485 to 3.63999, saving model to models/enron_wordRNN_0714.h5

Epoch 00055: loss improved from 3.44181 to 3.43867, saving model to models/enron_wordRNN_0714.h5
Epoch 56/500
Epoch 00056: loss improved from 3.43867 to 3.43449, saving model to models/enron_wordRNN_0714.h5
Epoch 57/500
Epoch 00057: loss improved from 3.43449 to 3.43116, saving model to models/enron_wordRNN_0714.h5
Epoch 58/500
Epoch 00058: loss improved from 3.43116 to 3.42883, saving model to models/enron_wordRNN_0714.h5
Epoch 59/500
Epoch 00059: loss improved from 3.42883 to 3.42632, saving model to models/enron_wordRNN_0714.h5
Epoch 60/500
Epoch 00060: loss improved from 3.42632 to 3.42304, saving model to models/enron_wordRNN_0714.h5
Epoch 61/500
Epoch 00061: loss improved from 3.42304 to 3.41880, saving model to models/enron_wordRNN_0714.h5
Epoch 62/500
Epoch 00062: loss improved from 3.41880 to 3.41623, saving model to models/enron_wordRNN_0714.h5
Epoch 63/500
Epoch 00063: loss improved from 3.41623 to 3.41351, saving model to models/enron_wordRNN_0714.h5
Epoch 64/500
Epoch 0006

Epoch 00109: loss improved from 3.30587 to 3.30498, saving model to models/enron_wordRNN_0714.h5
Epoch 110/500
Epoch 00110: loss improved from 3.30498 to 3.30321, saving model to models/enron_wordRNN_0714.h5
Epoch 111/500
Epoch 00111: loss improved from 3.30321 to 3.30025, saving model to models/enron_wordRNN_0714.h5
Epoch 112/500
Epoch 00112: loss improved from 3.30025 to 3.29905, saving model to models/enron_wordRNN_0714.h5
Epoch 113/500
Epoch 00113: loss improved from 3.29905 to 3.29785, saving model to models/enron_wordRNN_0714.h5
Epoch 114/500
Epoch 00114: loss improved from 3.29785 to 3.29471, saving model to models/enron_wordRNN_0714.h5
Epoch 115/500
Epoch 00115: loss improved from 3.29471 to 3.29188, saving model to models/enron_wordRNN_0714.h5
Epoch 116/500
Epoch 00116: loss improved from 3.29188 to 3.29095, saving model to models/enron_wordRNN_0714.h5
Epoch 117/500
Epoch 00117: loss improved from 3.29095 to 3.29083, saving model to models/enron_wordRNN_0714.h5
Epoch 118/500
E

Epoch 137/500
Epoch 00137: loss did not improve from 3.25715
Epoch 138/500
Epoch 00138: loss improved from 3.25715 to 3.25485, saving model to models/enron_wordRNN_0714.h5
Epoch 139/500
Epoch 00139: loss improved from 3.25485 to 3.25445, saving model to models/enron_wordRNN_0714.h5
Epoch 140/500
Epoch 00140: loss improved from 3.25445 to 3.25145, saving model to models/enron_wordRNN_0714.h5
Epoch 141/500
Epoch 00141: loss did not improve from 3.25145
Epoch 142/500
Epoch 00142: loss improved from 3.25145 to 3.24758, saving model to models/enron_wordRNN_0714.h5
Epoch 143/500
Epoch 00143: loss improved from 3.24758 to 3.24716, saving model to models/enron_wordRNN_0714.h5
Epoch 144/500
Epoch 00144: loss improved from 3.24716 to 3.24539, saving model to models/enron_wordRNN_0714.h5
Epoch 145/500
Epoch 00145: loss improved from 3.24539 to 3.24368, saving model to models/enron_wordRNN_0714.h5
Epoch 146/500
Epoch 00146: loss improved from 3.24368 to 3.24339, saving model to models/enron_wordRN

Epoch 165/500
Epoch 00165: loss did not improve from 3.21909
Epoch 166/500
Epoch 00166: loss improved from 3.21909 to 3.21796, saving model to models/enron_wordRNN_0714.h5
Epoch 167/500
Epoch 00167: loss improved from 3.21796 to 3.21527, saving model to models/enron_wordRNN_0714.h5
Epoch 168/500
Epoch 00168: loss improved from 3.21527 to 3.21328, saving model to models/enron_wordRNN_0714.h5
Epoch 169/500
Epoch 00169: loss improved from 3.21328 to 3.21307, saving model to models/enron_wordRNN_0714.h5
Epoch 170/500
Epoch 00170: loss improved from 3.21307 to 3.21238, saving model to models/enron_wordRNN_0714.h5
Epoch 171/500
Epoch 00171: loss improved from 3.21238 to 3.21145, saving model to models/enron_wordRNN_0714.h5
Epoch 172/500
Epoch 00172: loss improved from 3.21145 to 3.21016, saving model to models/enron_wordRNN_0714.h5
Epoch 173/500
Epoch 00173: loss improved from 3.21016 to 3.20918, saving model to models/enron_wordRNN_0714.h5
Epoch 174/500
Epoch 00174: loss improved from 3.209

Epoch 222/500
Epoch 00222: loss improved from 3.15935 to 3.15810, saving model to models/enron_wordRNN_0714.h5
Epoch 223/500
Epoch 00223: loss improved from 3.15810 to 3.15751, saving model to models/enron_wordRNN_0714.h5
Epoch 224/500
Epoch 00224: loss did not improve from 3.15751
Epoch 225/500
Epoch 00225: loss improved from 3.15751 to 3.15548, saving model to models/enron_wordRNN_0714.h5
Epoch 226/500
Epoch 00226: loss improved from 3.15548 to 3.15500, saving model to models/enron_wordRNN_0714.h5
Epoch 227/500
Epoch 00227: loss improved from 3.15500 to 3.15491, saving model to models/enron_wordRNN_0714.h5
Epoch 228/500
Epoch 00228: loss did not improve from 3.15491
Epoch 229/500
Epoch 00229: loss improved from 3.15491 to 3.15317, saving model to models/enron_wordRNN_0714.h5
Epoch 230/500
Epoch 00230: loss improved from 3.15317 to 3.15152, saving model to models/enron_wordRNN_0714.h5
Epoch 231/500
Epoch 00231: loss improved from 3.15152 to 3.15040, saving model to models/enron_wordRN

Epoch 251/500
Epoch 00251: loss did not improve from 3.13473
Epoch 252/500
Epoch 00252: loss did not improve from 3.13473
Epoch 253/500
Epoch 00253: loss improved from 3.13473 to 3.13427, saving model to models/enron_wordRNN_0714.h5
Epoch 254/500
Epoch 00254: loss improved from 3.13427 to 3.13342, saving model to models/enron_wordRNN_0714.h5
Epoch 255/500
Epoch 00255: loss did not improve from 3.13342
Epoch 256/500
Epoch 00256: loss improved from 3.13342 to 3.13287, saving model to models/enron_wordRNN_0714.h5
Epoch 257/500
Epoch 00257: loss improved from 3.13287 to 3.13173, saving model to models/enron_wordRNN_0714.h5
Epoch 258/500
Epoch 00258: loss improved from 3.13173 to 3.12974, saving model to models/enron_wordRNN_0714.h5
Epoch 259/500
Epoch 00259: loss improved from 3.12974 to 3.12815, saving model to models/enron_wordRNN_0714.h5
Epoch 260/500
Epoch 00260: loss did not improve from 3.12815
Epoch 261/500
Epoch 00261: loss improved from 3.12815 to 3.12769, saving model to models/e

<tensorflow.python.keras.callbacks.History at 0x144e4868bc8>

In [33]:
model.save(path)

from pickle import dump
dump(tokenizer,open('models/enron_tokenizer_wordRNN_0714','wb'))  

## Prediction 

### Large dataset

Training results:

- Epoch 273/500 loss: 3.1200 - accuracy: 0.3993
- Each epoch takes about 80-100 seconds on GPU (RTX 2070 SUPER)

In [34]:
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from pickle import load

model = load_model('models/enron_wordRNN_0714.h5')
tokenizer = load(open('models/enron_tokenizer_wordRNN_0714','rb'))
seq_len = 3 
def gen_text(model, tokenizer, seq_len, seed_text, num_gen_words):
    output_text = []
    input_text = seed_text
    for i in range(num_gen_words):
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        pad_encoded = pad_sequences([encoded_text], maxlen=seq_len,truncating='pre')
        pred_word_ind = model.predict_classes(pad_encoded,verbose=0)[0]
        
        pred_word = tokenizer.index_word[pred_word_ind]
        input_text += ' '+pred_word
        output_text.append(pred_word)
    return ' '.join(output_text)

In [36]:
print('\n\n===>Enter --exit to exit from the program')
while True:
    seed_text  = input('Enter string: ')
    if seed_text.lower() == '--exit':
        break
    else:
        out = gen_text(model, tokenizer, seq_len=seq_len, seed_text=seed_text, num_gen_words=5)
        print('Output: '+seed_text+' '+out)
        print()



===>Enter --exit to exit from the program
Enter string: discussions
Output: discussions as the arcade and the

Enter string: meetings
Output: meetings as hot for the restructuring

Enter string: presentation
Output: presentation financials to strong long and

Enter string: stock
Output: stock to the immediate value ability

Enter string: finance
Output: finance as trades and resource rental

Enter string: --exit


In [43]:
print('\n\n===>Enter --exit to exit from the program')
while True:
    seed_text  = input('Enter string: ')
    if seed_text.lower() == '--exit':
        break
    else:
        out = gen_text(model, tokenizer, seq_len=seq_len, seed_text=seed_text, num_gen_words=5)
        print('Output: '+seed_text+' '+out)
        print()



===>Enter --exit to exit from the program
Enter string: business plan is
Output: business plan is the largest term must be

Enter string: round table discussion
Output: round table discussion of the value of the

Enter string: business meetings are
Output: business meetings are expected to rise on the

Enter string: business meetings
Output: business meetings as exodus documents to manage

Enter string: business
Output: business and potentially mexico and introducing

Enter string: --exit


### Smaller dataset

Training results:

- Epoch 500/500 loss: 1.4459 - accuracy: 0.6336
- Each epoch takes about 80-100 seconds on GPU (RTX 2070 SUPER)

In [5]:
model = load_model('models/outlook_wordRNN_0713.h5')
tokenizer = load(open('models/outlook_tokenizer_wordRNN_0713','rb'))
seq_len = 3 

print('\n\n===>Enter --exit to exit from the program')
while True:
    seed_text  = input('Enter string: ')
    if seed_text.lower() == '--exit':
        break
    else:
        out = gen_text(model, tokenizer, seq_len=seq_len, seed_text=seed_text, num_gen_words=5)
        print('Output: '+seed_text+' '+out)
        print()



===>Enter --exit to exit from the program
Enter string: meeting
Output: meeting meeting multi singapore singapore designer

Enter string: discussions
Output: discussions have to one last not

Enter string: presentation
Output: presentation have to one last not

Enter string: stock
Output: stock have to one last not

Enter string: finance
Output: finance have to one last not

Enter string: --exit


In [6]:
print('\n\n===>Enter --exit to exit from the program')
while True:
    seed_text  = input('Enter string: ')
    if seed_text.lower() == '--exit':
        break
    else:
        out = gen_text(model, tokenizer, seq_len=seq_len, seed_text=seed_text, num_gen_words=5)
        print('Output: '+seed_text+' '+out)
        print()



===>Enter --exit to exit from the program
Enter string: business plan is
Output: business plan is have until be only the

Enter string: round table discussion
Output: round table discussion have to one last not

Enter string: business meetings are
Output: business meetings are have data designer protect this

Enter string: business meetings
Output: business meetings have to one last not

Enter string: --exit


## Conclusion

- A larger dataset is able to predict the next word better
- The predictions are not very accurate