# Autosuggestion - Predicting the next character using RNN

Adapted from: https://medium.com/towards-artificial-intelligence/sentence-prediction-using-word-level-lstm-text-generator-language-modeling-using-rnn-a80c4cda5b40


### Approach

- Test with bigger dataset (17000 sentences) vs smaller dataset (1700 sentences)
    - Goal: to identify if a larger training set improves prediction accuracy
    - Bigger dataset: sample of Enron email vs smaller dataset: personal Acronis email made up of mostly newsletters
    
### Steps

1. Get sentence tokens and clean sentences
2. Remove low frequency words and generate training sequences (10 characters)
3. Train model and predict

### Setup

- Python version: Python 3.7.4
- Set up virtual environment
    - pip install --user virtualenv
    - virtualenv tensorflow-gpu
    - tensorflow-gpu\Scripts\activate
- Add Jupyter Notebook to virtual environment
    - pip install ipykernel
    - python -m ipykernel install --name=tensorflow-gpu
- Set up Tensorflow 2.1 GPU version
    - https://www.tensorflow.org/install/gpu
    - pip install tensorflow==2.1 
    - CUDA 10.1, cuDNN 7.6.5
- pip install pandas

## Pre-processing

In [1]:
f = open("enron_train", "r")
if f.mode == 'r':
    content = f.read()
    print(content)

here is our forecast
traveling to have a business meeting takes the fun out of the trip
especially if you have to prepare a presentation
i would suggest holding the business plan meetings here then take a trip without any formal business meetings
i would even try and get some honest opinions on whether a trip is even desired or necessary as far as the business meetings i think it would be more productive to try and stimulate discussions across the different groups about what is working and what is not
too often the presenter speaks and the others are quiet just waiting for their turn
the meetings might be better if held in a round table discussion format
my suggestion for where to go is austin
play golf and rent a ski boat and jet skis
flying somewhere takes too much time
test successful
way to go
nan
randy can you send me a schedule of the salary and level of everyone in the scheduling group
plus your thoughts on any changes that need to be made
patti s for examplephillip
let us shoot

In [2]:
import re
# Cleaning the data
clean_data = []
for text in content.splitlines():
    text = text.replace('_', ' ')
    a = re.sub(r'[^a-zA-z ]+', '', text).strip()
    if len(a)>0:
        clean_data.append(a)
    else:
        None

In [3]:
# Removing the lines which are to short or to long
short_data = []
for line in clean_data:
    if 2 <= len(line.split()) <= 150:
        short_data.append(line)
    else:
        None

In [4]:
# Counting the appearnce of each word in the corpus also calculates the number of unique words also
word2count = {}
total_words = 0
for text in short_data:
    for word in text.split():
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1
        total_words += 1
print(total_words)

342066


In [5]:
# creating a list that will only contain the words that appear more than 15 times
word15 = []
threshold = 15
for word, count in word2count.items():
    if count >= threshold:
        if len(word) > 1:
            word15.append(word)
print(len(word15))

2526


In [6]:
# Removing the words from each string which appear less than 15 times
data_15 = []
for line in short_data:
    str1=''
    for word in line.split():
        if word in word15:
            str1 = " ".join((str1, word))
    data_15.append(str1.lstrip())

In [7]:
# Removing the lines which are to short or to long after removing the unnecssary words.     
short_data_consize = []
for line in data_15:
    if 3 <= len(line.split()) <= 150:
        short_data_consize.append(line)
    else:
        None

In [8]:
#defining a function to save data
def write_txt(name, data):
    file1 = open("{0}.txt".format(name),"w") 
    for line in data:
        file1.writelines(line) 
        file1.writelines('\n') 
    file1.close() #to change file access modes

write_txt(name = 'enron_train_charRNN', data = short_data_consize)

## LSTM Model

In [11]:
def read_file(filepath):
    with open(filepath) as f:
        str_text = f.read()
    return str_text

text = read_file('enron_train_charRNN.txt')
line = text.split("\n")
# token_lst = [x.split() for x in line]

In [12]:
token_lst = line

In [13]:
def generate_seq(tokens):
    train_len = 9+1
    text_sequences = []
    for i in range(train_len,len(tokens)):
        seq = tokens[i-train_len:i]
        text_sequences.append(seq)
    return text_sequences

seq_lst = [generate_seq(x) for x in token_lst]

In [14]:
flatten = [item for sublist in token_lst for item in sublist]

sequences = {}
count = 1
for i in range(len(flatten)):
    if flatten[i] not in sequences:
        sequences[flatten[i]] = count
        count += 1

In [15]:
flatten_seq = [item for sublist in seq_lst for item in sublist]

In [16]:
flatten_seq[40:50]

['ting takes',
 'ing takes ',
 'ng takes t',
 'g takes th',
 ' takes the',
 'takes the ',
 'akes the f',
 'kes the fu',
 'es the fun',
 's the fun ']

In [17]:
text_sequences = flatten_seq
train_len = 9+1

In [18]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level=True, oov_token='UNK')
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences) 

#Collecting some information   
vocabulary_size = len(tokenizer.word_counts)

import numpy as np
n_sequences = np.empty([len(sequences),train_len], dtype='int32')
for i in range(len(sequences)):
    n_sequences[i] = sequences[i]

In [19]:
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import Sequential
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import *

train_inputs = n_sequences[:,:-1]
train_targets = n_sequences[:,-1]

train_targets = to_categorical(train_targets, num_classes=vocabulary_size+2)
seq_len = train_inputs.shape[1]
train_inputs.shape

def create_model(vocabulary_size, seq_len):
    model = Sequential()
    model.add(Embedding(vocabulary_size, seq_len,input_length=seq_len))
    model.add(LSTM(50,return_sequences=True))
    model.add(LSTM(50))
    model.add(Dense(50,activation='relu'))
    model.add(Dense(vocabulary_size,activation='softmax'))
    opt_adam = Adam(lr=0.001)
    #You can simply pass 'adam' to optimizer in compile method. Default learning rate 0.001
    #But here we are using adam optimzer from optimizer class to change the LR.
    model.compile(loss='categorical_crossentropy',optimizer=opt_adam,metrics=['accuracy'])
    model.summary()
    return model

In [20]:
model = create_model(vocabulary_size+2,seq_len)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 9, 9)              261       
_________________________________________________________________
lstm (LSTM)                  (None, 9, 50)             12000     
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dense (Dense)                (None, 50)                2550      
_________________________________________________________________
dense_1 (Dense)              (None, 29)                1479      
Total params: 36,490
Trainable params: 36,490
Non-trainable params: 0
_________________________________________________________________


In [22]:
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.models import load_model

path = 'models/enron_charRNN_0714.h5'
checkpoint = ModelCheckpoint(path, monitor='loss', verbose=1, save_best_only=True, mode='min')
stopping = EarlyStopping(monitor='loss', mode='min', verbose=1, patience=3)
model.fit(train_inputs,train_targets,batch_size=128,epochs=500,verbose=1,callbacks=[checkpoint, stopping])

Train on 1383145 samples
Epoch 1/500
Epoch 00001: loss improved from inf to 1.98667, saving model to models/enron_charRNN_0714.h5
Epoch 2/500
Epoch 00002: loss improved from 1.98667 to 1.48603, saving model to models/enron_charRNN_0714.h5
Epoch 3/500
Epoch 00003: loss improved from 1.48603 to 1.33409, saving model to models/enron_charRNN_0714.h5
Epoch 4/500
Epoch 00004: loss improved from 1.33409 to 1.25818, saving model to models/enron_charRNN_0714.h5
Epoch 5/500
Epoch 00005: loss improved from 1.25818 to 1.21224, saving model to models/enron_charRNN_0714.h5
Epoch 6/500
Epoch 00006: loss improved from 1.21224 to 1.18091, saving model to models/enron_charRNN_0714.h5
Epoch 7/500
Epoch 00007: loss improved from 1.18091 to 1.15886, saving model to models/enron_charRNN_0714.h5
Epoch 8/500
Epoch 00008: loss improved from 1.15886 to 1.14199, saving model to models/enron_charRNN_0714.h5
Epoch 9/500
Epoch 00009: loss improved from 1.14199 to 1.12858, saving model to models/enron_charRNN_0714.h

Epoch 28/500
Epoch 00028: loss improved from 1.04274 to 1.04070, saving model to models/enron_charRNN_0714.h5
Epoch 29/500
Epoch 00029: loss improved from 1.04070 to 1.03872, saving model to models/enron_charRNN_0714.h5
Epoch 30/500
Epoch 00030: loss improved from 1.03872 to 1.03696, saving model to models/enron_charRNN_0714.h5
Epoch 31/500
Epoch 00031: loss improved from 1.03696 to 1.03572, saving model to models/enron_charRNN_0714.h5
Epoch 32/500
Epoch 00032: loss improved from 1.03572 to 1.03397, saving model to models/enron_charRNN_0714.h5
Epoch 33/500
Epoch 00033: loss improved from 1.03397 to 1.03254, saving model to models/enron_charRNN_0714.h5
Epoch 34/500
Epoch 00034: loss improved from 1.03254 to 1.03060, saving model to models/enron_charRNN_0714.h5
Epoch 35/500
Epoch 00035: loss improved from 1.03060 to 1.02969, saving model to models/enron_charRNN_0714.h5
Epoch 36/500
Epoch 00036: loss improved from 1.02969 to 1.02803, saving model to models/enron_charRNN_0714.h5
Epoch 37/5

Epoch 55/500
Epoch 00055: loss improved from 1.01170 to 1.01128, saving model to models/enron_charRNN_0714.h5
Epoch 56/500
Epoch 00056: loss improved from 1.01128 to 1.01081, saving model to models/enron_charRNN_0714.h5
Epoch 57/500
Epoch 00057: loss improved from 1.01081 to 1.00997, saving model to models/enron_charRNN_0714.h5
Epoch 58/500
Epoch 00058: loss improved from 1.00997 to 1.00942, saving model to models/enron_charRNN_0714.h5
Epoch 59/500
Epoch 00059: loss improved from 1.00942 to 1.00847, saving model to models/enron_charRNN_0714.h5
Epoch 60/500
Epoch 00060: loss improved from 1.00847 to 1.00832, saving model to models/enron_charRNN_0714.h5
Epoch 61/500
Epoch 00061: loss improved from 1.00832 to 1.00813, saving model to models/enron_charRNN_0714.h5
Epoch 62/500
Epoch 00062: loss improved from 1.00813 to 1.00752, saving model to models/enron_charRNN_0714.h5
Epoch 63/500
Epoch 00063: loss improved from 1.00752 to 1.00694, saving model to models/enron_charRNN_0714.h5
Epoch 64/5

Epoch 110/500
Epoch 00110: loss improved from 0.99249 to 0.99192, saving model to models/enron_charRNN_0714.h5
Epoch 111/500
Epoch 00111: loss did not improve from 0.99192
Epoch 112/500
Epoch 00112: loss did not improve from 0.99192
Epoch 113/500
Epoch 00113: loss improved from 0.99192 to 0.99139, saving model to models/enron_charRNN_0714.h5
Epoch 114/500
Epoch 00114: loss did not improve from 0.99139
Epoch 115/500
Epoch 00115: loss improved from 0.99139 to 0.99124, saving model to models/enron_charRNN_0714.h5
Epoch 116/500
Epoch 00116: loss improved from 0.99124 to 0.99084, saving model to models/enron_charRNN_0714.h5
Epoch 117/500
Epoch 00117: loss did not improve from 0.99084
Epoch 118/500
Epoch 00118: loss did not improve from 0.99084
Epoch 119/500
Epoch 00119: loss improved from 0.99084 to 0.99078, saving model to models/enron_charRNN_0714.h5
Epoch 120/500
Epoch 00120: loss improved from 0.99078 to 0.99007, saving model to models/enron_charRNN_0714.h5
Epoch 121/500
Epoch 00121: lo

<tensorflow.python.keras.callbacks.History at 0x1c35b643808>

In [23]:
# model.save(path)

from pickle import dump
dump(tokenizer,open('models/enron_tokenizer_charRNN_0714','wb'))  

## Prediction

### Larger dataset

Training results

- Epoch 123/500 loss: 0.9901 - accuracy: 0.6850
- Each training epoch takes 2.5-3 minutes on GPU (RTX 2070 SUPER)

In [24]:
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from pickle import load

model = load_model('models/enron_charRNN_0714.h5')
tokenizer = load(open('models/enron_tokenizer_charRNN_0714','rb'))
seq_len = 9
def gen_text(model, tokenizer, seq_len, seed_text, num_gen_words):
    output_text = []
    input_text = seed_text
    for i in range(num_gen_words):
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        pad_encoded = pad_sequences([encoded_text], maxlen=seq_len,truncating='pre')
        pred_word_ind = model.predict_classes(pad_encoded,verbose=0)[0]
        
        pred_word = tokenizer.index_word[pred_word_ind]
        input_text += ' '+pred_word
        output_text.append(pred_word)
    return ' '.join(output_text)

In [25]:
print('\n\n===>Enter --exit to exit from the program')
while True:
    seed_text  = input('Enter string: ')
    if seed_text.lower() == '--exit':
        break
    else:
        out = gen_text(model, tokenizer, seq_len=seq_len, seed_text=seed_text, num_gen_words=5)
        print('Output: '+seed_text+' '+out)
        print()



===>Enter --exit to exit from the program
Enter string: enron
Output: enron u p h e f

Enter string: email
Output: email   s u n a

Enter string: discussions
Output: discussions   s u n a

Enter string: discuss
Output: discuss   c a n a

Enter string: discussio
Output: discussio n e m a r

Enter string: --exit


### Smaller dataset

Training results:

- Epoch 500/500 loss: 0.1652 - accuracy: 0.9321
- Each epoch training takes 70-80 seconds on GPU (RTX 2070 SUPER)

In [26]:
model = load_model('models/outlook_charRNN_0713.h5')
tokenizer = load(open('models/outlook_tokenizer_charRNN_0713','rb'))
seq_len = 9

In [27]:
print('\n\n===>Enter --exit to exit from the program')
while True:
    seed_text  = input('Enter string: ')
    if seed_text.lower() == '--exit':
        break
    else:
        out = gen_text(model, tokenizer, seq_len=seq_len, seed_text=seed_text, num_gen_words=5)
        print('Output: '+seed_text+' '+out)
        print()



===>Enter --exit to exit from the program
Enter string: acronis
Output: acronis   l l o u

Enter string: email
Output: email   l o r o

Enter string: discussions
Output: discussions y o u n e

Enter string: discuss
Output: discuss   p t o r

Enter string: discussio
Output: discussio n p u n g

Enter string: --exit


## Conclusion

- Seems to return individual letters with spaces 
    - Need to relook at the modeling