# LSTM Word Model

The purpose of this notebook is to build on the ideas of the previous character model but to use a word model instead. Using the data set of tweets, the tweets will be tokenized and instead of each character sequence being used to train the model, the sequences will be the tweets themselves providing a more comprehensive text bot. Or at least that is the goal. A few more features are added to the model such as embeddings which I will go into in the appropriate sections. 

# Imports

In [1]:
import numpy as np
#import sys
import re
#import unicodedata
import pandas as pd
import keras.utils as ku
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Embedding
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


For this notebook I decided to use keras' tokenizer as it had more use than the simple NLTK tokenizers which just produced a list. The Keras tokenizer creates an object with other accessible attributes as you will see in a few sections. 

# Functions

In [2]:
def get_sequence_of_tokens(corpus):
    """Takes in a corpus of data, in this case the tweets and fits the tokenizer
    on the data set. A variable for the number of words is declared. And finally 
    the sequences which will be used to train the model is found using keras' 
    texts_to_sequences function. The input sequences and the total number of words
    are returned"""
    
    corpus = corpus.lower()
    t = Tokenizer()
    t.fit_on_texts(corpus)
    total_words = len(t.word_index) + 1
    
    #converts the corpus into a flat dataset of sentence sequences
    input_sequences = []
    for line in corpus:
        token_list = t.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
            
    return input_sequences, total_words

If you compare this notebook with the previous character model you will notice several differences. Lets start with the above function. While this function is doing roughly the same tokenization process as the  previous notebook it is slightly different. Here the words are being tokenized but then instead of making character sequences, this function makes sequences out of the tweets in an n-gram sequence. You can think of an n-gram as the sequence of N words. A 2-gram sequence would have two words and a 3-gram sequence would have 3 and so on. N-grams are used to assign probabilities to sentences and sequences of words based on relative frequency count approach.

In [18]:
def generate_padded_sequences(input_sequences):
    """Pads sequences to the same length. Transforms lists of integers into a
    2d Numpy array of shape (num_samples, maxlen). Creates predictors and labels
    for the sequences. Assigns the labels to categorical variables. Returns
    predictors, label, and max sequence length."""
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen = max_sequence_len, padding = 'pre'))
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes = total_words)
    
    return predictors, label, max_sequence_len

Since not every tweet or sentence is the same length, it is good practice to pad the sequences to make them all the same length. The pad_sequence function from Keras does just that. To input this data into a learning model, we need to create predictors and labels. The predictors are the sequences or parts of sequences and the labels are the word with the highest probability of coming next in the sequence. 


In [19]:
def generate_text(seed_text, next_words, model, max_seq_len):
    """Takes a seed text as input and predicts the next words. Tokenizes the seed
    texts, pad the sequences, and pass them to be the trained model for prediction."""
    for _ in range(next_words):
        token_list = t.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_seq_len-1, padding='pre')
        
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ''
        
        for word,index in t.word_index.items():
            if index == predicted:
                output_word = word
                break
                
        seed_text = seed_text + " " + output_word
        
    return seed_text.title()

Since this notebook doesn't deal with going from characters to numbers and back for the word prediction, a simpler generate_text function can be written. Using the trained model, and a seed_text which you can change at the end of the notebook you can control/try out several text generation seeds. Keras makes going from tokenized words back to the words so much simpler than manually doing it yourself.  

# Tokenizing and Cleaning the Data

Originally I had wanted to open the file just like I did in the previous notebook and just read through the csv. However there was an issue with the Keras tokenizer and the either the encodings or something along those lines. The issue made the tokenizer split every word into characters and used those as the tokens. Even after using a function to go through the file and put it into ascii format, it still would not work. Since this is not what I wanted I found a hacky way to make it work. If instead you read the csv file into an pandas data frame, you can then use the columns of that data frame as a list of the tweets and then tokenize those and for some reason this works. 

In [2]:
file = open('customer_service_data.csv', encoding='utf-8').read()

In [3]:
words = pd.read_csv('customer_service_data.csv')

In [5]:
words.head()

Unnamed: 0,"Our teams are now reporting that this is resolved! Downloads should be working normally, so please feel free to giv…",Minecraft Bedrock - iOS only: some players have said they can log in if they switched to cellular data. We are continuing…,We’ve heard that some of you are having trouble downloading purchased content. Our investigative teams are working…,We just received word that users should now be able to access purchased content again. We appreciate your reports.…,We understand some of you are also having trouble accessing purchased content and in-game content on the Xbox One.…,We've received word that some of you are having trouble downloading purchased content on the Xbox One. Our team is…,Our teams have let us know that you should now be able to download previously purchased content. Thank you for your…,We understand some of you might be having trouble downloading content you've purchased in the Store. Our teams are…,"Hello. Try unplugging your console from the electrical outlet for 5 minutes. Plug it back in, power up…",We have received word that users should now be able to view products listed on Xbox Marketplace. Thank you for you…,...,You must have had one heck of a power ride—there's magic within those glands of yours.,We're keen on tracking down your package—send a DM our way so we can dig in together.,We're picking up what you're putting down—you never know what our designers have up their sleeves.,"We vow to keep you comfortable, Jordie.","We're no psychic, but we predict a killer workout in your near future, Karlee.","We understand your in-store experience has been less than ideal, Sarah—we'd love to chat more about…",We're confident McDavid would be impressed.,Reasonable.,Took the words right out of our mouth.,We totally get it—rest assured your feedback is noted.


As you can see above this is not the most beautiful output but it is manageable.

In [6]:
tweets = list(words.columns)

In [7]:
print(tweets)



In [8]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
      if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
  w = unicode_to_ascii(w.lower().strip())

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
  w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

  w = w.rstrip().strip()

  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  #w = '<start> ' + w + ' <end>'
  return w

In [9]:
t = Tokenizer()
t.fit_on_texts(tweets)

In [11]:
# A dictionary of words and their counts.
print(t.word_counts)



In [12]:
# A dictionary of words and how many sequences each appeared in.
print(t.word_docs)



In [13]:
# An integer count of the total number of sequences that were used to fit the Tokenizer (i.e. total number of documents)
print(t.document_count)

7942


In [14]:
# A dictionary of words and their uniquely assigned integers.
print(t.word_index)



In [15]:
print('Found %s unique tokens.' % len(t.word_index))

Found 7255 unique tokens.


In [16]:
input_sequences, total_words = get_sequence_of_tokens(tweets)

In [10]:
input_sequences[:10]

[[3033, 20],
 [3033, 20, 313],
 [3033, 20, 313, 9],
 [3033, 20, 313, 9, 3],
 [3033, 20, 313, 9, 3, 2173],
 [3033, 20, 313, 9, 3, 2173, 9],
 [3033, 20, 313, 9, 3, 2173, 9, 2174],
 [3033, 20, 313, 9, 3, 2173, 9, 2174, 3034],
 [3033, 20, 313, 9, 3, 2173, 9, 2174, 3034, 2175],
 [3033, 20, 313, 9, 3, 2173, 9, 2174, 3034, 2175, 33]]

The above lists of integers represent the N-gram phrases generated from the corpus. 

In [20]:
#pads sequences and gets data ready for the model
predictors, label, max_sequence_len = generate_padded_sequences(input_sequences)

# The Model

In [21]:
model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_sequence_len - 1))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(total_words, activation='softmax'))

This model has a lot of similarities to the previous notebook's model with one exception, the Embedding layer. An embedding layer is used to compress the input feature space into a smaller one. One can imagine the Embedding layer as a simple matrix multiplication that transforms words into their corresponding word embeddings. In this case I hoped that it would allow the model to train faster.

In [22]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
model.fit(predictors, label, epochs=25, batch_size=256, verbose=1)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25

In [None]:
filename = "word_vec_model_weights_saved.hdf5"
model.save_weights(filename)
print("saved model weights")

In [21]:
print(generate_text("why can't i see this page", 120, model, max_sequence_len))

Why Can'T I See This Page We Can Help With Your Order Please Dm Us With Your Name And Address And 1 Conta… … 1 3 3 3 3 3 3 3 3 Confirming To Be Availab… I… I… I… Came To The Refer Confirming Confirming Have A Moment Confirming Be Able… At… Is The Top 2Nd Day… I… Is The Stream 1 2 20 Tue I… Is The Top Confirming Plea… The Store And Manned Ch… Tue Is A Branded Gt Months Is A P… 1 Affecting To The Store Management I'Ve… Is Been 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 A Scam Second The Enforcement Team… Is A Scam We Want


# Summary

Several things can be taken away from this model. For starters it takes much less time to train so I was able to go through more epochs and potentially have a better result. Some more fine tuning could be done. But that is always the case with NLP models. I could add more layers, more neurons, etc etc. But in this case I believe that if I remove numbers from the tokens than it will produce more text. And hopefully that text will be better. There is semblance of actual speech though in this text. It isn't just the same sentence over and over again which occurred in the word model. Keras really simplified the process of building this model as well. I would also like to explore GRUs as well as attention in the following models. 