# Character Tokenized RNN Model

The purpose of this notebook is to build the first text generation model. While building this model I did not expect very good text generation. I was hopeful though and wasn't immediately disappointed with the results, they were words after all. However, I knew I would be going on to make more complex models that would ideally provide better results. When searching for text generation, character by character generation seemed to come up over and over. That is why I chose this one as the starting point. Each character is made into a token, from there a sequence is made of possible characters developing patterns. These patterns are then used to train a Sequential model using LSTMs, Dropouts, and a "softmax" activation function. After the hours it takes to train the model. A function was written to predict the next 140 characters or length of a tweet. 

# Imports

In [1]:
import numpy
import sys
import re
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Embedding
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint

Using TensorFlow backend.


Most of this project is done using the keras library. Keras is a very helpful tool in designing and working with NLP models. Keras works on top of TensorFlow to make yours and my life easier. NLTK is another handy tool for tokenizing words and making them useable for NLP models. I tried out two tokenizers. The first one was RegexpTokenizer. This one you just set what you want to filter out like an regex expression and it creates the tokens. The second one I played with is the TweetTokenizer. I figured since I was dealing with tweets, might as well use this one. It gets rid of any character not related to tweets so anything besides "#". And was easy to implement, not needing any arguments. 

# Tokenizing the data

Tokenization is splitting a string of text into smaller units such as individual words or terms. These smaller units are referred to as tokens. We cannot jump into the model building part without cleaning the text first. Neural networks cannot work with raw text data, the characters must be transformed into a series of numbers the RNN can interpret.

In [2]:
file = open('customer_service_data.csv', encoding='utf-8').read()

In [3]:
def tokenize_words(input):
    """Takes in a text file and creates tokens"""
    
    #lowercase everything to standardize it
    input = input.lower()
    
    #instantiate the tokenizer
    tokenizer = TweetTokenizer()
    tokens = tokenizer.tokenize(input)

    return tokens

In [4]:
#preprocess the input data, make tokens
processed_inputs = tokenize_words(file)
print(processed_inputs)



# Data Prep

Character level models are quicker to train, require less memory and have faster inference than word based models. That is why this first model will be character level. The features will be limited to the characters that appear in the tokens as opposed to every word that appears. Since the model needs numbers, not text characters. We will need to conver the characters to numbers.We start off by sorting the characters then using the enumerate function to get a number representation and store it in a dictionary

In [5]:
chars = sorted(list(set(processed_inputs)))
char_to_num = dict((c, i) for i, c in enumerate(chars))

In [6]:
print(char_to_num)



In [7]:
#need input and vocab len for later data prep
input_len = len(processed_inputs)
vocab_len = len(chars)
print("Total number of characters:", input_len)
print("Total vocab:", vocab_len)

Total number of characters: 184487
Total vocab: 6567


We now need to make a data set the model will understand. I have limited the character sequence to 100, an arbitrary number that can be played around with. 

In [8]:
seq_length = 100
x_data = []
y_data = []

In [9]:
#loop through the inputs, start at the beginning and go until we hit
#the final character we can create a sequence out of
for i in range(0, input_len - seq_length, 1):
    #define the input and output sequences
    #input is the current character puls desired sequence length
    in_seq = processed_inputs[i:i + seq_length]
    
    #out sequence is the initial character plus total sequence length
    out_seq = processed_inputs[i+seq_length]
    
    #now convert list of characters to integers based on previously and add the values
    #to our lists
    x_data.append([char_to_num[char] for char in in_seq])
    y_data.append(char_to_num[out_seq])

Now we have our input sequences of characters and our output, which is the character that should come after the sequence ends. Training data and labels are stored as x_data and y_data. 

In [10]:
#total number of input sequences
n_patterns = len(x_data)
print("Total Patterns:", n_patterns)

Total Patterns: 184387


In [11]:
#reshape to work in network
X = numpy.reshape(x_data, (n_patterns, seq_length, 1))
X = X/float(vocab_len)

In [12]:
#one-hot encode and label data
y = np_utils.to_categorical(y_data)

# The Model

I am going to assume that you have some experience with neural networks and this is not your first rodeo. With that in mind lets get into RNN or recurrent neural networks. RNN's have the ability to remember prior inputs from previous layers while vanilla neural networks cannot. RNN's are useful for text processing because of their ability to remember different parts of a series of inputs. LSTMs or Long Short Term Memory networks are a kind of RNN. RNN's suffer from a vanishing gradient problem. The ability to preserve context of earlier inputs degrades over time. Irrelevant data is accumulated over time and blocks out relevant data. LSTM deals with the vanishing gradient problem by choosing to forget information deemed unneccesary by the LSTM algorithms. LSTMs can focus more on the data that matters. 

In [13]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))

You can build a model more ways than I would like to think about. This is just the setup that I have decided to go with. You can add more layers, less layers, and on and on but your model may not converge. This model already takes several hours to train but feel free to play around with it on your own.

In [14]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [15]:
model.fit(X, y, epochs=10, batch_size=256, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x1f91d5e9e80>

In [16]:
#saves model weights so that you don't have to run the model again
filename = "model_weights_saved.hdf5"
model.save_weights(filename)
print("saved model weights")

saved model weights


Since the characters were converted to numbers before, they need to be converted back

In [17]:
num_to_char = dict((i, c) for i, c in enumerate(chars))

In order for text generation. A random seed for the data set must be chosen to start the model predicting what character will come next

In [18]:
start = numpy.random.randint(0, len(x_data) - 1)
pattern = x_data[start]
print("Random Seed:")
print("\"", ''.join([num_to_char[value] for value in pattern]), "\"")


Random Seed:
" yougomary-kate.there'salsoalinktozizzi'swebsitewhichwillshowtheirtermsaswell.th…,"hithere,thenintendo'sareinquiteshortsupplyatthemoment.ifyoucandmyourpostcode,ican…","hi,thanksforgettingintouch.weweredoingabitworkontescopay+yesterdayevening.thismayhav…","hellosandra&henry,thanksforgettingintouch.canyoudmmetheerrormessageyouaregetting… "


To finally generate the text, we ask the model to predict what comes next based off of the random seed, convert the output from numbers to characters and then append it to the pattern that we started with the random seed which is the seed plus the generated characters. The model chooses what character to pick next based on what character it has decided has the highest probability of coming next.

In [19]:
for i in range(140):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(vocab_len)
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = num_to_char[index]
    seq_in = [num_to_char[value] for value in pattern]
    
    sys.stdout.write(result)
    
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

","hithere,thanksforgettingintouch.iyouliketototototo.pleasedmus","hi,thankforgettingintouch.iyouliketototototo.pleasedmus","hithere,thanksforgettingintouch.iyouliketototototo.pleasedmus","hithere,thanksforgettingintouch.iyouliketototototo.pleasedmus","hithere,thanksforgettingintouch.iyouliketototototo.pleasedmus","hithere,thanksforgettingintouch.iyouliketototototo.

# Summary 

There are many issues with this model as can be seen by the output above. It seems to have gotten stuck in a loop and just keeps printing out the same couple of words. This is probably because it is a character based model. And after all those hours of training, the model is predicting the same couple of characters after the words. Originally I was using a data set that just contained some 10,000 tweets that I pulled from twitter with keywords like 'data science' or 'big data'. The model realized that these words appear more often and only wanted to 'say' those words over and over. Now that I think I have found the issue causing this loop, I will build a new model that is word based and see what the output is from that. I will keep the structure of the model similar but I do want to add new features to the model introducing more complexity buy hopefully a better result