Deep neural networks can be used to classify text data just like they can be used to classify image data. The process for classifying text data is very similar to the process for classifying images. First, the data must be converted into a form that the network can learn, a training and testing set must be created, and then the classifier must be trained on the data. 

This notebook examines multiple different ways to create a text classifier, first demonstrating how to use word embeddings on a small dataset, and then adapting the process to larger dataset. A LSTM network is also implemented for the sake of comparison. We'll be using Keras to implement these networks.

To begin with, we're going to need to load in the data and convert the words in our text data to representations that our deep learning model can work with. These representations are referred to as "word embeddings" and they are numerical representations contained with a geometric space called "embedding space". Similar words will have similar representations in embedding space, and because of this the network is capable of learning patterns and even reasoning by analogy.

We'll start out by choosing a small dataset to work on, which in this case will be the Sentiment Labelled Sentences Data Set, found at the UCI Machine Learning Repository. It contains many reviews from Yelp, IMDB, and Amazon and classifies these reviews as positive or negative. We'll need to tokenize these reviews and then we can form embeddings of them and pass them in to our deep neural network. 

Let's start out by importing all the libraries we will need.

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, Dense, GlobalMaxPooling1D, LSTM
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.datasets import imdb
import numpy as np
from keras.preprocessing import sequence

The data is in the .txt format, but it will be easier to work with if we load it into a CSV. We'll use Pandas to create the dataframe. Thankfully, we can distingusih between the text/review and the label of that review by dividing where there is a tab in the document.

In [None]:
# Converts text to CSV
def csv_convert(input_data, data_source):
    # separate labels and sentences on tabs
    df = pd.read_csv(input_data, names=['sentence', 'label'], sep='\t')
    df['source'] = data_source
    return df

Now we just need to set the directory for the individual text files and convert them.

In [None]:
amz_data = "/sentiment labelled sentences/amazon_cells_labelled.txt"
imdb_data = "/sentiment labelled sentences/imdb_labelled.txt"
yelp_data = "/sentiment labelled sentences/yelp_labelled.txt"

amz_data = csv_convert(amz_data, "Amazon")
imdb_data = csv_convert(imdb_data, "IMDB")
yelp_data = csv_convert(yelp_data, "Yelp")

print(amz_data.head(5))

Let's join the three individual datasets into one large dataset. Then we can pull out the training features, which will be the sentences, and the labels from the completed data CSV.

In [None]:
df_complete = pd.concat([amz_data, imdb_data, yelp_data])
df_complete.to_csv("sentiment_labelled_complete.csv")
print(amz_data.head(10))

# Separate out the features and labels from the CSV

features = df_complete['sentence'].values
labels = df_complete['label'].values

We can now use the extremely useful `train_test_split` function from Scikit-learn to divide our features and labels up into training/testing features and training/testing labels.

In [None]:
# Creating training/testing features and labels
# Still need to tokenize them
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.20, random_state=108)

Now that we have all the data split into training and testing sets, we need to tokenize the data. This makes numerical representations of the text data that our network can intepret. There are a variety of options for tokenizing data, but the simplest way to do this is probably to use the built in `Tokenizer` that comes with Keras. First, we have to declare an instance of the tokenizer and tell it the total number of words we want to tokenize.

In [None]:
# Create the tokenizer, fit it on the text data
# use text_to_sequences to actually conver the features to tokens
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

# Need this for the creation of the network
vocab_len = len(tokenizer.word_index) + 1

Not every sequence will be the same length. This represents a bit of a problem for the classifier, but we can get around this by "padding" the data. Padding will insert zeroes where there isn't data to make sure all the data is the same length. We'll also set the max length to 100 characters so that no sequence can go past that length. By doing both these things we make sure all the sequences are the same length.

In [None]:
maxlen = 100
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

# Selects size/complexity of the embeddings
embedding_dim = 50

Now we can set up oup Convolutional Neural Network, embed the data, then train on the data. When creating the embeddings, we set a dimension for the embedding. The embedding dimension defines how complex the representation of the sequence is, with more complex representations being more useful to the classifier when it attempts to learn patterns.

After this, we insert a convolutional layer into the network. This convolutional layer is what actually analyzes our embedding data to find relevant patterns, detecting features likely to be important to the meaning of the sentence and hence the class/label. We'll then downsample this data, which is extremely complex, forming simpler snapshots of the data. The Max Pooling layer abstracts away any data deemed unimportant by simply taking the maximum value of regions of the convolutions. By taking the biggest/maximum value of a portion of the convolutions, the most important information in that region is maintained while the representation becomes simpler.

Finally, we pass the pooled data into the dense, fully-connected layers in the model. These layers are what learn the patterns found by the convolutional layers. For a problem this small, this number of dense layers may actually be overkill and it is possible that the model could be overfitting, but for this simple demonstration we can experiment with the results anyway. Finally, we just compile the model - with our chosen optimizer, loss, and metrics - and return it.

In [None]:
# Creates the text classification model
def sentiment_model(embedding_dim):
    model = Sequential()
    # include the word embedding layers
    model.add(Embedding(vocab_len, embedding_dim, input_length=maxlen))
    model.add(Conv1D(128, 5, activation='relu'))
    model.add(GlobalMaxPooling1D())
    model.add(Dense(20, activation='relu'))
    model.add(Dense(5, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    print(model.summary())
    return model

model = sentiment_model(embedding_dim)

We'll now fit the model to train it and save it as a variable, so that we can access metrics like accuracy.

In [None]:
records = model.fit(X_train, y_train, epochs=10, verbose=1, validation_data=(X_test, y_test), batch_size=10)

_, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
_, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

We'll also make a function that can be used to visualize accuracy and loss for the training and validation sets. We can keep using this function for our other models.

In [None]:
def plot_history(history):
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'r', label='Training acc')
    plt.plot(x, val_acc, 'b', label='Validation acc')
    plt.title('Accuracy Over EPochs')

    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'r', label='Training loss')
    plt.plot(x, val_loss, 'b', label='Validation loss')
    plt.title('Loss Over Epochs')
    plt.legend()

    plt.show()

Now we can plot the accuracy and loss.

In [None]:
plot_history(records)

The rising validation loss could be indicative of overfitting. It seems likely that the dataset we are experimenting with is just too small to for the complexity of our Convolutional Network. Now that we have a good idea of how to transform data and create an embedding network, let's create a classifier for a more complex problem. Keras has an IMDB review dataset built into it, intended for use with text classifiers. It contains reviews labeled as positive or negative, much like our other dataset.

Loading in this dataset is a bit convoluted, as the current version of Keras contains a bug which prevents it from being use properly. We can get around this by temporarily modifying the parameters of the dataset which are causing the issue. The lambda function you see below allows us to work around the bug. We can then easily load the data into training and testing sets, and if we are interested we can get an idea of how many labels and features we have.

In [None]:
# Bypass bug in current version of Keras, allows use of imdb dataset
# modifies the default parameters of numpy's load function
np_load_old = np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=10000)

# get features and labels, make sure they are the same number
features = np.concatenate((X_train, X_test), axis=0)
labels = np.concatenate((y_train, y_test), axis=0)

We need to pad the sequences of the data again. This time we'll use 500 words.

In [None]:
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

We can now define our new convlutional model. Like before, we set the input data to a chosen number of input sequences and set our output number when we create the embedding layer.

In [None]:
def conv_model(max_words):

    model = Sequential()
    model.add(Embedding(10000, 64, input_length=max_words))
    model.add(Conv1D(128, 5, activation='relu'))
    model.add(Conv1D(64, 5, activation='relu'))
    model.add(GlobalMaxPooling1D())
    model.add(Dense(10, activation='relu'))
    model.add(Dense(5, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    print(model.summary())
    return model

model = conv_model(max_words)

Now like before we can just fit the model.

In [None]:
records = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=1)

Then we'll evaluate the model.

In [None]:
# model evaluation
accuracy = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (accuracy[1]*100))
plot_history(records)

Let's try one last thing. We're going to try implementing a specific type of CNN called a Long Short-Term Memory network, or LSTM. LSTMs excel at handing chronological data, where the order of the data matters.  LSTMs are able to "remember" data from earlier time steps in the training and take it into account. 

Word order matters when trying to make sense of the meaning of a sentence, so an LSTM could potentially be useful in this case. The LSTM may not prove extremely useful as presence of positive and negative words may have an outsized effect on the classification of the features, moreso than the order of the words in our training examples. Nontheless, LSTMs are powerful tools and it's a good idea to be aware of their use cases.

We just define the LSTM model the same way that we define the other models, including LSTM layers instead.

In [None]:
def LSTM_model(max_words):
    model = Sequential()
    model.add(Embedding(10000, 32, input_length=max_words))
    model.add(LSTM(64, dropout=0.2, return_sequences=True))
    model.add(LSTM(128, dropout=0.2))
    model.add(Dense(20))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    print(model.summary())
    return model

model = LSTM_model(max_words)

Now we can fit the model.

In [None]:
records = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=1)

Finally, we just have to evaluate its performance.

In [None]:
# model evaluation
# "format" this
accuracy = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (accuracy[1]*100))

plot_history(records)