# Text Classification w/ Wording Embeddings and CNN

Convolutional neural networks are effective at document classification, namely because they are able to pick out salient features (e.g. tokens or sequences of tokens) in a way that is invariant to their position within the input sequences.

The architecture consists of three key pieces:
    1. Word Embeddings
    2. Convolution Model
        - A feature extraction model that learns to extract salient 
          features from documents represented using a word embedding.
    3. Fully Connected Model
        - The interpretation of extracted features in terms of a 
          predictive output.

Yoav Goldberg highlights the CNNs role as a feature extractor model in his book:

    ... the CNN is in essence a feature-extracting architecture. It 
    does not constitute a standalone, useful network on its own, but 
    rather is meant to be integrated into a larger network, and to be 
    trained to work in tandem with it in order to produce an end 
    result. The CNNs layer’s responsibility is to extract meaningful 
    sub-structures that are useful for the overall prediction task at 
    hand.
— Page 152, Neural Network Methods for Natural Language Processing, 2017.


The architecture is based on the approach used by Ronan Collobert, et al. in their paper Natural Language Processing (almost) from Scratch, 2011. In it, they develop a single end-to-end neural network model with convolutional and pooling layers for use across a range of fundamental natural language processing problems. 
    
    􏰀 Transfer function: rectified linear. 
    􏰀 Kernel sizes: 2, 4, 5.
    􏰀 Number of filters: 100.
    􏰀 Dropout rate: 0.5.
    􏰀 Weight regularization (L2): 3. 
    􏰀 Batch Size: 50.
    􏰀 Update Rule: Adadelta.
These configurations could be used to inspire a starting point for your own experiments.


## Dial in CNN hyperparameters 

Some hyperparameters matter more than others when tuning a convolutional neural network on your document classification problem. Ye Zhang and Byron Wallace performed a sensitivity analysis into the hyperparameters needed to configure a single layer convolutional neural network for document classification. The study is motivated by their claim that the models are sensitive to their configuration.

The study makes a number of useful findings that could be used as a starting point for configuring shallow CNN models for text classification. The general findings were as follows:

    􏰀 The choice of pre-trained Word2Vec and GloVe embeddings differ 
      from problem to problem, and both performed better than using 
      one hot encoded word vectors.
    􏰀 The size of the kernel is important and should be tuned for each 
      problem.
    􏰀 The number of feature maps is also important and should be 
      tuned.
    􏰀 The 1-max pooling generally outperformed other types of pooling.
    􏰀 Dropout has little effect on the model performance.

They go on to provide more specific heuristics, as follows:

    􏰀 Use Word2Vec or GloVe word embeddings as a starting point and 
      tune them while fitting the model.
    􏰀 Grid search across different kernel sizes to find the optimal 
      configuration for your problem, in the range 1-10.
    􏰀 Search the number of filters from 100-600 and explore a 
      dropout of 0.0-0.5 as part of the same search.
    􏰀 Explore using tanh, relu, and linear activation functions.

The key caveat is that the findings are based on empirical results on binary text classification problems using single sentences as input.

## Character-Level CNNs

Text documents can be modeled at the character level using convolutional neural networks that are capable of learning the relevant hierarchical structure of words, sentences, paragraphs, and more.The promise of the approach is that all of the labor-intensive effort required to clean and prepare text could be overcome if a CNN can learn to abstract the salient details.

The model reads in one hot encoded characters in a fixed-sized alphabet. Encoded characters are read in blocks or sequences of 1,024 characters. A stack of 6 convolutional layers with pooling follows, with 3 fully connected layers at the output end of the network in order to make a prediction. The model achieves some success, performing better on problems that offer a larger corpus of text.

## Deeper CNNs for Classification

Better performance can be achieved with very deep convolutional neural networks, although standard and reusable architectures have not been adopted for classification tasks, yet. Alexis Conneau, et al. comment on the relatively shallow networks used for natural language processing and the success of much deeper networks used for computer vision applications.

Key to their approach is an embedding of individual characters, rather than a word embedding:
    
    We present a new architecture (VDCNN) for text processing which 
    operates directly at the character level and uses only small 
    convolutions and pooling operations. 
        — Very Deep Convolutional Networks for Text Classification, 
          2016.

Results on a suite of 8 large text classification tasks show better performance than more shallow networks. Specifically, state-of-the-art results on all but two of the datasets tested, at the time of writing. 

Generally, they make some key findings from exploring the deeper architectural approach:

    􏰀 The very deep architecture worked well on small and large 
      datasets.
    􏰀 Deeper networks decrease classification error.
    􏰀 Max-pooling achieves better results than other, more 
      sophisticated types of pooling.
    􏰀 Generally going deeper degrades accuracy; the shortcut 
      connections used in the architecture are important.

# CNN + Embedding Model for Sentiment Analysis


## Data Preparation

We are pretending that we are developing a system that can predict the sentiment of a textual movie review as either positive or negative. This means that after the model is developed, we will need to make predictions on new textual reviews. This will require all of the same data preparation to be performed on those new reviews as is performed on the training data for the model. 

### Clean and Tokenize Movie Reviews

In [1]:
from nltk.corpus import stopwords
import string
import re


# load doc into memory
def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc):
    tokens = doc.split()
    
    # Remove punctuation 
    re_punc = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove punctuation
    tokens = [re_punc.sub('', w) for w in tokens]
    
    # Remove non-alphabet chars, stop-words, and 1 letter words
    tokens = [word for word in tokens if word.isalpha()] # remove non alphabet chars
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [word for word in tokens if len(word) > 1]
    
    return tokens


# filename = '../data/txt_sentoken/pos/cv000_29590.txt' 
# text = load_doc(filename)
# tokens = clean_doc(text)
# print(tokens)

### Define a vocabulary

It is important to define a vocabulary of known words when using a text model. The more words, the larger the representation of documents, therefore it is important to constrain the words to only those believed to be predictive. 

We can develop a vocabulary as a Counter, which is a dictionary mapping of words and their count that allows us to easily update and query. Each document can be added to the counter (a new function called add doc to vocab()) and we can step over all of the reviews in the negative directory and then the positive directory (a new function called process docs()).

In [2]:
from os import listdir
from collections import Counter

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
    doc = load_doc(filename)
    tokens = clean_doc(doc)
    vocab.update(tokens)

def process_docs(directory, vocab):
    print(directory)
    for filename in listdir(directory):
        if filename.startswith('cv9'):
            continue
        path = directory + '/' + filename
        add_doc_to_vocab(path, vocab)
        
# Save vocab to file
def save_list(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

# define vocab
vocab = Counter()
process_docs("../data/txt_sentoken/pos", vocab)
process_docs("../data/txt_sentoken/neg", vocab)

print(len(vocab))
print(vocab.most_common(25))

# Prune vocab
tokens = [k for k,c in vocab.items() if c >= 2]
print(len(tokens))

save_list(tokens, "../data/txt_sentoken/vocab.txt")


../data/txt_sentoken/pos
../data/txt_sentoken/neg
44276
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288)]
25767


## Train CNN w/ Embedding Layer
The real valued vector representation for words can be learned while training the neural network. We can do this in the Keras deep learning library using the Embedding layer. The first step is to load the vocabulary. We will use it to filter out words from movie reviews that we are not interested in.

In [3]:
import numpy as np

# load doc into memory
def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

# Load vocab into memory
def load_vocab(filename):
    file = open(filename, 'r')
    txt = file.read()
    file.close()
    return txt

# tokenize doc and filter out words not in vocab
def cnn_clean_doc(doc, vocab):
    tokens = doc.split()
    
    re_punc = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove punctuation
    tokens = [re_punc.sub('', w) for w in tokens]
    
    tokens = ' '.join([w for w in tokens if w in vocab])
    return tokens

# Load all docs in a directory
def cnn_process_docs(dir_, vocab, is_train):
    docs = list()
    for fn in listdir(dir_):
        if is_train and fn.startswith('cv9'):
            continue
        if not is_train and not fn.startswith('cv9'):
            continue
        
        path = dir_ + '/' + fn
        doc = load_doc(path)
        tokens = cnn_clean_doc(doc, vocab)
        docs.append(tokens)
    return docs

# Load and clean a dataset
def load_clean_dataset(vocab, is_train):
    neg = cnn_process_docs('../data/txt_sentoken/neg', vocab, is_train)
    pos = cnn_process_docs("../data/txt_sentoken/pos", vocab, is_train)
    docs = neg + pos
    labels = np.array([0 for _ in range(len(neg))] + [1 for _ in range(len(pos))])
    return docs, labels
    

### Keras Embedding Layer Preparation
The next step is to encode each document as a sequence of integers. The Keras Embedding layer requires integer inputs where each integer maps to a single token that has a specific real-valued vector representation within the embedding. These vectors are random at the beginning of training, but during training become meaningful to the network. We can encode the training documents as sequences of integers using the Tokenizer class in the Keras API. First, we must construct an instance of the class then train it on all documents in the training dataset. In this case, it develops a vocabulary of all tokens in the training dataset and develops a consistent mapping from words in the vocabulary to unique integers. We could just as easily develop this mapping ourselves using our vocabulary file. The create tokenizer() function below will prepare a Tokenizer from the training data.

#### Text Padding
We also need to ensure that all documents have the same length. We could truncate reviews to the smallest size or zero-pad (pad with the value 0) reviews to the maximum length, or some hybrid. In this case, we will pad all reviews to the length of the longest review in the training dataset. First, we can find the longest review using the max() function on the training dataset and take its length. We can then call the Keras function pad sequences() to pad the sequences to the maximum length by adding 0 values on the end.

In [4]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

def encode_docs(tokenizer, max_length, docs):
    encoded = tokenizer.texts_to_sequences(docs)
    return pad_sequences(encoded, maxlen=max_length, padding='post')

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


## CNN Model

We are now ready to define our neural network model. The model will use an Embedding layer as the first hidden layer. The Embedding layer requires the specification of the vocabulary size, the size of the real-valued vector space, and the maximum length of input documents. The vocabulary size is the total number of words in our vocabulary, plus one for unknown words

We will use a 100-dimensional vector space. The maximum document length was calculated above in the max length variable used during padding. 

Model architecture:
    - Embedding Layer
    - CNN layer
        - 32 filters (parallel fields for processing words), kernel size of 8 with a relu activation function. 
    - Pooling layer that reduces the output of the convolutional layer by half
    - 2D output from the CNN part of the model is flattened to one long 2D vector to represent the features  
      extracted by the CNN. 
    - Output layer w/ sigmoid activation function (0 for negative review and 1 for positive review) 

The back-end of the model is a standard Multilayer Perceptron layers to interpret the CNN features.

In [16]:
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten
from keras.layers.convolutional import Conv1D, MaxPooling1D

# define model
def define_model(vocab_size, max_length):
    model = Sequential()
    model.add( Embedding(vocab_size, 100, input_length = max_length) )
    model.add( Conv1D(filters = 32, kernel_size = 8, activation = 'relu') )
    model.add( MaxPooling1D(pool_size = 2) )
    model.add( Flatten() )
    model.add( Dense(10, activation = 'relu') )
    model.add( Dense(1, activation = 'sigmoid') )
    
    # compile network 
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [17]:
# Load data and run model
from os import listdir
from collections import Counter

# Load vocab into memory
vocab = load_vocab("../data/txt_sentoken/vocab.txt")
vocab = set(vocab.split())

# Load training data
train_docs, ytrain = load_clean_dataset(vocab, True)

# create the tokenizer
tokenizer = create_tokenizer(train_docs)

# Define vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print("vocab size: ", vocab_size)

# Encode Training data
max_length = max([len(s.split()) for s in train_docs])
Xtrain = encode_docs(tokenizer, max_length, train_docs)

# Define model
model = define_model(vocab_size, max_length)

model.summary()
print("Model ready to fit")

vocab size:  25768
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 1317, 100)         2576800   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 1310, 32)          25632     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 655, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 20960)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                209610    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
Total params: 2,812,053
Trainable params: 2,812,053
Non-trainable params: 0
_______________________________________________

In [7]:
model.fit(Xtrain, ytrain, epochs=10, verbose=1)

# save the model
model.save('../data/txt_sentoken/model.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Model Evaluation

In [8]:
from keras.models import load_model 

# Load Train and Test docs
train_docs, ytrain = load_clean_dataset(vocab, True)
test_docs, ytest = load_clean_dataset(vocab, False)

tokenizer = create_tokenizer(train_docs)  # Tokenizer
vocab_size = len(tokenizer.word_index) + 1
max_length = max([len(s.split()) for s in train_docs])

# Encode data
Xtrain = encode_docs(tokenizer, max_length, train_docs)
Xtest = encode_docs(tokenizer, max_length, test_docs)

# Load model
#model = load_model("../data/txt_sentoken/model.h5")
print("Model loaded. ready for evaluation")

1317
1317
Model loaded. ready for evaluation


In [9]:
# Evaluate model on training set
_, acc = model.evaluate(Xtrain, ytrain, verbose = 0)
print('Train Accuracy: %f' % (acc*100))

# evaluate model on test dataset
_, acc = model.evaluate(Xtest, ytest, verbose=0) 
print('Test Accuracy: %f' % (acc*100))

Train Accuracy: 100.000000
Test Accuracy: 85.500000


### Develop predictor for new data

In [10]:
def predict_sentiment(review, vocab, tokenizer, max_length, model):
    line = cnn_clean_doc(review, vocab)
    padded = encode_docs(tokenizer, max_length, [line])
    yhat = model.predict(padded, verbose = 0 )
    
    percent_pos = yhat[0,0]
    if round(percent_pos) == 0:
        return (1-percent_pos), 'NEGATIVE' 
    return percent_pos, 'POSITIVE'

text = 'Everyone will enjoy this film. I love it, recommended!'
percent, sentiment = predict_sentiment(text, vocab, tokenizer, max_length, model) 
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment, percent*100))

# test negative text
text = 'This is a bad movie. Do not watch it. It sucks.'
percent, sentiment = predict_sentiment(text, vocab, tokenizer, max_length, model) 
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment, percent*100))

1317
Review: [Everyone will enjoy this film. I love it, recommended!]
Sentiment: NEGATIVE (57.440%)
1317
Review: [This is a bad movie. Do not watch it. It sucks.]
Sentiment: NEGATIVE (63.011%)


## n-gram CNN Model for Sentiment Analysis

A standard deep learning model for text classification and sentiment analysis uses a word embedding layer and one-dimensional convolutional neural network. The model can be expanded by using multiple parallel convolutional neural networks that read the source document using different kernel sizes. This, in effect, creates a multichannel convolutional neural network for text that reads text with different n-gram sizes (groups of words).

### Model

A multi-channel convolutional neural network for document classification involves using multiple versions of the standard model with different sized kernels. This allows the document to be processed at different resolutions or different n-grams (groups of words) at a time, whilst the model learns how to best integrate these interpretations. This approach was first described by Yoon Kim in his 2014 paper titled Convolutional Neural Networks for Sentence Classification. 

In Keras, a multiple-input model can be defined using the functional API. We will define a model with three input channels for processing 4-grams, 6-grams, and 8-grams of movie review text. Each channel is comprised of the following elements:
    
    - Input layer that defines the length of input sequences.
    - Embedding layer set to the size of the vocabulary and 100-dimensional real-valued representations.
    - Conv1D layer with 32 filters and a kernel size set to the number of words to read at once.
    - MaxPooling1D layer to consolidate the output from the convolutional layer.
    - Flatten layer to reduce the three-dimensional output to two dimensional for concatenation.

The output from the three channels are concatenated into a single vector and process by a Dense layer and an output layer. The function below defines and returns the model. As part of defining the model, a summary of the defined model is printed and a plot of the model graph is created and saved to file.

In [5]:
from keras.models import Model
from keras.layers import Input, Dense, Flatten, Dropout, Embedding
from keras.layers.convolutional import Conv1D, MaxPooling1D 
from keras.layers.merge import concatenate

def define_ngram_model(length, vocab_size):
    
    # Channel 1
    in1 = Input(shape=(length,))
    eb1 = Embedding(vocab_size, 100)(in1)
    conv1 = Conv1D(filters=32, kernel_size=4, activation='relu')(eb1)
    drop1 = Dropout(0.5)(conv1)
    pool1 = MaxPooling1D(pool_size=2)(drop1)
    flat1 = Flatten()(pool1)
    
    # Channel 2
    in2 = Input(shape=(length,))
    eb2 = Embedding(vocab_size, 100)(in2)
    conv2 = Conv1D(filters=32, kernel_size=6, activation='relu')(eb2)
    drop2 = Dropout(0.5)(conv2)
    pool2 = MaxPooling1D(pool_size=2)(drop2)
    flat2 = Flatten()(pool2)
    
    # Channel 3
    in3 = Input(shape=(length,))
    eb3 = Embedding(vocab_size, 100)(in3)
    conv3 = Conv1D(filters=32, kernel_size=8, activation='relu')(eb3)
    drop3 = Dropout(0.5)(conv3)
    pool3 = MaxPooling1D(pool_size=2)(drop3)
    flat3 = Flatten()(pool3)
    
    # Merge channels
    merged = concatenate([flat1, flat2, flat3])
    
    # interpretation
    dense1 = Dense(10, activation='relu')(merged)
    outputs = Dense(1, activation='sigmoid')(dense1)
    model = Model(inputs=[in1, in2, in3], outputs=outputs)

    # compile
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # summarize
    return model

define_ngram_model(10,10).summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 10)           0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 10)           0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 10)           0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 10, 100)      1000        input_1[0][0]                    
__________________________________________________________________________________________________
embedding_

In [7]:
# Load data and run model
from os import listdir
from collections import Counter

# Load vocab into memory
vocab = load_vocab("../data/txt_sentoken/vocab.txt")
vocab = set(vocab.split())

# Load training data
train_docs, ytrain = load_clean_dataset(vocab, True)

# create the tokenizer
tokenizer = create_tokenizer(train_docs)

# Define vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print("vocab size: ", vocab_size)

# Encode Training data
max_length = max([len(s.split()) for s in train_docs])
Xtrain = encode_docs(tokenizer, max_length, train_docs)

# Define model
model = define_ngram_model(vocab_size, max_length)

model.summary()
print("Model ready to fit")

vocab size:  25768
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            (None, 25768)        0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, 25768)        0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            (None, 25768)        0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 25768, 100)   131700      input_4[0][0]                    
__________________________________________________________________________________________

In [8]:
from keras.models import load_model 

# Load Train and Test docs
train_docs, ytrain = load_clean_dataset(vocab, True)
test_docs, ytest = load_clean_dataset(vocab, False)

tokenizer = create_tokenizer(train_docs)  # Tokenizer
vocab_size = len(tokenizer.word_index) + 1
max_length = max([len(s.split()) for s in train_docs])

# Encode data
Xtrain = encode_docs(tokenizer, max_length, train_docs)
Xtest = encode_docs(tokenizer, max_length, test_docs)

# Load model
#model = load_model("../data/txt_sentoken/model.h5")
print("Model loaded. ready for evaluation")

Model loaded. ready for evaluation


In [9]:
# Evaluate model on training set
_, acc = model.evaluate(Xtrain, ytrain, verbose = 0)
print('Train Accuracy: %f' % (acc*100))

# evaluate model on test dataset
_, acc = model.evaluate(Xtest, ytest, verbose=0) 
print('Test Accuracy: %f' % (acc*100))

ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 3 array(s), but instead got the following list of 1 arrays: [array([[   27,    27,    27, ...,     0,     0,     0],
       [   74,  1536,  1426, ...,     0,     0,     0],
       [ 7430,     3, 16201, ...,     0,     0,     0],
       ...,
       [ 1078,   27...