In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../../notebook_format')
from formats import load_style
load_style( css_style = 'custom2.css' )

In [2]:
os.chdir(path)
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 8, 6 # change default figure size

# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%load_ext watermark
%load_ext autoreload 
%autoreload 2

from keras.models import Sequential
from keras.callbacks import EarlyStopping
from keras.utils.np_utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Flatten, Conv1D, MaxPooling1D, Embedding

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,matplotlib,keras,scikit-learn

Using TensorFlow backend.


Ethen 2016-08-16 20:35:34 

CPython 3.5.2
IPython 4.2.0

numpy 1.11.1
pandas 0.18.1
matplotlib 1.5.1
keras 1.0.6
scikit-learn 0.17.1


# Text Classification using Word Embedding and ConvNets

## What are word embeddings?

"Word embeddings" (or so called word vectors) are a family of natural language processing techniques that aims to map semantic meaning into a geometric space. This is done by associating a numeric vector to every word in a dictionary, such that the distance (e.g. L2 distance or more commonly cosine distance) between any two vectors would capture part of the semantic relationship between the two associated words. The geometric space formed by these vectors is called an embedding space.

For instance, "coconut" and "polar bear" are words that are semantically quite different, so a reasonable embedding space would represent them as vectors that would be very far apart. But "kitchen" and "dinner" are related words, so they should be embedded close to each other.

Ideally, in a good embeddings space, the "path" (a vector) to go from "kitchen" and "dinner" would capture precisely the semantic relationship between these two concepts. In this case the relationship is "where x occurs", so you would expect the vector kitchen - dinner (difference of the two embedding vectors, i.e. path to go from dinner to kitchen) to capture this "where x occurs" relationship. Basically, we should have the vectorial identity: dinner + (where x occurs) = kitchen (at least approximately). If that's indeed the case, then we can use such a relationship vector to answer questions. For instance, starting from a new vector, e.g. "work", and applying this relationship vector, we should get sometime meaningful, e.g. work + (where x occurs) = office, answering "where does work occur?".

Word embeddings are computed by applying dimensionality reduction techniques to datasets of co-occurence statistics between words in a corpus of text. This can be done via shallow neural networks, e.g. **word2vec**, or via matrix factorization, e.g. **Glove**.

## Task

The dataset we'll use is the [20 Newsgroup dataset](http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html). The task that we will try to solve is to classify posts coming from 20 different newsgroup into their original 20 categories using text associated with each post.

Categories (listed below) are fairly semantically distinct and thus will have quite different words associated with them.

```
alt.atheism
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
soc.religion.christian

comp.sys.ibm.pc.hardware
comp.graphics
comp.os.ms-windows.misc
comp.sys.mac.hardware
comp.windows.x

rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey

sci.crypt
sci.electronics
sci.space
sci.med

misc.forsale

```

To solve the classification problem we will:

- Convert all text samples in the dataset into sequences of word indices. A "word index" is simply representing each word with an integer ID. We will only consider the top 50,000 most commonly occuring words in the dataset, and we will truncate the sequences to a maximum length of 1000 words (sentences that have less than 1000 words will be padded with 0).
- Prepare an "embedding matrix" (more detail below) which will contain at index i the embedding vector for the word of index i in our word index.
- Load this embedding matrix into a Keras `Embedding` layer, set to be frozen (its weights, the embedding vectors, will not be updated during training).
- Build a 1D convolutional neural network on top of it, ending in a softmax output over our 20 categories.

> Side note: When solving classic text classication problem (e.g. spam filtering, sentiment analysis) using neural networks, Recurrent neural network (RNN) might be a more 'natural' approach, given that text is naturally sequential. However, RNNs are quite slow and fickle to train and Convnets surprisingly works quite well.

**GloVe word embeddings**

For the embedding matrix, we will use the pre-trained [GloVe embeddings](http://nlp.stanford.edu/projects/glove/). GloVe stands for "Global Vectors for Word Representation". It's a  popular embedding technique based on factorizing a matrix of word co-occurence statistics.

Specifically, we will use the pre-trained 100-dimensional GloVe embeddings of 400k words computed on a 2014 dump of English Wikipedia. You can download them [here](http://nlp.stanford.edu/data/glove.6B.zip).

<p>
<div class="alert alert-warning">
The pre-trained word embeddings is a 822MB download
</div>

With all that being said and done, let's get started.

In [3]:
# define some global variables
GLOVE_DIR = 'glove.6B/'
TEXT_DATA_DIR = '20_newsgroup/'
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

### Preprocessing

Preparing the text data:

We will simply iterate over the folders in which our text samples are stored, and format them into a list of data points. We will also assign each data point a class label (It turns out scikit-learn also already has this data built-in so we don't have to do this loading ourselves ...).

<p>
<div class="alert alert-warning">
If you happened to get an UnicodeDecodeError: 'utf8' codec can't decode byte error when reading the dataset, the way to get around this is to add `encoding = 'Latin-1'` to the open file method.
</div>

In [4]:
print('Processing text dataset')

# 1. list of text for each news story (data point)
# 2. list of label ids (target)
# 3. dictionary mapping label name (news' category) to numeric id
texts  = []
labels = []
labels_index = {}  

for name in os.listdir(TEXT_DATA_DIR):
    
    # each news category is stored in different category
    path = os.path.join( TEXT_DATA_DIR, name )
    if os.path.isdir(path):
        
        # assign a distinct label id for each category
        label_id = len(labels_index)
        labels_index[name] = label_id
    
        for fname in os.listdir(path):
            fpath = os.path.join( path, fname )
            with open( fpath, encoding = 'Latin-1' ) as f:
                text = f.read()
                texts.append(text)
                labels.append(label_id)

print( 'Found %s texts.' % len(texts) )

Processing text dataset
Found 19997 texts.


Then we preprocess our text samples and labels using functionalities provided by Kersa so that they can be fed into a neural network. `Tokenizer` allows us to convert the text data into a sequence of integers (building a dictionary mapping word to index then converting the each word in the orginal to index). For more info refer to the [Keras Text Preprocessing Doc](https://keras.io/preprocessing/text/).

In [5]:
# keras preprocessing
tokenizer = Tokenizer(nb_words = 50000)
tokenizer.fit_on_texts(texts)

# a dictionary of words to their id (all the words)
word_index = tokenizer.word_index
print( 'Found %s unique tokens.' % len(word_index) )

# during the text_to_sequence, it will remove words
# whose frequency is not larger than the `nb_words` parameter
# that you've specified
sequences = tokenizer.texts_to_sequences(texts)

Found 214873 unique tokens.


In [6]:
# write a different preprocessing
from text import Tokenizer as Tokenizer2
tokenizer2 = Tokenizer2(nb_words = 50000)
tokenizer2.fit_on_texts(texts)

# a dictionary of words to their id (all the words)
word_index = tokenizer2.word_index
print( 'Found %s unique tokens.' % len(word_index) )

# during the text_to_sequence, it will remove words
# whose frequency is not larger than the `nb_words` parameter
# that you've specified
sequences = tokenizer2.texts_to_sequences(texts)

Found 50000 unique tokens.


In [7]:
# pad the sequences so they become a equal lengthed array
# this will be our preprocessed data
X = pad_sequences( sequences, maxlen = MAX_SEQUENCE_LENGTH )

# one-hot encode the y labels
y = to_categorical(np.array(labels))
print('Shape of data:', X.shape)
print('Shape of label:', y.shape)

# split the data into a training set and a validation set
np.random.seed(1337)
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
y = y[indices]
val_samples = int(VALIDATION_SPLIT * X.shape[0])

X_train = X[:-val_samples]
y_train = y[:-val_samples]
X_val = X[-val_samples:]
y_val = y[-val_samples:]
print('Shape of training data:', X_train.shape)
print('Shape of validation data:', X_val.shape)
print('Shape of training label:', y_train.shape)
print('Shape of validation label:', y_val.shape)

Shape of data: (19997, 1000)
Shape of label: (19997, 20)
Shape of training data: (15998, 1000)
Shape of validation data: (3999, 1000)
Shape of training label: (15998, 20)
Shape of validation label: (3999, 20)


### Preparing the Embedding layer

We'll read in one line of the glove dataset to get a feeling of what it looks like. Note that the file name is `100d`, there're different dimensions of word embeddings in the file. you can try which one leads to better performance.

In [8]:
GLOVE_FILE = os.path.join( GLOVE_DIR, 'glove.6B.100d.txt' )
with open(GLOVE_FILE) as f:
    test = next(f)
test

'the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062\n'

The following code chunk reads in the pre-trained embeddings and construct an `embedding_index`, which is a dictionary mapping words to known embeddings. 

We can then leverage our `embedding_index` dictionary and our `word_index` to compute our embedding matrix. By computing, we mean if the word appeared in the pre-trained word embedding, then it will use its coefficients, if not then the coefficients' value will be set to zero.

In [9]:
print('Indexing word vectors.')

# words stored as keys, 
# corresponding embeddings (vectors) stored as values
embeddings_index = {}
GLOVE_FILE = os.path.join( GLOVE_DIR, 'glove.6B.100d.txt' )
with open(GLOVE_FILE) as f:
    for line in f:
        value = line.split()
        word  = value[0]
        coefs = np.array( value[1:], dtype = 'float32' )
        embeddings_index[word] = coefs

print( 'Found %s word vectors.' % len(embeddings_index) )

Indexing word vectors.
Found 400000 word vectors.


In [10]:
# 1. the EMBEDDING_DIM is 100, which matches the 100d glove file name
# 2. add one to the np.array, since the word id starts from 1 instead of 0
count = 0
embedding_matrix = np.zeros( ( len(word_index) + 1, EMBEDDING_DIM ), dtype = 'float32' )
for word, i in word_index.items():
    if word in embeddings_index:
        embedding_vector = embeddings_index[word]
        embedding_matrix[i] = embedding_vector
        count += 1

# number of tokens that appeared in the pre-trained word vectors
print(count)

34297


### Training a 1D convnet

Finally we can build a toy 1D convnet to solve our classification problem. Note that for the `Embedding` layer, we set `trainable = False` to prevent the weights from being updated during training.

An `Embedding` layer should be fed sequences of integers, i.e. a 2D input of shape (samples, indices). These input sequences should be padded so that they all have the same length in a batch of input data (although an `Embedding` layer is capable of processing sequence of heterogenous length, if you don't pass an explicit `input_length` argument to the layer).

All the `Embedding` layer does is to map the integer inputs to the vectors found at the corresponding index in the embedding matrix, i.e. the sequence [1, 2] would be converted to [embeddings[1], embeddings[2]]. This means that the output of the Embedding layer will be a 3D tensor of shape (samples, sequence_length, embedding_dim).

In [11]:
n_classes = y_train.shape[1]

# setting up the Embedding layer
embedding_layer = Embedding( len(word_index) + 1, EMBEDDING_DIM,
                             weights = [embedding_matrix],
                             input_length = MAX_SEQUENCE_LENGTH,
                             trainable = False )
model = Sequential()
model.add(embedding_layer)
model.add(Conv1D( nb_filter = 128, filter_length = 5, activation = 'relu' ))
model.add(MaxPooling1D( pool_length = 5 ))
model.add(Conv1D( nb_filter = 128, filter_length = 5, activation = 'relu' ))
model.add(MaxPooling1D( pool_length = 5 ))
model.add(Conv1D( nb_filter = 128, filter_length = 5, activation = 'relu' ))
model.add(MaxPooling1D( pool_length = 35 ))
model.add(Flatten())
model.add(Dense( 128, activation = 'relu' ))
model.add(Dense( n_classes, activation = 'softmax' ))
model.compile( loss = 'categorical_crossentropy',
               optimizer = 'rmsprop', metrics = ['acc'] )

```python
# obtain the graph visualization of the network if we wish
from keras.utils.visualize_util import plot
plot( model, to_file = 'model.png', show_shapes = True, show_layer_names = True )

```

In [12]:
# if the validation set's loss does not improvement
# after 2 epochs (patience), the training will be stopped
callback = [ EarlyStopping( monitor = 'val_loss', patience = 2, verbose = 1 ) ]
model.fit( X_train, y_train, shuffle = True,
           nb_epoch = 10, batch_size = 256,
           validation_data = (X_val, y_val),
           callbacks = callback )

Train on 15998 samples, validate on 3999 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10


<keras.callbacks.History at 0x12689fe48>

In [13]:
# predict the validation set's accuracy
y_val_pred = model.predict_classes( X_val, verbose = 1 )



In [14]:
# accuracy score for the word embedding / covnets
y_labels = np.array(labels)[indices][-val_samples:]
accuracy = np.sum( y_labels == y_val_pred ) / X_val.shape[0]
print( 'valid accuracy: %.2f' % ( accuracy * 100 ) )

valid accuracy: 94.42


This model reaches 94% classification accuracy on the validation set after 6 epochs. 

There're a whole bunch of other different things that we can try out to get an even higher accuracy:

- Include regularization mechanism (e.g. dropout)
- Try tuning the embedding layer (e.g. load different dimensions of pre-trained word vectors or let it train with the network using more epochs)
- Different network architecture, including increasing the number of epochs as the early stopping criteria was not met

To test how well we would have performed by not using pre-trained word embeddings, we just need to replace our Embedding layer with the following:

```python
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length = MAX_SEQUENCE_LENGTH )
```

This will initialize our `Embedding` layer from scratch and let it learn its weights during training.

In general, using pre-trained embeddings is relevant for natural processing tasks were little training data is available. The functionally the embeddings act as an injection of outside information, which might prove useful for the model.

## Comparing Performance

One popular approach to tackle this type of text classification problem is to use bag-of-words of tf-idf for the feature engineering step and and train a logistic (softmax) regression classifier. This combination serves as a very strong baseline and getting other approaches to match its speed/accuracy is difficult. We'll try one here, and compare the performances to see if the doing extra work like word embedding and training a more complicated was worth the effort.

In [15]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
# used the shuffled indices from before
# to shuffle the original text and label
texts = np.array(texts)
labels = np.array(labels)
texts_shuffled  = texts[indices]
labels_shuffled = labels[indices]

# train / validation split
texts_train  = texts_shuffled[:-val_samples]
labels_train = labels_shuffled[:-val_samples]
texts_val  = texts_shuffled[-val_samples:]
labels_val = labels_shuffled[-val_samples:]

In [17]:
# remove built-in english stop words, use 1 and 2 gram and remove words
# that appeared in less than two documents
tfidf = TfidfVectorizer( stop_words = 'english', ngram_range = (1, 2), min_df = 2 )
X_train_tfidf = tfidf.fit_transform(texts_train)
X_val_tfidf = tfidf.transform(texts_val)
print(X_train_tfidf.shape)

(15998, 476088)


In [18]:
logreg = LogisticRegression(n_jobs = -1)
logreg.fit( X_train_tfidf, labels_train )

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [19]:
# accuracy score for the tfidf-logistic regression
y_val_pred = logreg.predict(X_val_tfidf)
print( accuracy_score( labels_val, y_val_pred ) )

0.951987996999


## Reference

- [Keras Blog: Using pre-trained word embeddings in a Keras model](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)
- [Source code of the blog above](https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py)
- [Keras Text Preprocessing Doc](https://keras.io/preprocessing/text/)
- [Keras Text Preprocessing Source Code](https://github.com/fchollet/keras/tree/master/keras/preprocessing)