## Resources
- [keras example](https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py)
- [how word-embedding and image-vectors used to train VQA](http://localhost:8888/notebooks/USECASE%20-%20Visual%20Q%20%26%20A%20by%20Keras.ipynb)
- [Glove pretrained word vectors](http://nlp.stanford.edu/projects/glove/)
- [20 newgroup classification](http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html)

## Structure
- Use word-embedding layer as the first layer for downstream text tasks (e.g., classification)
- Set word-embedding layer weights with pretrained Glove vectors, and freeze it by setting `trainable = False`
- Use 1D CNN to convert each article into a single vector (level by level) - usually with comparable performance but much faster to train than LSTM
- Do traditional classification by a dense layer

In [1]:
import keras
keras.__version__

Using Theano backend.


Couldn't import dot_parser, loading of dot files will not be possible.


Using gpu device 0: GeForce GTX 980M (CNMeM is disabled, cuDNN 5005)


'1.0.6'

In [2]:
from glob import glob

from sklearn.cross_validation import train_test_split

import numpy as np

from keras.preprocessing import text, sequence
from keras.utils import np_utils

## 1. Convert texts into hash sequences
- it is de-facto for deep learning text applications to convert them into sequence of indices. These indices are hash keys to vocabulary.
- the output shape should be a 2D tensor, (n_articles, max_words_per_article). For some models the number of words for each sequence may not be necessarily the same, but most of time they need to be padded/truncated to a fixed length, specially for sentences or short articles.
- After word embedding, the above sequence matrix will be converted to 3D matrix (n_articles, max_words_per_article, word_vec_dim).
- the word vectors can be 1-hot-encoding (or tfidf), or word embedding, depending on what layer is used as the first in the model
- `keras.preprocessing.text` and `keras.preprocessing.image` are specially implemented to faciliate those tasks. And the starting point is usually a tokenizer

In [97]:
## load news group - they are plain files distributed in different group folders
def read_raw_data(data_folder):
    texts, labels = [], []
    for f in glob(data_folder + "/*/*"):
        label = f.split("/")[-2]
        text= open(f).read()
        texts.append(text)
        labels.append(label)
    return texts, labels
        
texts, labels = read_raw_data("../../data/newsgroup_20/")
print len(texts), len(labels)

19997 19997


In [98]:
## parameters for preprocessing and modelling
params = {
    "vocab_size": 20000 # it is common to cache only 20,000 words, and filter out infrequent
    , "max_seq_len": 1000 # fix the # of words in each article by padding or trucating
    , "word_dim": 100 # dimensionality of word space, 300 for maximum performance, 100 for small dataset
}

In [99]:
## Things start with a tokenizer for text data preprocessing
tokenizer = text.Tokenizer(nb_words=params["vocab_size"], )
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
print len(sequences), len(sequences[0])

19997 333


In [100]:
## vocabulary are essentially two structures:
## word2index and index2word
word2index = tokenizer.word_index
index2word = dict([(v,k) for k,v in word2index.items()])
print len(word2index), len(index2word)
print "min and max word index:", min(index2word.keys()), max(index2word.keys())
print "Notice keras needs word index built from 1 to vocab_size, 0 is reserved for PADDING"
print "Does it make more sense for Keras API to make len(tokenizer.word_index) == max_len?"

214909 214909
min and max word index: 1 214909
Notice keras needs word index built from 1 to vocab_size, 0 is reserved for PADDING
Does it make more sense for Keras API to make len(tokenizer.word_index) == max_len?


In [101]:
## we almost have everything for a valid text input to keras model
## the only thing left is to pad/truncated sequences to make them of fixed size
X = sequence.pad_sequences(sequences, maxlen=params["max_seq_len"])
print X.shape

(19997, 1000)


In [102]:
## we do the same thing to labels, notice keras needs 1-hot encoding for multi-classification
unique_labels = list(set(labels))
index2label = dict(enumerate(unique_labels))
label2index = dict([(v,k) for k,v in index2label.items()])
encoded_labels = map(label2index.get, labels)
Y = np_utils.to_categorical(encoded_labels)
print Y.shape

(19997, 20)


In [104]:
## to make it fair, we also need to keep a separate test (and also validation for training)
train_X, valid_X, train_Y, valid_Y = train_test_split(X, Y, test_size = 0.2)

print map(lambda x: x.shape, [train_X, valid_X])
print map(lambda x: x.shape, [train_Y, valid_Y])

[(15997, 1000), (4000, 1000)]
[(15997, 20), (4000, 20)]


## 2. Load the word vectors from GLove

In [11]:
## load embedding weights from GLOVE vectors
## Glove vector file format can be read from
## https://spacy.io/docs/tutorials/load-new-word-vectors
def load_glove_vecs(fpath):
    word2vec = {}
    for line in open(fpath).readlines():
        word, vec = line.split(" ", 1)
        word2vec[word] = np.asarray(vec.split(" "), dtype="float32")
    return word2vec
word2vec = load_glove_vecs("../../data/glove/glove.6B.100d.txt")
print len(word2vec), len(word2vec.values()[0])

400000 100


In [85]:
## the number of words, and thus the number of word vectors
## is the min of max_nb_words in vocab and the actual unique number of words from texts
nb_words = min(params["vocab_size"], len(word2index))

In [86]:
## build embedding vector weights for vocabulary
## unknown and padding words will be represented as all-zeros

## vocab_vecs are of shape (vocab_size+1, word_dim)
## +1 for PADDING index 0
vocab_vecs = np.zeros((nb_words+1, params["word_dim"]), dtype="float32")
for i in xrange(1, params["vocab_size"]+1):
    if i > nb_words: continue
    word = index2word[i]
    if word in word2vec:
        vocab_vecs[i] = word2vec[word]
print vocab_vecs.shape
print "Here instead of storing the whole vector set for word_index,",
print "we only store those up to vocab_size"

(20001, 100)
Here instead of storing the whole vector set for word_index, we only store those up to vocab_size


## 3. Construct the model

In [87]:
from keras import layers, models
from keras import backend as K

In [92]:
## index sequences from text - model input (batch_size, seq_len)
v_seqs = layers.Input(shape = (params["max_seq_len"], ), dtype = "int32", name="v_seqs")

## embedding layer and embedded seq vector - (batch_size, seq_len, word_dim)
## set trainalbe = False to fix the word vector representation
## weights of a layer is a list of np.arrays (each for a parameter, e.g., W, b) 
layer_embedding = layers.Embedding(input_dim=nb_words+1, 
                                   input_length=params["max_seq_len"], 
                                   output_dim=params["word_dim"],
                                   weights = [vocab_vecs],
                                   trainable = False,
                                   name = "layer_embedding")
v_embedded_seqs = layer_embedding(v_seqs)

In [93]:
## test with the embeding layer and output

from scipy.spatial.distance import cosine

seq_to_vec = K.function(inputs = [v_seqs], outputs=[v_embedded_seqs])

def get_word_vectors(word):
    seqs = sequence.pad_sequences( np.array([[word2index[word]]]), 
                                  maxlen=params["max_seq_len"], )
    vec = seq_to_vec([seqs])[0][0, -1, :]
    return vec

king = get_word_vectors("king")
man = get_word_vectors("man")
woman = get_word_vectors("woman")
queen = get_word_vectors("queen")

queen_to_be = king - man + woman

print "cosine similarity between king-man+woman and queen", 1 - cosine(queen, queen_to_be)

cosine similarity between king-man+woman and queen 0.783441347372


In [94]:
## use 1D cnn to convert sequence vectors into single vectors
## 5 is usually a magic number for filter_size for both texts and images
## ususally more cnn layers (conv + maxpool) give better results
## use dropouts if necessary to prevent overfitting
v_seq_vectors = layers.Conv1D(128, 5, activation="relu")(v_embedded_seqs)
v_seq_vectors = layers.MaxPooling1D(5)(v_seq_vectors)
v_seq_vectors = layers.Conv1D(128, 5, activation="relu")(v_embedded_seqs)
v_seq_vectors = layers.MaxPooling1D(5)(v_seq_vectors)
v_seq_vectors = layers.Conv1D(128, 5, activation="relu")(v_embedded_seqs)
v_seq_vectors = layers.MaxPooling1D(35)(v_seq_vectors)
v_seq_vectors = layers.Dropout(.5)(v_seq_vectors)

## flatten and dense layer for classification
v_article_vectors = layers.Flatten()(v_seq_vectors)
v_hidden_vectors = layers.Dense(128, activation="relu")(v_article_vectors)
v_probs = layers.Dense(len(label2index), activation="softmax")(v_hidden_vectors)

model = models.Model(input = v_seqs, output=v_probs)
model.compile(loss = "categorical_crossentropy", optimizer="adam", metrics = ["accuracy"])

In [105]:
## should give 95% accuracy after 2 epoches
model.fit(train_X, train_Y, nb_epoch=2, validation_data=(valid_X, valid_Y))

Train on 15997 samples, validate on 4000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f2bb57226d0>