## Text classification using Neural Networks

The goal of this notebook is to learn to use Neural Networks for text classification, going beyond simple Bag of Words models.

In this notebook, we will:
- Train a shallow model which learns embeddings
- Download pre-trained embeddings from Glove
- Use these pre-trained embeddings

Keep in mind:
- Deep Learning can be better on text classification that simpler ML techniques, but only on very large datasets and well designed/tuned models.
- We won't be using the most efficient (in terms of computing) techniques, as Keras is good for prototyping but rather inefficient for training small embedding models on text.
- The following projects can replicate similar word embedding models much more efficiently: [word2vec](https://github.com/dav/word2vec) and [gensim's word2vec](https://radimrehurek.com/gensim/models/word2vec.html)   (self-supervised learning only), [fastText](https://github.com/facebookresearch/fastText) (both supervised and self-supervised learning). However hard to see inside. We will use them tomorrow.
- Plain shallow sparse TF-IDF bigrams features without any embedding and Logistic Regression or Multinomial Naive Bayes is often competitive in small to medium datasets.

## The IMDB movie review dataset

(same dataset as in the TfIdf exercise.)

Fetch the dataset from http://ai.stanford.edu/~amaas/data/sentiment/ and un'tar it to
a directory near to this notebook. I placed it in `../data/`.

In [None]:
import numpy as np
from sklearn.datasets import load_files

reviews_train = load_files("../data/aclImdb/train/", categories=['neg', 'pos'])

text_trainval, y_trainval = reviews_train.data, reviews_train.target

print("type of text_train: {}".format(type(text_trainval)))
print("length of text_train: {}".format(len(text_trainval)))
print("class balance: {}".format(np.bincount(y_trainval)))

Let's randomly partition the text files in a training and test set while recording the target category of each file as an integer:

In [None]:
from sklearn.model_selection import train_test_split

# Remove some HTML and turn `bytes` into `str`
text_trainval = [doc.replace(b"<br />", b" ").decode() for doc in text_trainval]

# Use train_test_split to split up your dataset
texts_train, texts_test, target_train, target_test = train_test_split(
    text_trainval, y_trainval, stratify=y_trainval, random_state=0)

In [None]:
# look at an example review, and some other sanity checks
# just to make sure you properly loaded the data, splitting worked, etc
print("text_train[42]:\n{}".format(text_trainval[42]))

## A first baseline model

You've already constructed this model a few times. Feel free to copy&paste the
code here. Or make use of this opportunity to find out how to use `make_pipeline`
to construct the model.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# create a pipeline from the TfidfVectorizer and a LogisticRegression
# fit and score the model. Make a note of the amount of CPU time.
text_classifier = make_pipeline(
# ... your code here ...
)

In [None]:
# The %%timemagic (used by itself on a single line
# at the top of a cell will measure how long the cell runs

# fit your model

In [None]:
# score your model

You should reach a score of around 88%. It's unlikely that we can significantly beat this baseline with a more complex deep learning based model. However let's try to reach a comparable level of accuracy with an `Embedding`s-based model for teaching purpose.

To create a really competitive benchmark you should tune the hyper-parameters of the `TfidfVectorizer` and `LogisticRegression`. Come back to this later if you have time.

### Preprocessing text for the (supervised) CBOW model

We will implement a simple classification model in Keras. Raw text requires (sometimes a lot of) preprocessing.

The following cells uses Keras to preprocess text:
- using a tokenizer. You may use different tokenizers (from scikit-learn, spacy, custom Python function etc.). This converts the texts into sequences of indices representing the `20000` most frequent words
- sequences have different lengths, so we pad them (add 0s at the end until the sequence is of length `1000`)
- we convert the output classes as 1-hot encodings

In [None]:
import keras
from keras.preprocessing.text import Tokenizer

MAX_NB_WORDS = 20000

# vectorize the text samples into a 2D integer tensor
# except for tuning parameters in the Tokenizer or
# using your own/different one this is mostly boilerplate
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, char_level=False)
tokenizer.fit_on_texts(texts_train)
sequences = tokenizer.texts_to_sequences(texts_train)
sequences_test = tokenizer.texts_to_sequences(texts_test)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Tokenized sequences are converted to list of token ids (with an integer code):

In [None]:
sequences[0][:10]

The tokenizer object stores a mapping (vocabulary) from word strings to token ids that can be inverted to reconstruct the original message (without formatting):

In [None]:
type(tokenizer.word_index), len(tokenizer.word_index)

In [None]:
index_to_word = dict((i, w) for w, i in tokenizer.word_index.items())

In [None]:
# use `index_to_word` to turn your sequence of integers back into text
# for one or two documents

Let's have a closer look at the tokenized sequences. The next task is dealing with the fact that each review has a different length. We will have to decide a maximum length and then convert all reviews accordingly.

In [None]:
seq_lens = [len(s) for s in sequences]
print("average length: %0.1f" % np.mean(seq_lens))
print("max length: %d" % max(seq_lens))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(seq_lens, bins=50);

It almost looks like there was a hand made cut off at 1000, but some longer reviews got through.

Let's zoom on the distribution of regular sized reviews. The vast majority of the reviews have less than 500 symbols:

In [None]:
plt.hist([l for l in seq_lens if l < 500], bins=50);

Let's truncate and pad all the sequences to 500 symbols to build the training set.

Could you find a more quantative way to decide what sequence length to keep? Maybe look at the 95% quantile using `numpy`.

Use `pad_sequences` from `keras.preprocessing.sequence` to do the job of padding and limiting the length of our sequences

In [None]:
from keras.preprocessing.sequence import pad_sequences


MAX_SEQUENCE_LENGTH = 500

# pad sequences with 0s using the `pad_sequence` function
# ... your code here ...

print('Shape of data tensor:', X_train.shape)
print('Shape of data test tensor:', X_test.shape)

In [None]:
# we have to one hot encode our targets
from keras.utils.np_utils import to_categorical

y_train = to_categorical(target_train)
print('Shape of label tensor:', y_train.shape)

### A simple supervised CBOW model in Keras

The following computes a very simple model, as described in [fastText](https://github.com/facebookresearch/fastText):

<img src="fasttext.svg" style="width: 600px;" />

- Build an embedding layer mapping each word to a vector representation
- Compute the vector representation (`Embedding`) of all words in each sequence and average them (`GlobalAveragePooling1D`)
  - start with an embedding size of 50
- Add a `Dense` layer to output 2 classes (+ softmax)
- connect everything together in a keras `Model`.

Once you have a working model (debug using a small dataset of 10 samples maybe), `fit` it, and score it on the test dataset.

Some more questiosn to ask yourself:
How many epochs should you use (investigate `validation_split` argument to `fit()`)? How much data do you need? What happens if you switch optimizer? How big/small can you make the embedding dimension?

In [None]:
from keras.layers import Dense, Input, Flatten
from keras.layers import GlobalAveragePooling1D, Embedding
from keras.models import Model
from keras import optimizers

EMBEDDING_DIM = 50
N_CLASSES = len(np.unique(y_train))

# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

# ...create an embedding layer ...

# ... that feeds into a GlobalAveragePooling1D layer ...

# ... that feeds into a Dense layer which stores results in the variable predictions ...

# ... that your Model can wire up when you compile it ...

In [None]:
# fit your model on just a few epochs and small batch size (maybe 32) to make
# sure it is working, then fit on more epochs (ten or so)

Compute model accuracy on test set

In [None]:
# better, worse or the same as our baseline?

### Loading pre-trained embeddings

In the above example we learnt our own Embedding. What if we want to use some pre-made word vectors from somewhere else?

The file `glove100K.100d.txt` is an extract of the [Glove](http://nlp.stanford.edu/projects/glove/) Vectors, that were trained on english Wikipedia 2014 + Gigaword 5 (6B tokens).

It contains a subset of the `100 000` most frequent words. They have a dimension of `100`.

A compressed version of this file is in `data/` in the top level of this repository. You need to unzip it first before you can use it.

In [None]:
# what is the structure of the file?
# why are we constructing these data structures?

embeddings_index = {}
embeddings_vectors = []
with open('../data/glove100K.100d.txt', 'rb') as f:
    word_idx = 0
    for line in f:
        values = line.decode('utf-8').split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = word_idx
        embeddings_vectors.append(vector)
        word_idx = word_idx + 1

inv_index = {v: k for k, v in embeddings_index.items()}
print("found %d different words in the file" % word_idx)

In [None]:
# Stack all embeddings in a large numpy array
# what dimensions should that array have?
# what should be on each row?
# what should be in each column?

glove_embeddings = 0 #...your code here...


glove_norms = np.linalg.norm(glove_embeddings, axis=-1, keepdims=True)
glove_embeddings_normed = glove_embeddings / glove_norms

In [None]:
assert glove_embeddings.shape[1] == 100, "should have 100d for 100d vectors"

In [None]:
def get_emb(word):
    idx = embeddings_index.get(word)
    if idx is None:
        return None
    else:
        return glove_embeddings[idx]

    
def get_normed_emb(word):
    idx = embeddings_index.get(word)
    if idx is None:
        return None
    else:
        return glove_embeddings_normed[idx]

In [None]:
get_emb("computer")

### Finding similar words

Build a function to find most similar words, given a word as query:
- lookup the vector for the query word in the Glove index;
- compute the cosine similarity between a word embedding and all other words;
- display the top 10 most similar words.

This should be a repeat of earlier today.

- Change your function so that it takes multiple words as input (by averaging them). This lets you find words which are similar to a whole sentence.

In [None]:
def most_similar(words, topn=10):
    pass

In [None]:
most_similar("cpu")

In [None]:
most_similar("10")

In [None]:
most_similar("june")

In [None]:
# bonus: yangtze is a chinese river
most_similar(["river", "chinese"])

### Displaying vectors with  t-SNE

100 dimensions are hard to display, let's use a popular dimensionality reduction algorithm to display them in 2D. There should be some meaning to how things are distributed.

In [None]:
from sklearn.manifold import TSNE

word_emb_tsne = TSNE(perplexity=30).fit_transform(glove_embeddings_normed[:1000])

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(40, 40))
axis = plt.gca()
np.set_printoptions(suppress=True)
plt.scatter(word_emb_tsne[:, 0], word_emb_tsne[:, 1], marker=".", s=1)

for idx in range(1000):
    plt.annotate(inv_index[idx],
                 xy=(word_emb_tsne[idx, 0], word_emb_tsne[idx, 1]),
                 xytext=(0, 0), textcoords='offset points')
plt.savefig("tsne.png")
plt.show()
# probably worth opening this in a different window so you can zoom in

### Using pre-trained embeddings in our model

We want to use these pre-trained embeddings to perform "transfer learning". This process is very similar to transfer learning in image recognition: features learned for one task are useful for other similar tasks.

The features learnt on words might help us bootstrap the learning process, and increase performance if we don't have enough training data to learn vectors ourselves.

- We initialize embedding matrix from the model with Glove embeddings:
 - take all words from our IMDB vocabulary (`MAX_NB_WORDS = 20000`), and look up their Glove embedding 
 - place the Glove embedding at the corresponding index in the matrix
 - if the word is not in the Glove vocabulary, we only place zeros in the matrix (could experiment with setting these vectors to random values or amybe lookup most similar word and use the embedding for that?)
- We may fix these embeddings or fine-tune them

There is an example in ther keras documentation which is similar to what we will do: https://github.com/keras-team/keras/blob/454be50414967002197cc40be9d968a16a07f6b9/examples/pretrained_word_embeddings.py#L103-L121

In [None]:
EMBEDDING_DIM = 100

# prepare embedding matrix
nb_words_in_matrix = 0
nb_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = get_emb(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        nb_words_in_matrix = nb_words_in_matrix + 1
        
print("added %d words in the embedding matrix" % nb_words_in_matrix)

Build a layer with pre-trained embeddings. The key is the `weights` argument to `Embedding`.

In [None]:
pretrained_embedding_layer = Embedding(
#...your arguments here...
)

### A model with pre-trained Embeddings

Averaging word embeddings pre-trained with Glove / Word2Vec usually works surprisingly well. However, when averaging more than `10-15` words, the resulting vector becomes too noisy and classification performance is degraded.

In [None]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = pretrained_embedding_layer(sequence_input)

# what is the shape of `embedded_sequences`?
# Need to average the output of the embedding layer
average = GlobalAveragePooling1D()(embedded_sequences)
# what is the shape of `average` now?
predictions = Dense(N_CLASSES, activation='softmax')(average)

model = Model(sequence_input, predictions)

# We don't want to fine-tune the embeddings
# this is the key to using pre-trained vectors
model.layers[1].trainable = False

# ... compile the model ...

In [None]:
# ... fit the model for maybe 10 or 15 epochs ...

In [None]:
# ... score your model ...

**Remarks:**

- On this type of task, using pre-trained embeddings can degrade results as we train much less parameters and we average a large number pre-trained embeddings. Check out `model.summary()` to see how many trainable parameters your keras model has. Compare between the one that uses pre-trained vectors and the one that learns vectors.

- Pre-trained embeddings followed by global averaging prevents overfitting but can also cause some underfitting.

Pre-trained embeddings can be very useful when the training set is small and the individual text documents to classify are short: in this case there might be a single very important word in the test document that drives the label. If that word has never been seen in the training set but some synonyms were seen, the semantic similarity captured by the embedding will allow the model to generalized out of the restricted training set vocabulary.

We did not observe this effect here because the document are long enough so that guessing the topic can be done redundantly. Shortening the documents to make the task more difficult could possibly highlight this benefit. Investigate this!

## What about other languages?

If you have time find a dataset in German for example http://www.spinningbytes.com/resources/ with a journal article http://aclweb.org/anthology/W17-1106) and use the word vectors from https://fasttext.cc/docs/en/crawl-vectors.html to build a similar model for texts in languages other than English.

---

## Reality check

On small/medium datasets (this one is small), simpler classification methods usually perform better, and are much more efficient to compute. Here are two resources to go further:
- Naive Bayes approach, using scikit-learn http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
- Alec Radford (OpenAI) gave a very interesting presentation, showing that you need a VERY large dataset to have real gains from GRU/LSTM in text classification https://www.slideshare.net/odsc/alec-radfordodsc-presentation

However, when looking at the features used by simple lienar models one can see that classification is probably not very robust, and won't generalize well to slightly different domains (e.g. forum posts => emails). Try this out by feeding movie reviews from the internet to your models. For example from [Rotten Tomatoes](https://www.rottentomatoes.com/)

Note: Implementations in Keras for text is very slow due to python overhead and lack of hashing techniques. `fastText` implementation https://github.com/facebookresearch/fasttext is much, much faster. Use this in production!

---


## Going further

- Compare pre-trained embeddings vs specifically trained embeddings
- Check [Keras Examples](https://github.com/fchollet/keras/tree/master/examples) on `imdb` sentiment analysis

- Today, the **state-of-the-art text classification** can be achieved by **transfer learning from a language model** instead of using traditional word embeddings. See for instance: FitLaM, Fine-tuned Language Models for Text Classification https://arxiv.org/abs/1801.06146. Or even more recently: https://blog.openai.com/language-unsupervised/
- Interesting to read and try out https://github.com/facebookresearch/InferSent Deals with the problem of embedding sentences instead of averaging words.