<div style="text-align: right"><a href="http://ml-school.uni-koeln.de">Virtual Summer School "Deep Learning for
    Language Analysis"</a> <br/><strong>Text Analysis with Deep Learning</strong><br/>Aug 31 — Sep 4, 2020<br/>Nils Reiter<br/><a href="mailto:nils.reiter@uni-koeln.de">nils.reiter@uni-koeln.de</a></div>

# Exercise 2

In this exercise, we will continue using the amazon reviews dataset, and continue to classify them according to their polarity (positive/negative).

But in contrast to exercise 1, we will be using embeddings and a convoluational neural network.

----

The first two steps ensure that you have the data set at the correct location (but you should already have them from Exercise 1).

In [None]:
! if ! [[ -f data/amazon/train.ft.txt.bz2 ]]; then curl https://nilsreiter.de/assets/2020-08-31-deep-learning/amazon/train.ft.txt.bz2 > data/amazon/train.ft.txt.bz2; fi

In [None]:
! if ! [[ -f data/amazon/test.ft.txt.bz2 ]]; then curl https://nilsreiter.de/assets/2020-08-31-deep-learning/amazon/test.ft.txt.bz2 > data/amazon/test.ft.txt.bz2; fi

This block defines a function to read in the first 100000 reviews of both training and testing data set. This is the same as in Exercise 1.


In [None]:
import bz2
import numpy as np

def get_labels_and_texts(file, limit=100000):
    labels = []
    texts = []
    lineNumber = 0
    for line in bz2.BZ2File(file):
        x = line.decode("utf-8")
        labels.append(int(x[9]) - 1)
        texts.append(x[10:].strip())
        lineNumber = lineNumber + 1
        if lineNumber >= limit and limit > 0:
          break
    return np.array(labels), texts

train_labels, train_texts = get_labels_and_texts('data/amazon/train.ft.txt.bz2')
test_labels, test_texts = get_labels_and_texts('data/amazon/test.ft.txt.bz2')

## Task 1: Tokenization

Since we are not using a bag-of-words representation and a word-document matrix, we need to keep the sequence of the words intact. Therefore, we first need to **tokenize** all the texts. The class [`Tokenizer` from Keras](https://keras.io/api/preprocessing/text/#tokenizer-class) does this, and also maintains an index of strings to numbers. I.e., by using this class (which needs to be fitted to the training data), we get a sequence of integers as a result.

Three steps are needed for use:
1. Initialization of a `Tokenizer` object with reasonable parameters (e.g., you can restrict the number of words to distinguish with the parameter `num_words`).
2. Fitted the tokenizer to the training data (to learn an actual mapping from surfaces to integers).
3. Convert the train and test sequences to integer sequences. The results should be in variables called `train_texts` and `test_texts`, respectively.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

MAX_FEATURES = 12000
tokenizer = Tokenizer(num_words=MAX_FEATURES)
tokenizer.fit_on_texts(train_texts)
train_texts = tokenizer.texts_to_sequences(train_texts)
test_texts = tokenizer.texts_to_sequences(test_texts)

## Task 2: Padding

Because the input data needs to have the same shape for all sentences, we need to "fill" shorter sentences. This process is called **padding**, and there is also a ready-made function for this: [`pad_sequences()` from Keras](https://keras.io/api/preprocessing/timeseries/#padsequences-function). The function needs only one argument: The length to which sequences should be padded. This of course can be a made-up number, but this is risky, because sentences might be longer than you think. It's better to find out the length of the longest sentence in the training data first, and pad the data to this length.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

MAX_LENGTH = max(len(train_ex) for train_ex in train_texts)
train_texts = pad_sequences(train_texts, maxlen=MAX_LENGTH)
test_texts = pad_sequences(test_texts, maxlen=MAX_LENGTH)

## Task 3: Model Definition

Now we can define the model. For starters, you should define a model with the following layers: 1. Embedding, 2. Convolutional, 3. Flatten, 4. Dense. For compilation, you can pick a binary cross-entropy loss.

In [None]:
from tensorflow.python.keras import models, layers, optimizers

model = models.Sequential()
model.add(layers.Input(shape=(MAX_LENGTH,)))
model.add(layers.Embedding(input_dim=MAX_FEATURES, output_dim=64))
model.add(layers.Conv1D(filters=10, kernel_size=5, activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', metrics=['accuracy'])

## Task 4: Training and Evaluation

This might take a long time, because the entire data set is quite large. If you are impatient, it might be a good idea to only use a subset of the data for training (e.g., the first 5000 reviews or so).

In [None]:
NUM_INSTANCES=5000

history = model.fit(train_texts, train_labels, 
    batch_size=128, epochs=2)

Evaluate the model on unseen test data

In [None]:
model.evaluate(test_texts, test_labels)