<div style="text-align: right"><a href="http://ml-school.uni-koeln.de">Virtual Summer School "Deep Learning for
    Language Analysis"</a> <br/><strong>Text Analysis with Deep Learning</strong><br/>Aug 31 — Sep 4, 2020<br/>Nils Reiter<br/><a href="mailto:nils.reiter@uni-koeln.de">nils.reiter@uni-koeln.de</a></div>

# Exercise 1: Sentiment analysis as bag of words

This is the first exercise for you to solve independently, but as a group of approximately three students. Feel free to contact us via [Teams](https://teams.microsoft.com/l/team/19%3aeefdaf656d5d48d3868e04682d159f57%40thread.tacv2/conversations?groupId=de73deca-e22a-46d1-81da-42a4e12897f9&tenantId=4982814a-1107-493e-ac2f-3356509a8687) and call us into your room if you need support. 


In [1]:
import bz2
import numpy as np

def get_labels_and_texts(file, limit=100000):
    labels = []
    texts = []
    lineNumber = 0
    for line in bz2.BZ2File(file):
        x = line.decode("utf-8")
        labels.append(int(x[9]) - 1)
        texts.append(x[10:].strip())
        lineNumber = lineNumber + 1
        if lineNumber >= limit and limit > 0:
          break
    return np.array(labels), texts


If `data/amazon/train.ft.txt.bz2` does not exist, we download it. (The exclamation mark `!` indicates that the command is executed not by Python, but by the underlying shell. That's why the syntax does not look like python at all.)

In [None]:
! if ! [[ -f data/amazon/train.ft.txt.bz2 ]]; then curl https://nilsreiter.de/assets/2020-08-31-deep-learning/amazon/train.ft.txt.bz2 > data/amazon/train.ft.txt.bz2; fi

If `data/amazon/test.ft.txt.bz2` does not exist, we download it. (The exclamation mark `!` indicates that the command is executed not by Python, but by the underlying shell. That's why the syntax does not look like python at all.)

This line opens and parses the file we have downloaded before. The function `get_labels_and_texts(...)` is defined above. Because we are not overwriting the argument `limit`, the function only loads the first 100000 reviews.

In [2]:
train_labels, train_texts = get_labels_and_texts('data/amazon/train.ft.txt.bz2')

FileNotFoundError: [Errno 2] No such file or directory: 'data/amazon/train.ft.txt.bz2'

Now we have imported the train and test data into variables. `train_labels` contains the classes, `train_texts` the corresponding reviews. Feel free to inspect those.

In [None]:
# shows the seventh text (note the typo!)
train_texts[7]

## Task 1

In this exercise, we want the input to be a document-term-matrix. I.e., each document is represented by a numeric vector. The vector contains one dimension for each token in the vocabulary, i.e., unique token in the entire (training) corpus (we will talk about more in-depth this on Tuesday).

As an example, consider the following matrix:

| document | dog | cat | mouse | the | a | an |
| --- | --- | --- | --- | --- | --- | --- | 
| d1 | 5 | 6 | 0 | 10 | 5 | 6 |
| d2 | 0 | 1 | 10 | 3 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | 

Document `d1` contains five occurrences of the word "dog", six of the word "cat" etc.

Creating such a matrix requires creating a vocabulary and then counting all these words. The class `CountVectorizer()` from `scikit-learn` is exactly what we need for this. You'll find the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Please use it to create a document-term matrix for the `train_texts`.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=5000)

vectorizer.fit(train_texts)

train_texts_vec = vectorizer.transform(train_texts)

## Task 2

The vectors for each document are now represented as sparse arrays, i.e., they are not fully realized (zeros are not stored, for instance). To make them dense, we can use the `numpy`-function `todense()`. This is also a good opportunity to limit the number of training instances for development.


In [None]:
numInstances = 10000

x_train = train_texts_vec[:numInstances].todense()
y_train = train_labels[:numInstances]

## Task 3

We are now ready to define the neural network. Please define a neural network with 
1. an input layer with an appropriate `shape` argument
2. an hidden layer with size 5 and activation function `sigmoid`
3. an output layer with activation function `sigmoid`

One you have successfully fitted a first model, try to change the architecture to improve its accuracy!

In [None]:
from tensorflow.keras import models, layers, optimizers

ffnn = models.Sequential()
ffnn.add(layers.Input(shape=(5000,)))
ffnn.add(layers.Dense(5, activation="sigmoid"))
ffnn.add(layers.Dense(1, activation="sigmoid"))

ffnn.compile(loss="mean_squared_error", metrics=["accuracy"])

This model `ffnn` can now be trained on the input data, using the function `fit()`.

In [None]:
history = ffnn.fit(x_train, y_train, epochs=10, batch_size=5, verbose=1)

## Task 4: Evaluation

Now that the model has been trained, we can test it on held-out data.
To this end, you can download a data set with the shell command below.

In [None]:
! if ! [[ -f data/amazon/test.ft.txt.bz2 ]]; then curl https://nilsreiter.de/assets/2020-08-31-deep-learning/amazon/test.ft.txt.bz2 > data/amazon/test.ft.txt.bz2; fi

For the actual evaluation, the test dataset needs to undergo the same preprocessing steps as the training data.
1. Read in the data from a file
2. Vectorize each document, using the same vectorizer (from above)
3. Create one matrix `x_test` that contains the input data and one array `y_test` that contains the labels.

Use the Keras-function `evaluate()` on the model ([documentation](https://keras.io/api/models/model_training_apis/#evaluate-method)).

In [None]:
test_labels, test_texts = get_labels_and_texts('data/amazon/test.ft.txt.bz2')
test_texts_vec = vectorizer.transform(test_texts)

x_test = test_texts_vec[:numInstances].todense()
y_test = test_labels[:numInstances]

ffnn.evaluate(x_test, y_test)

## Credits

This notebook is based on [this one](https://www.kaggle.com/muonneutrino/sentiment-analysis-with-amazon-reviews) by MuonNeutrino on kaggle.