# Sentiment Analysis using LSTM and CNN

**There are 10 points in total for this homework. Send the completed notebook to beroth@cis.uni-muenchen.de. The deadline is Tuesday, December 19, 23:59. You can work in teams of 2 or 3. This is the last exercise before the projects.**

First some imports. You will have to add imports for the CNN in the second part of the exercise.

In [None]:
import collections
import random
import nltk
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Bidirectional
from keras.layers import add as addition
import sys
#TODO
random.seed(111)

The next part helps loading the sentiment data into the needed format. We will use the movie review corpus, which is provided by the nltk package. It classifies movie reviews as positive or negative. 

In [None]:
def nltk_data(n_texts_train=1500, n_texts_dev=500, vocab_size=10000):
    """
    Reads texts from the nltk movie_reviews corpus. A word2id dictionary is 
    created and the words in the texts are substituted with their numbers. Training
    and Development data is returned, together with labels and the word2id dictionary.
 
    :param n_texts_train: the number of reviews that will form the training data
    :param n_texts_dev: the number of reviews that will form the development data
    :param vocab_size: the maximum size of the vocabulary.

    :return list texts_train: A list containing lists of wordids corresponding to 
    training texts.
    :return list texts_dev: A list containing lists of wordids corresponding to 
    development texts.
    :return labels_train: A list containing the labels (0 or 1) for the corresponding
    text entry in texts_train
    :return labels_dev: A ilst containing the labels (0 or 1) for the corresponding
    text entry in texts_dev
    :return word2id: The dictionary obtained from the training texts that maps each
    seen word to an id.
    """
    all_ids = movie_reviews.fileids()
    if (n_texts_train+n_texts_dev>len(all_ids)):
        print ("Error: There are only",len(all_ids), "texts in the movie_reviews corpus. Training with all of those sentences.")
        n_texts_train=1500
        n_texts_dev=500
    posids = movie_reviews.fileids('pos')
    random.shuffle(all_ids)

    texts_train=[]
    labels_train=[]
    texts_dev=[]
    labels_dev=[]

    for i in range(n_texts_train):
        text = movie_reviews.raw(fileids=[all_ids[i]])
        tokens = [word.lower() for word in word_tokenize(text)]
        texts_train.append(tokens)
        if all_ids[i] in posids:       
            labels_train.append(1)
        else:
            labels_train.append(0)

    for i in range(n_texts_train, n_texts_train+n_texts_dev):
        text = movie_reviews.raw(fileids=[all_ids[i]])
        tokens = [word.lower() for word in word_tokenize(text)]
        texts_dev.append(tokens)
        if all_ids[i] in posids:
            labels_dev.append(1)
        else:
            labels_dev.append(0)

    word2id=create_dictionary(texts_train, vocab_size)
    texts_train = [to_ids(s,word2id) for s in texts_train]
    texts_dev = [to_ids(s,word2id) for s in texts_dev]
    return (texts_train, labels_train, texts_dev, labels_dev, word2id)

def create_dictionary(texts, vocab_size):
    """
    Creates a dictionary that maps words to ids. More frequent words have lower ids.
    The dictionary contains at the vocab_size-1 most frequent words (and a placeholder '<unk>' for unknown words).
    The place holder has the id 0.
    """
    counter = collections.Counter()
    for tokens in texts:
        counter.update(tokens)
    vocab = [w for w,c in counter.most_common(vocab_size-1)]
    word_to_id = {w:(i+1) for i,w in enumerate(vocab)} 
    word_to_id[UNKNOWN_TOKEN] = 0 
    return word_to_id 

def to_ids(words, dictionary):
    """
    Takes a list of words and converts them to ids using the word2id dictionary.
    """
    ids=[]
    for word in words:
        ids.append(dictionary.get(word, dictionary[UNKNOWN_TOKEN]))
    return ids


We define a couple of constants, which should be familiar from the last exercise and fetch the data. You don't have to remove the download part, it will check automatically if it is already downloaded.

In [None]:
VOCAB_SIZE = 10000
MAX_LEN = 100
BATCH_SIZE = 32
EMBEDDING_SIZE = 20
HIDDEN_SIZE = 10
EPOCHS = 10
UNKNOWN_TOKEN = "<unk>"

nltk.download('movie_reviews')
nltk.download('punkt')
x_train, y_train, x_dev, y_dev, word2id = nltk_data(vocab_size=VOCAB_SIZE)
x_train = sequence.pad_sequences(x_train, maxlen=MAX_LEN)
x_dev = sequence.pad_sequences(x_dev, maxlen=MAX_LEN)

The first of three models that we will define is a Bidirectional LSTM. In comparison to the last exercise when you had to get an output at each timestep, this time you will only need the last output.

** TODO: Build model with bidirectional LSTM, with HIDDEN_SIZE size for each direction. After the LSTM, insert one additional dense layer (HIDDEN_SIZE and tanh non-linearity), before the label is predicted by the final layer.(1 p.) **

In [None]:
lstm_model = Sequential()
lstm_model.add(Embedding(VOCAB_SIZE, EMBEDDING_SIZE))
lstm_model.add() # TODO
lstm_model.add() # TODO
lstm_model.add() # TODO

Train and evaluate the model

In [None]:
lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
lstm_model.fit(x_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(x_dev, y_dev))
score, acc = lstm_model.evaluate(x_dev, y_dev)

In [None]:
print("LSTM Accuracy: ",acc)

** TODO: Solve the same problem using a CNN+pooling over three-grams with 2\*HIDDEN_SIZE filters, and apply the tanh non-linearity. After CNN+pooling, insert as before one additional dense layer (HIDDEN_SIZE and tanh non-linearity), before the label is predicted by the final layer. (3 p.)**

In [None]:
cnn_model = Sequential() 
cnn_model.add(Embedding(VOCAB_SIZE, EMBEDDING_SIZE))
cnn_model.add() # TODO
cnn_model.add() # TODO
cnn_model.add() # TODO
cnn_model.add() # TODO

Train and evaluate

In [None]:
cnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
cnn_model.fit(x_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_data=(x_dev, y_dev))
score, acc = cnn_model.evaluate(x_dev, y_dev)

In [None]:
print("CNN Accuracy: ", acc)

** TODO: How many parameters in total does the entire CNN model (all layers including embedding) have to optimize/learn? Show your calculation. (1.5 p.) **

** TODO: Use the Keras functional API to merge (combine) the LSTM sentence representation with the pooled representation. Instead of concatenation use vector addition for merging. The same embedding should be learned and used going into the LSTM and the CNN. As before, insert one additional dense layer (HIDDEN_SIZE and tanh non-linearity), before the label is predicted by the final layer. Train and evaluate as before. (5.5 p.)**
functional API: https://keras.io/getting-started/functional-api-guide/

In [None]:
print("Compostion Accuracy: ", acc)