# Sentiment Analysis with an LSTM
In this notebook, we will train a binary sentiment classifier that is based on an RNN. To this end, will follow these steps:

* Preprocess the data: tokenization, stopword removal, lemmatization
* Train a word embedding model for reuse in the LSTM
* Bring data into format usable for LSTM
* Build the RNN architecture
* Training the model
* Testing the model

Finally, there will be a little exercise to make you more comfortable dealing with variable length outputs, which is essential to be able to implement more advanced models later in the course.

## Preprocessing

Deep learning models are known to be able to require little feature engineering, and this is also true for SOTA models in NLP. However, sometimes, especially on low-resource tasks, it may be better to do some basic preprocessing to the text instead of passing the raw stream of text. By doing so, we hard-code linguistic knowledge that makes it easier for the neural network to train.

But first, lets load the data (a small sample so things don't take forever).

In [1]:
import warnings
warnings.filterwarnings("ignore")

def load_data(num_data=25000):
    reviewsFile = open('../data/reviews.txt','r')
    reviews = list(map(lambda x:x[:-1],reviewsFile.readlines()))
    reviewsFile.close()

    labelsFile = open('../data/labels.txt','r')
    labels = list(map(lambda x:x[:-1],labelsFile.readlines()))
    labelsFile.close()
    
    reviews, labels = reviews[:num_data], labels[:num_data]
    
    return reviews,labels

reviews,labels = load_data(num_data=25000)

### Tokenization
In theory, we could feed the neural network model the input text one character at a time, and there are in fact models that do so and work well.
However, this approach ignores any prior linguistic knowledge, even the knowledge of what constitutes a word. "Tokenization" means to split up the text into tokens, i.e., the smallest units to be considered by subsequent steps.

Here, we use regular expressions to identify words that may or may not contain a "'", e.g., "didn't".

In [2]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("\w+\'?\w+|\w+")

In [3]:
def make_token(review):
    return tokenizer.tokenize(str(review))

### Stop word removal
For some applications, certain words are not very useful for the task at hand because they provide no or only little indication of the task's semantics. These words are called stopwords. For (topical) text classification, these often include common words such as articles, conjunctions, and pronouns.

In [4]:
from nltk.corpus import stopwords
from spacy.lang.en.stop_words import STOP_WORDS
stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

However, for sentiment classification many of these words can be useful because they can have a large impact on the actual sentiment expressed in the sentence (e.g., not). Hence, we don't want to exclude them.

In [5]:
exceptionStopWords = {
    'again',
    'against',
    'ain',
    'almost',
    'among',
    'amongst',
    'amount',
    'anyhow',
    'anyway',
    'aren',
    "aren't",
    'below',
    'bottom',
    'but',
    'cannot',
    'couldn',
    "couldn't",
    'didn',
    "didn't",
    'doesn',
    "doesn't",
    'don',
    "don't",
    'done',
    'down',
    'except',
    'few',
    'hadn',
    "hadn't",
    'hasn',
    "hasn't",
    'haven',
    "haven't",
    'however',
    'isn',
    "isn't",
    'least',
    'mightn',
    "mightn't",
    'move',
    'much',
    'must',
    'mustn',
    "mustn't",
    'needn',
    "needn't",
    'neither',
    'never',
    'nevertheless',
    'no',
    'nobody',
    'none',
    'noone',
    'nor',
    'not',
    'nothing',
    'should',
    "should've",
    'shouldn',
    "shouldn't",
    'too',
    'top',
    'up',
    'very'
    'wasn',
    "wasn't",
    'well',
    'weren',
    "weren't",
    'won',
    "won't",
    'wouldn',
    "wouldn't",
}

In [6]:
stop_words = set(stop_words).union(STOP_WORDS)

final_stop_words = stop_words-exceptionStopWords

def remove_stopwords(review):
    return [token for token in review if token not in final_stop_words]

### Lemmatization
Finally, lemmatization is a technique to reduce the number of words considered by the model by reducing different inflections of a word to the root. The 'spacy' library provides convenient functionality for doing this.

In [7]:
import spacy
nlp = spacy.load("en",disable=['parser', 'tagger', 'ner'])

In [8]:
def lemmatization(review):
    lemma_result = []
    
    for words in review:
        doc = nlp(words)
        for token in doc:
            lemma_result.append(token.lemma_)
    return lemma_result

Finally, we can preprocess all reviews as a pipeline of tokenization, stopword removal, and lemmatization.

In [9]:
def pipeline(review):
    review = make_token(review)
    review = remove_stopwords(review)
    return lemmatization(review)

In [10]:
%%time
reviews = list(map(lambda review: pipeline(review),reviews))

CPU times: user 4min, sys: 222 ms, total: 4min
Wall time: 4min


In [11]:
reviews[:2]

[['bromwell',
  'high',
  'cartoon',
  'comedy',
  'run',
  'time',
  'program',
  'school',
  'life',
  'teacher',
  'year',
  'teach',
  'profession',
  'lead',
  'believe',
  'bromwell',
  'high',
  'satire',
  'much',
  'close',
  'reality',
  'teacher',
  'scramble',
  'survive',
  'financially',
  'insightful',
  'student',
  'right',
  'pathetic',
  'teacher',
  'pomp',
  'pettiness',
  'situation',
  'remind',
  'school',
  'know',
  'student',
  'see',
  'episode',
  'student',
  'repeatedly',
  'try',
  'burn',
  'down',
  'school',
  'immediately',
  'recall',
  'high',
  'classic',
  'line',
  'inspector',
  'sack',
  'teacher',
  'student',
  'welcome',
  'bromwell',
  'high',
  'expect',
  'adult',
  'age',
  'think',
  'bromwell',
  'high',
  'far',
  'fetch',
  'pity',
  'isn'],
 ['story',
  'man',
  'unnatural',
  'feeling',
  'pig',
  'start',
  'open',
  'scene',
  'terrific',
  'example',
  'absurd',
  'comedy',
  'formal',
  'orchestra',
  'audience',
  'turn',
  '

## Training a word2vec model
As you very well know, the neural networks we consider should not be fed with high-dimensional one-hot-vectors directly. Instead, it is better to map them into low-dimensional space first (word embeddings). It is possible to just randomly initialize word embedding tables and train them from scratch jointly with the sentiment classification task. However, it often better to initialize the word embedding tables with pretrained word embeddings and finetune them on the task afterwards. It is common to use freely available word embeddings pretrained on large corpora, but here we train one from scratch.

With the gensim library, this is as simple as calling the following:

In [12]:
from gensim.models import Word2Vec

embedding_dimension = 100
model = Word2Vec(reviews,size=embedding_dimension, window=3, min_count=3, workers=4)

In [13]:
word_vectors = model.wv
del model

len(word_vectors.vocab)

28163

When inspecting the nearest neighbors of some of the words important to sentiment analysis, we can see that the word embeddings already capture some useful semantics

In [14]:
word_vectors.similar_by_word(word="good", topn=5)

[('decent', 0.7292054891586304),
 ('alright', 0.6922928094863892),
 ('okay', 0.6738833785057068),
 ('nice', 0.6543942093849182),
 ('great', 0.6427708864212036)]

In [15]:
word_vectors.similar_by_word(word="bad", topn=5)

[('horrible', 0.7072439193725586),
 ('suck', 0.7059657573699951),
 ('terrible', 0.6890344023704529),
 ('awful', 0.6877450942993164),
 ('lame', 0.6699240803718567)]

In [16]:
word_vectors.most_similar(positive="bad",topn=4)

[('horrible', 0.7072439193725586),
 ('suck', 0.7059657573699951),
 ('terrible', 0.6890344023704529),
 ('awful', 0.6877450942993164)]

In [17]:
word_vectors.similarity("good","bad")

0.5836363

In [18]:
word_vectors.similarity("good","be")

0.328366

In [19]:
word_vectors.similar_by_word(word="school", topn=5)

[('college', 0.7752484679222107),
 ('class', 0.7588450908660889),
 ('schooler', 0.7476534843444824),
 ('bromwell', 0.7275527715682983),
 ('bidder', 0.706292450428009)]

In [20]:
word_vectors.similar_by_word(word="comedy", topn=5)

[('slapstick', 0.6493667960166931),
 ('farce', 0.6451849937438965),
 ('humor', 0.6442668437957764),
 ('satire', 0.6236644983291626),
 ('humour', 0.6191367506980896)]

In [21]:
word_vectors.similar_by_word(word="action", topn=5)

[('suspense', 0.6244109869003296),
 ('thrill', 0.6063871383666992),
 ('gory', 0.597480058670044),
 ('courtroom', 0.5860505104064941),
 ('fantasy', 0.5682624578475952)]

In [22]:
word_vectors.similar_by_word(word="sad", topn=5)

[('depress', 0.7488235831260681),
 ('cry', 0.7024545669555664),
 ('heartwarming', 0.6907929182052612),
 ('genuinely', 0.679267168045044),
 ('honest', 0.6713963747024536)]

In [23]:
word_vectors.most_similar(negative=["bad"],positive=["decent"],topn=5)

[('solid', 0.3986845016479492),
 ('fine', 0.3981825113296509),
 ('splendid', 0.360782265663147),
 ('incredible', 0.3579597771167755),
 ('fantastic', 0.35041236877441406)]

In [24]:
padding_value = len(word_vectors.index2word)

In [25]:
padding_value

28163

## Preparing data for neural networks
As noted before, we want to reuse the trained embeddings in our network later, so we reuse the word->index mapping from the word2vec model and turn them into pytorch tensors.

In [26]:
# fix seeds for pytorch for reproducibility
import torch
import torch.nn as nn
import torch.optim as optim
SEED = 2222

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [27]:
# a function to map all words in a review into a tensor of indices.
def word2idx(embedding_model,review):
    index_review = []
    for word in review:
        try:
            index_review.append(embedding_model.vocab[word].index)
        except: 
             pass
    return torch.tensor(index_review)

In [28]:
index_review = list(map(lambda review: word2idx(word_vectors,review),reviews))

As usual, lets split the data into training, validation, and test sets.

In [29]:
from sklearn.model_selection import train_test_split
labels = [0 if label == 'negative' else 1 for label in labels ]
X_train, X_test, y_train, y_test = train_test_split(index_review, labels, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)

print(len(X_train),len(X_test),len(X_val))
print(len(y_train),len(y_test),len(y_val))

16000 5000 4000
16000 5000 4000


### Dealing with batches
Text data is different insofar as that it is variable length. This makes processing data in batches relatively complicated, however, parallelization via batching is crucial to train the neural networks in reasonable time. To cope with this problem, we need to pad all sentences in a batch to equal length. Below, we pad all sequences to the length of the longest sequence, however, sometimes it is reasonable to put a hard cap on the sequence length to speed up training.

In [30]:
batch_size = 128
import numpy as np
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def iterator_func(X,y):
    size = len(X)
    permutation = np.random.permutation(size)
    iterator = []
    for i in range(0,size, batch_size):
        indices = permutation[i:i+batch_size]
        batch = {}
        batch["text"] = [X[i] for i in indices]
        batch["label"] = [y[i] for i in indices]
        
        batch["text"],batch["label"] = zip(*sorted(zip(batch["text"],batch["label"]),key=lambda x: len(x[0]),reverse=True))
        batch["length"] = [len(review) for review in batch["text"]]
        batch["length"] = torch.IntTensor(batch["length"])
        batch["text"] = torch.nn.utils.rnn.pad_sequence(batch["text"],batch_first=True).t() # pads all sequences max length
        batch["label"] = torch.Tensor(batch["label"])
        
        batch["label"]  = batch["label"].to(device)
        batch["length"] = batch["length"].to(device) 
        batch["text"]   = batch["text"].to(device) 
        
        iterator.append(batch)
        
    return iterator

train_iterator = iterator_func(X_train,y_train)
valid_iterator = iterator_func(X_val,y_val)
test_iterator = iterator_func(X_test,y_test)

## Building the RNN
For the implementation of the RNN, we provide two implementations. The first one is an explicit implementation of the procedure as it was introduced to you in the lecture: At each step, the RNN (or, LSTM in this particular case) takes the current hidden state and the element at the current timestep as input and produces an output as well as a new hidden state. This is implemented as a for loop. Because we are dealing with variable length sequences which have been padded to fit a batch, we cannot just take the output of the last hidden state as the representation of the sequence, because it will incorrectly consider the pad tokens as a valid input. Therefore, we need to extract the output at the last _valid_ output, which will be used to represent the text. This representation is input to a logistic regression classifier to get the final binary output.

In [31]:
class RNNExplicit(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, embedding_weights):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding.from_pretrained(embedding_weights)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x, text_lengths):
        #x [sent length , batch size]
        embedded = self.embedding(x) #[sentect len,batch size,embedding dim]
        max_len = embedded.size(0)
        batch_size = embedded.size(1)
        hidden_states = (torch.zeros((1, batch_size, self.hidden_dim)), torch.zeros((1, batch_size, self.hidden_dim)))
        output = []
        for t in range(max_len):
            embedded_t = embedded[t].unsqueeze(0)
            out, (hidden, cell) = self.rnn(embedded_t, hidden_states)#output[sent length,batch size,hiddendin*num of directions],[numberlayers*num of dir,batch size,hid dim]
            hidden_states = (hidden, cell)
            output.append(out)
        #[f0,b0,f1,b1,.......fn,bn]
        output = torch.cat(output, dim = 0)
        
        # get the last output
        text_lengths = torch.tensor(text_lengths, requires_grad=False).long()
        masks = (text_lengths - 1).unsqueeze(0).unsqueeze(2)
        masks = masks.expand(max_len, -1, self.hidden_dim)
        output = output.gather(0, masks)[0, :, :]
        
        return self.fc(output)

However, the above code is both slow and inconvenient and should thus not be used in practice. PyTorch provides convenient functionality that makes dealing with variable length sequences easier and faster. The below code makes use of the PackedSequence class, which can be passed directly to RNNs in one go. At the end, the 'hidden' variable will automatically contain the outputs at the last _valid_ timestep for each element of the batch.

In [32]:
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, embedding_weights):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding.from_pretrained(embedding_weights)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x, text_lengths):
        #x [sent length , batch size]
        embedded = self.embedding(x) #[sentect len,batch size,embedding dim]
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        packed_output, (hidden, cell) = self.rnn(packed_embedded)#output[sent length,batch size,hiddendin*num of directions],[numberlayers*num of dir,batch size,hid dim]

        return self.fc(hidden.squeeze(0))

## Training
Lastly, it remains to train the model we just created. Nothing new happens here - we just need to configure the model and training parameters, set up the training loop, and run the whole thing.

In [33]:
INPUT_DIM = padding_value
EMBEDDING_DIM = 100
HIDDEN_DIM = 100
OUTPUT_DIM = 1
N_EPOCHS = 5
embedding_weights = torch.Tensor(word_vectors.vectors)

#model = RNNExplicit(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, embedding_weights)
model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, embedding_weights)
optimizer = optim.Adam(model.parameters(), lr=0.001)
model

RNN(
  (embedding): Embedding(28163, 100)
  (rnn): LSTM(100, 100)
  (fc): Linear(in_features=100, out_features=1, bias=True)
)

In [34]:
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)

def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum()/len(correct)
    return acc

def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        predictions = model(batch["text"],batch["length"]).squeeze(1)
        loss = criterion(predictions, batch["label"])
        acc = binary_accuracy(predictions, batch["label"])
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            predictions = model(batch["text"],batch["length"]).squeeze(1)
            loss = criterion(predictions, batch["label"])
            acc = binary_accuracy(predictions, batch["label"])

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [35]:

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% |')

| Epoch: 01 | Train Loss: 0.507 | Train Acc: 75.31% | Val. Loss: 0.418 | Val. Acc: 81.62% |
| Epoch: 02 | Train Loss: 0.422 | Train Acc: 81.47% | Val. Loss: 0.414 | Val. Acc: 83.13% |
| Epoch: 03 | Train Loss: 0.379 | Train Acc: 84.12% | Val. Loss: 0.386 | Val. Acc: 83.11% |
| Epoch: 04 | Train Loss: 0.357 | Train Acc: 84.72% | Val. Loss: 0.367 | Val. Acc: 84.30% |
| Epoch: 05 | Train Loss: 0.334 | Train Acc: 86.08% | Val. Loss: 0.341 | Val. Acc: 85.69% |


## Testing the model

In [36]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}% |')

| Test Loss: 0.348 | Test Acc: 85.20% |


In [37]:
def predict_sentiment(sentence):
    tokenized = pipeline(sentence)
    indexed = word2idx(word_vectors,tokenized)
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = torch.sigmoid(model(tensor,torch.LongTensor([len(indexed)]).to(device)))
    return prediction.item()

In [38]:
predict_sentiment("this is an awesome movie")

0.6866046786308289

In [39]:
predict_sentiment("this is not an action movie, is a  very good movie")

0.7844770550727844

In [40]:
predict_sentiment("Despite the terrible title, the bad credits section in the end,"
                  
                  " and the low-quality sounds here"\
                  " and there, this is a movie of extraordinary quality.")

0.03949751332402229

In [41]:
predict_sentiment("this is an awful movie")

0.11702029407024384

In [42]:
predict_sentiment("this is a bad movie")

0.19338952004909515

## Exercise
In the model above, we used the hidden state at the last timestep to represent the whole sequence. But often this is not the best way to aggregate the outputs of an RNN: Average or max pooling often works better. In the following, we ask you to implement and test mean- and max pooling.

In [1]:
class OtherRNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, embedding_weights, aggregation = "max"):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding.from_pretrained(embedding_weights)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.aggregation = aggregation
        
    def forward(self, x, text_lengths):
        #x [sent length , batch size]
        embedded = self.embedding(x) #[sentect len,batch size,embedding dim]
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        packed_output, (hidden, cell) = self.rnn(packed_embedded)#output[sent length,batch size,hiddendin],[1,batch size,hid dim]
        outputs, output_lens = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=False)

        # TODO: add your code here!
        if self.aggregation == "max":
            #sequence_embedding = ...
            pass
        elif self.aggregation == "mean":
            #sequence_embedding = ...
            pass
        
        return self.fc(sequence_embedding)

NameError: name 'nn' is not defined