# **NLP - Sentiment Analysis of Tweets using biLSTM**
A deep learning model built using PyTorch and TorchText to classify sentiments of tweets using a subset of the <a href="https://www.kaggle.com/kazanova/sentiment140">sentiment140 dataset</a>.

1. [Dataset Preparation](#section1)
2. [Preprocessing](#section2)
3. [Model](#section3)
4. [Training](#section4)
5. [Prediction](#section5)

In [1]:
import pandas as pd
import numpy as np
import time
import spacy
import random
from pathlib import Path
import torch 
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchtext import data 
import torchtext
from nltk.tokenize.treebank import TreebankWordDetokenizer

In [2]:
# Setting device on GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

torch.backends.cudnn.deterministic = True

Using device: cuda

Tesla K80
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB


<a id='section1'></a>
# **1. Dataset Preparation**
The first column contains the sentiments and the last column contains the tweets.

In [3]:
# Read in data into a dataframe
df = pd.read_csv("training.1600000.processed.noemoticon.csv", engine="python", header=None)

df.head(5)

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


The dataset consists of two sentiments (0 - negative, 4 - positive)

In [4]:
# Count the number of tweets per sentiment
df[0].value_counts()

0    800000
4    800000
Name: 0, dtype: int64

In [5]:
# Model the sentiments as binary (0 - negative, 1 - positive)
df[0]=df[0].replace(to_replace=4,value=1)
df[0].value_counts()

0    800000
1    800000
Name: 0, dtype: int64

In [6]:
# Save a subset as a smaller dataset from training
df.sample(100000).to_csv("sentiment140-small.csv", header=None, index=None)

<a id='section2'></a>
# **2. Preprocessing**

In [7]:
# Declare fields for tweets and labels
# include_lengths tells the RNN how long the actual sequences are
TEXT = torchtext.legacy.data.Field(tokenize='spacy', lower=True, include_lengths= True)
LABEL = torchtext.legacy.data.LabelField(dtype=torch.float)

# Map data to fields
fields = [('label', LABEL), ('id',None),('date',None),('query',None),
      ('name',None), ('text', TEXT),('category',None)]

# Apply field definition to create torch dataset
dataset = torchtext.legacy.data.TabularDataset(
        path="sentiment140-small.csv",
        format="CSV",
        fields=fields,
        skip_header=False)

# Split data into train, test, validation sets
(train_data, test_data, valid_data) = dataset.split(split_ratio=[0.8,0.1,0.1])

print("Number of train data: {}".format(len(train_data)))
print("Number of test data: {}".format(len(test_data)))
print("Number of validation data: {}".format(len(valid_data)))



Number of train data: 80000
Number of test data: 10000
Number of validation data: 10000


In [8]:
# An example from the training set
print(vars(train_data.examples[0]))

{'label': '0', 'text': ['but', 'i', "'m", 'missing', 'everybody', 'that', 'is', 'doing', 'it']}


### **Build Vocabulary**
Build the vocabulary for the training set using pre-trained GloVe embeddings.
GloVe embeddings were trained on 6 billion tokens and the embeddings are 100-dimensional.

In [9]:
MAX_VOCAB_SIZE = 287799

# unk_init initializes words in the vocab using the Gaussian distribution
TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE,
                 vectors = "glove.6B.100d",
                 unk_init = torch.Tensor.normal_)

# build vocab for training set - convert words into integers
LABEL.build_vocab(train_data)

# Most frequent tokens
TEXT.vocab.freqs.most_common(10)

[('i', 49892),
 ('!', 44950),
 ('.', 40330),
 (' ', 29336),
 ('to', 28454),
 ('the', 25999),
 (',', 24186),
 ('a', 19128),
 ('my', 15704),
 ('and', 15301)]

### **Iterator**
Pad each tweet to be the same length to process in batch. 
The BucketIterator will group tweets of similar lengths together for minimized padding in each batch.


In [10]:
BATCH_SIZE = 128

# sort_within_batch sorts all the tensors within a batch by their lengths
train_iterator, valid_iterator, test_iterator = torchtext.legacy.data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.text),
    sort_within_batch = True)

<a id='section3'></a>
# **3. Model**

In [11]:
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        """
        Define the layers of the module.

        vocab_size - vocabulary size
        embedding_dim - size of the dense word vectors
        hidden_dim - size of the hidden states
        output_dim - number of classes
        n_layers - number of multi-layer RNN
        bidirectional - boolean - use both directions of LSTM
        dropout - dropout probability
        pad_idx -  string representing the pad token
        """
        
        super().__init__()

        # 1. Feed the tweets in the embedding layer
        # padding_idx set to not learn the emedding for the <pad> token - irrelevant to determining sentiment
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)

        # 2. LSTM layer
        # returns the output and a tuple of the final hidden state and final cell state
        self.encoder = nn.LSTM(embedding_dim, 
                               hidden_dim, 
                               num_layers=n_layers,
                               bidirectional=bidirectional,
                               dropout=dropout)
        
        # 3. Fully-connected layer
        # Final hidden state has both a forward and a backward component concatenated together
        # The size of the input to the nn.Linear layer is twice that of the hidden dimension size
        self.predictor = nn.Linear(hidden_dim*2, output_dim)

        # Initialize dropout layer for regularization
        self.dropout = nn.Dropout(dropout)
      
    def forward(self, text, text_lengths):
        """
        The forward method is called when data is fed into the model.

        text - [tweet length, batch size]
        text_lengths - lengths of tweet
        """

        # embedded = [sentence len, batch size, emb dim]
        embedded = self.dropout(self.embedding(text))    

        # Pack the embeddings - cause RNN to only process non-padded elements
        # Speeds up computation
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu())

        # output of encoder
        packed_output, (hidden, cell) = self.encoder(packed_embedded)

        # unpack sequence - transform packed sequence to a tensor
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        # output = [sentence len, batch size, hid dim * num directions]
        # output over padding tokens are zero tensors
        
        # hidden = [num layers * num directions, batch size, hid dim]
        # cell = [num layers * num directions, batch size, hid dim]
        
        # Get the final layer forward and backward hidden states  
        # concat the final forward and backward hidden layers and apply dropout
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))

        # hidden = [batch size, hid dim * num directions]

        return self.predictor(hidden)


# class SentimentLSTM(nn.Module):
#     """
#     The RNN model that will be used to perform Sentiment analysis.
#     """

#     def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
#         """
#         Initialize the model by setting up the layers.
#         """
#         super().__init__()

#         self.output_size = output_size
#         self.n_layers = n_layers
#         self.hidden_dim = hidden_dim
        
#         # embedding and LSTM layers
#         self.embedding = nn.Embedding(vocab_size, embedding_dim)
#         self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
#                             dropout=drop_prob, batch_first=True)
        
#         # dropout layer
#         self.dropout = nn.Dropout(0.3)
        
#         # linear and sigmoid layers
#         self.fc = nn.Linear(hidden_dim, output_size)
#         self.sig = nn.Sigmoid()
        

#     def forward(self, x, hidden):
#         """
#         Perform a forward pass of our model on some input and hidden state.
#         """
#         batch_size = x.size(0)

#         # embeddings and lstm_out
#         embeds = self.embedding(x)
#         lstm_out, hidden = self.lstm(embeds, hidden)
    
#         # stack up lstm outputs
#         lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
#         # dropout and fully-connected layer
#         out = self.dropout(lstm_out)
#         out = self.fc(out)
#         # sigmoid function
#         sig_out = self.sig(out)
        
#         # reshape to be batch_size first
#         sig_out = sig_out.view(batch_size, -1)
#         sig_out = sig_out[:, -1] # get last batch of labels
        
#         # return last sigmoid output and hidden state
#         return sig_out, hidden
    
    
#     def init_hidden(self, batch_size):
#         ''' Initializes hidden state '''
#         # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
#         # initialized to zero, for hidden state and cell state of LSTM
#         weight = next(self.parameters()).data
        
#         if torch.cuda.is_available():
#             hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
#                   weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
#         else:
#             hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
#                       weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
#         return hidden

### **Create Model**

In [12]:
INPUT_DIM = len(TEXT.vocab)
# dim must be equal to the dim of pre-trained GloVe vectors
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
# 2 layers of biLSTM
N_LAYERS = 2
BIDIRECTIONAL = True
# Dropout probability
DROPOUT = 0.5
# Get pad token index from vocab
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

# Create an instance of LSTM class
model = SentimentLSTM(INPUT_DIM,
            EMBEDDING_DIM,
            HIDDEN_DIM,
            OUTPUT_DIM,
            N_LAYERS,
            BIDIRECTIONAL,
            DROPOUT,
            PAD_IDX)


# vocab_size = len(TEXT.vocab) # +1 for the 0 padding
# output_size = 1
# embedding_dim = 100
# hidden_dim = 256
# n_layers = 2
# model = SentimentLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(model)


SentimentLSTM(
  (embedding): Embedding(87790, 100, padding_idx=1)
  (encoder): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (predictor): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


In [13]:
# Sample from the training set
print(vars(train_iterator.dataset[0]))

{'label': '0', 'text': ['but', 'i', "'m", 'missing', 'everybody', 'that', 'is', 'doing', 'it']}


In [14]:
# Copy the pre-trained word embeddings into the embedding layer
pretrained_embeddings = TEXT.vocab.vectors

# [vocab size, embedding dim]
print(pretrained_embeddings.shape)

torch.Size([87790, 100])


In [15]:
# Replace the initial weights of the embedding layer with the pre-trained embeddings
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.4429,  0.9365, -0.0912,  ..., -1.2642,  0.1385,  0.2278],
        [-1.3049, -0.2526, -0.5582,  ..., -2.9292, -1.9503,  1.0903],
        [-0.0465,  0.6197,  0.5665,  ..., -0.3762, -0.0325,  0.8062],
        ...,
        [ 0.2626,  0.5619, -2.1544,  ..., -0.3947,  1.9642,  0.3741],
        [ 1.0540, -0.6501,  0.3274,  ...,  0.8386, -1.3984, -0.0883],
        [-0.0962, -0.4969,  0.4455,  ...,  0.2586,  0.6800, -1.2188]])

In [17]:
# Initialize <unk> and <pad> both to all zeros - irrelevant for sentiment analysis
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

# Setting row in the embedding weights matrix to zero using the token index
model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0465,  0.6197,  0.5665,  ..., -0.3762, -0.0325,  0.8062],
        ...,
        [ 0.2626,  0.5619, -2.1544,  ..., -0.3947,  1.9642,  0.3741],
        [ 1.0540, -0.6501,  0.3274,  ...,  0.8386, -1.3984, -0.0883],
        [-0.0962, -0.4969,  0.4455,  ...,  0.2586,  0.6800, -1.2188]])


<a id='section4'></a>
# **4. Training**

In [27]:
# Adam optimizer used to update the weights
# optimizer = optim.Adam(model.parameters(), lr=2e-3)

# Loss function: binary cross entropy with logits
# It restricts the predictions to a number between 0 and 1 using the logit function
# then use the bound scarlar to calculate the loss using binary cross entropy
criterion = nn.BCEWithLogitsLoss()

# Use GPU
model = model.to(device)
criterion = criterion.to(device) 

In [28]:
# Helper functions

def batch_accuracy(predictions, label):
    """
    Returns accuracy per batch.

    predictions - float
    label - 0 or 1
    """

    # Round predictions to the closest integer using the sigmoid function
    preds = torch.round(torch.sigmoid(predictions))
    # If prediction is equal to label
    correct = (preds == label).float()
    # Average correct predictions
    accuracy = correct.sum() / len(correct)

    return accuracy

def timer(start_time, end_time):
    """
    Returns the minutes and seconds.
    """

    time = end_time - start_time
    mins = int(time / 60)
    secs = int(time - (mins * 60))

    return mins, secs

In [29]:
def train(model, iterator, optimizer, criterion):
    """
    Function to evaluate training loss and accuracy.

    iterator - train iterator
    """
    
    # Cumulated Training loss
    training_loss = 0.0
    # Cumulated Training accuracy
    training_acc = 0.0
    
    # Set model to training mode
    model.train()
    
    # For each batch in the training iterator
    for batch in iterator:
        
        # 1. Zero the gradients
        optimizer.zero_grad()
        
        # batch.text is a tuple (tensor, len of seq)
        text, text_lengths = batch.text
        
        # 2. Compute the predictions
        predictions = model(text, text_lengths).squeeze(1)
        
        # 3. Compute loss
        loss = criterion(predictions, batch.label)
        
        # Compute accuracy
        accuracy = batch_accuracy(predictions, batch.label)
        
        # 4. Use loss to compute gradients
        loss.backward()
        
        # 5. Use optimizer to take gradient step
        optimizer.step()
        
        training_loss += loss.item()
        training_acc += accuracy.item()
    
    # Return the loss and accuracy, averaged across each epoch
    # len of iterator = num of batches in the iterator
    return training_loss / len(iterator), training_acc / len(iterator)

def evaluate(model, iterator, criterion):
    """
    Function to evaluate the loss and accuracy of validation and test sets.

    iterator - validation or test iterator
    """
    
    # Cumulated Training loss
    eval_loss = 0.0
    # Cumulated Training accuracy
    eval_acc = 0
    
    # Set model to evaluation mode
    model.eval()
    
    # Don't calculate the gradients
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text
            
            predictions = model(text, text_lengths).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            accuracy = batch_accuracy(predictions, batch.label)

            eval_loss += loss.item()
            eval_acc += accuracy.item()
        
    return eval_loss / len(iterator), eval_acc / len(iterator)

### **Train the model**

In [31]:
# Number of epochs
NUM_EPOCHS = 10

# Lowest validation lost
best_valid_loss = float('inf')

learning_rates = [1e-3, 5e-3, 1e-2, 5e-2]

for lr in learning_rates:
    print(f'learning rate is {lr}')
    optimizer = optim.Adam(model.parameters(), lr=lr)

    for epoch in range(NUM_EPOCHS):

        start_time = time.time()

        # Evaluate training loss and accuracy
        train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
        # Evaluate validation loss and accuracy
        valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

        end_time = time.time()

        mins, secs = timer(start_time, end_time)

        # At each epoch, if the validation loss is the best
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            # Save the parameters of the model
            torch.save(model.state_dict(), 'model-small.pt')

        print("Epoch {}:".format(epoch+1))
        print("\t Total Time: {}m {}s".format(mins, secs))
        print("\t Train Loss {} | Train Accuracy: {}%".format(round(train_loss, 2), round(train_acc*100, 2)))
        print("\t Validation Loss {} | Validation Accuracy: {}%".format(round(valid_loss, 2), round(valid_acc*100, 2)))

learning rate is 0.001
Epoch 1:
	 Total Time: 0m 26s
	 Train Loss 0.68 | Train Accuracy: 56.71%
	 Validation Loss 0.66 | Validation Accuracy: 59.96%
Epoch 2:
	 Total Time: 0m 25s
	 Train Loss 0.68 | Train Accuracy: 56.69%
	 Validation Loss 0.66 | Validation Accuracy: 59.41%
Epoch 3:
	 Total Time: 0m 25s
	 Train Loss 0.68 | Train Accuracy: 56.72%
	 Validation Loss 0.66 | Validation Accuracy: 60.56%
Epoch 4:
	 Total Time: 0m 25s
	 Train Loss 0.68 | Train Accuracy: 56.78%
	 Validation Loss 0.66 | Validation Accuracy: 61.07%
Epoch 5:
	 Total Time: 0m 25s
	 Train Loss 0.68 | Train Accuracy: 56.79%
	 Validation Loss 0.66 | Validation Accuracy: 60.29%
Epoch 6:
	 Total Time: 0m 25s
	 Train Loss 0.68 | Train Accuracy: 56.83%
	 Validation Loss 0.66 | Validation Accuracy: 59.27%
Epoch 7:
	 Total Time: 0m 25s
	 Train Loss 0.68 | Train Accuracy: 56.78%
	 Validation Loss 0.66 | Validation Accuracy: 60.84%
Epoch 8:
	 Total Time: 0m 25s
	 Train Loss 0.68 | Train Accuracy: 56.79%
	 Validation Loss 0.66

<a id='section5'></a>
# **5. Prediction**

In [32]:
# Load the model with the best validation loss
model.load_state_dict(torch.load('model-small.pt'))

# Evaluate test loss and accuracy
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print("Test Loss: {} | Test Acc: {}%".format(round(test_loss, 2), round(test_acc*100, 2)))

Test Loss: 0.66 | Test Acc: 59.23%


In [36]:
# nlp = spacy.load('en')
nlp = spacy.blank("en")


def predict(model, text, tokenized=True):
    """
    Given a tweet, predict the sentiment.

    text - a string or a a list of tokens
    tokenized - True if text is a list of tokens, False if passing in a string
    """

    # Sets the model to evaluation mode
    model.eval()

    if tokenized == False:
        # Tokenizes the sentence
        tokens = [token.text for token in nlp.tokenizer(text)]
    else:
        tokens = text

    # Index the tokens by converting to the integer representation from the vocabulary
    indexed_tokens = [TEXT.vocab.stoi[t] for t in tokens]
    # Get the length of the text
    length = [len(indexed_tokens)]
    # Convert the indices to a tensor
    tensor = torch.LongTensor(indexed_tokens).to(device)
    # Add a batch dimension by unsqueezeing
    tensor = tensor.unsqueeze(1)
    # Converts the length into a tensor
    length_tensor = torch.LongTensor(length)
    # Convert prediction to be between 0 and 1 with the sigmoid function
    prediction = torch.sigmoid(model(tensor, length_tensor))

    # Return a single value from the prediction
    return prediction.item()

In [41]:
# Single example prediction from the test set
print("Tweet: {}".format(TreebankWordDetokenizer().detokenize(test_data[100].text)))

print("Prediction: {}".format(round(predict(model, test_data[100].text), 2)))

print("True Label: {}".format(test_data[10].label))

Tweet: all the people i really want to have lunch with live in other states . or countries . big
Prediction: 0.59
True Label: 0


In [38]:
# Example prediction from the test set

# List to append data to
d = []


for idx in range(10):

    # Detokenize the tweets from the test set
    tweet = TreebankWordDetokenizer().detokenize(test_data[idx].text)
                                                 
    # Append tweet, prediction, and true label
    d.append({'Tweet': tweet, 'Prediction': predict(model, test_data[idx].text), 'True Label': test_data[idx].label})

# Convert list to dataframe
pd.DataFrame(d)

Unnamed: 0,Tweet,Prediction,True Label
0,just got back from the city . broke as now s...,0.565977,0
1,@tinyalice really? that sucks.,0.595713,0
2,what a day! everyone swing by neocon booth #7 ...,0.351752,1
3,i wish i was in sheffield,0.596101,0
4,too hot and i m stuck inside,0.579621,0
5,@akitty13 lol i stand corrected (tweet tweet),0.48437,1
6,woohooo .... 2 - 0 lakers!!!! our time!!!,0.410894,1
7,"in the car boreeedddddd, rain rain qo away",0.529292,0
8,lego harry potter game coming out next year ...,0.60679,1
9,wants this before coming to nuq's campus htt...,0.617302,0
