<a href="https://colab.research.google.com/github/akashe/NLP/blob/main/assignment/SST_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will do sentiment analysis on Standford sentiment Treebank dataset. 
The core things-
1) We will focussing on sentiments of the phrases with 5 classes.
2) data augmentation
3) model architectures.

Original paper: https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf
Dataset location: https://nlp.stanford.edu/sentiment/
We wont make the same architecture as mentioned in the paper. We will use a simple LSTM.

# Preparing data

In [1]:
# Download packages
!pip install zipfile > /dev/null 2>&1  
!pip install io > /dev/null 2>&1  

In [2]:
# Download dataset
from urllib.request import urlopen
from zipfile import ZipFile 
from io import BytesIO
import random

dataset_location = "http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip"
zipresp = urlopen(dataset_location)
with urlopen(dataset_location) as zipresp:
  with ZipFile(BytesIO(zipresp.read())) as zfile:
    zfile.extractall("./SST_dataset")


In [3]:
# Loading data
# phrases are present in dictionary.txt

phrases = {}
with open("/content/SST_dataset/stanfordSentimentTreebank/dictionary.txt") as f:
  for line in f:
    phrase,id = line.strip("\n").split("|")
    phrases[id]= {}
    phrases[id]["phrase"]= phrase

with open("/content/SST_dataset/stanfordSentimentTreebank/sentiment_labels.txt") as f:
  i = 0
  for line in f:
    if i==0: i+=1;continue
    id,sentiment = line.strip("\n").split("|")
    if id not in phrases:
      raise(KeyError)
    else:
      phrases[id]["sentiment"] = sentiment

print(phrases['0'],phrases['25'])  


{'phrase': '!', 'sentiment': '0.5'} {'phrase': "'s a visual delight and a decent popcorn adventure ,", 'sentiment': '0.77778'}


# Data Augmentation

In [4]:
# #Back Translate
# # Install package for google translate
# # using google_trans_new for unlimited translations
# !pip install google_trans_new > /dev/null 2>&1
# !pip install googletrans > /dev/null 2>&1


# from google_trans_new import google_translator
# import random
# import googletrans

# translator = google_translator()
# back_translated_phrases = {}

# for i in phrases:
#   sentence = phrases[i]['phrase']
#   available_langs = list(googletrans.LANGUAGES.keys())
#   trans_lang = random.choice(available_langs) 
#   translation = translator.translate(sentence,lang_tgt=trans_lang)
#   assert type(translation) is str 
#   back_translation = translator.translate(translation, lang_tgt="en")
#   if sentence == back_translation:
#     continue
#   else:
#     new_id = str(len(back_translated_phrases))
#     back_translated_phrases[new_id] = {}
#     back_translated_phrases[new_id]['phrase']= back_translation
#     back_translated_phrases[new_id]['sentiment'] = phrases[i]['sentiment']

# print("Number of datapoints added after back translation {}".format(len(back_translated_phrases)))




In [5]:
# Dump back translated sentence for future use:
!pip install pickle > /dev/null 2>&1
import pickle

# with open("/content/SST_dataset/back_translated.pickle","wb") as f:
#   pickle.dump(back_translated_phrases,f)

# Also dump existing data for later use
with open("/content/SST_dataset/phrases.pickle","wb") as f:
  pickle.dump(phrases,f)

In [6]:
#random_swap
# Swapping words within a phrase for n times
n = 5

def random_swap(sentence):
  sentence = sentence.split(" ")
  if len(sentence)<3:
    return " ".join(sentence)
  length = range(len(sentence))
  for _ in range(n):
    idx1, idx2 = random.sample(length,2)
    sentence[idx1],sentence[idx2] = sentence[idx2],sentence[idx1]
  return " ".join(sentence)

random_swapped_phrases = {}

for i in phrases:
  sentence = phrases[i]['phrase']
  swapped_sentence = random_swap(sentence)
  if sentence != swapped_sentence:
    new_idx = str(len(random_swapped_phrases))
    random_swapped_phrases[new_idx] = {}
    random_swapped_phrases[new_idx]['phrase'] = swapped_sentence
    random_swapped_phrases[new_idx]['sentiment'] = phrases[i]['sentiment']

print(" Number of data points added with random swap = {}".format(len(random_swapped_phrases)))

with open("/content/SST_dataset/swapped_phrases.pickle","wb") as f:
  pickle.dump(random_swapped_phrases,f)


 Number of data points added with random swap = 179306


In [7]:
# random_delete
# Deleting a word from sentences with a probablilty greater than p

p = 0.8

def random_deletion(sentence):
  sentence = sentence.split(" ")
  if len(sentence) == 1:
    return " ".join(sentence)
  pruned_sentence = list(filter(lambda x: random.uniform(0,1)>p,sentence))
  if len(pruned_sentence)==0:
    return random.choice(sentence)
  else:
    return " ".join(pruned_sentence)

random_deleted_phrases = {}

for i in phrases:
  sentence = phrases[i]['phrase']
  deleted_sentence = random_deletion(sentence)
  if sentence != deleted_sentence:
    new_idx = str(len(random_deleted_phrases))
    random_deleted_phrases[new_idx] = {}
    random_deleted_phrases[new_idx]['phrase'] = deleted_sentence
    random_deleted_phrases[new_idx]['sentiment'] = phrases[i]['sentiment']

print(" Number of data points added with random deletion = {}".format(len(random_deleted_phrases)))

with open("/content/SST_dataset/deleted_phrases.pickle","wb") as f:
  pickle.dump(random_deleted_phrases,f)


 Number of data points added with random deletion = 215095


# Defining Fields, Datasets and (train,valid) splits

In [8]:
!pip install torch > /dev/null 2>&1

import torch, torchtext
from torchtext import data

seed = 7
torch.manual_seed(seed)

phrase = data.Field(sequential = True, tokenize = 'spacy', batch_first = True, include_lengths = True)
sentiment = data.LabelField(tokenize= 'spacy', is_target = True, preprocessing = lambda x: int(float(x)/0.2), batch_first = True, sequential = False)

fields = [('phrase',phrase),('sentiment',sentiment)]

def dict_custom_add(*args):
  # A simpler way would be c = {**a,**b} but that would override same key values
  new_dict = {}
  for i in args:
    assert type(i) is dict
    for j in i:
      if j not in new_dict:
        new_dict[j] = i[j]
      else:
        new_dict[str(len(new_dict))] = i[j]

  return new_dict

# For albation studies, we will make 5 seperate datasets

# TODO: convert creating train and valid for these datasets into a loop

# 1. Orignal data:
original_examples = [data.Example.fromlist([phrases[i]['phrase'],phrases[i]['sentiment']],fields) for i in phrases]
original_dataset = data.Dataset(original_examples,fields)

original_train, original_valid = original_dataset.split(split_ratio=[0.85,0.15], random_state= random.seed(seed))

# # 2. Back translated data:
# custom_dict = dict_custom_add(phrases,back_translated_phrases)
# back_translated_examples = [data.Example.fromlist([custom_dict[i]['phrase'],custom_dict[i]['sentiment']],fields) for i in custom_dict]
# back_translated_dataset = data.Dataset(back_translated_examples,fields)

# back_translate_train, back_translated_valid = back_translated_dataset.split(split_ratio=[0.85,0.15], random_state= random.seed(seed))

# 3. Random swapped data:
custom_dict = dict_custom_add(phrases,random_swapped_phrases)
random_swapped_examples = [data.Example.fromlist([custom_dict[i]['phrase'],custom_dict[i]['sentiment']],fields) for i in custom_dict]
random_swapped_dataset = data.Dataset(random_swapped_examples,fields)

random_swapped_train, random_swapped_valid = random_swapped_dataset.split(split_ratio=[0.85,0.15], random_state= random.seed(seed))

# 4. Random deletion data:
custom_dict = dict_custom_add(phrases,random_deleted_phrases)
random_deletion_examples = [data.Example.fromlist([custom_dict[i]['phrase'],custom_dict[i]['sentiment']],fields) for i in custom_dict]
random_deletion_dataset = data.Dataset(random_deletion_examples,fields)

random_deletion_train, random_deletion_valid = random_deletion_dataset.split(split_ratio=[0.85,0.15], random_state= random.seed(seed))

# 5. All the above combined:
custom_dict = dict_custom_add(phrases,random_swapped_phrases, random_deleted_phrases)
full_exmaples = [data.Example.fromlist([custom_dict[i]['phrase'],custom_dict[i]['sentiment']],fields) for i in custom_dict]
full_dataset = data.Dataset(full_exmaples,fields)

full_train, full_valid = full_dataset.split(split_ratio=[0.85,0.15], random_state= random.seed(seed))

# Defining model

In [9]:
import torch.nn as nn
import torch.nn.functional as F

class classifier(nn.Module):
    
    # Define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout):
        
        super().__init__()          
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # LSTM layer
        self.encoder = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           dropout=dropout,
                           batch_first=True)
        # try using nn.GRU or nn.RNN here and compare their performances
        # try bidirectional and compare their performances
        
        # Dense layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, text_lengths):
        
        # text = [batch size, sent_length]
        embedded = self.embedding(text)
        # embedded = [batch size, sent_len, emb dim]
      
        # packed sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(), batch_first=True)
        
        packed_output, (hidden, cell) = self.encoder(packed_embedded)
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]
    
        # Hidden = [batch size, hid dim * num directions]
        dense_outputs = self.fc(hidden)   
        
        # Final activation function softmax
        output = F.softmax(dense_outputs[0], dim=1)
            
        return output


# Train Eval Loop

In [10]:
# Set device
import torch.optim as optim
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [11]:
def runner(train_dataset,valid_dataset,epochs=10,embedding_dim=300,num_hidden_nodes=100,num_layers=2,dropout=0.2,lr=2e-4):
  # build vocab
  phrase.build_vocab(train_dataset)
  sentiment.build_vocab(train_dataset)

  # build iterators
  train_iterator,valid_iterator = data.BucketIterator.splits((train_dataset, valid_dataset), batch_size = 32, 
                                                            sort_key = lambda x: len(x.phrase),
                                                            sort_within_batch=True, device = device)
  
  # intialize models
  size_of_vocab = len(phrase.vocab)
  num_output_nodes = len(sentiment.vocab)
  model = classifier(size_of_vocab, embedding_dim, num_hidden_nodes, num_output_nodes, num_layers, dropout = dropout)
  
  # optimizer
  optimizer = optim.Adam(model.parameters(), lr=lr)
  criterion = nn.CrossEntropyLoss()
  
  # accuracy
  def binary_accuracy(preds, y):
    #round predictions to the closest integer
    _, predictions = torch.max(preds, 1)
    
    correct = (predictions == y).float() 
    acc = correct.sum() / len(correct)
    return acc
  
  # move model and crtierion to gpu if available
  model = model.to(device)
  criterion = criterion.to(device)

  # train loop
  def train():
    # initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    # set the model in training phase
    model.train()  
    
    for batch in train_iterator:
        
        # resets the gradients after every batch
        optimizer.zero_grad()   
        
        # retrieve text and no. of words
        phrase, phrase_lengths = batch.phrase   
        
        # convert to 1D tensor
        predictions = model(phrase, phrase_lengths).squeeze()  
        
        # compute the loss
        loss = criterion(predictions, batch.sentiment)        
        
        # compute the binary accuracy
        acc = binary_accuracy(predictions, batch.sentiment)   
        
        # backpropage the loss and compute the gradients
        loss.backward()       
        
        # update the weights
        optimizer.step()      
        
        # loss and accuracy
        epoch_loss += loss.item()  
        epoch_acc += acc.item()    
        
    return epoch_loss / len(train_iterator), epoch_acc / len(train_iterator)

  # eval loop
  def evaluate():
    # initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    # deactivating dropout layers
    model.eval()
    
    # deactivates autograd
    with torch.no_grad():
    
        for batch in valid_iterator:
        
            # retrieve text and no. of words
            phrase, phrase_lengths = batch.phrase
            
            # convert to 1d tensor
            predictions = model(phrase, phrase_lengths).squeeze()
            
            # compute loss and accuracy
            loss = criterion(predictions, batch.sentiment)
            acc = binary_accuracy(predictions, batch.sentiment)
            
            # keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(valid_iterator), epoch_acc / len(valid_iterator)
  
  # Running for epochs
  best_valid_accuracy = float('-inf')
  accuracy = None

  for epoch in range(epochs):
    
    # train the model
    train_loss, train_acc = train()

    # evaluate the model
    valid_loss, valid_acc = evaluate()

    if valid_acc > best_valid_accuracy:
      best_valid_accuracy = valid_acc

    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}% \n')
  # return best accuracy
  return best_valid_accuracy


# Training models for all datasets

In [12]:
best_accuracies = []
print("Training with Original data")
best_accuracy = runner(original_train,original_valid)
print("Best accuracy with Original data {}".format(best_accuracy))
best_accuracies.append(("Orginal data -->",best_accuracy))

# print("Training with Original data + Back Translation data")
# best_accuracy = runner(back_translate_train,back_translated_valid)
# print("Best accuracy with Original data + Back Translation data {}".format(best_accuracy))
# best_accuracies.append(("Original data + Back Translation data -->",best_accuracy))

print("\nTraining with Original data + Random Swapped data")
best_accuracy = runner(random_swapped_train,random_swapped_valid)
print("Best accuracy with Original data + Random Swapped data {}".format(best_accuracy))
best_accuracies.append(("Original data + Random Swapped data -->",best_accuracy))

print("\nTraining with Original data + Random Deleted data")
best_accuracy = runner(random_deletion_train,random_deletion_valid)
print("Best accuracy with Original data + Random Deleted data {}".format(best_accuracy))
best_accuracies.append(("Original data + Random Deleted data -->",best_accuracy))

print("\nTraining with Original data + Random Swapped data + Random Deleted data")
best_accuracy = runner(full_train,full_valid)
print("Best accuracy with Original data + Random Swapped data + Random Deleted data {}".format(best_accuracy))
best_accuracies.append(("Original data + Random Swapped data + Random Deleted data -->",best_accuracy))

Training with Original data
	Train Loss: 1.533 | Train Acc: 51.59%
	 Val. Loss: 1.492 |  Val. Acc: 54.72% 

	Train Loss: 1.462 | Train Acc: 57.95%
	 Val. Loss: 1.453 |  Val. Acc: 58.93% 

	Train Loss: 1.426 | Train Acc: 61.69%
	 Val. Loss: 1.435 |  Val. Acc: 60.69% 

	Train Loss: 1.402 | Train Acc: 64.11%
	 Val. Loss: 1.427 |  Val. Acc: 61.48% 

	Train Loss: 1.385 | Train Acc: 65.92%
	 Val. Loss: 1.420 |  Val. Acc: 62.20% 

	Train Loss: 1.372 | Train Acc: 67.19%
	 Val. Loss: 1.415 |  Val. Acc: 62.78% 

	Train Loss: 1.362 | Train Acc: 68.27%
	 Val. Loss: 1.413 |  Val. Acc: 63.00% 

	Train Loss: 1.353 | Train Acc: 69.11%
	 Val. Loss: 1.410 |  Val. Acc: 63.23% 

	Train Loss: 1.347 | Train Acc: 69.83%
	 Val. Loss: 1.409 |  Val. Acc: 63.20% 

	Train Loss: 1.340 | Train Acc: 70.46%
	 Val. Loss: 1.409 |  Val. Acc: 63.35% 

Best accuracy with Original data 0.6335441519551098

Training with Original data + Random Swapped data
	Train Loss: 1.529 | Train Acc: 51.55%
	 Val. Loss: 1.475 |  Val. Acc

# Result

In [14]:
print(" Best Accuracies for different augmentation methods")
for i,j in best_accuracies:
  print(str(j*100)+"\t"+i)

 Best Accuracies for different augmentation methods
63.35441519551098	Orginal data -->
64.35084493335233	Original data + Random Swapped data -->
58.256064163687086	Original data + Random Deleted data -->
60.2231992596089	Original data + Random Swapped data + Random Deleted data -->
