<a href="https://colab.research.google.com/github/YeAnbang/cs447-hw2-rnn/blob/master/Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2

In this part of assignment 2 we'll be building a machine learning model to detect sentiment of movie reviews using the Stanford Sentiment Treebank([SST])(http://ai.stanford.edu/~amaas/data/sentiment/) dataset. First we will import all the required libraries. We highly recommend that you finish the PyTorch Tutorials [ 1 ](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html),[ 2 ](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html),[ 3 ](https://github.com/yunjey/pytorch-tutorial). before starting this assignment. After finishing this assignment we will able to answer the following questions-


* How to write Dataloaders in Pytorch?
* How to build dictionaries and vocabularies for Deep Nets?
* How to use Embedding Layers in Pytorch?
* How to build various recurrent models (LSTMs and GRUs) for sentiment analysis?
* How to use packed_padded_sequences for sequential models?




# Import Libraries

In [0]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F
from collections import defaultdict
from torchtext import datasets
from torchtext import data
device = torch.device('cpu')#'cuda' if torch.cuda.is_available() else 
from torch.nn.utils.rnn import pack_sequence, pad_sequence, pack_padded_sequence,pad_packed_sequence


## Download dataset
First we will download the dataset using [torchtext](https://torchtext.readthedocs.io/en/latest/index.html), which is a package that supports NLP for PyTorch. The following command will get you 3 objects `train_data`, `val_data` and `test_data`. To access the data:

*   To access list of textual tokens - `train_data[0].text`
*   To access label - `train_data[0].label`



In [0]:
if(__name__=='__main__'):
  train_data, val_data, test_data = datasets.SST.splits(data.Field(tokenize = 'spacy'), data.LabelField(dtype = torch.float), filter_pred=lambda ex: ex.label != 'neutral')

In [3]:
if(__name__=='__main__'):
  print(train_data[0].text)
  print(train_data[0].label)

['The', 'Rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'Century', "'s", 'new', '`', '`', 'Conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'Arnold', 'Schwarzenegger', ',', 'Jean', '-', 'Claud', 'Van', 'Damme', 'or', 'Steven', 'Segal', '.']
positive


## Define the Dataset Class

In the following cell, we will define the dataset class. You need to implement the following functions: 


*   ` build_dictionary() ` - creates the dictionaries `ixtoword` and `wordtoix`. Converts all the text of all examples, in the form of text ids and stores them in `textual_ids`. If a word is not present in your dictionary, it should use `<unk>`. Use the hyperparameter `THRESHOLD` to control the words to be in the dictionary based on their occurrence. Note the occurrences should be `>=THRESHOLD` to be included in the dictionary.
*   ` get_label() ` - It should return the value `0` if the label in the dataset is `positive`, and should return `1` if it is `negative`. 
*   ` get_text() ` - This function should pad the review with `<end>` character uptil a length of `MAX_LEN` if the length of the text is less than the `MAX_LEN`.
*   ` __len__() ` - This function should return the total length of the dataset.
*   ` __getitem__() ` - This function should return the padded text, the length of the text (without the padding) and the label.


In [0]:
THRESHOLD = 10
MAX_LEN = 60
END = "<end>"
UNK = "<unk>"

class TextDataset(data.Dataset):
  def __init__(self, examples, split, ixtoword=None, wordtoix=None, THRESHOLD=THRESHOLD):
    self.examples = examples
    self.split = split
    self.THRESHOLD = THRESHOLD
    self.count = defaultdict(int)
    self.ixtoword = None
    self.wordtoix = None
    self.textual_ids = None
    self.labeltoix = []
    self.build_dictionary()
    ### TO-DO
  
  def build_dictionary(self):
    ### TO-DO
    ### <end> should be at idx 0
    ### <unk> should be at idx 1 
    print("build dictionary")
    print("size example: ",len(self.examples))
    ixtoword = {}
    wordtoix = {}
    total = len(self.examples)//100
    print("#"*total, end="")
    print("\b"*total)
    for i in range(len(self.examples)):
        if i%100==0:
          print("=",end="")
        text = self.examples[i].text
        label = self.examples[i].label
        if label not in self.labeltoix:
            self.labeltoix.append(label)
        for t in range(len(text)):
            self.count[text[t]]+=1
    textual_ids = []
    for key,value in self.count.items():
        if value>=THRESHOLD:
            textual_ids.append(key)
        else:
            wordtoix[key]=0
    textual_ids.sort(reverse=True)
    textual_ids.insert(0,UNK)
    textual_ids.insert(0,END)
    print(textual_ids[0],textual_ids[1])
    for i in range(len(textual_ids)):
        ixtoword[i] = textual_ids[i]
        wordtoix[textual_ids[i]] = i
    self.textual_ids = textual_ids
    self.ixtoword = ixtoword
    self.wordtoix = wordtoix
    print("finish building dictionary: ",len(textual_ids))
    return textual_ids, ixtoword, wordtoix
  
  def get_label(self, index):
    ### TO-DO
    return self.examples[index].label
   
  def get_text(self, index):
    ### TO-DO
    temp = self.examples[index].text
    for i in range(len(temp)):
        if i==MAX_LEN-1:
            break
        if self.count[temp[i]]>=THRESHOLD:
            pass
        else:
            temp[i]=UNK
    temp.append(END)   #end
    return temp 
    
  
  def __len__(self):
    ### TO-DO
    return len(self.examples)
  
  def __getitem__(self, index):
    ### TO-DO
    text = self.examples[index].text
    #print("get_item: ",text,type(self.wordtoix),type(self.count))
    for i in range(len(text)):
        if i==MAX_LEN-1:
            break
        if self.count[self.wordtoix[text[i]]]>=THRESHOLD:
            text[i]=self.textual_ids.index(text[i])
        else:
            text[i]=1  #unk
    while len(text)<MAX_LEN:
        text.append(0)   #end
    text_len = len(text)-1
    lbl = self.labeltoix.index(self.examples[index].label)
    #print("getitem length: ",len(text))
    text = torch.tensor(text)
    lbl = torch.tensor(lbl)
    
    
    return text, text_len, lbl

## Initialize the Dataloader
We initialize the training and testing dataloaders using the Dataset classes we create for both training and testing. Make sure you use the same vocabulary for both the datasets.

In [5]:
if(__name__=='__main__'):
  Ds = TextDataset(train_data, 'train')
  batch_size = 32
  train_loader = torch.utils.data.DataLoader(Ds, batch_size=batch_size, shuffle=True, num_workers=4, drop_last=True)
  test_Ds = TextDataset(test_data, 'test')
  test_loader = torch.utils.data.DataLoader(test_Ds, batch_size=1, shuffle=False, num_workers=4, drop_last=True)

build dictionary
size example:  6920
#####################################################################
finish building dictionary:  1462
build dictionary
size example:  1821
##################
finish building dictionary:  384


## Build your Sequential Model
In the following we provide you the class to build your model. We provide some parameters, we expect you to use in the initialization of your sequential model.

In [0]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        ## To-Do
        # - Create an embedding layer - refer to nn.Embedding
        # - Use a sequential network - nn.LSTM or nn.GRU
        # Have an output layer for outputting a single output value
        num_direction = 2 if bidirectional else 1
        self.embeds = nn.Embedding(vocab_size, embedding_dim)
        self.h0 = Variable(torch.Tensor(torch.randn(n_layers*num_direction, 32, hidden_dim)))
        self.c0 = Variable(torch.Tensor(torch.randn(n_layers*num_direction, 32, hidden_dim)))
        self.LSTM = nn.LSTM(embedding_dim,hidden_dim,n_layers,dropout=dropout,bidirectional=bidirectional)
        self.W = Variable(torch.Tensor(torch.randn(output_dim,hidden_dim*num_direction)))
        self.b = Variable(torch.Tensor(torch.randn(output_dim)))
        
    def forward(self, text, text_lengths):
        #print(text.shape)
        
        embed_text = self.embeds(Variable(torch.LongTensor(text)))
        #print(embed_text.shape)
        packed = pack_padded_sequence(embed_text, text_lengths, batch_first=False)
        #print(packed)
        
        # Forward propagate RNN
        out, _ = self.LSTM(packed,(self.h0,self.c0))
        out_data = pad_packed_sequence(out)[0][-1]
        #print(out_data.shape)
        #print(out)
        #print(out.data[-1].dot(self.W).shape)
        self.output = nn.functional.linear(out_data,self.W,self.b)
        ## TO - DO 
        ## Hint(s):  Refer to nn.utils.rnn.pack_padded_sequence for padded tensors
        ## You do not need to apply a sigmoid to the final output - we do that for you when we call it in evaluation
        
        #text = [MAX LEN, batch size]
        #text_lengths = [batch size]
        return self.output

In [0]:

# Hyperparameters for your model
# Feel Free to play around with these
# for getting optimal performance
# TO-DO
INPUT_DIM = len(test_Ds) #this should be your vocab size
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = 0

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

In [8]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
if(__name__=='__main__'):
  print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,492,244 trainable parameters


### Define your loss function and optimizer

In [0]:
import torch.optim as optim
# TO-DO
# Feel Free to play around with different optimizers and loss functions
# for getting optimal performance
# For optimizers : https://pytorch.org/docs/stable/optim.html
# For loss functions : https://pytorch.org/docs/stable/nn.html#loss-functions
if(__name__=='__main__'):
  optimizer = optim.SGD(model.parameters(), lr=1e-3)
  criterion = nn.BCEWithLogitsLoss() 

### Put your model on the GPU

In [0]:
if(__name__=='__main__'):
  model = model.to(device)
  criterion = criterion.to(device)

In [0]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

## Train your Model

In [0]:
def train_model(model, num_epochs, data_loader):
  model.train()
  for epoch in range(10):
    epoch_loss = 0
    epoch_acc = 0
    for idx, (text, text_lens, label) in enumerate(data_loader):
        if(idx%100==0):
          print('Executed Step {} of Epoch {}'.format(idx, epoch))
        text = text.to(device)
        # text - [batch_len, MAX_LEN]
        text_lens = text_lens.to(device)
        # text - [batch_len]
        label = label.float()
        label = label.to(device)
        optimizer.zero_grad()
        text = text.permute(1, 0) # permute for sentence_len first for embedding
        predictions = model(text, text_lens).squeeze(1)
        loss = criterion(predictions, label)

        acc = binary_accuracy(predictions, label)

        loss.backward()

        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()
    print('Training Loss Value of Epoch {} = {}'.format(epoch ,epoch_loss/len(train_loader)))
    print('Training Accuracy of Epoch {} = {}'.format(epoch ,epoch_acc/len(train_loader)))

## Evaluate your Model

In [0]:
def evaluate(model, data_loader):
  model.eval()
  epoch_loss = 0
  epoch_acc = 0
  all_predictions = []
  for idx, (text, text_lens, label) in enumerate(data_loader):
      if(idx%100==0):
        print('Executed Step {}'.format(idx))
      text = text.to(device)
      text_lens = text_lens.to(device)
      label = label.float()
      label = label.to(device)
      optimizer.zero_grad()
      
      text = text.permute(1, 0)
      predictions = model(text, text_lens).squeeze(1)
      all_predictions.append(torch.round(torch.sigmoid(predictions)))
      loss = criterion(predictions, label)
      acc = binary_accuracy(predictions, label)
      epoch_loss += loss.item()
      epoch_acc += acc.item()
  print(epoch_loss/len(data_loader))
  print(epoch_acc/len(data_loader))
  predictions = torch.cat(all_predictions)
  return predictions

## Training and Evaluation

We first train your model using the training data. Feel free to play around with the number of epochs. We recommend **you write code to save your model** [(save/load model tutorial)](https://pytorch.org/tutorials/beginner/saving_loading_models.html) as colab connections are not permanent and it can get messy if you'll have to train your model again and again.

In [0]:
if(__name__=='__main__'):
  train_model(model, 10, train_loader)

Executed Step 0 of Epoch 0
Executed Step 100 of Epoch 0


Now we will evaluate your model on the test set.

In [1]:
if(__name__=='__main__'):
  predictions = evaluate(model, test_loader)
  predictions = predictions.cpu().data.detach().numpy()
  assert(len(predictions)==len(test_data))

NameError: ignored

## Saving results for Submission
Saving your test results for submission. You will save the `result.txt` with your test data results. Make sure you do not **shuffle** the order of the `test_data` or the autograder will give you a bad score.

You will submit the following files to the autograder on the gradescope :


1.   Your `result.txt` of test data results
2.   Your code of this notebook. You can do it by clicking `File`-> `Download .py` - make sure the name of the downloaded file is `assignment2.py`



In [0]:
if(__name__=='__main__'):
  try:
    from google.colab import drive
    drive.mount('/content/drive')
  except:
    pass
  np.savetxt('drive/My Drive/result.txt', predictions, delimiter=',')