In [1]:
# Check if a gpu is available
!nvidia-smi


Sun Mar  1 02:27:52 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P8    18W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

# Sentiment Classification

In this assignment we will:
1. Use Pytorch to load the IMDb movie dataset and do preprocessing;
2. Develop a Recurrent Neural Network (RNN) Classifier for the same dataset;
3. Convert the RNN to a bidirectional Long-Short-Term-Memory (LSTM) model







## 1. Loading dataset

In [0]:
import torch
from torchtext import data

SEED = 12138

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# Torchtext will let us to load the text and labels separately.
TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)

In [3]:
 # follow the steps to authorize colab to get access to your google drive data
 from google.colab import drive
 drive.mount('/content/gdrive')


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [4]:
# make sure that you can see the ipynb files and IMDB.gz
!ls  gdrive/My\ Drive/Colab\ Notebooks/nlp_hw2/

IMDB.gz  PartA.ipynb  PartB.ipynb  PartC.ipynb


## Data loading
Read more: https://pytorchnlp.readthedocs.io/en/latest/_modules/index.html


In [5]:
from torchtext import datasets
import os

# set up the path
ROOT_DIR = " gdrive/My\ Drive/Colab\ Notebooks/nlp_hw2/"
DATA_DIR = ROOT_DIR+'IMDB.gz'

# load data, this may take a while
all_data = datasets.IMDB(DATA_DIR,TEXT, LABEL)
train_data, test_data = all_data.splits(TEXT, LABEL)

print ('Loading finished!')

Loading finished!


In [6]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000


In [7]:
import random
# split into train and validation set
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


In [0]:
# set vocab
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

##### Define iterator

Define an iterator that batches examples of similar lengths together. 
There are other options. For more: https://torchtext.readthedocs.io/en/latest/data.html



In [0]:
BATCH_SIZE = 64

# If there is a GPU available, we will set to use it; otherwise we will use cpu.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

## 2. Recurrent Neural Network

This part of the assignment will involve building your own Recurrent Neural Network model for the sentiment analysis task.

1. The first thing you’ll want to do is fill out the code in the initialization of the RNN class. You’ll need to define three layers: self.embedding, self.rnn, and self.fc. Use the built-in functions in torch.nn to accomplish this (remember that a fully-connected layer is also a linear layer!) and pay attention to what each dimensions each layer should have for its input and output.
2. The next step (still in the RNN class) is to implement the forward pass. Make use of the layers you defined above to create embedded, hidden, and output vectors for a given input x.

Hint to start our model:
The RNN model should have the following structure:
1. start by an embedding layer; shape:  (input_dim, embedding_dim)
2. then we put the RNN layer; shape: (embedding_dim, hidden_dim)
3. last, we add a liner layer; shape: (hidden_dim, output_dim)

In [0]:
import torch.nn as nn

## TODO: define the RNN class
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        
        ## TODO starts
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        ## TODO ends
        
    def forward(self, text):

        ## TODO starts
        embedded = self.embedding(text)
        output, hidden = self.rnn(embedded)
        result = self.fc(hidden.squeeze(0))
        ## TODO ends

        return  result

## Model Training



In [0]:
# define some hyperparameters
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1



In [0]:
import torch.optim as optim
# apply our RNN model here
model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)


## TODO: define optmizer

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

## TODO ends

In [0]:

## setup device
model = model.to(device)
criterion = criterion.to(device)

### Calculate accuracy

In [0]:
## TODO: return the accuracy given the preditions (preds) and true values (y); acc should be a float number
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer

    r_preds = torch.round(torch.sigmoid(preds))
    correct = (r_preds == y).float() 
    acc = correct.sum()/len(correct)

    return acc

## Training function

The next function is the train function. Most of the code is handled for you- you only need to get a set of predictions for the current batch and then calculate the current loss and accuracy. For the latter two calculations, make sure to use the criterion and binary_accuracy functions you are given. For calculating the batch predictions, extract the text of the current batch and run it through the model, which is passed in as a parameter.


In [0]:
## TODO: finish the training function
## iterator contains batches of the training data; 
## hint: use batch.text and batch.label to get access to the training data and labels
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        # TODO starts
        optimizer.zero_grad()
        batch_pred = model(batch.text).squeeze()
        batch_loss = criterion(batch_pred, batch.label)
        batch_acc = binary_accuracy(batch_pred, batch.label)

        ## Back
        batch_loss.backward()
        optimizer.step()
        epoch_loss += batch_loss.item()
        epoch_acc += batch_acc.item()

        ## TODO ends
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Evaluation function

This step is to copy and paste what you did in the training function into the evaluate function. This time, there’s no additional optimization after the predictions, loss, and accuracy are calculated.

In [0]:
## TODO: finish the evaluation function
## iterator contains batches of the training data; 
## hint: this function is very similar to the training function
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:
            
            ## TODO starts
            batch_pred = model(batch.text).squeeze()
            batch_loss = criterion(batch_pred, batch.label)
            batch_acc = binary_accuracy(batch_pred, batch.label)

            epoch_loss += batch_loss.item()
            epoch_acc += batch_acc.item()

            ## TODO ends
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Start training
It may take a few minutes in total. The validate accuracy is around 50-51%.



In [17]:
N_EPOCHS = 5

best_valid_loss = float('inf')
# let's train 5 epochs
for epoch in range(N_EPOCHS):
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
      
    # we keep track of the best model, and save it
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best_model.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

	Train Loss: 0.698 | Train Acc: 49.54%
	 Val. Loss: 0.702 |  Val. Acc: 49.65%
	Train Loss: 0.698 | Train Acc: 49.88%
	 Val. Loss: 0.702 |  Val. Acc: 49.19%
	Train Loss: 0.700 | Train Acc: 50.22%
	 Val. Loss: 0.696 |  Val. Acc: 48.90%
	Train Loss: 0.699 | Train Acc: 49.63%
	 Val. Loss: 0.694 |  Val. Acc: 51.24%
	Train Loss: 0.697 | Train Acc: 49.74%
	 Val. Loss: 0.697 |  Val. Acc: 51.19%


### Restore the best model and evaluate

The test accuracy is around 47%



In [18]:
model.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(model, test_iterator, criterion)


print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.694 | Test Acc: 49.51%


## 3. LSTM
This step of this assignment is to modify your RNN into a bidirectional LSTM network. We’ll see that this kind of model performs much better than our previous ones.

1. You’ll be making changes to your model in the RNN Class. In the init class, for the rnn layer, use the nn.LSTM function and make sure you pass in the bidirectional argument. Also note that the fully connected layer now has to map from two hidden layer passes (forward and backward).
2. In the forward pass, not much changes from before, besides the addition of the cell. Also note that you’ll have to concatenate the final forward hidden layer and the final backward hidden layer. If any of this is unclear, look up example of how nn.lstm works for clarification.


In [0]:

class RNN(nn.Module):
    # TODO: IMPLEMENT THIS FUNCTION
    # Initialize the three layers in the RNN, self.embedding, self.rnn, and self.fc
    # Each one has a corresponding function in nn
    # embedding maps from input_dim->embedding_dim
    # rnn maps from embedding_dim->hidden_dim
    # fc maps from hidden_dim*2->output_dim
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, bidirectional):
        super().__init__()
        
        ## CHANGE THESE DEFINITIONS
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, bidirectional=bidirectional)
        self.fc = nn.Linear(2 * hidden_dim, output_dim)
       
    # TODO: IMPLEMENT THIS FUNCTION
    # x has dimensions [sentence length, batch size]
    # embedded has dimensions [sentence length, batch size, embedding_dim]
    # output has dimensions [sentence length, batch size, hidden_dim*2] (since bidirectional)
    # hidden has dimensions [2, batch size, hidden_dim]
        # cell has dimensions [2, batch_size, hidden_dim]
    # Need to concatenate the final forward and backward hidden layers
    def forward(self, x):
        embedded = self.embedding(x)
        output, (hidden, cell) = self.rnn(embedded)
        hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), 1)
        
        return self.fc(hidden.squeeze(0))

In [0]:
# apply our RNN model here
BIDIRECTIONAL = True
model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, BIDIRECTIONAL)
## setup device
model = model.to(device)
criterion = criterion.to(device)

It may take a few minutes in total. The validate accuracy is around 50%.

In [21]:
# train again!
best_valid_loss = float('inf')
# let's train 5 epochs
for epoch in range(N_EPOCHS):
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
      
    # we keep track of the best model, and save it
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best_model_LSTM.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

	Train Loss: 0.694 | Train Acc: 49.90%
	 Val. Loss: 0.693 |  Val. Acc: 49.79%
	Train Loss: 0.694 | Train Acc: 49.75%
	 Val. Loss: 0.693 |  Val. Acc: 49.79%
	Train Loss: 0.694 | Train Acc: 49.87%
	 Val. Loss: 0.693 |  Val. Acc: 49.79%
	Train Loss: 0.694 | Train Acc: 49.80%
	 Val. Loss: 0.693 |  Val. Acc: 49.79%
	Train Loss: 0.694 | Train Acc: 49.91%
	 Val. Loss: 0.693 |  Val. Acc: 49.79%


In [22]:
model.load_state_dict(torch.load('best_model_LSTM.pt'))
test_loss, test_acc = evaluate(model, test_iterator, criterion)


print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.696 | Test Acc: 50.15%


**Question**: Do you think LSTM is working better than RNN? Why or why not? How do you compare with LSTM and RNN (model complexity, etc)?
Please answer in the next Text cell.

**Answer**: LSTM should work better than RNN. 

RNN is a class of neural networks that allow previous outputs to be used as inputs while having hidden states and can use their internal state (memory) to process variable length sequences of input. So they could capture information about what has been calculated. In other words, output of each node depends on computations on previous nodes. Thus, RNN have the following issues: 1) Gradient vanishing and exploding problems. 2) Cannot consider any future input for the current state. On the other hand, LSTM is the modified version of recurrent neural networks, which makes it easier to remember past data in memory. It could fix the vanishing gradient problem of RNN by using a cell with 3 gates (input gate, output gate, and forget gate).

Furthermore, I implemented the Bidirectional LSTM above. Undirectional LSTM only preserves information of the past states because the only inputs it only see are from the past. Using the bidirectional can take over the inputs in two ways, one is from the past to future and another is from future to past. In other words, it can effectively preserve information from both past and future.


## Submission

Now that you have completed the assignment, follow the steps below to submit your aissgnment:
1. Click __Runtime__  > __Run all__ to generate the output for all cells in the notebook.
2. Save the notebook with the output from all the cells in the notebook by click __File__ > __Download .ipynb__.
3. **Keep the output cells** , and answers to the question in the Text cell. 
4. Put the .ipynb file under your hidden directory on the Zoo server `~/hidden/<YOUR_PIN>/Homework2/`.
5. As a final step, run a script that will set up the permissions to your homework files, so we can access and run your code to grade it. Make sure the command be;pw runs without errors, and do not make any changes or run the code again. If you do run the code again or make any changes, you need to run the permissions script again. Submissions without the correct permissions may incur some grading penalty.
`/home/classes/cs477/bash_files/hw2_set_permissions.sh <YOUR_PIN>`