# Welcome to the PyTorch Sentiment Analysis Exercise

This exercise will cover some key concepts in natural language machine learning, including:

* The preparation of text datasets
* The construction of neural net models for natural language tasks
* The use of *recurrent neural network layers,* which are commonly used in problems involving sequences, including natural language tasks.

## Notes on Using This Notebook

* Code will be provided for boilerplate tasks; in other places, you will need to fill in code to complete the exercise. Cells you need to fill in will be flagged with the **Exercise** heading.
* The code cells are, in general, meant to be run in order. If you think a code cell should be working, but it isn't, verify that all previous cells were run - the cell you're having trouble with may depend on a variable or file that is created in a previous cell.
* Class names and other text normally meant for consumption by a computer will be rendered in a `monospace font`. This will hopefully reduce confusion between, e.g., the word "dataset" referring to the concept of a cohesive body of data, and the class name `Dataset` referring to the related PyTorch class.

### Do This Now:

The cell below downloads and unzips the dataset we'll be using for this exercise. The dataset isn't huge, but it may take a minute to download, so **please uncomment and execute the following code cell now** to get the process started. (The commented lines are there to prevent the download triggering accidentally, so you may wish to replace them afterward.)

In [0]:
# !curl -0 https://s3-us-west-1.amazonaws.com/pytorch-course-datasets/sentiment-analysis-on-movie-reviews.zip > reviews.zip
# !unzip reviews.zip

## Introduction

This exercise is based on the Kaggle competition, [Sentiment Analysis on Movie Reviews](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/overview). The goal is to accurately classify movie reviews based on how positively they speak of the movie being reviewed.

### The Training Dataset

Let's have a look at the input data. In `train.tsv`, you should see a file with four fields: A phrase ID, a sentence ID, a phrase, and a sentiment score. (The score runs from 0 to 4, with higher numbers indicating more positive sentiment.) You should see the same content in `test.tsv`, but without the scores.

### The Approach

There are many approaches to Natural Language Processing (NLP) using deep learning, and they will often include a variety of common neural network constructs, including fully-connected layers, recurrent neural networks (RNNs), convnets, embeddings, and dropouts. For this exercise, we'll be guiding you through constructing a simple model using word embeddings and RNNs, and pointing out directions for further enhancement and exploration.

### The Final Step

The test dataset is a separate, unlabeled collection of phrases from movie reviews. The final step in today's exercise will be to use your model to classify the unlabeled phrases. You will export your predictions to a file and upload them to the Kaggle site to receive a final accuracy score.

## Setting Up Your Training Dataset

We'll have to build a dataset that can take in our TSV file and transform it to input tensors and labels that PyTorch can consume. For this first pass, we'll set aside the first two columns, and focus on the phrase and the label.

Let's look at the data:

In [0]:
!wc -l train.tsv
!head -n 10 train.tsv
!wc -l test.tsv
!head -n 10 test.tsv

Fortunately, the `torchtext` module provides us with a `TabularDataset` class that wraps the consumption of CSV, TSV, and JSON-formatted files.

In order to use the `torchtext` dataset facilities, we'll need to specify some `Field` objects that describe the data we expect from the dataset.

In [0]:
import torch
from torchtext import data

In [0]:
phrases = data.Field(include_lengths=True, tokenize='spacy')
labels = data.LabelField(dtype=torch.int64, sequential=False)

fields = [
    ('SKIP_phrase_id', None),
    ('SKIP_sentence_id', None),
    ('phrases', phrases),
    ('labels', labels)
]


train_data = data.TabularDataset(
    'train.tsv', # path to file
    'TSV', # file format
    fields,
    skip_header = True # we have a header row
)

In [3]:
# check_iter = iter(train_data)
# x = check_iter.__next__()
# print('---')
# print(x.phrases)
# print(x.labels) # str wtf

check_iter = iter(train_data)
maxlen = 0
for x in check_iter:
    maxlen = max(maxlen, len(x.phrases))
print(maxlen)

53


Below, we'll define some constants that we'll use for setting up our datasets and training loop:

In [0]:
# dataset constants
VOCAB_SIZE = 15000 # max size of vocabulary
VOCAB_VECTORS = "glove.6B.100d" # Stanford NLP GloVe (global vectors) for word rep

# model constants
EMBEDDING_SIZE = 100 # must match dimensions in vocab vectors above
HIDDEN_SIZE = 100
OUTPUT_SIZE = 5 # 0 to 4

# training loop constants
BATCH_SIZE = 64
EPOCHS = 20

An important step in any NLP process is *building the vocabulary.*

In [0]:
import random

random.seed()

train_data, eval_data = train_data.split()

phrases.build_vocab(train_data, max_size=VOCAB_SIZE, vectors=VOCAB_VECTORS)
labels.build_vocab(train_data)

PAD_INDEX = phrases.vocab.stoi[phrases.pad_token]
UNKNOWN_INDEX = phrases.vocab.stoi[phrases.unk_token]


CPU -> GPU

In [6]:
if not torch.cuda.is_available():
    device = torch.device('cpu')
    print('*** GPU not available - running on CPU. ***')
else:
    device = torch.device('cuda')
    print('GPU ready to go!')

GPU ready to go!


In [0]:
train_iter, eval_iter = data.BucketIterator.splits(
    (train_data, eval_data), 
    batch_size=BATCH_SIZE,
    device=device, sort=False)

## The Model

For this simple model, we'll be using an LSTM - a Long Short-Term Memory layer. This is a type of recurrent neural network that keeps an internal memory that has information added to and removed from it during training, which helps it deal with patterned sequential data.

Explanation follows the code below:

In [0]:
import torch.nn as nn
import torch.nn.functional as F
# class SentimentAnalyzer(nn.Module):
    
#     def __init__(self, vocab_size, embedding_dim, hidden_dim, pad_index):
#         super(SentimentAnalyzer, self).__init__()
        
#         self.embed = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index)
#         self.lstm = nn.LSTM(embedding_dim, hidden_dim, dropout=0.5, num_layers=2, bidirectional=True)
#         self.scorer = nn.Linear(hidden_dim, 5)
    
#     def forward(self, phrases, lengths):
#         x = self.embed(phrases)
#         x = nn.utils.rnn.pack_padded_sequence(x, lengths, enforce_sorted=False)
#         _, (hidden_weights, cell_weights) = self.lstm(x)
#         # x, out_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
#         x = hidden_weights[-1, :, :].squeeze(0)
#         x = self.scorer(x)
#         return F.log_softmax(x)

class SentimentAnalyzer(nn.Module):
    
    def __init__(self, input_size, embedding_size, hidden_size, output_size):
        super(SentimentAnalyzer, self).__init__()
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(input_size, embedding_size)
        
        # TODO check whether we need batch_first=True
        self.rnn = nn.RNN(input_size=embedding_size, hidden_size=hidden_size)
        self.fc = nn.Linear(hidden_size, output_size)
        
    def forward(self, phrases, hidden):
#         print('PHRASES')
#         print(phrases.shape)
        if hidden is None:
            hidden = torch.zeros(1, BATCH_SIZE, self.hidden_size, dtype=torch.float).to(device)
#         print('HIDDEN')
#         print(hidden.shape)
        x = self.embedding(phrases)
#         print('EMBEDDING')
#         print(x.shape)
        x, hidden = self.rnn(x, hidden)
#         print('RNN')
#         print(x.shape)
#         print(hidden.shape)
        x = self.fc(hidden)
#         print('CLASSIFIER')
#         print(x.shape)
        return x.squeeze(0), hidden

In [60]:
# sa = SentimentAnalyzer(len(phrases.vocab), EMBEDDING_DIM, HIDDEN_DIM, PAD_INDEX)
sa = SentimentAnalyzer(len(phrases.vocab), EMBEDDING_SIZE, HIDDEN_SIZE, OUTPUT_SIZE)
def count_model_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print('The model has {} trainable parameters'.format(count_model_params(sa)))
print(sa)

The model has 1520905 trainable parameters
SentimentAnalyzer(
  (embedding): Embedding(15002, 100)
  (rnn): RNN(100, 100)
  (fc): Linear(in_features=100, out_features=5, bias=True)
)


In [61]:
pretrained_embeddings = phrases.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([15002, 100])


In [62]:
sa.embedding.weight.data.copy_(pretrained_embeddings)
# sa.embedding.weight.requires_grad = False
# print(sa.embedding.weight.data.shape)
# print(pretrained_embeddings.shape)
# print(len(phrases.vocab))

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0869,  0.1346,  0.0688,  ..., -0.8253, -0.1474,  0.2279],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0527, -0.0618,  0.0736,  ..., -0.1710,  0.1995,  0.3998]])

In [63]:
sa.embedding.weight.data[UNKNOWN_INDEX] = torch.zeros(EMBEDDING_SIZE)
sa.embedding.weight.data[PAD_INDEX] = torch.zeros(EMBEDDING_SIZE)

print(sa.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0869,  0.1346,  0.0688,  ..., -0.8253, -0.1474,  0.2279],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0527, -0.0618,  0.0736,  ..., -0.1710,  0.1995,  0.3998]])


In [0]:
import time

def tlog(msg):
    print('{}   {}'.format(time.asctime(), msg))

In [0]:
def count_correct(guesses, labels):
    # g, l = unscale_sentiment_score(guesses), unscale_sentiment_score(labels)
    return (guesses == labels).float().sum()    

In [0]:
def train(model, iterator, loss_fn, optimizer): # one epoch
    curr_loss = 0.
    curr_correct = 0.
    hidden = None
    # model.train() # makes sure that training-only fns, like dropout, are active
    
    for batch in iterator:
        # get the data
        phrases, lengths = batch.phrases
        
        # predict and learn
        optimizer.zero_grad()
        guesses, hidden = model(phrases, hidden)

#         print('***')
#         print(guesses.shape)
        # guesses = guesses.squeeze(0)
#         print('GUESSES')
#         print(guesses.shape)
#         print('LABELS')
#         print(batch.labels.shape)
        loss = loss_fn(guesses, batch.labels)
        loss.backward()
        optimizer.step()
        
        # measure
#         curr_loss += loss.item()
#         curr_correct += count_correct(torch.argmax(guesses, 1), batch.labels)
        
    return curr_loss / len(iterator), curr_correct / (len(iterator) * BATCH_SIZE)        

In [0]:
def evaluate(model, iterator, loss_fn):
    curr_loss = 0.
    curr_correct = 0.
    hidden = None
    model.eval() # makes sure that training-only fns, like dropout, are inactive
    
    with torch.no_grad(): # not training
        for batch in iterator:
            # get the data
            phrases, lengths = batch.phrases
            
            # predict
            guesses, hidden = model(phrases, hidden) # .squeeze(1)
            loss = loss_fn(guesses, batch.labels)
            
            # measure
            curr_loss += loss.item()
            curr_correct += count_correct(guesses, batch.labels)

    
    return curr_loss / len(iterator), curr_correct / (len(iterator) * BATCH_SIZE)        

In [0]:
def learn(model):
    model = model.to(device)

    loss_fn = torch.nn.CrossEntropyLoss()
    # loss_fn = loss_fn.to(device)

    optimizer = torch.optim.Adam(model.parameters())
    
    for epoch in range(EPOCHS):
        tlog('EPOCH {}'.format(epoch))
        
        train_loss, train_acc = train(model, train_iter, loss_fn, optimizer)
        tlog('  Training loss {}'.format(train_loss))
        tlog('  Training accuracy {}'.format(train_acc))
        
        eval_loss, eval_acc = evaluate(model, eval_iter, loss_fn)
        tlog('  Validation loss {}'.format(eval_loss))
        tlog('  Validation accuracy {}'.format(eval_acc))
    
    tlog('DONE')

In [71]:
# print(type(sa))

learn(sa)
# sa.embedding.weight.data.copy_(pretrained_embeddings)
# print(type(sa.embedding.weight.data))
# print(type(sa.embedding.weight))
# print(type(sa.embedding))
# sa.embedding.weight.requires_grad = False
# print(sa.embedding.weight.requires_grad)
# print(pretrained_embeddings.shape)
# print(len(phrases.vocab))

Thu May 23 17:56:17 2019   EPOCH 0


RuntimeError: ignored