<a href="https://colab.research.google.com/github/ajfisch/deeplearning_bootcamp_2020/blob/master/deeplearning_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Task: Beer Review Sentiment Analysis (Deep Learning)

In this tutorial, we'll extend on the tutorial from lab1 to implement neural networks to learn to analyze beer reviews. 

Let's get started! First run the following cells to install PyTorch and get the data again.

In [0]:
# We use the CountVectorizer again to create the vocab.
from sklearn.feature_extraction.text import CountVectorizer

# Torch modules.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

# Gives a progress bar
from tqdm import tqdm

# Utilities for plotting.
import matplotlib.pyplot as plt
import numpy as np
import pickle


In [0]:
!apt-get install wget
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_train.p
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_dev.p
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_test.p

train_set =  pickle.load(open("overall_train.p", "rb"))
dev_set =  pickle.load(open("overall_dev.p", "rb"))
test_set =  pickle.load(open("overall_test.p", "rb"))

# Extract tweets and labels into 2 lists
def preprocess_data(data):
    for indx, sample in enumerate(data):
        text, label = sample['text'], sample['y']
        text = text.lower().strip()
        data[indx] = text, label
    return data

# Preprocess all the data splits.
train_set = preprocess_data(train_set)
dev_set = preprocess_data(dev_set)
test_set =  preprocess_data(test_set)

# Separate components into X and Y lists.
trainText = [t[0] for t in train_set]
trainY = [t[1] for t in train_set]

devText = [t[0] for t in dev_set]
devY = [t[1] for t in dev_set]

testText = [t[0] for t in test_set]
testY = [t[1] for t in test_set]

# Set that word has to appear at least 5 times to be in vocab
min_df = 5
max_features = 1000
countVec = CountVectorizer(min_df=min_df, max_features=max_features )

# Learn vocabulary from train set
countVec.fit(trainText)

# Transform list of review to matrix of bag-of-word vectors
trainX = countVec.transform(trainText)
devX = countVec.transform(devText)
testX = countVec.transform(testText)

# Step 1: Pytorch Dataset

Datasets are abstractions that hold data for you. As long as you define a __len__ and __getitem__, they can be used to pipe data into your training routine.

In [0]:
# Define a Beer review dataset
class BeerReviewDataset(torch.utils.data.Dataset):
    
    def __init__(self, X, Y):
      self.dataset = (X, Y)
      assert X.shape[0] == len(Y)
    
    def __len__(self):
       # Returns the number of points in the dataset.
       return self.dataset[0].shape[0]

    def __getitem__(self, i):
      # Returns count vector as x and the label as y.
      return np.array(self.dataset[0][i].todense()[0]), self.dataset[1][i]

train = BeerReviewDataset(trainX, trainY)
dev =   BeerReviewDataset(devX, devY)
test =   BeerReviewDataset(testX, testY)

# Step 2: Define the Model

In [0]:
class Model(nn.Module):
   
    def __init__(self):
        super(Model, self).__init__()
        self.fully_connected = nn.Linear(1000, 3)

    def forward(self, x):
        return self.fully_connected(x)


# Exercise 1:

This is just a linear model!

Add a non-linearity (F.relu) and an extra layer.

# Step 3: Training




In [0]:
# Training settings
batch_size = 64
epochs = 10
lr = .01
momentum = 0.5

model = Model()
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)
dev_loader = torch.utils.data.DataLoader(dev, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=True)


In [0]:

for batch in train_loader:
  print(batch[0].shape)
  print(batch[1].shape)
  
  break


To train our model:

1) we'll randomly sample batches from our train loader

2) compute our loss (using standard `cross_entropy`)

3) compute our gradients (by calling `backward()` on our loss)

4) update our neural network with an `optimizer.step()`, and go back to 1)

I've added some extra stuff here to log our accuracy and average loss for the epoch.


In [0]:
def train_epoch( model, train_loader, optimizer, epoch):
    model.train() # Set the nn.Module to train mode. 
    total_loss = 0
    correct = 0
    num_samples = len(train_loader.dataset)
    
    # Iterate over batches of data.
    for batch_idx, (x, target) in enumerate(train_loader):
        x = x.float().squeeze(1)
        
        # Reset gradient data to 0
        optimizer.zero_grad()
        
        # 1) Get the prediction for batch
        output = model(x)
        
        # 2) Compute loss
        loss = F.cross_entropy(output, target)
        
        # 3) Do backprop
        loss.backward()
        
        # 4) Update model
        optimizer.step()
        
        # Do book-keeping to track accuracy and avg loss
        pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.view_as(pred)).sum().item()
        total_loss += loss.detach() # Don't keep computation graph 

    print('Train Epoch: {} \tLoss: {:.4f}, Accuracy: {}/{} ({:.0f}%)'.format(
            epoch, total_loss / num_samples, 
            correct, 
            num_samples,
            100. * correct / num_samples))


# Step 4: Evaluation
Similar to above, we'll also loop through our dev or test set, and compute our loss and accuracy. 
This lets us see how well our model is generalizing. 

In [0]:
def eval_epoch(model, test_loader, name):
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        data = data.float().squeeze(1)
        target = target.long()
        output = model(data)
        test_loss += F.cross_entropy(output, target).item() # sum up batch loss
        pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    print('\n{} set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        name,
        test_loss, 
        correct, 
        len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


# Step 5: Everything Together

In [0]:

for epoch in range(1, epochs + 1):
    train_epoch(model, train_loader, optimizer, epoch)
    eval_epoch(model,  dev_loader, "Dev")
    print("---")

In [0]:
eval_epoch(model,  test_loader, "Test")

# Exercise 2:

1. What is the training accuracy?
2. Try changing the learning rate or the batch size!