# Classifying Sentiment of Restaurant Reviews

## The Yelp Review Dataset
In 2015, Yelp held a contest in which it asked participants to predict the rating of a restaurant given its review. Zhang, Zhao, and Lecun (2015) simplified the dataset by converting the 1- and 2-star ratings into a “negative” sentiment class and the 3- and 4-star ratings into a “positive” sentiment class, and split it into 560,000 training samples and 38,000 testing samples. In this example we use the simplified Yelp dataset, with two minor differences. In the remainder of this section, we describe the process by which we minimally clean the data and derive our final dataset. Then, we outline the implementation that utilizes PyTorch’s Dataset class.

First, let's load the dataset. Instead of just using `pandas` or whatsoever, we load it with `ReviewDataset`, which basically wraps the `pandas` functionality.

In [3]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
from text import ReviewDataset, ReviewVectorizer, Vocabulary
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader


dataset = ReviewDataset.load_dataset_and_make_vectorizer(
    'data/yelp/reviews_with_splits_lite.csv',
)
vectorizer = dataset.get_vectorizer()
print("Length of dataset: {}".format(len(dataset)))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Length of dataset: 39200


`ReviewDataset` inherits from Dataset, an abstract class which defines some methods to be defined by their concrete descendants.

In [4]:
Dataset??

[0;31mInit signature:[0m [0mDataset[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mDataset[0m[0;34m([0m[0mobject[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""An abstract class representing a Dataset.[0m
[0;34m[0m
[0;34m    All other datasets should subclass it. All subclasses should override[0m
[0;34m    ``__len__``, that provides the size of the dataset, and ``__getitem__``,[0m
[0;34m    supporting integer indexing in range from 0 to len(self) exclusive.[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m__getitem__[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mindex[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mraise[0m [0mNotImplementedError[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m__len__[0m[0;34m([0m[0mself[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mraise[0m [0mNotImplementedError[0m[0;

In [5]:
dataset[0]

{'x_data': array([1., 1., 1., ..., 0., 0., 0.], dtype=float32), 'y_target': 0}

In [6]:
vectorizer.vectorize("hello I love this")

array([1., 0., 0., ..., 0., 0., 0.], dtype=float32)

We will use a very basic perceptron classifier:

## Perceptron classifier

In [7]:
import torch.nn as nn
import torch.nn.functional as F

class ReviewClassifier(nn.Module):
    """ a simple perceptron-based classifier """
    def __init__(self, num_features):
        """
        Args:
            num_features (int): the size of the input feature vector
        """
        super(ReviewClassifier, self).__init__()
        self.fc1 = nn.Linear(in_features=num_features, 
                             out_features=1)

    def forward(self, x_in, apply_sigmoid=False):
        """The forward pass of the classifier
        
        Args:
            x_in (torch.Tensor): an input data tensor 
                x_in.shape should be (batch, num_features)
            apply_sigmoid (bool): a flag for the sigmoid activation
                should be false if used with the cross-entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch,).
        """
        y_out = self.fc1(x_in).squeeze()
        if apply_sigmoid:
            y_out = F.sigmoid(y_out)
        return y_out


In [8]:
# model
classifier = ReviewClassifier(num_features=len(vectorizer.review_vocab))

## Loss function and Optimizer

Here, we define the loss function and the optimizer

`BCEWithLogitsLoss` is the Binary Cross-Entropy function. In Pytorch, there are two versions: `BCELoss` and `BCEWithLogitsLoss`. The difference between them is that while BCELoss expects the output of a sigmoid function, the latter expects the logits. If you go into the maths, you will find more numerically stable to derive directly the Cross-Entropy of the logits.

Regarding the optimizer, we will use `Adam` –a fairly standard option nowadays.

In [9]:
from torch import optim

lr = 0.001

# loss and optimizer
loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(classifier.parameters(), lr=lr)

## Training Loop

Now, let's train our model.

As our model returns the logits, we have to apply sigmoid first!

In [10]:

def compute_accuracy(y_pred, y_target):
    y_pred_indices = (torch.sigmoid(y_pred)>0.5).cpu().long()
    
    n_correct = torch.eq(y_pred_indices, y_target).sum().item()
    return n_correct / len(y_pred_indices) * 100

Our training will basically consist of *epochs*, which are complete passes through all the elements of our datasets. 

In Stochastic Gradient Descent, we don't use all the dataset at each time but instead calculate the gradient from smaller `batches`. 

In [11]:
loader = DataLoader(dataset=dataset, batch_size=1000)

batch_iter = iter(loader)

batch = next(batch_iter)

batch.keys()

dict_keys(['x_data', 'y_target'])

Now, let's train our model

**Obs**: there's a tricky calculation on `running_loss` and `running_acc`. It is basically an average.

In [12]:
epochs = 20
batch_size = 128
# Let's save the info here

train_state ={
    'epoch_index': 0,
    'train_loss': [],
    'train_acc': [],
    'val_loss': [],
    'val_acc': [],
    'test_loss': -1,
    'test_acc': -1
}



for epoch_index in range(epochs):
    if epoch_index % 5 == 0 and epoch_index > 0:
        print("{:<3} epoch".format(epoch_index))
    train_state['epoch_index'] = epoch_index

    # Iterate over training dataset

    # setup: batch generator, set loss and acc to 0, set train mode on
    dataset.set_split('train')
    
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                        shuffle=True, drop_last=True)

    running_loss = 0.0
    running_acc = 0.0
    classifier.train()
    
    for batch_index, batch_dict in enumerate(dataloader):
        # the training routine is 5 steps:

        # step 1. zero the gradients
        optimizer.zero_grad()

        # step 2. compute the output
        y_pred = classifier(x_in=batch_dict['x_data'].float())

        # step 3. compute the loss
        loss = loss_func(y_pred, batch_dict['y_target'].float())
        loss_batch = loss.item()
        
        # This is 
        running_loss += (loss_batch - running_loss) / (batch_index + 1)

        # step 4. use loss to produce gradients
        loss.backward()
        # step 5. use optimizer to take gradient step
        optimizer.step()

        # -----------------------------------------
        # compute the accuracy
        acc_batch = compute_accuracy(y_pred, batch_dict['y_target'])
        running_acc += (acc_batch - running_acc) / (batch_index + 1)

    train_state['train_loss'].append(running_loss)
    train_state['train_acc'].append(running_acc)

    # Iterate over val dataset

    # setup: batch generator, set loss and acc to 0, set eval mode on
    dataset.set_split('val')
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                        shuffle=True, drop_last=True)
    running_loss = 0.
    running_acc = 0.
    classifier.eval()

    for batch_index, batch_dict in enumerate(dataloader):

        # step 1. compute the output
        y_pred = classifier(x_in=batch_dict['x_data'].float())

        # step 2. compute the loss
        loss = loss_func(y_pred, batch_dict['y_target'].float())
        loss_batch = loss.item()
        running_loss += (loss_batch - running_loss) / (batch_index + 1)

        # step 3. compute the accuracy
        acc_batch = compute_accuracy(y_pred, batch_dict['y_target'])
        running_acc += (acc_batch - running_acc) / (batch_index + 1)

    train_state['val_loss'].append(running_loss)
    train_state['val_acc'].append(running_acc)        

5   epoch
10  epoch
15  epoch


## Testing in held-out

In [13]:
dataset.set_split('test')
dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                    shuffle=True, drop_last=True)

running_loss = 0.
running_acc = 0.
classifier.eval()

for batch_index, batch_dict in enumerate(dataloader):
    # compute the output
    y_pred = classifier(x_in=batch_dict['x_data'].float())

    # compute the loss
    loss = loss_func(y_pred, batch_dict['y_target'].float())
    loss_batch = loss.item()
    running_loss += (loss_batch - running_loss) / (batch_index + 1)

    # compute the accuracy
    acc_batch = compute_accuracy(y_pred, batch_dict['y_target'])
    running_acc += (acc_batch - running_acc) / (batch_index + 1)

train_state['test_loss'] = running_loss
train_state['test_acc'] = running_acc

In [14]:
print("Test loss: {:.3f}".format(train_state['test_loss']))
print("Test Accuracy: {:.2f}".format(train_state['test_acc']))

Test loss: 0.214
Test Accuracy: 91.85


## Inspecting Model Weights

Which are the most predictive features for positive reviews? Let's look for those with highest positive weights.

In [15]:
fc1_weights = classifier.fc1.weight.detach()[0]
sorted_weights, indices = torch.sort(fc1_weights, dim=0, descending=True)
indices = indices.numpy().tolist()

# Top 20 words
print("Influential words in Positive Reviews:")
print("--------------------------------------")
for i in range(20):
    word = vectorizer.review_vocab.lookup_index(indices[i])
    weight = sorted_weights[i]
    print("{} ({:.3f})".format(word, weight))

Influential words in Positive Reviews:
--------------------------------------
delicious (1.611)
fantastic (1.460)
pleasantly (1.413)
amazing (1.401)
great (1.304)
vegas (1.291)
excellent (1.249)
yum (1.241)
ngreat (1.220)
perfect (1.217)
awesome (1.212)
yummy (1.186)
love (1.143)
bomb (1.117)
solid (1.083)
wonderful (1.045)
pleased (1.044)
notch (1.039)
chinatown (1.025)
perfection (1.011)


In [16]:
# Top 20 negative words
print("Influential words in Negative Reviews:")
print("--------------------------------------")
indices.reverse()
for idx in indices[:20]:
    word = vectorizer.review_vocab.lookup_index(idx)
    weight = sorted_weights[idx]
    print("{} ({:.3f})".format(word, weight))
    

Influential words in Negative Reviews:
--------------------------------------
worst (0.479)
mediocre (0.611)
bland (0.341)
horrible (0.491)
meh (0.288)
awful (0.516)
rude (0.542)
terrible (1.460)
tasteless (0.389)
overpriced (0.531)
disgusting (0.478)
unacceptable (0.144)
slowest (-0.127)
poorly (0.182)
unfriendly (0.392)
nmaybe (0.018)
disappointing (0.335)
disappointment (0.104)
inconsistent (-0.013)
underwhelmed (0.072)
