The first 3 code cells of this notebook are related to using Google Colab. This first cell gives Colab access to my Google Drive contents.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


I'm changing the current directory to the directory of this notebook in my Google Drive.

In [2]:
import os
os.chdir('/content/drive/MyDrive/Colab Notebooks')

Colab doesn't have the transformers package installed so this will install it.

In [3]:
!pip install transformers



In [4]:
from nltk.tree import Tree
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
from torch.utils.data import Dataset, DataLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Here is a link to the data. Unzip it into the same directory as this notebook.

https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip

The data is in a tree format with 5 categories of sentiment. The following function reads in the data and flattens each tree into a string. It also reduces the 5 categories to just 2 (postive and negative).

In [5]:
def sentiment_treebank_reader(filename):
    with open(filename, encoding='utf8') as f:
        X, y = [], []
        for line in f:
            tree = Tree.fromstring(line)
            label = int(tree.label())
            string = " ".join(tree.leaves())

            if label == 0 or label == 1: 
                y.append(0)
                X.append(string)
                
            elif label == 3 or label == 4:
                y.append(1)
                X.append(string)
    return X, y

In [6]:
X_str_dev, y_dev = sentiment_treebank_reader('trees/dev.txt')
X_str_train, y_train = sentiment_treebank_reader('trees/train.txt')
X_str_test, y_test = sentiment_treebank_reader('trees/test.txt')

We now define a dataset class in which the data consists of Bert-tokenized versions of the string data we just read in.

In [7]:
class SentimentDataset(Dataset):
    def __init__(self, strings, labels):
        self.strings = strings
        self.labels = labels
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
        
    def __getitem__(self, index):
        string = self.strings[index]
        label = self.labels[index]
        
        encoding = self.tokenizer.encode_plus(string, 
                    add_special_tokens=True, return_attention_mask=True, padding='max_length')
        return (
                torch.tensor(encoding['input_ids']).to(device), 
                torch.tensor(encoding['attention_mask']).to(device), 
                torch.tensor(label, dtype=torch.long).to(device)
        )
    
    def __len__(self):
        return len(self.strings)

In [8]:
train_set = SentimentDataset(X_str_train, y_train)
dev_set = SentimentDataset(X_str_dev, y_dev)
test_set = SentimentDataset(X_str_test, y_test)

train_loader = DataLoader(train_set, batch_size=16)
dev_loader = DataLoader(dev_set, batch_size=16)
test_loader = DataLoader(test_set, batch_size=16)

As in the paper, the model is Bert with a classifier on the [CLS] output. We will then fine-tune the entire model on the training data.

In [9]:
class SentimentClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-cased')
        self.classifier_layer = nn.Linear(768, 2)
    
    def forward(self, indices, mask):
        cls_output = self.bert(indices, attention_mask=mask)['pooler_output']
        return self.classifier_layer(cls_output)

In [10]:
model = SentimentClassifier()
model.to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

This helper function evaluates the loss and accuracy on a given dataset.

In [11]:
def evaluate_model(model, data_loader, criterion, length):
    model.eval()
    with torch.no_grad():
        total_loss = 0
        correct_predictions = 0
        for inputs, masks, labels in data_loader:
            outputs = model(inputs, masks)
            _, preds = torch.max(outputs, dim=1)
            loss = criterion(outputs, labels)
            correct_predictions += torch.sum(preds == labels)
            total_loss += loss.item()
    return total_loss, correct_predictions.item() / length

Here is the training loop.

In [None]:
# best_accuracy = 0
for epoch in range(2):
    print('Epoch', epoch + 1)
    model.train()

    for inputs, masks, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs, masks)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

    train_loss, train_acc = evaluate_model(model, train_loader, criterion, len(train_set))
    dev_loss, dev_acc = evaluate_model(model, dev_loader, criterion, len(dev_set))
    print('Train loss', train_loss, '  accuracy', train_acc)
    print('Dev   loss', dev_loss, '  accuracy', dev_acc)

    # if dev_acc > best_accuracy:
    #     torch.save(model.state_dict(), 'best_model.bin')
    #     best_accuracy = dev_acc

Epoch 1


In [None]:
# model = SentimentClassifier()
# model.load_state_dict(torch.load('best_model.bin'))
# model.to(device)

In [None]:
_, test_acc = evaluate_model(model, test_loader, criterion, len(test_set))
print('Test accuracy', test_acc)