# Using BERT for Classification

The Hugging Face `transformers` library offers the `AutoModelForSequenceClassification` object, which automatically adds a classifier on top of BERT. Hugging Face also offers some easy to use methods for training models `TrainingArguments` and `Trainer`. However, since the nature of your data might be quite different to many "toy" models, I find it best to write custom data loaders and training loops.

In [None]:
from google.colab import drive
import os

mount='/content/gdrive'
drive.mount(mount)

# Switch to the directory on the Google Drive that you want to use
drive_root = mount + "/My Drive/large-language-models-main"
%cd $drive_root

In [1]:
from datasets import load_dataset

imdb = load_dataset('imdb')

  from .autonotebook import tqdm as notebook_tqdm


We always have a look at an example:

In [2]:
imdb['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

Sounds like a negative review to me, so 0 must be negative and 1 positive. We make a map translate sentiment to numbers and vice versa

In [3]:
def sentiment2label(sentiment):
    return 'positive' if sentiment == 1 else 'negative'

def label2sentiment(label):
    return 1 if label == 'positive' else 0

We write a function to get the text and the labels

In [4]:
def clean_text(text):
    return text.replace('<br />', ' ')

def get_text_and_labels(data):
    texts = [clean_text(sample['text']) for sample in data]
    labels = [sample['label'] for sample in data]
    return texts, labels

In [5]:
texts, labels = get_text_and_labels(imdb['train'])
test_texts, test_labels = get_text_and_labels(imdb['test'])

In [6]:
texts[0], labels[0]

('I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.  The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.  What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not s

## Building the Dataset

Now we create a custom dataset class to feed this dataset into BERT

In [7]:
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModel, AutoTokenizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [8]:
class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        inputs = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_tensors='pt',
            truncation=True
        )
        return {
            'input_ids': inputs['input_ids'].flatten(),
            'attention_mask': inputs['attention_mask'].flatten(),
            'label': torch.tensor(label)
        }

In [9]:
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=0)

tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')
max_len = 128
batch_size = 16

train_dataset = TextClassificationDataset(train_texts, train_labels, tokenizer, max_len)
val_dataset = TextClassificationDataset(val_texts, val_labels, tokenizer, max_len)
test_dataset = TextClassificationDataset(test_texts, test_labels, tokenizer, max_len)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [10]:
# check if the data is loaded correctly
batch = next(iter(train_loader))

## Building the classifier

Now we build our classifier. We want to have the freedom to choose some different embedding models, not just BERT, so we use the `AutoTokenizer` and `AutoModel` methods.

Typically, the [CLS] token embedding is passed to the classifier, but we can also take an average of the tokens of the last hidden state and pass this instead. Here, I have done the former.

In [11]:
embedding_model = AutoModel.from_pretrained('distilbert/distilbert-base-uncased')

In [12]:
# pass a single input through the model
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
output = embedding_model(input_ids, attention_mask)

In [13]:
output.last_hidden_state.shape

torch.Size([16, 128, 768])

In [14]:
class Classifier(nn.Module):
    def __init__(self, embedding_model, n_classes, dropout_p=0.1, train_embedder=True):
        super().__init__()
        self.embedding_model = AutoModel.from_pretrained(embedding_model)
        self.dropout = nn.Dropout(dropout_p)
        self.linear = nn.Linear(self.embedding_model.config.hidden_size, n_classes)

        if not train_embedder:
            for param in self.embedding_model.parameters():
                param.requires_grad = False

    def forward(self, input_ids, attention_mask):
        outputs = self.embedding_model(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state
        pooled_output = last_hidden_state[:, 0]
        pooled_output = self.dropout(pooled_output)
        logits = self.linear(pooled_output)
        return logits



In [25]:
def get_device():
    if torch.cuda.is_available():
        return torch.device('cuda')
    # For M1 Macs
    elif torch.backends.mps.is_available():
        return torch.device('mps')
    else:
        return torch.device('cpu')

device = get_device()

In [26]:
# pass a single batch through the model
model = Classifier('distilbert/distilbert-base-uncased', 2, train_embedder=True).to(device)

input_ids = train_loader.dataset[0]['input_ids'].unsqueeze(0).to(device)
attention_mask = train_loader.dataset[0]['attention_mask'].unsqueeze(0).to(device)
with torch.no_grad():
    output = model(input_ids, attention_mask)
output

tensor([[-0.1947,  0.0810]], device='mps:0')

In [27]:
from transformers import get_linear_schedule_with_warmup
from torch.optim import AdamW

loss_fn = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=2e-5)
epochs = 3
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)


In [28]:
from tqdm import tqdm

In [29]:
for epoch in range(epochs):
    model.train()
    losses = []
    for batch in tqdm(train_loader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        model.zero_grad()
        outputs = model(input_ids, attention_mask)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()
        losses.append(loss.item())

    print(f'Epoch {epoch + 1}, train loss: {sum(losses) / len(losses)}')

    model.eval()
    losses = []
    predictions = []
    true_labels = []
    with torch.no_grad():
        for batch in tqdm(val_loader):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask)
            loss = loss_fn(outputs, labels)
            losses.append(loss.item())

            _, predicted = torch.max(outputs, dim=1)
            predictions.extend(predicted.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    print(f'Epoch {epoch + 1}, validation loss: {sum(losses) / len(losses)}, accuracy: {accuracy_score(true_labels, predictions)}')

100%|██████████| 1250/1250 [06:03<00:00,  3.44it/s]


Epoch 1, train loss: 0.3531162521749735
Epoch 1, validation loss: 0.318233408533727, accuracy: 0.8644


100%|██████████| 1250/1250 [06:11<00:00,  3.36it/s]


Epoch 2, train loss: 0.20439695144742728
Epoch 2, validation loss: 0.30492765495714286, accuracy: 0.8798


100%|██████████| 1250/1250 [06:17<00:00,  3.31it/s]


Epoch 3, train loss: 0.10851184164509177
Epoch 3, validation loss: 0.370008900847893, accuracy: 0.879


After 3 epochs accuracy on the validation set is pretty decent, but it's starting to overfit, but that's OK. Now, we will make this easier to manage and put all of this into a `.py` file, where we can specify a config.

I like to use `SimpleNamespace` for my config and experiment results objects, because they enable accessing attributes using the dot notation.

Note that this model can also be used for regressions tasks, for example regressing text onto continuous values. Just replace `n_classes` with 1 and perhaps use `MSELoss()` instead of cross entropy.

Now we just import everything and define our model parameters. The main code for this can be found in the folder `/embedder`, and can be easily adapted to similar tasks or other datasets, or extended for additional functionality.

In [1]:
from types import SimpleNamespace
from embedder.imdb_classification import main

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
args = SimpleNamespace(
    embedding_model='distilbert-base-uncased',
    batch_size=16,
    max_len=128,
    epochs=3,
    lr=2e-5,
    train_embedder=True,
    dropout_p=0.1,
    n_epochs=3
)

main(args)

Getting data...
Building model...
Training model...


100%|██████████| 5/5 [00:01<00:00,  2.62it/s]


Epoch 1/3, Train Loss: 0.3581, Val Loss: 0.1024, Val Accuracy: 1.0000


100%|██████████| 5/5 [00:01<00:00,  3.52it/s]


Epoch 2/3, Train Loss: 0.0485, Val Loss: 0.0114, Val Accuracy: 1.0000


100%|██████████| 5/5 [00:01<00:00,  3.52it/s]


Epoch 3/3, Train Loss: 0.0075, Val Loss: 0.0034, Val Accuracy: 1.0000
Evaluating model...
Test Loss: 0.0036, Test Accuracy: 1.0000
