This notebook, shows step-by-step how to perform text classification by fine-tuning a BERT-based model.

Here we install the transformers package, from Hugging Face. We use one of its pre-trained BERT models, more specifically a compact model that is trained through model distillation. We will use the package to:

Tokenize the text according to the BERT model specification, using its DistilBertTokenizer class
Instantiate a pre-trained BERT model, modified for the text classification task, using its DistilBertForSequenceClassification class, that we will then be fine-tuned for our specific dataset.

For a comprehensive tutorial about using this package to fine-tune BERT for text classification, please see [here](https://mccormickml.com/2019/07/22/BERT-fine-tuning/).

Installing needed python libraries

In [1]:
import pandas as pd

df = pd.read_csv('./data/consumer_complaint_data_sample_prepared.csv')

In [2]:
df.shape

(52442, 3)

In [3]:
df.head(10)

Unnamed: 0,Product,Complaint,Product_Label
0,Credit Reporting,i first report on 2019 i asked for a master pr...,2
1,Credit Reporting,please be advised that this is my third writte...,2
2,Credit Reporting,is falsely reporting 9 hard inquiries and want...,2
3,Banking Services,open account on 2018 through the citi bank off...,0
4,Card Services,i was never sent a credit card bill for my pur...,1
5,Debt Collection,hello i am writing to dispute a collection ref...,3
6,Credit Reporting,i was made aware of four accounts today 19 on ...,2
7,Credit Reporting,the credit bureaus are reporting inaccurate ou...,2
8,Credit Reporting,transunion ss dob dear sir or madam i am a vic...,2
9,Card Services,this is my second complaint my first ended up ...,1


In [4]:
label_counts = pd.DataFrame(df['Product'].value_counts())
label_counts

Unnamed: 0,Product
Credit Reporting,19100
Debt Collection,11266
Mortgage,6414
Card Services,5637
Loans,5343
Banking Services,4682


Here we create an array with the label names in the order they were numerically encoded. We use them later when plotting model performance data.

In [5]:
label_values = list(label_counts.index)
order = list(pd.DataFrame(df['Product_Label'].value_counts()).index)
label_values = [l for _,l in sorted(zip(order, label_values))]

label_values

['Banking Services',
 'Card Services',
 'Credit Reporting',
 'Debt Collection',
 'Loans',
 'Mortgage']

We need to create 2 arrays: one with the textual data, which is our feature data, and one with the numerically encoded labels, representing our target data.

In [6]:
texts = df['Complaint'].values
labels = df['Product_Label'].values

BERT is a ‘heavy-weight’´model. This makes the training a very resource-intensive process, specially when we are fine-tuning for all model layers. To mitigate this, we can control the sequence length of our input text, which is given by the number of tokens in our input text, plus 2 special tokens to mark the beginning and ending of a text sequence.

In [7]:
text_lengths = [len(texts[i].split()) for i in range(len(texts))]
print(min(text_lengths))
print(max(text_lengths))

5
5367


In [8]:
sum([1 for i in range(len(text_lengths)) if text_lengths[i] >= 300])

8737

In [9]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)

print('Original Text: ', texts[0], '\n')
print('Tokenized Text: ', tokenizer.tokenize(texts[0]), '\n')
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(texts[0])))

Original Text:  i first report on 2019 i asked for a master promissory note proving the loans that they had under my name for there company the cfpb dismissed my complaint because navient sent a promissory note that had nothing to do with the loans they claim they have on file for me 

i have attached to this complaint is the promissory note that navient attached with my first complaint i have also attached what navient reporting to the creditors 

why are the original grantors not the same isnt the information suppose to match 

i also asked my private loan carrier for a master promissory note and i was shocked to find out that they have the same promissory note for my private loans that navient have for federal loans for different schools 

how is that possible 

 has told me on multiple occasions and by multiple employees when i have asked navient who my original grantor is there answer has changed atleast twice and now that i see that there reporting that is the original grantor th

We then tokenize and encode the entire dataset. In this process, we perform the following:

. tokenize the text as shown above
. encode it to the corresponding numeric values for each token.
. truncate it to the maximum sequence length of 300.
. pad the tokens positions greater than 300.
. include the special token IDs to mark the beginning and end of each sequence

In [10]:
text_ids = [tokenizer.encode(text, max_length=300, pad_to_max_length=True) for text in texts]

text_ids[0]

[101,
 1045,
 2034,
 3189,
 2006,
 10476,
 1045,
 2356,
 2005,
 1037,
 3040,
 20877,
 14643,
 10253,
 3602,
 13946,
 1996,
 10940,
 2008,
 2027,
 2018,
 2104,
 2026,
 2171,
 2005,
 2045,
 2194,
 1996,
 12935,
 2361,
 2497,
 7219,
 2026,
 12087,
 2138,
 6583,
 13469,
 3372,
 2741,
 1037,
 20877,
 14643,
 10253,
 3602,
 2008,
 2018,
 2498,
 2000,
 2079,
 2007,
 1996,
 10940,
 2027,
 4366,
 2027,
 2031,
 2006,
 5371,
 2005,
 2033,
 1045,
 2031,
 4987,
 2000,
 2023,
 12087,
 2003,
 1996,
 20877,
 14643,
 10253,
 3602,
 2008,
 6583,
 13469,
 3372,
 4987,
 2007,
 2026,
 2034,
 12087,
 1045,
 2031,
 2036,
 4987,
 2054,
 6583,
 13469,
 3372,
 7316,
 2000,
 1996,
 23112,
 2339,
 2024,
 1996,
 2434,
 3946,
 5668,
 2025,
 1996,
 2168,
 3475,
 2102,
 1996,
 2592,
 6814,
 2000,
 2674,
 1045,
 2036,
 2356,
 2026,
 2797,
 5414,
 6839,
 2005,
 1037,
 3040,
 20877,
 14643,
 10253,
 3602,
 1998,
 1045,
 2001,
 7135,
 2000,
 2424,
 2041,
 2008,
 2027,
 2031,
 1996,
 2168,
 20877,
 14643,
 10253,
 3602,
 

In [11]:
text_ids_lengths = [len(text_ids[i]) for i in range(len(text_ids))]
print(min(text_ids_lengths))
print(max(text_ids_lengths))

300
300


To fine-tune our model, we need two inputs: one array of token IDs (created above) and one array of a corresponding binary mask, called attention mask in the BERT model specification. Each attention mask has the same length of the corresponding input sequence and has a 0 if the corresponding token is a pad token, or a 1 otherwise.

In [12]:
att_masks = []
for ids in text_ids:
    masks = [int(id > 0) for id in ids]
    att_masks.append(masks)
    
att_masks[0]

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,


Here we split the input and output arrays created before into train, validation, and test sets. We use 80% of the data for training, 10% for training validation, and 10% for final testing.

In [13]:
from sklearn.model_selection import train_test_split

train_x, test_val_x, train_y, test_val_y = train_test_split(text_ids, labels, random_state=111, test_size=0.2)
train_m, test_val_m = train_test_split(att_masks, random_state=111, test_size=0.2)

test_x, val_x, test_y, val_y = train_test_split(test_val_x, test_val_y, random_state=111, test_size=0.5)
test_m, val_m = train_test_split(test_val_m, random_state=111, test_size=0.5)

We are working with the PyTorch artifacts in the transformers library, therefore we need our model input and output data as PyTorch tensors.

In [14]:
import torch

train_x = torch.tensor(train_x)
test_x = torch.tensor(test_x)
val_x = torch.tensor(val_x)
train_y = torch.tensor(train_y)
test_y = torch.tensor(test_y)
val_y = torch.tensor(val_y)
train_m = torch.tensor(train_m)
test_m = torch.tensor(test_m)
val_m = torch.tensor(val_m)

print(train_x.shape)
print(test_x.shape)
print(val_x.shape)
print(train_y.shape)
print(test_y.shape)
print(val_y.shape)
print(train_m.shape)
print(test_m.shape)
print(val_m.shape)

torch.Size([41953, 300])
torch.Size([5244, 300])
torch.Size([5245, 300])
torch.Size([41953])
torch.Size([5244])
torch.Size([5245])
torch.Size([41953, 300])
torch.Size([5244, 300])
torch.Size([5245, 300])


To feed data into the model for training, we use Pytorch’s Dataset, DataLoader, and Sampler. For feeding training data, which drives model weights updates, we use the RandomSampler. For feeding the validation data we can use the SequentialSampler.

In [15]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 32

train_data = TensorDataset(train_x, train_m, train_y)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

val_data = TensorDataset(val_x, val_m, val_y)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

Here we instantiate our model class. We use a compact version, that is trained through model distillation from a base BERT model and modified to include a classification layer at the output. This compact version has 6 transformer layers instead of 12 as in the original BERT model. Please see here for more details.

In [16]:
from transformers import DistilBertForSequenceClassification, AdamW, DistilBertConfig

num_labels = len(set(labels))

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=num_labels,
                                                            output_attentions=False, output_hidden_states=False)

BERT is a very large model. Unless you are freezing model weights in all layers but the classification layer, it is recommended to train it on a GPU.

In [17]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

model = model.to(device)

cpu


Here we print the model architecture and all model learnable parameters.

In [18]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print('Number of trainable parameters:', count_parameters(model), '\n', model)

Number of trainable parameters: 66958086 
 DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (drop

In [19]:
[n for n, p in model.named_parameters()]

['distilbert.embeddings.word_embeddings.weight',
 'distilbert.embeddings.position_embeddings.weight',
 'distilbert.embeddings.LayerNorm.weight',
 'distilbert.embeddings.LayerNorm.bias',
 'distilbert.transformer.layer.0.attention.q_lin.weight',
 'distilbert.transformer.layer.0.attention.q_lin.bias',
 'distilbert.transformer.layer.0.attention.k_lin.weight',
 'distilbert.transformer.layer.0.attention.k_lin.bias',
 'distilbert.transformer.layer.0.attention.v_lin.weight',
 'distilbert.transformer.layer.0.attention.v_lin.bias',
 'distilbert.transformer.layer.0.attention.out_lin.weight',
 'distilbert.transformer.layer.0.attention.out_lin.bias',
 'distilbert.transformer.layer.0.sa_layer_norm.weight',
 'distilbert.transformer.layer.0.sa_layer_norm.bias',
 'distilbert.transformer.layer.0.ffn.lin1.weight',
 'distilbert.transformer.layer.0.ffn.lin1.bias',
 'distilbert.transformer.layer.0.ffn.lin2.weight',
 'distilbert.transformer.layer.0.ffn.lin2.bias',
 'distilbert.transformer.layer.0.output_laye

In the following 5 cells we define our PyTorch optimizer and corresponding parameters, learning rate scheduler, and the training loop for the fine-tuning procedure. We train for 1 epochs.

In [20]:
learning_rate = 1e-5
adam_epsilon = 1e-8

no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.2},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)

In [21]:
from transformers import get_linear_schedule_with_warmup

num_epochs = 1
total_steps = len(train_dataloader) * num_epochs

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

In [22]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [23]:
import numpy as np
import random

seed_val = 111

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [24]:
train_losses = []
val_losses = []
num_mb_train = len(train_dataloader)
num_mb_val = len(val_dataloader)

if num_mb_val == 0:
    num_mb_val = 1

for n in range(num_epochs):
    train_loss = 0
    val_loss = 0
    start_time = time.time()
    
    for k, (mb_x, mb_m, mb_y) in enumerate(train_dataloader):
        optimizer.zero_grad()
        model.train()
        
        mb_x = mb_x.to(device)
        mb_m = mb_m.to(device)
        mb_y = mb_y.to(device)
        
        outputs = model(mb_x, attention_mask=mb_m, labels=mb_y)
        
        loss = outputs[0]
        #loss = model_loss(outputs[1], mb_y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        
        train_loss += loss.data / num_mb_train
    
    print ("\nTrain loss after itaration %i: %f" % (n+1, train_loss))
    train_losses.append(train_loss.cpu())
    
    with torch.no_grad():
        model.eval()
        
        for k, (mb_x, mb_m, mb_y) in enumerate(val_dataloader):
            mb_x = mb_x.to(device)
            mb_m = mb_m.to(device)
            mb_y = mb_y.to(device)
        
            outputs = model(mb_x, attention_mask=mb_m, labels=mb_y)
            
            loss = outputs[0]
            #loss = model_loss(outputs[1], mb_y)
            
            val_loss += loss.data / num_mb_val
            
        print ("Validation loss after itaration %i: %f" % (n+1, val_loss))
        val_losses.append(val_loss.cpu())
    
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    print(f'Time: {epoch_mins}m {epoch_secs}s')

KeyboardInterrupt: 

After training, we can save the model and necessary configuration parameters, to recreate it later and use it to score the test data. Here we also save the losses computed from both training and validation data.

In [None]:
import pickle
import os

out_dir = './model'

if not os.path.exists(out_dir):
    os.makedirs(out_dir)
    
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(out_dir)
tokenizer.save_pretrained(out_dir)

with open(out_dir + '/train_losses.pkl', 'wb') as f:
    pickle.dump(train_losses, f)
    
with open(out_dir + '/val_losses.pkl', 'wb') as f:
    pickle.dump(val_losses, f)

In [None]:
out_dir = './model'

model = DistilBertForSequenceClassification.from_pretrained(out_dir)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

with open(out_dir + '/train_losses.pkl', 'rb') as f:
    train_losses = pickle.load(f)
    
with open(out_dir + '/val_losses.pkl', 'rb') as f:
    val_losses = pickle.load(f)

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

plt.figure()
plt.plot(train_losses)

In [None]:
plt.figure()
plt.plot(val_losses)

After instantiating a trained model, we can then score the test data and compute its accuracy. We then print the classification report and plot a confusion matrix.

The trained model gives us good results on the test data, being able to correctly classify 80% or more on each of the 6 distinct categories.