# Training BERT

This notebook uses the data from `pre_process_data.py` and trains BERT models to classify toxicity.

## Potential Extensions

The addition this method really needs is to use the extra columns, such as identity attack. If we did this way, we could build BERT models for each attack and then use some form of logistic regression on top.

## Configuration

#### Model

In [1]:
model_type = 'bert-base-cased'
# model_type = 'bert-base-uncased'
# model_type = 'bert-large-cased'
# model_type = 'bert-large-uncased'

In [2]:
dataset_size = 1000 # set to None for full dataset

#### Learning Parameters

In [3]:
epochs = 10
learning_rate = 2e-5
warmup = 0.05
batch_size = 32
accumulation_steps=2
seed = 0

## Variables to Not Change

In [4]:
max_sentence_length = 512
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']

In [5]:
if dataset_size == None:
    output_model_file = f'{model_type}.bin'
else:
    output_model_file = f'{dataset_size}_{model_type}.bin'

## Check Configuration

This is kept pretty naive. Mainly want to make sure that a model isn't overwritten.

In [6]:
import os

if not os.path.isdir('model'):
    os.mkdir('model')

model_output_path = os.path.join('model', output_model_file)
assert os.path.exists(model_output_path) == False

## Getting Data for BERT

In [7]:
from torch.utils.data import TensorDataset

import numpy as np
import pickle
import torch

In [8]:
if dataset_size == None:
    data_path = model_type
else:
    data_path = f'{model_type}_{dataset_size}'

In [9]:
f = open(os.path.join('data', f'{data_path}_training_data.pkl'), 'rb')
train = pickle.load(f)
f.close()

In [10]:
f = open(os.path.join('data', f'{data_path}_testing_data.pkl'), 'rb')
test = pickle.load(f)
f.close()

In [11]:
def spread_data(dataset):
    x = []
    y = []
    split = []
    
    for val in dataset:
        x.append(val[0])
        split.append(val[1])
        y.append(val[2])
        
    y = torch.tensor(y, dtype=torch.float)
    split = torch.tensor(split)
    
    return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.float), torch.tensor(split)

In [12]:
train_x, train_y, split_indexes_train = spread_data(train)
test_x, test_y, split_indexes_test = spread_data(test)

  


In [13]:
train = TensorDataset(train_x, train_y)
test = TensorDataset(test_x, test_y)

## Loading Bert

In [14]:
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import numpy as np

In [15]:
torch.backends.cudnn.deterministic = True
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)

In [16]:
tokenizer = BertTokenizer.from_pretrained(model_type)

The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.


## Load Pre-Trained BERT Model

In [17]:
from pytorch_pretrained_bert import BertForSequenceClassification,BertAdam

In [18]:
%%time

# num_labels can be updates so we could extend this to predict more than just the toxicity.
model = BertForSequenceClassification.from_pretrained(model_type,cache_dir=None,num_labels=1)

CPU times: user 3.88 s, sys: 657 ms, total: 4.54 s
Wall time: 4.69 s


In [19]:
# This may not be necessary, but just in case

# class LinearRegressionModel(torch.nn.Module):
#     '''
#     https://hackernoon.com/linear-regression-in-x-minutes-using-pytorch-8eec49f6a0e2
#     '''
#     def __init__(self, input_dim, output_dim):

#         super(LinearRegressionModel, self).__init__() 
#         self.linear = torch.nn.Linear(input_dim, output_dim)

#     def forward(self, x):
#         out = self.linear(x)
#         return out
    
# model.classifier = LinearRegressionModel(model.classifier.in_features, 1)

Set up model for back propogation

In [20]:
model.zero_grad()

In [21]:
# model.eval()

## Test

We are going to run a test on BERT before and after training, so we can see the results of training.

In [22]:
from torch.utils.data import DataLoader, RandomSampler
from tqdm import tqdm_notebook

In [23]:
def test_model(model, test):
    print(len(test))
    loader = DataLoader(test, batch_size=batch_size)
    mse = torch.nn.MSELoss()
    mses = []
    
    for step, (x, y) in tqdm_notebook(enumerate(loader), desc='Testing Data'):
        predictions = model(x)
        mses.append(mse(predictions, y))
        print(f'batch MSE: {mses[-1]}')
        
    print(f'MSE: {sum(mses) / float(len(mses))}')


In [24]:
test_model(model, test)

200


HBox(children=(IntProgress(value=1, bar_style='info', description='Testing Data', max=1, style=ProgressStyle(d…

  return F.mse_loss(input, target, reduction=self.reduction)


batch MSE: 0.2544938623905182
batch MSE: 0.31998899579048157
batch MSE: 0.23781757056713104
batch MSE: 0.3128339648246765
batch MSE: 0.16856695711612701
batch MSE: 0.2577625513076782


  return F.mse_loss(input, target, reduction=self.reduction)


batch MSE: 0.14601393043994904

MSE: 0.24249684810638428


## Fine-Tune BERT

In [25]:
from torch.nn import functional as F

In [26]:
train_optimization_steps = int(epochs*len(train)/batch_size/accumulation_steps)

In [27]:
param_optimizer = list(model.named_parameters())

optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

In [28]:
optimizer = BertAdam(
    optimizer_grouped_parameters,
    lr=learning_rate,
    warmup=warmup,
    t_total=train_optimization_steps)

In [29]:
%%time

criterion = torch.nn.MSELoss()  

for _ in tqdm_notebook(range(epochs), desc='epoch'):
    train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)
    
    optimizer.zero_grad()
    model = model.train()

    for step, (x, y) in tqdm_notebook(enumerate(train_loader), desc='batch'):
        predictions = model(x)
        
        loss = criterion(predictions, y)
        
        optimizer.zero_grad()        
        loss.backward()
        optimizer.step()

HBox(children=(IntProgress(value=0, description='epoch', max=10, style=ProgressStyle(description_width='initia…

HBox(children=(IntProgress(value=1, bar_style='info', description='batch', max=1, style=ProgressStyle(descript…




HBox(children=(IntProgress(value=1, bar_style='info', description='batch', max=1, style=ProgressStyle(descript…




HBox(children=(IntProgress(value=1, bar_style='info', description='batch', max=1, style=ProgressStyle(descript…




HBox(children=(IntProgress(value=1, bar_style='info', description='batch', max=1, style=ProgressStyle(descript…




HBox(children=(IntProgress(value=1, bar_style='info', description='batch', max=1, style=ProgressStyle(descript…




HBox(children=(IntProgress(value=1, bar_style='info', description='batch', max=1, style=ProgressStyle(descript…

Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_to




HBox(children=(IntProgress(value=1, bar_style='info', description='batch', max=1, style=ProgressStyle(descript…

Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_to




HBox(children=(IntProgress(value=1, bar_style='info', description='batch', max=1, style=ProgressStyle(descript…

Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_to




HBox(children=(IntProgress(value=1, bar_style='info', description='batch', max=1, style=ProgressStyle(descript…

Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_to




HBox(children=(IntProgress(value=1, bar_style='info', description='batch', max=1, style=ProgressStyle(descript…

Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.
Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_to



CPU times: user 7h 36min 4s, sys: 19min 3s, total: 7h 55min 8s
Wall time: 1h 7min 18s


In [30]:
torch.save(model.state_dict(), model_output_path)

## Test Fine-Tuned Model

In [31]:
test_model(model, test)

200


HBox(children=(IntProgress(value=1, bar_style='info', description='Testing Data', max=1, style=ProgressStyle(d…

batch MSE: 0.03213440254330635
batch MSE: 0.04888967424631119
batch MSE: 0.03305823355913162
batch MSE: 0.04150481894612312
batch MSE: 0.027392998337745667
batch MSE: 0.03093765303492546
batch MSE: 0.015334195457398891

MSE: 0.03275028243660927
