# Training BERT

This notebook uses the data from `pre_process_data.py` and trains BERT models to classify toxicity.

## Potential Extensions

The addition this method really needs is to use the extra columns, such as identity attack. If we did this way, we could build BERT models for each attack and then use some form of logistic regression on top.

## Configuration

#### Model

In [1]:
# model_type = 'bert-base-cased'
model_type = 'bert-base-uncased'
# model_type = 'bert-large-cased'
# model_type = 'bert-large-uncased'

output_model_file = "model_output.bin"

dataset_size = 1000 # set to None for full dataset

#### Learning Parameters

In [24]:
epochs = 1
learning_rate = 2e-5
warmup = 0.05
batch_size = 32
accumulation_steps=2
seed = 0

## Check Configuration

This is kept pretty naive. Mainly want to make sure that a model isn't overwritten.

In [3]:
import os

if not os.path.isdir('model'):
    os.mkdir('model')

model_output_path = os.path.join('model', output_model_file)
assert os.path.exists(model_output_path) == False

## Getting Data for BERT

In [10]:
import pickle
import torch

In [11]:
if dataset_size == None:
    data_path = model_type
else:
    data_path = f'{model_type}_{dataset_size}'

In [12]:
f = open(os.path.join('data', f'{data_path}_training_data.pkl'), 'rb')
train = pickle.load(f)
f.close()

In [13]:
f = open(os.path.join('data', f'{data_path}_testing_data.pkl'), 'rb')
test = pickle.load(f)
f.close()

In [14]:
def spread_data(dataset):
    x = []
    y = []
    split = []
    
    for val in dataset:
        x.append(val[0])
        split.append(val[1])
        y.append(val[2])
        
    return x, torch.tensor(y), split

In [15]:
train_x, train_y, split_indexes_train = spread_data(train)
test_x, test_y, split_indexes_test = spread_data(test)

## Loading Bert

In [16]:
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import numpy as np

In [17]:
torch.backends.cudnn.deterministic = True
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)

In [18]:
tokenizer = BertTokenizer.from_pretrained(model_type)
model = BertForMaskedLM.from_pretrained(model_type)

## Load Pre-Trained BERT Model

In [19]:
from pytorch_pretrained_bert import BertForSequenceClassification,BertAdam

In [20]:
%%time

# num_labels can be updates so we could extend this to predict more than just the toxicity.
model = BertForSequenceClassification.from_pretrained(model_type,cache_dir=None,num_labels=1)

CPU times: user 3.57 s, sys: 700 ms, total: 4.27 s
Wall time: 4.41 s


## Fine-Tune BERT

In [21]:
optimizer = BertAdam(
    optimizer_grouped_parameters,
    lr=learning_rate,
    warmup=warmup,
    t_total=num_train_optimization_steps)

NameError: name 'optimizer_grouped_parameters' is not defined