# Building a RoBERTa-Classifier 


In this notebook we build a model to classify sentences. The classifier consists of a fully connected layer ont top of the [CLS]-vector a RoBERTa model.

The implemenation is based on:
- https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/
-https://huggingface.co/transformers/model_doc/bert.html
-https://huggingface.co/transformers/model_doc/roberta.html


First, mount drive in which the used data is stored.


In [None]:
#from google.colab import drive
#drive.mount('/content/gdrive')

## Loading all needed Libraries

In [None]:
! pip install transformers==3
! pip install tokenizers

In [None]:
import transformers
from transformers import RobertaModel, RobertaTokenizer, AdamW, get_linear_schedule_with_warmup
import torch
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap
from torch import nn, optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import re
from sklearn.utils import resample
from sklearn.metrics import confusion_matrix
import random
import pickle

- Get GPU

In [None]:
# Get Device 
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'Used device is {device}')

## Loading and Combining Data into Dataset

In this section the data is supposed to be laoded. Here we should have two lists of sentences, each from one corpus. These will be assigned labels and  then grouped and shuffled. Make sure the data is at least to some extend balanced.

Moreover, all sentences are lower-cased. Punctuationbehaviour should also be similar in both corpora. It is recommended to run both corpora through the same preprocessing to ensure no formality give-aways occur in the data.

In [None]:
# Paths to data in drive // will be replaced at some point as data is put into Arthurs Drive 
datapath = '/evaluation/datasets/'
datapath_scientific = datapath + 'scientific_arxiv/'
datapath_almost_scientific = datapath + 'almost_scientific_medium/'



In [None]:
# Load Almost Scientific data
with open(datapath_almost_scientific + 'medium_processed_sentences.pkl', "rb") as f:  #os.path.join(args.data_dir, "template_%d.pickle" % i)
    data1 = pickle.load(f)

# lower-casing
data1 = [sent.lower() for sent in data1]

# non-scientific is assigend the label zero
labels1 = [0 for _ in range(len(data1))]

# looking at examples
for i in range(100):
  print(data1[i])

print()
print(f'Number of Medium Sentences: {len(data1)}')

In [None]:
# Load Scientific data
with open(datapath_scientific + 'ds_sentence_list_full.pkl', "rb") as f:  #os.path.join(args.data_dir, "template_%d.pickle" % i)
    data2= pickle.load(f)

# Just using the same amount of data to be balanced
data2 = data2[:len(data1)]

# scientific sentences are given the label one
labels2 = [1 for _ in range(len(data2))]

# looking at examples
for i in range(5,20):
  print(data2[i])


print()
print(f'Number of DS Arxiv Sentences: {len(data2)}')  

**Sentences from both corpora should have the same appearance w.r.t. casing and punctuation. If this is not the case, data needs to be reworked.**

Netx, the two coropra and their labels are combined into one dataset.

In [None]:
# Combine both datasets
combinded_data = list(zip(data1, labels1))+ list(zip(data2, labels2))

#Shuffle
random.Random(42).shuffle(combinded_data)

# get list of sentecnes and corresponding labels
data, labels = zip(*combinded_data)

# looking at examples
for i in range(10):
  print(data[i], '\t', labels[i])

### Data Overview 

As RoBERTa models need a max length of tokens to consider, the distribution of teh data set will be analyzed here. Then, the data might be filtered given some threshold on the length of sentences.

In [None]:
len_data_words = [len(sent.split()) for sent in data]
len_data_chars = [len(sent) for sent in data]

print(f'Min words: {min(len_data_words)} | Max words: {max(len_data_words)}')
print(f'Min chars: {min(len_data_chars)} | Max chars: {max(len_data_chars)}')

fig, ax = plt.subplots(1,2, figsize = (15,6))

ax[0].hist(len_data_words, bins = 100)
ax[0].set_title('#words per sentence ', fontsize = 16)

ax[1].hist(len_data_chars, bins = 100)
ax[1].set_title('#chars per sentence ', fontsize = 16)
plt.show()

#### Filtering

Thresholds on the amount of chars and words in a sentence should be set according to the histograms above.

In [None]:
# Filtering, as there seem to be sentences of unfeasible sizes
threshold_words = 100
threshold_chars =  500

data = [sent for i, sent in enumerate(data) if len_data_words[i] <= threshold_words and len_data_chars[i] <= threshold_chars]
labels = [label for i, label in enumerate(labels) if len_data_words[i] <= threshold_words and len_data_chars[i] <= threshold_chars]

assert len(data) == len(labels)

In [None]:
num_sentences = len(labels)
num_sci_sentences = sum(labels)
num_medium_sentences = num_sentences - num_sci_sentences

print(f' Total: {num_sentences} sentences')
print(f' Scientific: : {num_sci_sentences} sentences ({(100*(num_sci_sentences/num_sentences)):.2f}%)')
print(f' Non-Scientific: : {num_medium_sentences} sentences ({(100*(num_medium_sentences/num_sentences)):.2f}%)')

## Building RoBERTa Classifier Model
Here we use the classical RoBERTa model 'roberta-base'. This might be replaced by more specific pretrained models. 

### RoBERTa specific processing

In [None]:
# Special RoBERTa mdoe to be used: While slight information might be lost; capitalization in tweets is a neglectable characteristic
PRE_TRAINED_MODEL_NAME = 'roberta-base'
tokenizer = RobertaTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

Notes on the tokenizer:
  - distinguishes between single whitespace and multiple
  - distinguishes punctuation
  - distinguishes lower and upper cases -> we use only lower cases

In [None]:
# Example
print(tokenizer.encode('This is  it .'))
print(tokenizer.encode('this is it.'))

- Get understanding of the distribution of token sizes for maximal length used in RoBERTa, again set MAX_LEN accordingly

In [None]:
len_data_tokens = [len(tokenizer.tokenize(sent)) for sent in data]

fig, ax = plt.subplots(1,1, figsize = (15,6))

ax.hist(len_data_tokens, bins = 100)
ax.set_title('# Token per sentence', fontsize = 16)
plt.show()

### Build PyTorch Dataset and DataLoader

This Function builds the basic fucntionality for teh RoBERTa classifer, Dataloaders etc.

In [None]:
# Data Structure
class SentenceDataset(Dataset):
    def __init__(self, sents, labels, tokenizer, max_len):
        self.sents = sents
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __len__(self):
        return len(self.sents)
    
    def __getitem__(self, item):
        
        sent = str(self.sents[item])
        label = self.labels[item]

        encoding = tokenizer(sent,
                             truncation=True,
                             add_special_tokens=True, # Add '[CLS]' and '[SEP]'
                             return_token_type_ids=False,
                             padding = 'max_length',
                             max_length=self.max_len,
                             return_attention_mask=True,
                             return_tensors='pt')
      
        return { 'sent': sent, 'input_ids': encoding['input_ids'].flatten(),
          'attention_mask': encoding['attention_mask'].flatten(),
          'label': torch.tensor(label, dtype=torch.long)
        }
# Data Loader
def create_data_loader(sentences, labels, tokenizer, max_len, batch_size):
    ds = SentenceDataset(
        sents=sentences, #.to_numpy()
        labels=labels, #.to_numpy()
        tokenizer=tokenizer,
        max_len=max_len
    )
    return DataLoader(
        ds,
        batch_size=batch_size,
        num_workers=2
    )

### Actual Model

In [None]:
class StyleClassifier(nn.Module):
    def __init__(self, n_classes, drop = 0.3):
        
        super(StyleClassifier, self).__init__()
        self.roberta = RobertaModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
        self.drop = nn.Dropout(p=drop)
        self.out = nn.Linear(self.roberta.config.hidden_size, n_classes)
        
    def forward(self, input_ids, attention_mask):
        _, pooled_output = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
        output = self.drop(pooled_output)
        return self.out(output)

## Training the model

### Define Helper functions

Here some functions for training and evaluationa re built, taht will be used later on.

In [None]:
# This provides just a way to illustrate our confusion matrices in a nice and labeled way
def show_confusion_matrix(confusion_matrix, save_path = None):
  confusion_df = pd.DataFrame(cm, index=names,columns=names)
  plt.figure(figsize=(5,5))
  sns.heatmap(confusion_df,annot=True,annot_kws={"size": 12},cbar=False, square=True,fmt='.2f')
  plt.ylabel(r'True categories',fontsize=14)
  plt.xlabel(r'Predicted categories',fontsize=14)
  plt.tick_params(labelsize=12)
  if save_path:
    plt.savefig(save_path)
  plt.show()

In [None]:
def get_predictions(model, data_loader):
  # put to eval mode to disable dropout 
  model = model.eval()

  predictions = []
  prediction_probs = []
  real_values = []
  with torch.no_grad():
    for d in data_loader:

      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      labels = d["label"].to(device)

      outputs = model(input_ids=input_ids, attention_mask=attention_mask) #dim BATCH_SIZE x 3
      # torch.max(outputs, dim=1) returns (vals, positions) of maxima -> positions are kept and correspond to class labels
      _, preds = torch.max(outputs, dim=1) # dim BATCH_SIZE x 1

      predictions.extend(preds)
      #prediction_probs.extend(outputs)
      real_values.extend(labels)
  predictions = torch.stack(predictions).cpu()
  #prediction_probs = torch.stack(prediction_probs).cpu()
  real_values = torch.stack(real_values).cpu()
  return predictions, real_values #prediction_probs

In [None]:
def train_epoch(model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):
    
    model = model.train()
    losses = []
    correct_predictions = 0
    
    show_every = np.floor(len(data_loader)/5)
    
    for i,d in enumerate(data_loader):
        
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        labels = d["label"].to(device)
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        # torch.max(outputs, dim=1) returns (vals, positions) of maxima -> positions are kept and correspond to class labels
        _, preds = torch.max(outputs, dim=1)
        
        loss = loss_fn(outputs, labels) # both outputs and labels are in {0,1,2}
        correct_predictions += torch.sum(preds == labels)
        
        losses.append(loss.item())
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
                
    return correct_predictions.double() / n_examples, np.mean(losses)

In [None]:
def eval_model(model, data_loader, loss_fn, device, n_examples):
    
    model = model.eval()
    losses = []
    correct_predictions = 0
    
    with torch.no_grad():
        for d in data_loader:
            
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            labels = d["label"].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            _, preds = torch.max(outputs, dim=1)
            loss = loss_fn(outputs, labels)
            correct_predictions += torch.sum(preds == labels)
            losses.append(loss.item())
            
    return correct_predictions.double() / n_examples, np.mean(losses)

### Parameter choices

Here the parametres fro the model need to be set (e.g. MAX_LEN from the token histogram above). Model will be stored by the day its trained on, Hyperparamter choices will be written to the name model to ensure its replicable.

In [None]:
import datetime
print(datetime.date.today())
today = datetime.date.today()
! mkdir /content/gdrive/MyDrive/StyleClassifier/models/{today}

In [None]:
MAX_LEN = 100 # chosen acccroding to hist above
BATCH_SIZE = 32# tunable hyper parameter
EPOCHS = 3
lr = 1e-5
dropout = 0.3

# Choose adequate paths to google drive
! mkdir /models/{today}/RoBERTA_Epochs{EPOCHS}_Bs{BATCH_SIZE}_lr{lr}_drop{dropout}

model_save_path = f'/models/{today}/RoBERTA_Epochs{EPOCHS}_Bs{BATCH_SIZE}_lr{lr}_drop{dropout}/RoBERTa.bin'# save model parameters, should include BERT and hyperparameters in name
img_save_path = f'/models/{today}/RoBERTA_Epochs{EPOCHS}_Bs{BATCH_SIZE}_lr{lr}_drop{dropout}/cm.png' # save confusion matrix, should include BERT and hyperparameters in name
report_save_path = f'/models/{today}/RoBERTA_Epochs{EPOCHS}_Bs{BATCH_SIZE}_lr{lr}_drop{dropout}/report.csv' # save classification report, should include BERT and hyperparameters in name

### Optimization


- Prepare data: split into some training-, test- and validation set. This partition of data will also be stored, such that one can later work with the exact same test data.

In [None]:
# Splitting data -> adpat split if necessary
data_train, data_valtest, labels_train, labels_valtest = train_test_split(data, labels, test_size=0.5, random_state = 42) 
data_val, data_test, labels_val, labels_test = train_test_split(data_valtest, labels_valtest, test_size=0.5,random_state = 42) 

assert len(data) == len(data_test)+len(data_train)+len(data_val)

print(f'Number of training sentences: \t {len(data_train)}')
print(f'Number of validation sentences:  {len(data_val)}')
print(f'Number of test sentences: \t {len(data_test)}')

# Create DataLoaders
train_data_loader = create_data_loader(data_train, labels_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(data_val, labels_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(data_test, labels_test, tokenizer, MAX_LEN, BATCH_SIZE)

model_specific_data = {}
model_specific_data['train'] = list(zip(data_train, labels_train))
model_specific_data['val'] = list(zip(data_val, labels_val))
model_specific_data['test'] = list(zip(data_test, labels_test))

In [None]:
# Save partition of data
data_save_path = f'/models/{today}/RoBERTA_Epochs{EPOCHS}_Bs{BATCH_SIZE}_lr{lr}_drop{dropout}/model_specific_data.pkl'

with open(data_save_path, 'wb') as f:
  pickle.dump(model_specific_data, f)

- Initialize model


In [None]:
# Initialize model
model = StyleClassifier(2, drop = dropout).to(device)

- Set everything optimization related



In [None]:
#Optimizer & Scheduler
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)

scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps= len(train_data_loader) * EPOCHS
)


# The choice of loss is done, as 'This criterion combines LogSoftmax and NLLLoss in one single class'. 
# given input vector softmay is applied, and then NLLL  loss performed
loss_fn = nn.CrossEntropyLoss().to(device)

- Actual optimization procedure

In [None]:
# Test Accuracy before training: should be roughly 50%
test_acc, _ = eval_model(model, test_data_loader, loss_fn, device, len(data_test))
print(f'Test Accuracy: {test_acc.item()}')

In [None]:
# Training the model for the desired amount of epochs
%%time
history = defaultdict(list)
best_accuracy = 0

for epoch in range(EPOCHS):
    print(f'Epoch {epoch + 1}/{EPOCHS}')
    print('-' * 10)
    train_acc, train_loss = train_epoch(model, train_data_loader, loss_fn, optimizer, device, scheduler, len(data_train))
      
    print(f'Train loss {train_loss} accuracy {train_acc}')
    val_acc, val_loss = eval_model(model, val_data_loader, loss_fn, device, len(data_val))
      
    print(f'Val   loss {val_loss} accuracy {val_acc}')
    print()
    
    history['train_acc'].append(train_acc)
    history['train_loss'].append(train_loss)
    history['val_acc'].append(val_acc)
    history['val_loss'].append(val_loss)
    
    if val_acc > best_accuracy:
        print(f'Model updated after epoch: {epoch+1} \n')
        if model_save_path:
          torch.save(model.state_dict(), model_save_path)
        best_accuracy = val_acc


In [None]:
# Test Accuracy after training: should be significantly better than previous 50%
test_acc, _ = eval_model(model, test_data_loader, loss_fn, device, len(data_test))
print(f'Test Accuracy: {test_acc.item()}')

## Model Evaluation

In [None]:
# Trajectory of training and validation loss during training
plt.plot(history['train_acc'], label='train accuracy')
plt.plot(history['val_acc'], label='validation accuracy')
plt.title('Training history')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
#plt.ylim([0, 1])
plt.yscale('log')
plt.show()

- Next we load our best model, as the current state might have overfitted

In [None]:
# load best model from optimization
if model_save_path: 
  model = StyleClassifier(2)
  model.load_state_dict(torch.load(model_save_path))
  model = model.to(device)

In [None]:
# Test Accuracy
test_acc, _ = eval_model(model, test_data_loader, loss_fn, device, len(data_test))
print(f'Test Accuracy: {test_acc.item()}')

- Calculate Scores 

In [None]:
# Calculation of relevant scores
labels_pred, labels_test = get_predictions(model, test_data_loader)
report = classification_report(labels_test, labels_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()
if report_save_path:
  report_df.to_csv(report_save_path) 

report_df

In [None]:
cm = confusion_matrix(labels_test, labels_pred, normalize = 'true')
show_confusion_matrix(cm, names = ['0', '1'], save_path=img_save_path)