Automatic classification of stigmatizing mental illness articles in online news journals - April 2022
Author: Alina Yanchuk - alinayanchuk@ua.pt


### Table of contents:

* [4. Classification](#chapter4)
    * [4.1 Requirements](#section_4_1)
    * [4.2 Imports](#section_4_2)
    * [4.3 Get data](#section_4_3)
    * [4.4 Train and Test dataset](#section_4_4)
    * [4.5 Model training/fine-tuning](#section_4_5)
    * [4.6 Evaluation analysis](#section_4_6)

# 4. Classification with BERT (BERTimbau) <a class="anchor" id="chapter4"></a>

Most of the labeled text datasets are not big enough to train deep neural networks and get the most accurate results. Pre-trained models came to help. Transfer learning is a technique where a deep learning model trained on a large dataset is used to perform similar tasks on another dataset. The models are already pre-trained and just need to be fine-tuned for the specific task/problem. BERT is one example of these models.

- BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks;
- BERT uses the Transformer encoder architecture to process each token of input text in the full context of all tokens before and after;
- BERTimbau is trained on the Portuguese language. BERT-Base and BERT-Large Cased variants were trained on the BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask.



References:
    1. https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification
    2. https://github.com/neuralmind-ai/portuguese-bert
    3. https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04


## 4.1 Requirements <a class="anchor" id="section_4_1"></a>

In [None]:
#pip install transformers

In [None]:
#pip install torch

## 4.2 Imports <a class="anchor" id="section_3_2"></a>

In [2]:
import pandas as pd

from sklearn.model_selection import train_test_split

from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup
import torch
from torch import optim, cuda
from torch.utils.data import DataLoader

from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import precision_recall_fscore_support as scores

2022-03-31 19:27:46.234929: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-31 19:27:46.235068: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## 4.3 Get data <a class="anchor" id="section_4_3"></a>

In [3]:
data = pd.read_pickle('data_preprocessed.pkl')
data.head()

Unnamed: 0,label,content
0,0,prisão perpétua homem tentou assassinar senado...
1,0,john nash matemático mente brilhante morre aci...
2,1,mito reeleição mínima garantida cavaco sairá d...
3,0,morreu rita levintalcini grande dama ciência i...
4,0,trás porta amarela homem problemas psicológico...


## 4.4 Train and Test dataset <a class="anchor" id="section_4_4"></a>

In [4]:
# Divide the data et into a 80% train dataset and 20% test dataset
# Divide the train dataset into a 80% train dataset and 20% validation dataset

data = data.loc[:,['content', 'label']]

data_train, data_test = train_test_split(data, train_size=0.8, random_state=55, stratify=data.label.values)
data_train, data_val = train_test_split(data_train, random_state=55, train_size=0.8, stratify=data_train.label.values)

train = [{'X': content, 'y': label} for (content, label) in zip(data_train.content, data_train.label)]
test = [{'X': content, 'y': label} for (content, label) in zip(data_test.content, data_test.label)]
val = [{'X': content, 'y': label} for (content, label) in zip(data_val.content, data_val.label)]

print("Number of news in train dataset: " + str(len(train)))
print("Number of news in validation dataset: " + str(len(val)))
print("Number of news in test dataset: " + str(len(test)))

Number of news in train dataset: 412
Number of news in validation dataset: 104
Number of news in test dataset: 129


## 4.5 Model training/fine-tuning <a class="anchor" id="section_4_5"></a>

In [5]:
# Set relevant parameters

pretrained_model_name = 'neuralmind/bert-base-portuguese-cased'
n_classes = 2 # Binary problem
n_epochs = 4 
batch_size = 8
batch_status = 32
learning_rate = 1e-5
early_stop = 2 
max_length = 480 # Pad or truncate all texts to same length
device = 'cuda' if cuda.is_available() else 'cpu'  # GPU or CPU
print(f"Using: {device}")

Using: cpu


In [7]:
# Parse data into batches of tensors

traindata = DataLoader(train, batch_size=batch_size, shuffle=True)
valdata = DataLoader(val, batch_size=batch_size, shuffle=True)
testdata = DataLoader(test, batch_size=batch_size, shuffle=True)

In [8]:
# Get pre-trained model, it's tokenizer, an optimizer and scheduler.

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, do_lowercase=False)
pretrained_model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name, num_labels=n_classes).to(device) # Bert Model transformer with a sequence classification head on top (a linear layer on top of the pooled output) 

optimizer = optim.AdamW(pretrained_model.parameters(), lr=learning_rate)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = n_epochs*len(traindata))

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the

In [9]:
# Function to evaluate the validation dataset

def evaluate_val(model, valdata):
  y_real, y_pred = [], []
  losses = []

  model.eval()
  
  for batch_idx, inp in enumerate(valdata):
    texts, labels = inp['X'], inp['y']

    with torch.no_grad():
      # Classifying
      inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(device)
      output = model(**inputs, labels=labels.to(device))
                  
      pred_labels = torch.argmax(output.logits, 1)

      loss = output.loss
      losses.append(float(loss.item()))
      
      y_real.extend(labels.tolist())
      y_pred.extend(pred_labels.tolist())

    if (batch_idx+1) % batch_status == 0:
      print('Progress:', round(batch_idx / len(testdata), 2), batch_idx)

  avg_loss = round(sum(losses) / len(losses), 5)
  print(classification_report(y_real, y_pred, labels=[0, 1], target_names=['Literal', 'Stigma']))

  return avg_loss

In [None]:
# Function to store evaluation metrics for the final model testing

def evaluate_test(model, testdata):
  evaluation_metrics_list = []
  y_real, y_pred = [], []

  model.eval()
  
  for batch_idx, inp in enumerate(testdata):
    texts, labels = inp['X'], inp['y']
    
    with torch.no_grad():
      # Classifying
      inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(device)
      output = model(**inputs)
                  
      pred_labels = torch.argmax(output.logits, 1)
      
      y_real.extend(labels.tolist())
      y_pred.extend(pred_labels.tolist())

  # Performance metrics
  accuracy = accuracy_score(y_real, y_pred)*100

  # Precision, recall, f1 scores
  precision, recall, f1score, support = scores(y_real, y_pred, average='micro')

  # Add metrics to evaluation list
  evaluation_metrics_list.append(dict([
      ('Model', 'BERTimbau'),
      ('Accuracy (%)', round(accuracy, 2)),
      ('Precision', round(precision, 2)),
      ('Recall', round(recall, 2)),
      ('F1', round(f1score, 2))
  ]))

  return evaluation_metrics_list

In [None]:
# Model training

best_loss = float('inf')
avg_val_loss = 0
all_losses = {'train_loss':[], 'val_loss':[]}

for epoch in range(n_epochs):

  losses = []

  pretrained_model.train()
  
  for batch_idx, inp in enumerate(traindata):
    texts, labels = inp['X'], inp['y']

    # Classifying
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(device) # Tokenize
    output = pretrained_model(**inputs, labels=labels.to(device))

    pretrained_model.zero_grad() # Clear any previously calculated gradients

    # Calculate loss
    loss = output.loss
    losses.append(float(loss)) # Accumulate losses over all batches

    # Backpropagation
    loss.backward()
    optimizer.step()

    # Update the learning rate
    scheduler.step() 

  print('Train Epoch: {}'.format(epoch))
  avg_loss = round(sum(losses) / len(losses), 5)
  avg_val_loss = evaluate_val(pretrained_model, valdata) # Evaluation validation dataset for this epoch
  print(f'\nTraining Loss: {avg_loss:.3f}')
  print(f'\nValidation Loss: {avg_val_loss:.3f}')
  all_losses['train_loss'].append(avg_loss)
  all_losses['val_loss'].append(avg_val_loss)

  if avg_val_loss < best_loss:
    torch.save(pretrained_model.state_dict(), 'saved_weights.pt') # Save best weights
    best_loss = avg_val_loss
    print('Saving best model...')

## 4.6 Evaluation analysis <a class="anchor" id="section_4_6"></a>

In [3]:
# Evalute on test dataset

pretrained_model.load_state_dict(torch.load('saved_weights.pt'))

# Load the best model weights 
evaluation_metrics = evaluate_test(pretrained_model, testdata)
evaluation = pd.DataFrame(data=evaluation_metrics)
evaluation.columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1']
evaluation = evaluation.sort_values(by='Accuracy', ascending=False)
evaluation