# **Multi Class Text Classification using BERT**
This notebook implements a multi class classifier using BERT, to predict the severity of a Reddit user's depression level.

In order to create this code, I followed this tutorial https://towardsdatascience.com/multi-class-text-classification-with-deep-learning-using-bert-b59ca2f5c613.

This model is in response to, and uses the dataset from, task 3 of CLEF's eRisk 2021 workshop: https://early.irlab.org.

# **Environment Set-up**

### **Importing and Installing Required Libraries**

In [None]:
!pip install transformers
import torch
import pandas as pd
import numpy as np
import random
import csv

from tqdm.notebook import tqdm
from transformers import RobertaTokenizer
from torch.utils.data import TensorDataset
from transformers import RobertaForSequenceClassification
from sklearn.model_selection import train_test_split

#change to the correct directory
%cd /content/drive/MyDrive/CS408/

/content/drive/MyDrive/CS408


# **Preparing Data**
The dataset has been preprocessed here: https://colab.research.google.com/drive/1J8XAD7JPShZQuBRJ6zfFjV1dUtH4y858?usp=sharing

In [None]:
bdi_questions_answers = pd.read_csv("BDI_csv.csv")

#list of all 21 question names in BDI i.e. ['Sadness', 'Pessimism', ...]
questions = bdi_questions_answers['Question'].unique()

#list of lists where element contains all possible answers per question 
answers = []

#look up in bdi_questions_answers df for all possible answers per question and add to answers
for question_name in questions:
  answers.append(bdi_questions_answers.loc[((bdi_questions_answers['Question'] == question_name)), 'Answer'].values)

## **Create Dataframes**

Convert the wrangled datset csvs into dataframes for ease of use later on. Creating variables and dataframes dynamically is bad practice, thus I made the csvs first and convert them here into their respective dataframes.

In [None]:
%cd question_csvs

/content/drive/MyDrive/CS408/question_csvs


In [None]:
#Creating data frames for each question 
agitation_df = pd.read_csv("answer_classes_posts_Agitation.csv") 
appetite_df = pd.read_csv("answer_classes_posts_Changes in Appetite.csv")
sleep_df = pd.read_csv("answer_classes_posts_Changes in Sleeping Pattern.csv") 
concentration_df = pd.read_csv("answer_classes_posts_Concentration Difficulty.csv") 
crying_df = pd.read_csv("answer_classes_posts_Crying.csv") 
guilty_df = pd.read_csv("answer_classes_posts_Guilty Feelings.csv")
indecisive_df = pd.read_csv("answer_classes_posts_Indecisiveness.csv") 
irritability_df = pd.read_csv("answer_classes_posts_Irritability.csv") 
energy_df = pd.read_csv("answer_classes_posts_Loss of Energy.csv") 
sexinterest_df = pd.read_csv("answer_classes_posts_Loss of Interest in Sex.csv") 
interest_df = pd.read_csv("answer_classes_posts_Loss of Interest.csv") 
pleasure_df = pd.read_csv("answer_classes_posts_Loss of Pleasure.csv") 
pastfailure_df = pd.read_csv("answer_classes_posts_Past Failure.csv") 
pessimism_df = pd.read_csv("answer_classes_posts_Pessimism.csv") 
punishment_df = pd.read_csv("answer_classes_posts_Punishment Feelings.csv") 
sadness_df = pd.read_csv("answer_classes_posts_Sadness.csv") 
selfcritcalness_df = pd.read_csv("answer_classes_posts_Self-Criticalness.csv") 
selfdislike_df = pd.read_csv("answer_classes_posts_Self-Dislike.csv") 
suicidal_df = pd.read_csv("answer_classes_posts_Suicidal Thoughts or Wishes.csv") 
fatigue_df = pd.read_csv("answer_classes_posts_Tiredness or Fatigue.csv") 
worthlessness_df = pd.read_csv("answer_classes_posts_Worthlessness.csv") 

#Create a list of the subject names as strings 
# (any df could be used for this)
subjects = agitation_df['Subject'].unique()

#Add all dataframes in order of BDI questions 
BDI_df = [sadness_df, pessimism_df, pastfailure_df, pleasure_df, guilty_df, punishment_df, selfdislike_df, selfcritcalness_df, suicidal_df, crying_df, agitation_df, interest_df, indecisive_df, worthlessness_df, energy_df, sleep_df, irritability_df, appetite_df, concentration_df, fatigue_df, sexinterest_df]

# **Create Classifier**

In [None]:
def evaluate(dataloader_test, model):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_test:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_test) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
%cd BERT_Classifier/

/content/drive/My Drive/CS408/BERT_Classifier


In [None]:
for i, df in enumerate(BDI_df):
  
  label_dict = pd.Series(df.Class.values,index=df.Answer).to_dict()
  df['label'] = df.Answer.replace(label_dict)

  #set data and labels 
  X = df.index.values
  y = df.label

  #test and train split
  X_train, X_test, y_train, y_test = train_test_split(df.index.values, 
                                                    df.label.values, 
                                                    test_size=0.30, 
                                                    random_state=42, 
                                                    stratify=df.label.values)

  df['data_type'] = ['not_set']*df.shape[0]

  df.loc[X_train, 'data_type'] = 'train'
  df.loc[X_test, 'data_type'] = 'test'

  df.groupby(['Answer', 'label', 'data_type']).count()

  dataset_train, dataset_test = build_and_encode()
  model = create_model()
  dataloader_train, dataloader_test = data_loaders(dataset_train, dataset_test)
  optimizer, scheduler = optimiser_and_scheduler(model)

  seed_val = 17
  random.seed(seed_val)
  np.random.seed(seed_val)
  torch.manual_seed(seed_val)
  torch.cuda.manual_seed_all(seed_val)

  device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
  model: model.to(device)

  val_loss, predictions, true_vals = evaluate(dataloader_test, model)

  #Training loop
  epochs = 5
  for epoch in tqdm(range(1, epochs+1)):
  
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:
        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})

    torch.save(model.state_dict(), f'data_volume/finetuned_BERT_epoch_{epoch}.model')

    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    

    model = RobertaForSequenceClassification.from_pretrained("roberta-base",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

    model.to(device)

    model.load_state_dict(torch.load('data_volume/finetuned_BERT_epoch_1.model', map_location=torch.device('cpu')))

    #load and evaluate model
    _, predictions, true_vals = evaluate(dataloader_test, model)

    #create dataframe storing the predicted answers
    resultsdf = pd.DataFrame(columns=['Class'], data=y_test)
    pred_flat = np.argmax(predictions, axis=1).flatten()
    resultsdf.insert(1, 'Predicted Class', value=pred_flat)

    #sort the shuffled data back into place, 
    # in order to match predicted answers with true class, subject and posts
    resultsdf.sort_index(inplace=True)
    sorted_indexes = resultsdf.index.to_list()

    #create csv per question, making it easy to calculate evaluation metrics
    filename = questions[i] + '_results.csv'
    with open(filename, mode='w') as csv_file:
      csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
      csv_writer.writerow(['Subject', 'Post', 'Class', 'Predicted Class'])
      count = 0
      for j, row in df.iterrows():
        if count == len(sorted_indexes):
          break
        if (j == sorted_indexes[count]):
          subject = row['Subject']
          post = row['Post']
          real_class = resultsdf.loc[sorted_indexes[count], 'Class']
          pred_class = resultsdf.loc[sorted_indexes[count], 'Predicted Class']
          csv_writer.writerow([subject, post, real_class, pred_class])
          count += 1


### **Constructs a RoBERTa tokenizer and instantiates a pre-trained RoBERTa model configuration to encode our data.**

In [None]:
def build_and_encode():

  #create tokenizer
  tokenizer = RobertaTokenizer.from_pretrained('roberta-base', 
                                            do_lower_case=True)

  #encode training data                                         
  encoded_data_train = tokenizer.batch_encode_plus(
      df[df.data_type=='train'].Post.values, 
      add_special_tokens=True, 
      return_attention_mask=True, 
      padding=True, 
      max_length=128, 
      return_tensors='pt',
      truncation=True
  )

  #encode test data
  encoded_data_test = tokenizer.batch_encode_plus(
      df[df.data_type=='test'].Post.values, 
      add_special_tokens=True, 
      return_attention_mask=True, 
      padding = True,
      max_length=128, 
      return_tensors='pt',
      truncation=True
  )

  #train
  input_ids_train = encoded_data_train['input_ids']
  attention_masks_train = encoded_data_train['attention_mask']
  labels_train = torch.tensor(df[df.data_type=='train'].label.values)

  #test
  input_ids_test = encoded_data_test['input_ids']
  attention_masks_test = encoded_data_test['attention_mask']
  labels_test = torch.tensor(df[df.data_type=='test'].label.values)

  train_indexes = df[df.data_type=='train'].index.values.astype(int)
  test_indexes = df[df.data_type=='test'].index.values.astype(int)

  dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
  dataset_test = TensorDataset(input_ids_test, attention_masks_test, labels_test)

  return dataset_train, dataset_test

### **Creates the RoBERTa pretrained model for classification**

In [None]:
def create_model():
  model = RobertaForSequenceClassification.from_pretrained("roberta-base",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)
  return model

### **Create dataloaders that will combine a datset and sampler, providing an iterble over the data.**

In [None]:
def data_loaders(dataset_train, dataset_test):
  
  from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

  batch_size = 3

  dataloader_train = DataLoader(dataset_train, 
                                sampler=RandomSampler(dataset_train), 
                                batch_size=batch_size)

  dataloader_test = DataLoader(dataset_test, 
                                    sampler=SequentialSampler(dataset_test), 
                                    batch_size=batch_size)
  
  return dataloader_train, dataloader_test

### **Constructs and optimiser and a scheduler**







In [None]:
def optimiser_and_scheduler(model):
  from transformers import AdamW, get_linear_schedule_with_warmup

  #create optimiser
  optimizer = AdamW(model.parameters(),
                    lr=2e-5)
                    
  epochs = 4

  #build scheduler
  scheduler = get_linear_schedule_with_warmup(optimizer, 
                                              num_warmup_steps=0,
                                              num_training_steps=len(dataloader_train)*epochs)
  
  return optimizer, scheduler

### **Results**
Unfortunately this model took too long to run and evaluate. So i was only able to gather predicted answers to posts for the first 3 questions. The question_hr function only calculates the number of times a post has been labeled correctly. Given more time, the answer would've been predicted per subject by counting the number of labels and seeing which one was predicted the most. The

In [None]:
question_hr(pd.read_csv('Sadness_results.csv'))

49.65

In [None]:
question_hr(pd.read_csv('Pessimism_results.csv'))

35.14

In [None]:
question_hr(pd.read_csv('Past Failure_results.csv'))

33.83

In [None]:
def question_hr(df):
  hr = 0
  for i, row in df.iterrows():
    if row['Class'] == row['Predicted Class']:
      hr += 1
    
  return round(hr/i * 100, 2) 