### Training BERT on Labeled Endometriosis Dataset
This script trains, tests, and saves a BERT model on our labeled paragraphs (or posts) from the endometriosis dataset. 

Additional resources for this code:


*   HuggingFace's docs on [fine-tuning a pre-trained model](https://huggingface.co/docs/transformers/training)
*   BERT for Humanist's [Fine-Tuning for Classification](https://colab.research.google.com/drive/19jDqa5D5XfxPU6NQef17BC07xQdRnaKU?usp=sharing) tutorial



In [None]:
import os

# Change label_class_annotations, label_type, and chronic_conditions_dir to get started
label_class_annotations = 'combined_negligence.csv'
# note: for endo support community this label_type may not work - change to spaces
label_type = "PERCEIVED-NEGLIGENCE"

# point to your chronic conditions dir on drive
chronic_conditions_dir = '/content/drive/MyDrive/chronic_conditions/'
annotations_file_path = os.path.join(chronic_conditions_dir, 'labeling', 'annotated-data', 'formatted_csvs', label_class_annotations)
model_output_path = os.path.join(chronic_conditions_dir, 'code', 'output', label_type)

In [None]:
# Basic Python modules
from collections import defaultdict
import random
import pickle

# For data manipulation and analysis
import pandas as pd
import numpy as np

# For machine learning tools and evaluation
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, cross_val_score, train_test_split

# For deep learning
# https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html
import torch

Install the HuggingFace 🤗 transformers library

In [None]:
!pip3 install transformers

Collecting transformers
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 5.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 495 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 45.3 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting 

In [None]:
# using DistilBERT for testing --> can switch to BERT once set up
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

In [None]:
# Choose the BERT model that we want to use (make sure to keep the cased/uncased consistent)
model_name = 'distilbert-base-uncased'  

# Choose the GPU we want to process this script
device_name = 'cuda'       

# This is the maximum number of tokens in any document sent to BERT
max_length = 512                                                        

In [None]:
# Mount the Google drive for access to files
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Set up classification task

In [None]:
# Read in annotated data that will be used for training/testing
annotations_df = pd.read_csv(annotations_file_path)

In [None]:
# Set up training and testing sets
X = annotations_df["text"].to_list()
# if perceived negligence, may need to manually change this bc space vs dash
y = annotations_df[label_type].to_list()

In [None]:
train_texts, test_texts, train_labels, test_labels = train_test_split(X, y, test_size = 0.25)

In [None]:
from collections import Counter
Counter(train_labels)

Counter({0: 1193, 1: 179})

### BERT Encoding 

In [None]:
# load the encoder/tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
test_texts[1:10]

['1. Is it possible that switching from the BC pill (Loestrin) to my IUD (Skyla, and now Kyleena) allowed my symptoms to develop, or is this just a coincidence in timing? Admittedly I’m still learning about the hormone side of things.',
 "I had nailed down my diet after my diagnosis in January 2020. No sugar, no gluten, no caffeine. But then, by October 2020, I digressed. Now, I don't exercise any diet control at all. This, of course, has resulted in tremendous weight gain (combined with no workouts because of how bad the pandemic situation in India is), bloating, occasional cramps, and an overall shitty feeling. ",
 "Last month I had an operation to have one of my ovaries removed due to complications with endometriosis and my ovary, which is apparently supposed to be the size of an almond, expanded to the size of a large cantelope and nearly ruptured. My doctors were incredible and I made a speedy recovery. At my follow up appointment my doctor gave me the diagnosis of endometriosis, 

In [None]:
# Pass training/testing sentences to tokenizer, truncate them if over max length, and add padding (PAD tokens up to 512)
train_encodings = tokenizer(train_texts,  truncation=True, padding=True)
test_encodings = tokenizer(test_texts,  truncation=True, padding=True)

## Convert into a Torch Dataset
Combine encoded text and labels into a torch dataset object.

In [None]:
class SCDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [None]:
train_dataset = SCDataset(train_encodings, train_labels)
test_dataset = SCDataset(test_encodings, test_labels)

## Set up the training task

Choose the arguments that will be used with the HuggingFace TrainingArguments object, that will be passed to the HuggingFace Trainer object. 

In [None]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    learning_rate=5e-5,              # initial learning rate for Adam optimizer
    warmup_steps=50,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluation_strategy='steps',
)

Load the pretrained model and send this to cuda. This pretrained model is trained on a range of English language texts, like Wikipedia entries or books. When fine-tuning it, we make it more attuned to our corpus (in this case, reddit posts about endometriosis).

In [None]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased").to(device_name)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_clas

In [None]:
# Define a custom evaluation function (this could be changes to return accuracy metrics)
def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

Create the trainer object based on what we've set up prior to this point! This combines our `model`, `training_args`, `train_dataset` and `test_dataset`, and custom evaluation function `compute_metrics`. 

In [None]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,            # evaluation dataset
    compute_metrics=compute_metrics      # custom evaluation function
)

Fine-tune the model on our dataset/labels. The trainer object will periodically output the state of the model.

In [None]:
trainer.train()

***** Running training *****
  Num examples = 1372
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 258


Step,Training Loss,Validation Loss,Accuracy
10,0.7033,0.64512,0.866812
20,0.5586,0.429336,0.866812
30,0.3945,0.391151,0.866812
40,0.3534,0.399153,0.866812
50,0.3553,0.374018,0.866812
60,0.2979,0.390544,0.866812
70,0.4153,0.379014,0.866812
80,0.4212,0.35679,0.866812
90,0.3146,0.382486,0.844978
100,0.3269,0.389938,0.818777


***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evaluation *****
  Num examples = 458
  Batch size = 20
***** Running Evalua

TrainOutput(global_step=258, training_loss=0.2748474917670553, metrics={'train_runtime': 779.5499, 'train_samples_per_second': 5.28, 'train_steps_per_second': 0.331, 'total_flos': 545235812868096.0, 'train_loss': 0.2748474917670553, 'epoch': 3.0})

In [None]:
# built in evaluation function
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 458
  Batch size = 20


{'epoch': 3.0,
 'eval_accuracy': 0.851528384279476,
 'eval_loss': 0.48460227251052856,
 'eval_runtime': 16.0167,
 'eval_samples_per_second': 28.595,
 'eval_steps_per_second': 1.436}

### Save the model

In [None]:
model_output_path

'/content/drive/MyDrive/chronic_conditions/code/output/PERCEIVED-NEGLIGENCE'

In [None]:
model.save_pretrained(model_output_path)

Configuration saved in /content/drive/MyDrive/chronic_conditions/code/output/PERCEIVED-NEGLIGENCE/config.json
Model weights saved in /content/drive/MyDrive/chronic_conditions/code/output/PERCEIVED-NEGLIGENCE/pytorch_model.bin


## Assess performance

In [None]:
Counter(test_labels)

Counter({0: 397, 1: 61})

In [None]:
predicted_labels = trainer.predict(test_dataset)

***** Running Prediction *****
  Num examples = 458
  Batch size = 20


In [None]:
actual_predicted_labels = predicted_labels.predictions.argmax(-1)
Counter(actual_predicted_labels)

Counter({0: 397, 1: 61})

In [None]:
Counter(predicted_labels.label_ids.flatten())

Counter({0: 397, 1: 61})

In [None]:
from sklearn.metrics import classification_report
class_report = classification_report(predicted_labels.label_ids.flatten(), actual_predicted_labels.flatten(), output_dict=True)
print(classification_report(predicted_labels.label_ids.flatten(), actual_predicted_labels.flatten()))

              precision    recall  f1-score   support

           0       0.91      0.91      0.91       397
           1       0.44      0.44      0.44        61

    accuracy                           0.85       458
   macro avg       0.68      0.68      0.68       458
weighted avg       0.85      0.85      0.85       458

