# Predicting Clinical Trial Terminations
### Notebook 4.1: Advanced Modelling - Word Embedding

**Author: Clement Chan**

---
Notes on the notebook:

https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT

- ADD NOTES HERE!!

In [26]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

# sklearn metrics
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score

# import PyTorch
import torch

# huggingface/transformers
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, DataCollatorWithPadding
from transformers import TrainingArguments, Trainer, TextClassificationPipeline

# load_dataset
from datasets import load_dataset

# ignores the filter warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Bio_ClinicalBert Setup
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

In [3]:
text_df = pd.read_csv('clean_ctg.csv', index_col=0)
text_df.head()

Unnamed: 0,Study Title,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,Secondary Outcome Measures,Sponsor,Collaborators,Sex,Age,Phases,Enrollment,Funder Type,Study Type,Study Design,Locations,study_duration
0,Effectiveness of a Problem-solving Interventio...,0,We will conduct a two-arm individually randomi...,0,"Mental Health Issue (E.G., Depression, Psychos...",behavioral: pride 'step 1' problem-solving int...,"Mental health symptoms, The Strengths and Diff...","Mental health symptoms, The adolescent-reporte...",Sangath,Harvard Medical School (HMS and HSDM)|London S...,ALL,"CHILD, ADULT",NO PHASE,210-490,OTHER,1,Allocation: RANDOMIZED|Intervention Model: PAR...,"Sangath, New Delhi, Delhi, 110016, India",123-244
1,Investigating the Effect of a Prenatal Family ...,0,The purpose of this study is to measure the di...,0,Focus: Contraceptive Counseling|Focus: Postpar...,other: family planning counseling let by commu...,"Self-reported contraceptive use, 6 months post...","Intent to use contraception in the future, 6 m...",Planned Parenthood League of Massachusetts,Society for Family Planning Research Fund,FEMALE,"CHILD, ADULT, OLDER_ADULT",NO PHASE,120-209,OTHER,1,Allocation: NON_RANDOMIZED|Intervention Model:...,Palestinian Ministry of Health Maternal Child ...,366-515
2,Pre-exposure Prophylaxis (PrEP) for People Who...,1,People who inject drugs (PWID) experience high...,0,Intravenous Drug Abuse,behavioral: prep uptake/adherence intervention...,"PrEP uptake by self-report, measured using 1 i...",Participant satisfaction with intervention con...,Boston University,National Institute on Drug Abuse (NIDA),ALL,"ADULT, OLDER_ADULT",NO PHASE,0-8,OTHER,1,Allocation: RANDOMIZED|Intervention Model: PAR...,unknown,245-365
3,Tailored Inhibitory Control Training to Revers...,0,Insufficient inhibitory control is one pathway...,0,Smoking|Alcohol Drinking|Prescription Drug Abu...,behavioral: person-centered inhibitory control...,"Inhibitory control performance, Task 1, Perfor...",Far transfer to a task related to inhibitory c...,University of Oregon,none,ALL,ADULT,NO PHASE,80-119,OTHER,1,Allocation: RANDOMIZED|Intervention Model: PAR...,"University of Oregon, Social and Affective Neu...",516-671
4,Neuromodulation of Trauma Memories in PTSD & A...,0,The purpose of this study is to examine the ef...,1,Alcohol Dependence|PTSD,drug: propranolol|drug: placebo,"Retrieval Session Distress Scores (Session 1),...","Proportion of Drinking Days, Proportion of dri...",Medical University of South Carolina,National Institute on Alcohol Abuse and Alcoho...,ALL,"ADULT, OLDER_ADULT",PHASE2,42-59,OTHER,1,Allocation: RANDOMIZED|Intervention Model: PAR...,"MUSC, Charleston, South Carolina, 294258908, U...",862-1097


In [4]:
sample = text_df['Study Title'][:2].tolist()
sample

['Effectiveness of a Problem-solving Intervention for Common Adolescent Mental Health Problems in India',
 'Investigating the Effect of a Prenatal Family Planning Counseling Intervention Led by Community Health Workers on Postpartum Contraceptive Use Among Women in the West Bank']

In [5]:
inputs = tokenizer(sample, return_tensors="pt", padding=True)
inputs

{'input_ids': tensor([[  101, 12949,  1104,   170,  2463,   118, 15097,  9108,  1111,  1887,
         25506,  4910,  2332,  2645,  1107,  1107,  7168,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [  101, 11950,  1103,  2629,  1104,   170,  3073, 24226,  1348,  1266,
          3693, 22138,  9108,  1521,  1118,  1661,  2332,  3239,  1113,  2112,
         17482,  8928, 14255,  4487, 17046,  1329,  1621,  1535,  1107,  1103,
          1745,  3085,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [6]:
inputs["input_ids"].shape

torch.Size([2, 33])

In [7]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

['[CLS]',
 'effectiveness',
 'of',
 'a',
 'problem',
 '-',
 'solving',
 'intervention',
 'for',
 'common',
 'adolescent',
 'mental',
 'health',
 'problems',
 'in',
 'in',
 '##dia',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

In [8]:
outputs = model(**inputs)

In [9]:
outputs.last_hidden_state.shape

torch.Size([2, 33, 768])

In [10]:
model.config

BertConfig {
  "_name_or_path": "emilyalsentzer/Bio_ClinicalBERT",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.32.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}


Here, we actually pull a pre-trained model and get it setup with our desired output classification labels.  

In [51]:
# model setup for text classification
model = AutoModelForSequenceClassification.from_pretrained(
    "emilyalsentzer/Bio_ClinicalBERT",
    num_labels=2,)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [52]:
# Out target categories need to be encoded as integers, but we'll want to reverse this
# encoding back to the original categorical strings later, so we need forward and reverse lookups.
id2label = {i:cat for i,cat in enumerate(set(text_df["Study Status"]))}
label2id = {v:k for k,v in id2label.items()}

In [53]:
# pull out columns of interest and do a manual test_train split based on random integer assignment
simplified = text_df[["Study Status","Study Title"]].copy()
simplified.columns = ["label","text"]
simplified.loc[:,"label"] = list(label2id[lab] for lab in simplified["label"])
test_flag = np.random.randint(0,high=10,size=text_df.shape[0])
simplified.loc[:,'test'] = test_flag > 4


In [54]:
# select 30 of each category for training in "few shot" mode
train = simplified[~simplified.test].groupby('label',group_keys=False).apply(lambda x: x.sample(5000))

In [55]:
train.shape

(10000, 3)

In [56]:
train[train['label'] == 1].count()

label    5000
text     5000
test     5000
dtype: int64

In [57]:
# select 90 of each cagegory for test evaluation during tuning
test = simplified[simplified.test].groupby('label',group_keys=False).apply(lambda x: x.sample(10000))

In [58]:
test.shape

(20000, 3)

In [59]:
test['label'].value_counts()

label
0    10000
1    10000
Name: count, dtype: int64

In [60]:
# save as CSV files to be processed by the hugging face data pipelines
train.reset_index(drop=True).to_csv("data/bert_train.csv")
test.reset_index(drop=True).to_csv("data/bert_test.csv")

In [61]:
# load as hugging face dataset
dataset = load_dataset('csv', data_files={'train': 'data/bert_train.csv', 'test': 'data/bert_test.csv'})

Downloading and preparing dataset csv/default to C:/Users/tingh/.cache/huggingface/datasets/csv/default-ee2b8e5d3f365f51/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to C:/Users/tingh/.cache/huggingface/datasets/csv/default-ee2b8e5d3f365f51/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [62]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

# pro forma hugging face text processing setup
tokenized_data = dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [63]:
# leverage sklearn metrics for runtime training evaluation
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred,average='micro')
    precision = precision_score(y_true=labels, y_pred=pred,average='micro')
    f1 = f1_score(y_true=labels, y_pred=pred,average='micro')

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

We setup our complete transfer learning pipeline by instantiating a TrainingArguments instance with specific parameters and creating a new Trainer that collects all components together: model, training_args, preprocessing pipeline, and evaluation funcs.

In [64]:
# training parameter setup goes in a specific class instance
training_args = TrainingArguments(
    output_dir="clinical-study-classifier",
    learning_rate=2e-5,
    optim="adamw_torch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=10,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False)

# the trainier
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics)

In [65]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.672,0.663162,0.60375,0.60375,0.60375,0.60375
2,0.6267,0.671792,0.6195,0.6195,0.6195,0.6195
3,0.5031,0.727091,0.6082,0.6082,0.6082,0.6082
4,0.3589,1.230018,0.599,0.599,0.599,0.599
5,0.2694,1.999553,0.59845,0.59845,0.59845,0.59845


KeyboardInterrupt: 

In [77]:
# predictions for the entire dataset
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False, device='cuda')
pipe(["Developing a Diabetic Foot Ulcer Protocol"])

[{'label': 'LABEL_1', 'score': 0.9991926550865173}]

**Logs**

In [25]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.828775,0.587333,0.587333,0.587333,0.587333
2,0.497600,0.888766,0.5765,0.5765,0.5765,0.5765
3,0.497600,1.258348,0.585,0.585,0.585,0.585
4,0.230100,2.080483,0.573667,0.573667,0.573667,0.573667
5,0.230100,2.603837,0.564833,0.564833,0.564833,0.564833
6,0.073100,2.892967,0.571333,0.571333,0.571333,0.571333
7,0.073100,3.041043,0.566667,0.566667,0.566667,0.566667
8,0.037400,3.135675,0.574167,0.574167,0.574167,0.574167
9,0.037400,3.208559,0.5715,0.5715,0.5715,0.5715
10,0.007200,3.244623,0.573,0.573,0.573,0.573


TrainOutput(global_step=2500, training_loss=0.16911778354644774, metrics={'train_runtime': 1118.3021, 'train_samples_per_second': 17.884, 'train_steps_per_second': 2.236, 'total_flos': 904423697248800.0, 'train_loss': 0.16911778354644774, 'epoch': 10.0})

In [40]:
# predictions for the entire dataset
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False, device='cuda')
pipe(["Developing a Diabetic Foot Ulcer Protocol"])

[{'label': 'LABEL_0', 'score': 0.7159504294395447}]

In [38]:
text_df[text_df['Study Status'] == 1]['Study Title']

2         Pre-exposure Prophylaxis (PrEP) for People Who...
15                Developing a Diabetic Foot Ulcer Protocol
17        Docetaxel and St. John's Wort in Treating Pati...
48        Sunitinib and Capecitabine for First Line Colo...
49        Diabetic Retinopathy and Subclinical Signs of ...
                                ...                        
305075    Evaluation and Long-Term Follow-Up of Patients...
305078    Effectiveness of Supplementary Feeding During ...
305081    Pharmacokinetic Study of Forodesine in Childre...
305085    Study Comparing DTP-HB-Hib by Disposable-Syrin...
305092    REPRISE IV: LOTUS Edge Valve System in Interme...
Name: Study Title, Length: 41730, dtype: object

In [39]:
text_df[text_df['Study Status'] == 1]['Study Title'][15]

'Developing a Diabetic Foot Ulcer Protocol'

In [67]:
text_df[text_df['Study Status'] == 0]['Study Title']

0         Effectiveness of a Problem-solving Interventio...
1         Investigating the Effect of a Prenatal Family ...
3         Tailored Inhibitory Control Training to Revers...
4         Neuromodulation of Trauma Memories in PTSD & A...
5         High Intensity Interval Training Versus Circui...
                                ...                        
305105    Evaluating the Pharmacokinetics, Safety, and T...
305106    A Follow-up Study to Assess Safety and Tolerab...
305107    Lipid Formulation to Increase the Bioavailabil...
305108    Reliability of Topography Measurements in Kera...
305109    The Accuracy of Human Endoscopic Detection of ...
Name: Study Title, Length: 263380, dtype: object

In [71]:
text_df[text_df['Study Status'] == 0]['Study Title'][3]

'Tailored Inhibitory Control Training to Reverse EA-linked Deficits in Mid-life'