# Predicting Clinical Trial Terminations
### Notebook 4.1: Advanced Modelling - Word Embedding

**Author: Clement Chan**

---
Notes on the notebook:

https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT

**Bio_ClinicalBERT** model was trained on the [MIMIC III](https://physionet.org/content/mimiciii/1.4/) database that has over 40,000 patients over 11 years

- Using the `Study Title`, we will perform word embedding by utilising the Tokenizer from Bio_ClinicalBERT
- Then we perform transfer learning by fine-tuning the pretrained model Bio_ClinicalBERT and evaluate metrics like accuracy, f1_score, and roc_auc_score.

### Data Dictionary for this notebook that is based on clinicaltrials.gov:

---
| Column | Description                                  |Data Type|
|-------|--------------------------------------------|-------|
| Study Status (**Dependant Variable**)| Binary column, 0 for Completed Trials and 1 for Terminated Trials | int |
| Study Title | Title of the Clinical Trial           | object |
| Brief Summary | Short description of the clinical study (Includes study hypothesis) | object |
| Study Results | Whether the results are posted (yes = 1 or no = 0) | int|
| Conditions | Primary Disease or Condition being studied     | object |
| Primary Outcome Measures | Description of specific primary outcome | object |
| Sponsor | The corporation or agency that initiates the study | object |
| Collaborators | Other organizations that provide support | object |
| Sex | All: No limit on eligibility based on sex, Male: Only male participants, Female: Only female participants | int |
| Age | Age group of participants: ADULT, OLDER_ADULT, CHILD  | int |
| Phases | Clinical trial phase of the study | int |
| Enrollment | Total number of participants in a study | int |
| Funder Type | Government, Industry, or Other | int |
| Study Type | Interventional = 1, Observational = 0 | int |
| Study Design | Study design based on study type | object |
| Study Duration | Length of the entire study in categories | object |
| Locations | Country of where the clinical study was held | object |

**Importing Libraries**

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

# sklearn metrics
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score

# huggingface/evaluate

# import PyTorch
import torch

# huggingface/evaluate (metrics)
import evaluate
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
roc_auc_score = evaluate.load("roc_auc")

# huggingface/transformers
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, DataCollatorWithPadding
from transformers import TrainingArguments, Trainer, TextClassificationPipeline

# load_dataset
from datasets import load_dataset

# ignores the filter warnings
import warnings
warnings.filterwarnings('ignore')

  torch.utils._pytree._register_pytree_node(
Using the latest cached version of the module from C:\Users\tingh\.cache\huggingface\modules\evaluate_modules\metrics\evaluate-metric--accuracy\f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Sat Apr 13 23:27:15 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from C:\Users\tingh\.cache\huggingface\modules\evaluate_modules\metrics\evaluate-metric--f1\0ca73f6cf92ef5a268320c697f7b940d1030f8471714bffdb6856c641b818974 (last modified on Sat Apr 13 23:27:20 2024) since it couldn't be found locally at evaluate-metric--f1, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from C:\Users\tingh\.cache\huggingface\modules\evaluate_modules\metrics\evaluate-metric--roc_auc\693acedc576861c3f672089d0699a69bbb74f5105e7075204284e52a12f99098 (last modified on Sat Apr 13 23:27:22 2024) since

<a id = 'table'><a/>
## Table of Contents

---
    
1. [Evaluating Tokenizer](#token)
2. [Text Preprocessing](#text)
3. [Transfer Learning](#Transfer)
    - a) [TransferModel_V2](#V2)
4. [Test Sample](#test)
5. [Summary](#sum)

[Word Embeddings](https://www.tensorflow.org/text/guide/word_embeddings) gives meaning to related words that have a similar encoding. For example, the words "cancer", "tumor", and "maligancy" are often used as synonyms of each other, so when those sentences and words are vectorized, they should naturally have a smaller cosine similarity. In other words, word embeddings that were trained in a very large database would provide both meaning and better scores when classifying tasks.

In this notebook, we will be utilising a pretrained model from HuggingFace called BioClinicalBERT on our Study Title to evaluate how well it classifies terminated trials.

**Initial Setup**
- Using both the tokenizer and model from Bio_ClinicalBERT

In [2]:
# Bio_ClinicalBert Setup
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

**Load dataset**

In [3]:
text_df = pd.read_csv('clean_ctg.csv', index_col=0)
text_df.head()

Unnamed: 0,Study Title,Study Status,Brief Summary,Study Results,Conditions,Interventions,Primary Outcome Measures,Secondary Outcome Measures,Sponsor,Collaborators,Sex,Age,Phases,Enrollment,Funder Type,Study Type,Study Design,Locations,study_duration
0,Effectiveness of a Problem-solving Interventio...,0,We will conduct a two-arm individually randomi...,0,"Mental Health Issue (E.G., Depression, Psychos...",behavioral: pride 'step 1' problem-solving int...,"Mental health symptoms, The Strengths and Diff...","Mental health symptoms, The adolescent-reporte...",Sangath,Harvard Medical School (HMS and HSDM)|London S...,ALL,"CHILD, ADULT",NO PHASE,210-490,OTHER,1,Allocation: RANDOMIZED|Intervention Model: PAR...,"Sangath, New Delhi, Delhi, 110016, India",123-244
1,Investigating the Effect of a Prenatal Family ...,0,The purpose of this study is to measure the di...,0,Focus: Contraceptive Counseling|Focus: Postpar...,other: family planning counseling let by commu...,"Self-reported contraceptive use, 6 months post...","Intent to use contraception in the future, 6 m...",Planned Parenthood League of Massachusetts,Society for Family Planning Research Fund,FEMALE,"CHILD, ADULT, OLDER_ADULT",NO PHASE,120-209,OTHER,1,Allocation: NON_RANDOMIZED|Intervention Model:...,Palestinian Ministry of Health Maternal Child ...,366-515
2,Pre-exposure Prophylaxis (PrEP) for People Who...,1,People who inject drugs (PWID) experience high...,0,Intravenous Drug Abuse,behavioral: prep uptake/adherence intervention...,"PrEP uptake by self-report, measured using 1 i...",Participant satisfaction with intervention con...,Boston University,National Institute on Drug Abuse (NIDA),ALL,"ADULT, OLDER_ADULT",NO PHASE,0-8,OTHER,1,Allocation: RANDOMIZED|Intervention Model: PAR...,unknown,245-365
3,Tailored Inhibitory Control Training to Revers...,0,Insufficient inhibitory control is one pathway...,0,Smoking|Alcohol Drinking|Prescription Drug Abu...,behavioral: person-centered inhibitory control...,"Inhibitory control performance, Task 1, Perfor...",Far transfer to a task related to inhibitory c...,University of Oregon,none,ALL,ADULT,NO PHASE,80-119,OTHER,1,Allocation: RANDOMIZED|Intervention Model: PAR...,"University of Oregon, Social and Affective Neu...",516-671
4,Neuromodulation of Trauma Memories in PTSD & A...,0,The purpose of this study is to examine the ef...,1,Alcohol Dependence|PTSD,drug: propranolol|drug: placebo,"Retrieval Session Distress Scores (Session 1),...","Proportion of Drinking Days, Proportion of dri...",Medical University of South Carolina,National Institute on Alcohol Abuse and Alcoho...,ALL,"ADULT, OLDER_ADULT",PHASE2,42-59,OTHER,1,Allocation: RANDOMIZED|Intervention Model: PAR...,"MUSC, Charleston, South Carolina, 294258908, U...",862-1097


## 1. Evaluating Tokenizer

First let's evaluate the tokenizer to see how it preprocesses the text.
- We will select the first 3 Study titles.

In [39]:
sample = text_df['Study Title'][:3].tolist()
sample

['Effectiveness of a Problem-solving Intervention for Common Adolescent Mental Health Problems in India',
 'Investigating the Effect of a Prenatal Family Planning Counseling Intervention Led by Community Health Workers on Postpartum Contraceptive Use Among Women in the West Bank',
 'Pre-exposure Prophylaxis (PrEP) for People Who Inject Drugs (PWID)']

The words inputed into the model must be a fixed-sized tensor. Since the words in the `Study Title` have different lengths, we will need to apply padding and truncation to the tokenizer.


**Padding** adds a special [PAD] token to the end of the word sequence to make sure all tensors are the same size.

**Truncation** Removes words in the other direction usually corresponding to the `maximum_length`

source: (https://huggingface.co/docs/transformers/en/pad_truncation)

In [40]:
# tokenizer takes in the samples, preprocesses it, and returns "pytorch" tensors
inputs = tokenizer(sample, return_tensors="pt", padding=True, truncation=True)
inputs

{'input_ids': tensor([[  101, 12949,  1104,   170,  2463,   118, 15097,  9108,  1111,  1887,
         25506,  4910,  2332,  2645,  1107,  1107,  7168,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [  101, 11950,  1103,  2629,  1104,   170,  3073, 24226,  1348,  1266,
          3693, 22138,  9108,  1521,  1118,  1661,  2332,  3239,  1113,  2112,
         17482,  8928, 14255,  4487, 17046,  1329,  1621,  1535,  1107,  1103,
          1745,  3085,   102],
        [  101,  3073,   118,  7401, 21146, 18873,  7897,  1548,   113,  3073,
          1643,   114,  1111,  1234,  1150,  1107, 16811,  5557,   113,   185,
         10073,  1181,   114,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Now we need to convert the input ids back into their corresponding words.

In [41]:
# take the third study title and convert the input_IDs back into words
tokenizer.convert_ids_to_tokens(inputs["input_ids"][2])

['[CLS]',
 'pre',
 '-',
 'exposure',
 'prop',
 '##hyl',
 '##ax',
 '##is',
 '(',
 'pre',
 '##p',
 ')',
 'for',
 'people',
 'who',
 'in',
 '##ject',
 'drugs',
 '(',
 'p',
 '##wi',
 '##d',
 ')',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]']

Notice how at the beginning of the sequence, the tokenizer adds [CLS] and then [SEP] at the end to signify the start and end of the sequence. The [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) model stands for Bidirectional Encoder Representations from Transformers. Essentially, BERT encodes the sentence sequence from both directions which is why the sentences needs to be preprocessed a certain way.

Additionally, we can see the [PAD] added at the end of the sequence so the tensors are the same size!

source (https://huggingface.co/docs/transformers/en/model_doc/bert)

Lets look at the model configuration by running `model.config`

In [86]:
model.config

BertConfig {
  "_name_or_path": "emilyalsentzer/Bio_ClinicalBERT",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.32.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

This looks like our classic neural network where we have our dropout layers to prevent overfitting, the activation methods, learning rate, hidden layers, etc. We can definitely optimize these parameters more in the future!

<a id = 'text'><a/>
    
## 2. Text Preprocessing
    
---

We need to set our labels or the dependent variables.

In [4]:
# Create expected IDs and their labels
id2label = {0: "COMPLETED", 1: "TERMINATED"}
label2id = {"COMPLETED": 0, "TERMINATED": 1}

In [5]:
# model setup for text classification
model = AutoModelForSequenceClassification.from_pretrained(
    "emilyalsentzer/Bio_ClinicalBERT",
    num_labels=2,
    id2label=id2label,
    label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Then we need to slice out the `Study Status` and the `Study Title` and perform a train_test_split.

In [42]:
# Slice out label and text columns
simplified = text_df[["Study Status","Study Title"]].copy()

# Rename the columns to "label" and "text"
simplified.columns = ["label","text"]

# Manual train_test_split
test_flag = np.random.randint(0,high=10,size=text_df.shape[0])
simplified.loc[:,'test'] = test_flag > 4 # This will split train and test ~50:50

In [43]:
# check the split proportion
simplified['test'].value_counts(normalize=True)

test
False    0.50021
True     0.49979
Name: proportion, dtype: float64

In [44]:
# can also do value_counts
simplified['test'].value_counts()

test
False    152619
True     152491
Name: count, dtype: int64

Great! Let's do a final check on what the dataframe looks like and then we need to select a sample size to be fed into the model.

In [45]:
# final check
simplified.head()

Unnamed: 0,label,text,test
0,0,Effectiveness of a Problem-solving Interventio...,True
1,0,Investigating the Effect of a Prenatal Family ...,False
2,1,Pre-exposure Prophylaxis (PrEP) for People Who...,True
3,0,Tailored Inhibitory Control Training to Revers...,False
4,0,Neuromodulation of Trauma Memories in PTSD & A...,True


We will start small by selecting 5000 random samples for the train and 1500 random samples for the test. Then we need to export them as CSVs for them to be processed by the model.

In [9]:
# Select random train sample
train = simplified[~simplified.test].groupby('label',group_keys=False).apply(lambda x: x.sample(5000))

In [10]:
# Select random test sample
test = simplified[simplified.test].groupby('label',group_keys=False).apply(lambda x: x.sample(1500))

In [11]:
# save as CSV files to be processed by the hugging face data pipelines
train.reset_index(drop=True).to_csv("data/bert_train.csv")
test.reset_index(drop=True).to_csv("data/bert_test.csv")

In [12]:
# load as hugging face dataset
dataset = load_dataset('csv', data_files={'train': 'data/bert_train.csv', 'test': 'data/bert_test.csv'})

Downloading and preparing dataset csv/default to C:/Users/tingh/.cache/huggingface/datasets/csv/default-6aa0f91d2967186f/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to C:/Users/tingh/.cache/huggingface/datasets/csv/default-6aa0f91d2967186f/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [13]:
dataset.shape

{'train': (10000, 4), 'test': (3000, 4)}

Great this looks good, now let's move on to fine-tuning the pretrained model!

<a id = 'transfer'><a/>
    
## 3. Transfer Learning - Fine-Tuning Pretrained Model (BioClinicalBERT)

---

We need to define a preprocess function that uses the tokenizer and parameters we mentioned before.
- dataset.map will map the preprocess function to be applied to all rows of the `Study Title`
- **DataCollatorWithPadding** helps prepare batches by dynamically padding the sequences with a common length

source (https://huggingface.co/docs/transformers/en/main_classes/data_collator)

In [14]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

# From hugging face text processing setup
tokenized_data = dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

We also need to define our metrics!
- HuggingFace has their own evaluate library that consists of many evaluation metrics for their models.
- We will be using accuracy, f1_score, and auc_score to evaluate our model's performance.

Source (https://huggingface.co/docs/evaluate/en/index)

**Important Notes**

We need to run an np.argmax() function since the predictions will provide two scores for each class. The argmax() will find the largest score out of the 2.

In [15]:
# leverage sklearn metrics for runtime training evaluation
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy_score = accuracy.compute(predictions=pred, references=labels)
    f1_score = f1.compute(predictions=pred, references=labels)
    auc_score = roc_auc_score.compute(prediction_scores=pred, references=labels)

    return {"accuracy": accuracy_score, "f1": f1_score, "AUC": auc_score}

Next we can setup our complete transfer learning pipeline by creating a TrainingArguments instance with specific parameters and creating a new Trainer that collects all components together: model, training_args, preprocessing pipeline, and evaluation funcs.

**Training_args**
- output_dir: saves the checkpoints of the model
- weight_decay: is the models regularlization that prevents overfitting (higher number = stronger regularlization)
- evaluation_strategy: computes the metrics after every epoch
- save_strategy: saves a checkpoint after every epoch
- load best model at the end!

source (https://huggingface.co/docs/transformers/en/main_classes/trainer)

In [16]:
# training parameter setup goes in a specific class instance
training_args = TrainingArguments(
    output_dir="clinical-study-classifier",
    learning_rate=2e-5,
    # optim="adamw_torch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=10,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False)

# the trainier
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics)

Finally! We can now use trainer.train() to start fine-tuning the model.

In [17]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Auc
1,0.6697,0.654717,{'accuracy': 0.6116666666666667},{'f1': 0.5881937080240367},{'roc_auc': 0.6116666666666667}
2,0.6304,0.677115,{'accuracy': 0.608},{'f1': 0.6072144288577155},{'roc_auc': 0.6079999999999999}
3,0.4792,0.833124,{'accuracy': 0.6033333333333334},{'f1': 0.62882096069869},{'roc_auc': 0.6033333333333334}
4,0.3239,1.412217,{'accuracy': 0.5953333333333334},{'f1': 0.5879158180583842},{'roc_auc': 0.5953333333333333}
5,0.2332,1.956136,{'accuracy': 0.5983333333333334},{'f1': 0.6192733017377566},{'roc_auc': 0.5983333333333333}
6,0.1735,2.654079,{'accuracy': 0.591},{'f1': 0.6027840725153771},{'roc_auc': 0.591}
7,0.1011,3.079926,{'accuracy': 0.5873333333333334},{'f1': 0.5780504430811179},{'roc_auc': 0.5873333333333333}
8,0.0636,3.38544,{'accuracy': 0.5786666666666667},{'f1': 0.5552427867698804},{'roc_auc': 0.5786666666666667}
9,0.0444,3.506462,{'accuracy': 0.5886666666666667},{'f1': 0.5951443569553806},{'roc_auc': 0.5886666666666667}
10,0.0304,3.573226,{'accuracy': 0.5903333333333334},{'f1': 0.5843760568143388},{'roc_auc': 0.5903333333333334}


TrainOutput(global_step=12500, training_loss=0.27236922760009763, metrics={'train_runtime': 3712.528, 'train_samples_per_second': 26.936, 'train_steps_per_second': 3.367, 'total_flos': 5052624373834080.0, 'train_loss': 0.27236922760009763, 'epoch': 10.0})

The model is overfitting to the training data because the training loss is going down while the validation loss is going up! Our highest Accuracy is around ~61% which is not that good, the f1_score ranges from 55% to 63% and the AUC scores ranges from 57% - 61%.

This is by any means not a good model since the scores are not optimal. This is probably due to the small dataset that is fed into the model. To prevent overfitting, we can increase the `weight_decay` in the next attempt and add more rows into the train and test sets.

In [72]:
# saving the model
trainer.save_model("models/TransferModel")

<a id = 'V2'><a/>

### 3. a) TransferModel_V2
    
---

Using the same code, we will slice out 7500 random samples for train and 2500 random samples for test. Unfortunately due to limited time and processing power, we cannot add a larger sample size to train.

In [7]:
# Select random train sample
train = simplified[~simplified.test].groupby('label',group_keys=False).apply(lambda x: x.sample(7500))

# Select random test sample
test = simplified[simplified.test].groupby('label',group_keys=False).apply(lambda x: x.sample(2500))

# save as CSV files to be processed by the hugging face data pipelines
train.reset_index(drop=True).to_csv("data/bert_train.csv")
test.reset_index(drop=True).to_csv("data/bert_test.csv")

In [8]:
# load as hugging face dataset
dataset = load_dataset('csv', data_files={'train': 'data/bert_train.csv', 'test': 'data/bert_test.csv'})
dataset.shape

Downloading and preparing dataset csv/default to C:/Users/tingh/.cache/huggingface/datasets/csv/default-f2b70ceaad06f926/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to C:/Users/tingh/.cache/huggingface/datasets/csv/default-f2b70ceaad06f926/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

{'train': (15000, 4), 'test': (5000, 4)}

We need to remap the processing function again to the new dataset!

In [46]:
# From hugging face text processing setup
tokenized_data = dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/15000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Keeping all parameters the same and increasing weight_decay to 0.1 to prevent overfit.

In [11]:
# training parameter setup goes in a specific class instance
training_args = TrainingArguments(
    output_dir="clinical-study-classifier",
    learning_rate=2e-5,
    # optim="adamw_torch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=10,
    weight_decay=0.1, # changed from 0.01 to prevent overfit
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False)

# the trainier
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics)

In [12]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Auc
1,0.6691,0.653622,{'accuracy': 0.6236},{'f1': 0.6818796484110886},{'roc_auc': 0.6235999999999999}
2,0.6208,0.666562,{'accuracy': 0.6312},{'f1': 0.6285253827558421},{'roc_auc': 0.6312000000000001}
3,0.4979,0.76786,{'accuracy': 0.6244},{'f1': 0.6370313103981446},{'roc_auc': 0.6244000000000001}
4,0.3483,1.30188,{'accuracy': 0.6158},{'f1': 0.5903177649818725},{'roc_auc': 0.6157999999999999}
5,0.2615,1.776627,{'accuracy': 0.6162},{'f1': 0.6488563586459287},{'roc_auc': 0.6162000000000001}
6,0.192,2.269892,{'accuracy': 0.6056},{'f1': 0.6038569706709521},{'roc_auc': 0.6055999999999999}
7,0.1273,2.623639,{'accuracy': 0.6112},{'f1': 0.637853949329359},{'roc_auc': 0.6112000000000001}
8,0.0956,2.835454,{'accuracy': 0.6052},{'f1': 0.6228505922812381},{'roc_auc': 0.6052000000000001}
9,0.0551,3.152989,{'accuracy': 0.6002},{'f1': 0.6276774073384244},{'roc_auc': 0.6002}
10,0.0286,3.290262,{'accuracy': 0.606},{'f1': 0.6247619047619047},{'roc_auc': 0.6060000000000001}


TrainOutput(global_step=18750, training_loss=0.29077237213134766, metrics={'train_runtime': 5032.6505, 'train_samples_per_second': 29.805, 'train_steps_per_second': 3.726, 'total_flos': 7099685940078240.0, 'train_loss': 0.29077237213134766, 'epoch': 10.0})

The model is still overfitting to the training set, but is performing much better than the previous tuned model. This model has a accuracy of 62%, F1_score of 68% and AUC_score of 62% which is not optimal but an improvement.

In [37]:
# saving the model
trainer.save_model("models/TransferModel_v2")

<a id = 'test'><a/>

## 4. Test Sample

---

We can perform some sample tests to see what the model would predict given a `Study Title`

In [47]:
# list out terminated trial indices
text_df[text_df['Study Status'] == 1]['Study Title'].index

Index([     2,     15,     17,     48,     49,     60,     64,     65,     79,
           94,
       ...
       305044, 305045, 305052, 305057, 305065, 305075, 305078, 305081, 305085,
       305092],
      dtype='int64', length=41730)

In [50]:
# Pick a random index to plug into pipeline
term_sample = text_df[text_df['Study Status'] == 1]['Study Title'][48]
term_sample

'Sunitinib and Capecitabine for First Line Colon Cancer'

The TextClassificationPipeline is a simple pipe that classifies text based on the fine-tuned model and the tokenizer.

In [51]:
# predictions for the entire dataset
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False, device='cuda') # cuda allows the function to run on your GPU
pipe([term_sample])

[{'label': 'TERMINATED', 'score': 0.780062735080719}]

Great! It's predicting the `Study Title` as Terminated with a confidence of 78%

Let's try it with just the word `COVID`, since in the previous notebook, we found that COVID related trials had a higher coefficient to terminated trials.

In [55]:
# predictions for the entire dataset
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False, device='cuda') # cuda allows the function to run on your GPU
pipe(["COVID"])

[{'label': 'TERMINATED', 'score': 0.5402730703353882}]

This is looking good! Now we just need to test out some of the completed data.

In [52]:
# list out completed trial indices
text_df[text_df['Study Status'] == 0]['Study Title'].index

Index([     0,      1,      3,      4,      5,      6,      7,      8,      9,
           10,
       ...
       305100, 305101, 305102, 305103, 305104, 305105, 305106, 305107, 305108,
       305109],
      dtype='int64', length=263380)

In [53]:
# Pick a random index to plug into pipeline
comp_sample = text_df[text_df['Study Status'] == 0]['Study Title'][7]
comp_sample

'A Pilot Study to Assess the Effectiveness of BehaviouRal ActiVation Group Program in Patients With dEpression: BRAVE'

In [54]:
# predictions for the entire dataset
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False, device='cuda')
pipe([comp_sample])

[{'label': 'COMPLETED', 'score': 0.597963809967041}]

Now it managed to predict a completed trial based on the `Study Title` with a confidence of 60%

<a id = 'sum'><a/>

## 5. Summary

---
    
We created word embeddings from the `Study Title` column and then performed Transfer Learning by feeding those embedded words into a pretrained model. Our final model in this notebook manages to predict trial terminations at a 62% accuracy with a f1_score of 68% and a AUC_score of 62%. To improve this model further we can add more rows into the model and tokenize the `Brief Summary` column as well. Additionally, it would be useful to learn how to optimize the BERT configuration and training arguments further to obtain better results
    
**Limitations:**
Text preprocessing and fine-tuning pretrained model requires a lot of time, iterations, and computing power. It will be useful to learn how to optimize training more efficiently to get better results in a shorter time.
    
**Product Demo**
We will include this model in our product demo, since it will be useful to find out what words in `Clinical Study Titles` often associate with a trial terminations

---
    
### Predicting Clinical Trial Termination Conclusion
    
Our initial goal was to find out the main factors associated with trial terminations and then create a predictive machine learning model to classify terminated trials.

**Some factors include**
- Low enrollments (0-8)
- Trials associated with COVID
- Chronic conditions such as cancer, parkinsons, crohn disease etc.
- Specific Sponsors and Collaborators
- And many more...
    
We created 2 predictive models, a fine-tuned RandomForest Classifier and a word embedded pretrained model.
    
**Next Steps:** We can definitely refine the each of the models more based on the limitations, incorporate further hyperparameter tuning, and different feature selection strategies.