# Our Plan
We want to further pre-train the bert-based-uncased model using a clinical notes dataset. Here is our plan:

1. Import necessary packages.
2. Download the dataset from Kaggle.
  * We will use the akashadesai/clinical-notes dataset from Kaggle.
  * Save the dataset on your machine.
3. Preprocess the dataset.
  * Create a pandas dataframe with each sentence in a new row.
  * Ensure consecutive sentences are in consecutive rows (e.g., Sentence A in row i and Sentence B in row i+1).
3. Create a custom Dataset class.
  * For BERT training, each item should be in the format: `Sentence A + [SEP] + Sentence B`.
  * The __getitem__ method should return the tokenization of (`Sentence A + [SEP] + Sentence B`).
4. Create a DataCollatorForPreTraining class. This will be passed as the collate_fn in the DataLoader.
  * The class should inherit from DataCollatorForLanguageModeling.
  * Mask a few tokens from Sentence A.
5. Create a DataLoader.
6. Declare the model, loss, and optimizer.
7. Prepare the accelerator for GPU and distributed training.
8. Perform further training of the bert-based-uncased model.
The Clinical BERT model is now ready.



# 1. Import Necessary Packages

In [30]:
import torch
import pandas as pd
from torch.utils.data import DataLoader, Dataset
from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import DataCollatorForSeq2Seq
from accelerate import Accelerator

accelerator = Accelerator()


# 2. Download the Dataset from Kaggle
Please Download dataset from [https://www.kaggle.com/datasets/akashadesai/clinical-notes](https://www.kaggle.com/datasets/akashadesai/clinical-notes)

#3. Pre-process the Dataset
The following function takes a dataframe as an argument. The dataframe contains a column named 'TEXT' which consists of clinical notes from different patients. In this function:

* We remove special characters from the clinical notes.
* We split the clinical notes into individual sentences.
* We create a new dataframe, where each row holds a single sentence.

In [16]:
import pandas as pd
import nltk
import re
from nltk.tokenize import sent_tokenize

def create_sentence_dataframe(df):
    # Initialize an empty list to store sentences
    sentences = []

    # Define a pattern to match special characters
    special_chars_pattern = re.compile(r'[^a-zA-Z0-9\s.,?!]+|\n')
    # Loop through each row in the DataFrame
    for text in df['TEXT']:
        # Remove special characters from the text
        clean_text = special_chars_pattern.sub('', text)

        # Tokenize the cleaned text into sentences
        tokenized_sentences = sent_tokenize(clean_text)

        # Add the tokenized sentences to the list
        sentences.extend(tokenized_sentences)

    # Create a new DataFrame with the sentences
    sentence_df = pd.DataFrame(sentences, columns=['text'])

    return sentence_df

In [17]:
data_txt = pd.read_csv("/Users/premtimsina/Documents/bpbbook/chapter5/dataset/medical_data.csv")

In [18]:
pd.options.display.max_colwidth = 100
data=create_sentence_dataframe(data_txt)

In [19]:
data.head()

Unnamed: 0,text
0,Admission Date 216233 Discharge Date 2162325Date of Birth 208014 Se...
1,Known lastname 1829 was seen at Hospital1 18 after a mechanical fall froma height of 10 feet.
2,CT scan noted unstable fracture of C67 posterior elements.Major Surgical or Invasive Procedure1.
3,"Anterior cervical osteotomy, C6C7, with decompression andexcision of ossification of the posteri..."
4,Anterior cervical deformity correction.3.


#4. Create a custom Dataset class.
* For BERT training, each item should be in the format: Sentence A + [SEP] + Sentence B.
* The getitem method should return the tokenization of (Sentence A + [SEP] + Sentence B).

In [22]:
from transformers import BertTokenizer

class ClinicalDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        news = self.data.loc[idx, "text"]
        if idx + 1 < len(self.data):
            next_news = self.data.loc[idx + 1, "text"]
        else:
            next_news = self.data.loc[0, "text"]

        combined_news = news + " [SEP] " + next_news
        tokenized = self.tokenizer(combined_news, truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt")
        return {"input_ids": tokenized["input_ids"].squeeze(0), "attention_mask": tokenized["attention_mask"].squeeze(0), "text": combined_news}


In [23]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

dataset=ClinicalDataset(data,tokenizer)
dataset[0]

{'input_ids': tensor([  101,  9634,  3058, 20294, 21926,  2509, 11889,  3058, 20294, 21926,
         17788, 13701,  1997,  4182, 18512, 24096,  2549,  3348,  5796,  2121,
          7903,  2063,  4200, 24164, 10623,  3111, 24343,  2680,  2004,  2383,
          2053,  2124,  2035,  2121, 17252,  2000,  5850, 19321, 18537,  8873,
         12096,  2171,  2509,  1048,  2546, 11517,  5428, 12879, 12087,  2213,
          2099,  1012,   102,  2124,  2197, 18442, 11523,  2001,  2464,  2012,
          2902,  2487,  2324,  2044,  1037,  6228,  2991,  2013,  2050,  4578,
          1997,  2184,  2519,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

In [25]:
tokenizer.sep_token_id

102

# 5. Create a DataCollatorForPreTraining class
This is what we are performing in the `DataCollatorForPreTraining` class:
1. We inherit from the DataCollatorForLanguageModeling class provided by the Hugging Face Transformers library. We will use `DataCollatorForLanguageModeling` for the MLM.
2. We override the __call__ method to process the input examples for the pre-training task.
  * We initialize lists to store NSP labels, input IDs, attention masks, and labels for each example.
  * We aim to create 50% sentence pairs as NSP (Next Sentence Prediction) True and 50% sentence pairs as NSP False.
  * In the following function, if random.random() > 0.5, we consider the sentence pair as a True NSP pair. Since the data coming from the Dataset class is already a True NSP pair, we don't need to modify it.
  * On the other hand, if random.random() < 0.5, we consider the sentence pair as a False NSP pair. To achieve this, we shuffle the tokens in the next sentence. As a result, the sentence after [SEP] is not the true next sentence, making it an NSP False pair.
3. We use the parent class's __call__ method to handle the MLM task for the examples.
4. We add NSP labels to the batch and return the final batch for further processing in the pre-training loop.

In [31]:
from transformers import DataCollatorForLanguageModeling
import random


class DataCollatorForPreTraining(DataCollatorForLanguageModeling):
    def __init__(self, tokenizer, mlm=True, mlm_probability=0.15, nsp_probability=0.5):
        super().__init__(tokenizer=tokenizer, mlm=mlm, mlm_probability=mlm_probability)
        self.nsp_probability = nsp_probability

    def __call__(self, examples):
        # NSP labels
        nsp_labels = []

        input_ids_list = []
        attention_masks_list = []
        labels_list = []

        # Create NSP input
        for example in examples:
            input_ids = example["input_ids"]
            attention_mask = example["attention_mask"]

            if random.random() > self.nsp_probability:
                # Is Next Sentence
                nsp_labels.append(1)
            else:
                # Not Next Sentence
                nsp_labels.append(0)

                # Shuffle second sentence
                sep_idx = (input_ids == self.tokenizer.sep_token_id).nonzero(as_tuple=True)[0][0].item()
                second_sentence = input_ids[sep_idx + 1:]
                second_sentence = second_sentence[torch.randperm(second_sentence.size()[0])]

                # Concatenate first sentence and shuffled second sentence
                input_ids = torch.cat((input_ids[:sep_idx + 1], second_sentence), dim=0)

            input_ids_list.append(input_ids)
            attention_masks_list.append(attention_mask)

            # Mask only the first sentence
            sep_idx = (input_ids == self.tokenizer.sep_token_id).nonzero(as_tuple=True)[0][0].item()
            labels = input_ids.clone()
            labels[sep_idx:] = -100
            labels_list.append(labels)

        # Create a list of dictionaries for the parent class
        example_dicts = [{"input_ids": ids, "attention_mask": mask, "labels": lbl} for ids, mask, lbl in zip(input_ids_list, attention_masks_list, labels_list)]
        
        # Handle MLM using the parent class
        batch = super().__call__(example_dicts)

        # Add NSP labels to batch
        batch["next_sentence_label"] = torch.tensor(nsp_labels, dtype=torch.long)

        return batch


# 6. Create DataLoader
Let's Discuss each item of dataloader.
1. input_ids: It is token number of each token. The padded token has the token number of 0. 
2. attention_mask: 1 signifies true token, 0 signifies padded token
3. labels: It is the label for Masked Language Modeling task.
 * `-100` means do not use that token to calculate loss function; or, the correspoding token will not be masked
 * non `-100` signifies that the corresponding token will be masked and use for MLM pre-training objective.



In [32]:
from torch.utils.data import DataLoader

# Instantiate the tokenizer, dataset, and data collator
data_collator = DataCollatorForPreTraining(tokenizer)

# Create the DataLoader
train_dataloader = DataLoader(
    dataset, shuffle=True, collate_fn=data_collator, batch_size=16
)

## 6.1 We are just reviewing what dataloader looks like.

In [35]:
item=next(iter(train_dataloader))





In [37]:
print(len(train_dataloader))
print('ids', item['input_ids'][0])
print('mask', item['attention_mask'][0])
print('labels', item['labels'][0])
print('next_sentence_label',item['next_sentence_label'][0])

3474
ids tensor([  101,   103,   103,  3298,  2138,  1997,  2010,  3532, 17084,  1012,
          102,  2196, 20482,  1012,   102,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0

#7. Pre-training
1. `model = BertForPreTraining(config)`: Here is the reason why we are creating instance of `BertForPreTraining`
  * It is specifically designed for pre-training the BERT model architecture. The class encapsulates the BERT architecture along with additional pre-training tasks: NSP and MLM
  * This is a crucial consideration: if you cannot find a module that satisfies the pre-training objective of a particular model, you will need to create the module yourself. In our case, Hugging Face's BertForPreTraining module already met both NSP and MLM pre-training objectives, so we didn't need to write a custom module.
  * At the time of writing this book, I could not find a BartForPretraining module that satisfied BART's pre-training objectives. Therefore, if we want to pre-train BART, we would need to create a custom module for further pre-training BART.

2. In the following code, we are using just one epoch. However, to achieve optimal results, you should consider using multiple epochs.

In [38]:
import torch
from torch.nn import CrossEntropyLoss
from torch.optim import AdamW
from transformers import BertForPreTraining, BertConfig
from accelerate import Accelerator

# Initialize accelerator
accelerator = Accelerator()

# Load BERT model
config = BertConfig.from_pretrained("bert-base-uncased")
model = BertForPreTraining(config)

# Set up the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Prepare the model and optimizer for acceleration
model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)

# Set training parameters
num_epochs = 1
print_every = 10

# Training loop
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    model.train()
    running_loss = 0.0

    for step, batch in enumerate(train_dataloader):
        input_ids = batch["input_ids"].to(accelerator.device)
        attention_mask = batch["attention_mask"].to(accelerator.device)
        labels = batch["labels"].to(accelerator.device)
        next_sentence_label = batch["next_sentence_label"].to(accelerator.device)

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels, next_sentence_label=next_sentence_label)
        loss = outputs.loss

        # Backward pass
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

        running_loss += loss.item()

        if (step + 1) % print_every == 0:
            print(f"Step {step + 1}: Loss = {running_loss / print_every:.4f}")
            running_loss = 0.0

print("Training complete!")



Epoch 1/1
Step 10: Loss = 1.0446
Step 20: Loss = 0.9181
Step 30: Loss = 0.9136
Step 40: Loss = 0.9404
Step 50: Loss = 0.8646
Step 60: Loss = 0.8865
Step 70: Loss = 0.9166
Step 80: Loss = 0.8685
Step 90: Loss = 0.8547
Step 100: Loss = 0.8541
Step 110: Loss = 0.8169
Step 120: Loss = 0.8246
Step 130: Loss = 0.8463
Step 140: Loss = 0.8528
Step 150: Loss = 0.7851
Step 160: Loss = 0.7743
Step 170: Loss = 0.6331
Step 180: Loss = 0.4951
Step 190: Loss = 0.7283
Step 200: Loss = 1.0740
Step 210: Loss = 0.9100
Step 220: Loss = 0.8492
Step 230: Loss = 0.8410
Step 240: Loss = 0.8498
Step 250: Loss = 0.8160
Step 260: Loss = 0.8486
Step 270: Loss = 0.8349
Step 280: Loss = 0.8328
Step 290: Loss = 0.8725
Step 300: Loss = 0.8514
Step 310: Loss = 0.8503
Step 320: Loss = 0.8676
Step 330: Loss = 0.8211
Step 340: Loss = 0.8593
Step 350: Loss = 0.8815
Step 360: Loss = 0.8553
Step 370: Loss = 0.8196
Step 380: Loss = 0.8281
Step 390: Loss = 0.8457
Step 400: Loss = 0.8193
Step 410: Loss = 0.8452
Step 420: Loss 

In [39]:
save_directory = "/Users/premtimsina/Documents/bpbbook/chapter5/pretrained_bert/"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)


('/Users/premtimsina/Documents/bpbbook/chapter5/pretrained_bert/tokenizer_config.json',
 '/Users/premtimsina/Documents/bpbbook/chapter5/pretrained_bert/special_tokens_map.json',
 '/Users/premtimsina/Documents/bpbbook/chapter5/pretrained_bert/vocab.txt',
 '/Users/premtimsina/Documents/bpbbook/chapter5/pretrained_bert/added_tokens.json')

We conducted further pre-training of bert-based-uncased. Some of the areas for optimization are:

1. We used a very simple approach to clean the data, like removing
numbers and special characters. It's essential to invest more time in this process and employ more sophisticated techniques.
2. We used the simple nltk module to split sentences. While nltk works well for general sentence splitting, clinical notes are written in a more informal manner and often include numbers and stats. As a result, nltk is not the optimal solution. We should use advanced sentence detectors to split sentences.
3. The data items we prepared are not entirely accurate. For example, when merging all clinical notes together, the last sentence of clinical note A and the first sentence of clinical note B become sentence A and sentence B, which is not entirely correct.
When creating an LLM for your organization, it's crucial to invest a significant amount of time in cleaning the data; otherwise, you'll end up with a suboptimal model despite having a robust model architecture.