# Named Entity Recognition for Healthcare

What is Named Entity Recognition.
Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, medical terms, etc. In healthcare, NER can be used to extract relevant medical entities from clinical notes, research papers, and other medical documents.

## AIM and Objectives
The aim of this project is to develop a Named Entity Recognition (NER) system specifically tailored for healthcare applications. The objectives include:
 - Developing a robust NER model that can accurately identify and classify medical entities in clinical texts.
  - Evaluating the model's performance using standard metrics such as precision, recall, and F1-score.
  - Exploring the use of pre-trained language models and transfer learning techniques to improve NER performance.

## Related Work
In healthcare, NER has been applied to various tasks such as extracting drug names, medical conditions, and treatment plans from clinical notes. Previous studies have shown that using domain-specific language models can significantly improve NER performance in healthcare contexts. For instance, models like BioBERT and ClinicalBERT have been fine-tuned on large biomedical corpora to enhance their understanding of medical terminology and context.
Example of these models include:
- BioBERT: A pre-trained biomedical language representation model based on BERT.
- ClinicalBERT: A variant of BERT fine-tuned on clinical notes to improve performance on healthcare-related tasks.
- Med7: A transformer-based model specifically designed for NER in the medical domain, achieving state-of-the-art results on various biomedical NER benchmarks.

## Datasets

The datasets used for training and evaluating the NER model include:
- BC5CDR: A large corpus of clinical notes annotated with medical entities, including diseases, treatments, and medications.
- NCBI Disease Corpus: A collection of biomedical literature annotated with disease entities, providing a rich source of medical terminology and context.
- MedMentions: A dataset containing mentions of medical concepts in clinical texts, annotated with their corresponding UMLS (Unified Medical Language System) concepts.

> Some other datasets that can be used for NER in healthcare and will be listed here in the future

## Methodology

## Data Preprocessing

1. The dataset used has the following labels:
   - `O`: 0
   - `B-Chemical`: 1
   - `B-Disease`: 2
   - `I-Disease`: 3
   - `I-Chemical`: 4



> We will use another dataset apart from this one in the future, but for now we will use the BC5CDR dataset.

In [15]:
from datasets import load_dataset

# Load the BC5CDR dataset
dataset = load_dataset('tner/bc5cdr')
dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5228
    })
    validation: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5330
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5865
    })
})

> The dataset has 3 splits: 'train', 'validation', and 'test'. Each split contains text data along with the corresponding entity annotations.

In [2]:
len(dataset['train']), len(dataset['validation']), len(dataset['test'])

(5228, 5330, 5865)

In [3]:
import random
random.seed(0)  # Set a random seed for reproducibility
dataset['train'][random.randint(0, len(dataset['train']))] # Load any example from the training set

{'tokens': ['RESULTS',
  ':',
  'All',
  'the',
  'patients',
  'were',
  'examined',
  'for',
  'toxicity',
  ';',
  '34',
  'were',
  'examinable',
  'for',
  'response',
  '.'],
 'tags': [0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0]}

## Model Selection and Training

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer

id2label = {0: "O", 1: "B-Chemical", 2: "B-Disease", 3: "I-Disease", 4: "I-Chemical"}
label2id = {v: k for k, v in id2label.items()}
device = "mps" if torch.backends.mps.is_available() else "cpu"
# Load a pre-trained tokenizer and model for token classification
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=5,  # Number of labels in the dataset
    id2label=id2label,
    label2id=label2id,
)
model.to(device)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

In [5]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['tokens'],
        truncation=True,
        padding="max_length",  # Add this
        max_length=128,        # Or another suitable value
        is_split_into_words=True,
        return_offsets_mapping=True
    )
    labels = []
    for i, label in enumerate(examples['tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs
# Tokenize the dataset and align labels with tokens
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)
tokenized_datasets['train'][0]

Map:   0%|          | 0/5330 [00:00<?, ? examples/s]

{'tokens': ['Naloxone',
  'reverses',
  'the',
  'antihypertensive',
  'effect',
  'of',
  'clonidine',
  '.'],
 'tags': [1, 0, 0, 0, 0, 0, 1, 0],
 'input_ids': [101,
  11896,
  2858,
  21501,
  1162,
  7936,
  1116,
  1103,
  2848,
  7889,
  17786,
  5026,
  2109,
  2629,
  1104,
  172,
  4934,
  2386,
  2042,
  119,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,

In [7]:
import evaluate
metric = evaluate.load("seqeval")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[l] for l in label if l != -100]
        for label in labels
    ]
    results = metric.compute(predictions=true_predictions, references=true_labels)
    # Multiply only float values by 100
    def scale_metrics(d):
        return {k: (v * 100 if isinstance(v, float) else v) for k, v in d.items()}
    return {k: scale_metrics(v) if isinstance(v, dict) else (v * 100 if isinstance(v, float) else v)
            for k, v in results.items()}


In [8]:
args  = TrainingArguments(
    output_dir="ner-model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch"
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

  trainer = Trainer(


Step,Training Loss
500,0.191




TrainOutput(global_step=981, training_loss=0.13031390051107766, metrics={'train_runtime': 616.1656, 'train_samples_per_second': 25.454, 'train_steps_per_second': 1.592, 'total_flos': 1024572371696640.0, 'train_loss': 0.13031390051107766, 'epoch': 3.0})

In [9]:
trainer.evaluate()



{'eval_loss': 0.18123093247413635,
 'eval_Chemical': {'precision': 92.13810110974106,
  'recall': 93.84136233485708,
  'f1': 92.98193220845154,
  'number': 19907},
 'eval_Disease': {'precision': 81.11102620921525,
  'recall': 84.66937863922789,
  'f1': 82.85201373712145,
  'number': 12537},
 'eval_overall_precision': 87.81248126611115,
 'eval_overall_recall': 90.29712735790902,
 'eval_overall_f1': 89.03747378658481,
 'eval_overall_accuracy': 95.08518333908465,
 'eval_runtime': 56.7029,
 'eval_samples_per_second': 93.999,
 'eval_steps_per_second': 5.89,
 'epoch': 3.0}

In [10]:
model.save_pretrained('ner-healthcare-part1')
tokenizer.save_pretrained('ner-healthcare-part1')

('ner-healthcare-part1/tokenizer_config.json',
 'ner-healthcare-part1/special_tokens_map.json',
 'ner-healthcare-part1/vocab.txt',
 'ner-healthcare-part1/added_tokens.json',
 'ner-healthcare-part1/tokenizer.json')

In [11]:
model_path = 'ner-healthcare-part1'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)
model.eval()
sentence = "The patient was diagnosed with diabetes and prescribed metformin."
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
outputs

TokenClassifierOutput(loss=None, logits=tensor([[[ 5.4552, -0.9119, -0.3546, -1.7974, -0.9001],
         [ 8.7544, -2.2196, -1.4454, -1.4424, -1.5638],
         [ 8.7365, -2.2940, -1.1467, -1.1861, -1.6643],
         [ 8.9532, -2.1271, -1.4135, -1.5300, -1.6134],
         [ 8.1951, -2.3656, -0.5390, -1.4019, -1.8728],
         [ 7.6669, -2.0340,  0.1609, -1.4354, -2.2945],
         [-0.7699, -1.1184,  5.2833, -0.0622, -2.5795],
         [ 8.4840, -1.6472, -1.7583, -1.6701, -1.4407],
         [ 8.2167, -1.0587, -1.5711, -1.8173, -1.5328],
         [-1.3135,  6.8937, -2.1130, -2.3221, -1.6221],
         [-0.9988,  6.8142, -2.2059, -2.4825, -1.5235],
         [-0.9092,  6.9303, -2.2710, -2.4274, -1.0259],
         [ 8.2220, -1.3677, -1.8447, -1.7283, -1.4924],
         [ 4.6525, -0.0528, -1.8846, -1.7758, -0.2108]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [13]:
logits = outputs.logits
predicted_label = torch.argmax(logits, dim=-1)
predicted_label

tensor([[0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 1, 1, 0, 0]])

In [14]:
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
predictions = predicted_label[0].tolist()
for token, prediction in zip(tokens, predictions):
    print(f"{token}: {id2label[prediction]}")

[CLS]: O
The: O
patient: O
was: O
diagnosed: O
with: O
diabetes: B-Disease
and: O
prescribed: O
met: B-Chemical
##form: B-Chemical
##in: B-Chemical
.: O
[SEP]: O


## Evaluation