# Named Entity Recognition for Healthcare

What is Named Entity Recognition.
Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, medical terms, etc. In healthcare, NER can be used to extract relevant medical entities from clinical notes, research papers, and other medical documents.

## AIM and Objectives
The aim of this project is to develop a Named Entity Recognition (NER) system specifically tailored for healthcare applications. The objectives include:
 - Developing a robust NER model that can accurately identify and classify medical entities in clinical texts.
  - Evaluating the model's performance using standard metrics such as precision, recall, and F1-score.
  - Exploring the use of pre-trained language models and transfer learning techniques to improve NER performance.

## Related Work
In healthcare, NER has been applied to various tasks such as extracting drug names, medical conditions, and treatment plans from clinical notes. Previous studies have shown that using domain-specific language models can significantly improve NER performance in healthcare contexts. For instance, models like BioBERT and ClinicalBERT have been fine-tuned on large biomedical corpora to enhance their understanding of medical terminology and context.
Example of these models include:
- BioBERT: A pre-trained biomedical language representation model based on BERT.
- ClinicalBERT: A variant of BERT fine-tuned on clinical notes to improve performance on healthcare-related tasks.
- Med7: A transformer-based model specifically designed for NER in the medical domain, achieving state-of-the-art results on various biomedical NER benchmarks.

## Datasets

The datasets used for training and evaluating the NER model include:
- BC5CDR: A large corpus of clinical notes annotated with medical entities, including diseases, treatments, and medications.
- NCBI Disease Corpus: A collection of biomedical literature annotated with disease entities, providing a rich source of medical terminology and context.
- MedMentions: A dataset containing mentions of medical concepts in clinical texts, annotated with their corresponding UMLS (Unified Medical Language System) concepts.

> Some other datasets that can be used for NER in healthcare and will be listed here in the future

## Methodology

## Data Preprocessing

In [1]:
from datasets import load_dataset
# Load the BC5CDR dataset
dataset = load_dataset('tner/bc5cdr')
dataset

README.md: 0.00B [00:00, ?B/s]

bc5cdr.py: 0.00B [00:00, ?B/s]

0000.parquet:   0%|          | 0.00/367k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/364k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/386k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5228 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5330 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5865 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5228
    })
    validation: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5330
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5865
    })
})

> The dataset has 3 splits: 'train', 'validation', and 'test'. Each split contains text data along with the corresponding entity annotations.

In [6]:
len(dataset['train']), len(dataset['validation']), len(dataset['test'])

(5228, 5330, 5865)

In [9]:
import random
random.seed(0)  # Set a random seed for reproducibility
dataset['train'][random.randint(0, len(dataset['train']))] # Load any example from the training set

{'tokens': ['RESULTS',
  ':',
  'All',
  'the',
  'patients',
  'were',
  'examined',
  'for',
  'toxicity',
  ';',
  '34',
  'were',
  'examinable',
  'for',
  'response',
  '.'],
 'tags': [0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0]}

## Model Selection and Training

## Evaluation