<a href="https://colab.research.google.com/github/ahmed191034/Analysis-on-Nlp-toolkits-on-Mimic-Iv/blob/main/Entity_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
'''
Importing essential libraries for the machine learning workflow.
pandas and numpy are used for loading datasets and performing numerical operations.
The 're' module supports regular expression processing for text cleaning.
PyTorch provides the deep-learning framework required to run ClinicalBERT, while tqdm
adds progress bars to monitor long-running loops.
This block also prints the environment information for reproducibility and debugging,
and automatically selects 'cuda' if a GPU is available; otherwise it defaults to CPU.
This ensures the user knows where the ClinicalBERT model will run.
'''
import pandas as pd
import numpy as np
import re
import torch
from tqdm import tqdm

print("Torch version:", torch.__version__)
print("GPU Available:", torch.cuda.is_available())
print("Using device:", "cuda" if torch.cuda.is_available() else "cpu")


Torch version: 2.9.0+cu126
GPU Available: True
Using device: cuda


In [None]:
'''
This prints detailed information about the hardware environment used to run the NLP models.
It first displays whether the system is using a GPU or CPU, which is important for understanding
the execution speed of ClinicalBERT. The command  retrieves GPU specifications such as
memory, driver version, and utilisation; if no GPU is available, it prints a fallback message.

The next section prints CPU information, showing the processor type and core details, which helps
evaluate system performance when the model falls back to CPU. Finally, the code displays the system‚Äôs
RAM capacity using , allowing to assess whether there is sufficient memory to process
large clinical text datasets. This entire block supports reproducibility and transparency in model evaluation.
'''

print("Using device:", "cuda" if torch.cuda.is_available() else "cpu")
print("=== GPU Info ===")
!nvidia-smi || echo "No GPU detected"

print("\n=== CPU Info ===")
!cat /proc/cpuinfo | head -20

print("\n=== RAM Info ===")
!free -h


Using device: cuda
=== GPU Info ===
Sat Dec  6 03:00:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   34C    P8             12W /   72W |       3MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
            

In [None]:
'''The clean_text() function standardises each note by converting text to lowercase, removing
unnecessary underscores, and expanding all abbreviations using the predefined patterns. It
also normalises numerical expressions such as ‚Äú4wk‚Äù ‚Üí ‚Äú4 week‚Äù, replaces unwanted symbols,
and collapses extra whitespace. This process ensures cleaner, more consistent input for
Stanza and ClinicalBERT, improving entity-extraction accuracy and reducing noise from
clinical formatting variations.
'''
# Abbreviation dictionary
ABBR_DICT = {
    'y/o': 'year old', 'yo': 'year old', 'wk': 'week', 'mo': 'month',
    'yr': 'year', 's/p': 'status post', 'hx': 'history', 'h/o': 'history of',
    'pmh': 'past medical history', 'hpi': 'history of present illness',
    'ros': 'review of systems', 'n/v/d': 'nausea vomiting and diarrhea',
    'n/v': 'nausea and vomiting', 'c/o': 'complains of', 'c/w': 'consistent with',
    'sob': 'shortness of breath', 'htn': 'hypertension', 'dm': 'diabetes mellitus',
    'cad': 'coronary artery disease', 'copd': 'chronic obstructive pulmonary disease',
    'mi': 'myocardial infarction', 'chf': 'congestive heart failure',
    'pna': 'pneumonia', 'uti': 'urinary tract infection', 'gi': 'gastrointestinal',
    'ct': 'computed tomography', 'mri': 'magnetic resonance imaging',
    'rrr': 'regular rate and rhythm', 'iv': 'intravenous', 'po': 'by mouth',
    'abx': 'antibiotics', 'fx': 'fracture', 'prn': 'as needed'
}

ABBR_PATTERNS = {
    re.compile(r'\b' + re.escape(k) + r'\b'): v for k, v in ABBR_DICT.items()
}

def clean_text(text):
    if not isinstance(text, str):
        return ""
    s = text.lower()
    s = re.sub(r'_+', ' ', s)

    # abbreviation expansion
    for pattern, full in ABBR_PATTERNS.items():
        s = pattern.sub(full, s)

    # number patterns: 4wk ‚Üí 4 week
    s = re.sub(r'(\d+)wk\b', r'\1 week', s)
    s = re.sub(r'(\d+)mo\b', r'\1 month', s)
    s = re.sub(r'(\d+)yr\b', r'\1 year', s)

    # remove symbols
    s = re.sub(r"[^a-zA-Z0-9\. ]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()

    return s


In [None]:
'''
This section loads the MIMIC-IV discharge summaries, randomly selects a subset of 40,000 notes
for analysis, and applies the earlier text-cleaning function to standardise the clinical
narratives. The dataset is read using pandas, and sampling with a fixed random ensures
the subset is reproducible across runs. The tqdm progress bar is enabled to provide visual
feedback during preprocessing, which can take significant time due to the size of the dataset.
Each note is then cleaned using clean_text(), expanding abbreviations and normalising formatting
before passing the text to the NLP models. The cleaned subset is finally saved as an intermediate
CSV file to allow reuse without repeating the expensive preprocessing step.
'''

df = pd.read_csv("/content/drive/MyDrive/discharge.csv")
print("Loaded:", len(df), "notes")

df_40k = df.sample(n=40000, random_state=42).reset_index(drop=True)
print("Sampled:", len(df_40k), "notes")

# Clean text
tqdm.pandas()
df_40k["cleaned_text"] = df_40k["text"].progress_apply(clean_text)

# Save intermediate cleaned dataset (optional)
df_40k.to_csv("/content/drive/MyDrive/df_40k_cleaned.csv", index=False)
print("‚úî cleaned dataset saved")


Loaded: 331793 notes
Sampled: 40000 notes


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 40000/40000 [04:48<00:00, 138.59it/s]


‚úî cleaned dataset saved


In [None]:
'''
This block loads the ClinicalBERT model and prepares it for clinical entity extraction.
Using the HuggingFace Transformers library, the model and tokenizer are retrieved from the
pretrained checkpoint ‚Äúsamrawal/bert-base-uncased_clinical-ner,‚Äù which is specifically
fine-tuned for identifying medical entities in clinical text.

The AutoTokenizer converts raw text into tokens that the model can understand, while
AutoModelForTokenClassification loads the underlying transformer architecture configured
for token-level predictions. A HuggingFace pipeline is then created to streamline inference:
it performs tokenisation, model execution, and output aggregation in a single step.
The aggregation_strategy="simple" merges subword fragments into complete entity spans,
producing cleaner outputs. The device parameter selects GPU if available, otherwise CPU,
ensuring optimal runtime performance.

This setup prepares ClinicalBERT for efficient, large-scale entity extraction across thousands
of ICU discharge summaries.
'''

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

model_name = "samrawal/bert-base-uncased_clinical-ner"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp_cb = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
    device=0 if torch.cuda.is_available() else -1
)

print("üî• ClinicalBERT loaded")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/300 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/895 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Device set to use cuda:0


üî• ClinicalBERT loaded


In [None]:
'''
This block measures the processing speed of the ClinicalBERT pipeline by timing how long it
takes to extract entities from a sample of 100 cleaned discharge summaries. A small subset of
notes is selected to obtain an average inference time per note, which helps estimate the model's
scalability and overall computational cost.

The script records the start and end timestamps while applying the ClinicalBERT pipeline to each
sample note, then calculates the average runtime per note. This value is extrapolated to estimate
how long it would take to process the full dataset of 40,000 notes. Such runtime estimation is
important for understanding hardware requirements and evaluating whether ClinicalBERT is suitable
for large-scale real-world clinical deployments.
'''

import time

print("\n‚è± Measuring ClinicalBERT speed on 100 notes...")

sample_texts = df_40k["cleaned_text"].head(100).tolist()

start = time.time()
for t in sample_texts:
    _ = nlp_cb(t)
end = time.time()

cb_time_per_note = (end - start) / 100
print(f"ClinicalBERT ‚Üí {cb_time_per_note:.4f} seconds per note")

# Estimate full 40k runtime
cb_total_minutes = (cb_time_per_note * 40000) / 60
print(f"Estimated runtime for 40,000 notes: {cb_total_minutes:.2f} minutes")



‚è± Measuring ClinicalBERT speed on 100 notes...


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

ClinicalBERT ‚Üí 0.0509 seconds per note
Estimated runtime for 40,000 notes: 33.95 minutes


In [None]:
'''
This block defines two helper functions used during the post-processing stage of entity
extraction. The first function, pick_longest(), receives a list of extracted entity spans
and returns the longest string. This heuristic is useful when models produce overlapping
or fragmented predictions; choosing the longest span often captures the most complete and
clinically meaningful phrase.

The second function, extract_outcome(), implements a simple rule-based method for detecting
patient outcomes directly from the text. Because outcomes are often expressed in consistent,
formulaic language in discharge summaries, keyword matching can reliably identify categories
such as ‚Äúdischarged home,‚Äù ‚Äútransferred,‚Äù ‚Äúdied,‚Äù or ‚Äúpalliative care.‚Äù If none of these
phrases appear, the function returns an empty string. Although lightweight, this approach
provides a baseline outcome extraction method that complements model-based predictions.
'''

def pick_longest(entities_list):
    if not entities_list:
        return ""
    return max(entities_list, key=len)

def extract_outcome(text):
    # simple rule-based for now (same as your code)
    if "discharged home" in text:
        return "discharged home"
    if "transferred" in text:
        return "transferred"
    if "died" in text or "expired" in text:
        return "died"
    if "palliative" in text:
        return "palliative / comfort care"
    return ""


In [None]:
'''
This function performs entity extraction using ClinicalBERT and organises the output into
well-defined clinical categories. It first validates the input and returns empty fields
if the text is missing or invalid. The ClinicalBERT pipeline (nlp_cb) is then applied to
the note, producing a list of token-level predictions.

Each predicted entity is mapped into one of four categories‚Äîdiseases, treatments, procedures,
and medications‚Äîbased on the model‚Äôs entity labels. Because ClinicalBERT sometimes outputs
subword fragments or inconsistent punctuation, the code cleans each entity span using a
regular expression before storing it.

After collecting all spans, the function applies pick_longest() to each list to return the
most complete and meaningful phrase for each clinical entity type. This helps reduce noise
from fragmented or overlapping predictions. The outcome category is populated separately
using a rule-based function (extract_outcome) that detects common discharge outcome phrases.

The function ultimately returns a dictionary containing one representative extracted entity
per category: disease, treatment, procedure, medication, and outcome.
'''

def extract_cb(text):
    if not isinstance(text, str) or not text.strip():
        return {"CB_Disease":"", "CB_Treatment":"", "CB_Procedure":"",
                "CB_Medication":"", "CB_Outcome":""}

    res = nlp_cb(text)

    diseases, treatments, procedures, medications = [], [], [], []

    for ent in res:
        lab = ent["entity_group"].upper()
        word = re.sub(r"[^a-zA-Z0-9\s\-\.]", "", ent["word"]).strip()

        if lab in ["PROBLEM", "DISEASE", "DISORDER", "SYMPTOM", "FINDING"]:
            diseases.append(word)
        elif lab in ["TREATMENT", "THERAPY"]:
            treatments.append(word)
        elif lab in ["DRUG", "MEDICATION"]:
            medications.append(word)
        elif lab in ["PROCEDURE", "OPERATION", "TEST"]:
            procedures.append(word)

    return {
        "CB_Disease": pick_longest(diseases),
        "CB_Treatment": pick_longest(treatments),
        "CB_Procedure": pick_longest(procedures),
        "CB_Medication": pick_longest(medications),
        "CB_Outcome": extract_outcome(text)
    }


In [None]:
'''
This block applies the ClinicalBERT extraction function to all 40,000 cleaned discharge
summaries in an efficient, batch-wise manner. Processing the dataset in batches of 2,000
notes prevents memory overload and allows the progress bar to give regular feedback on
long-running operations.

For each batch, the extract_cb() function is applied to every note, returning a structured
dictionary of extracted clinical entities. These results are stored and gradually appended
to a master output list. After all batches are processed, the outputs are converted into a
DataFrame and concatenated with the original df_40k dataset, ensuring each note retains its
associated extracted entities.

Finally, the combined dataset‚Äîcontaining the original note metadata and all ClinicalBERT
predictions‚Äîis saved as ‚ÄúClinicalBERT_40k.csv.‚Äù This file serves as the complete output
of ClinicalBERT‚Äôs entity extraction across the 40,000-note sample and can be used for
evaluation, comparison with Stanza, and further analysis.
'''

batch_size = 2000
outputs = []

for i in range(0, len(df_40k), batch_size):
    print(f"Batch {i} ‚Üí {i+batch_size}")
    batch = df_40k["cleaned_text"].iloc[i:i+batch_size]
    batch_results = batch.progress_apply(extract_cb)
    outputs.extend(batch_results)

df_cb_40k = pd.concat([df_40k, pd.DataFrame(outputs)], axis=1)

df_cb_40k.to_csv("/content/drive/MyDrive/ClinicalBERT_40k.csv", index=False)
print("üî• ClinicalBERT_40k.csv saved!")


Batch 0 ‚Üí 2000



  0%|          | 0/2000 [00:00<?, ?it/s][A
  0%|          | 4/2000 [00:00<01:17, 25.75it/s][A
  0%|          | 7/2000 [00:00<01:22, 24.17it/s][A
  0%|          | 10/2000 [00:00<01:24, 23.44it/s][A
  1%|          | 13/2000 [00:00<01:27, 22.81it/s][A
  1%|          | 16/2000 [00:00<01:26, 22.96it/s][A
  1%|          | 19/2000 [00:00<01:27, 22.61it/s][A
  1%|          | 22/2000 [00:00<01:26, 22.99it/s][A
  1%|‚ñè         | 25/2000 [00:01<01:27, 22.64it/s][A
  1%|‚ñè         | 28/2000 [00:01<01:26, 22.79it/s][A
  2%|‚ñè         | 31/2000 [00:01<01:25, 22.90it/s][A
  2%|‚ñè         | 34/2000 [00:01<01:25, 23.00it/s][A
  2%|‚ñè         | 37/2000 [00:01<01:23, 23.46it/s][A
  2%|‚ñè         | 40/2000 [00:01<01:23, 23.53it/s][A
  2%|‚ñè         | 43/2000 [00:01<01:21, 24.11it/s][A
  2%|‚ñè         | 46/2000 [00:01<01:21, 23.91it/s][A
  2%|‚ñè         | 49/2000 [00:02<01:23, 23.46it/s][A
  3%|‚ñé         | 52/2000 [00:02<01:21, 23.89it/s][A
  3%|‚ñé         | 55/2000 [00:02<01

Batch 2000 ‚Üí 4000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.10it/s]


Batch 4000 ‚Üí 6000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.01it/s]


Batch 6000 ‚Üí 8000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.09it/s]


Batch 8000 ‚Üí 10000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.04it/s]


Batch 10000 ‚Üí 12000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.17it/s]


Batch 12000 ‚Üí 14000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 22.99it/s]


Batch 14000 ‚Üí 16000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:27<00:00, 22.98it/s]


Batch 16000 ‚Üí 18000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.04it/s]


Batch 18000 ‚Üí 20000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.06it/s]


Batch 20000 ‚Üí 22000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:27<00:00, 22.96it/s]


Batch 22000 ‚Üí 24000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.05it/s]


Batch 24000 ‚Üí 26000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.07it/s]


Batch 26000 ‚Üí 28000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:27<00:00, 22.94it/s]


Batch 28000 ‚Üí 30000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:27<00:00, 22.87it/s]


Batch 30000 ‚Üí 32000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:27<00:00, 22.84it/s]


Batch 32000 ‚Üí 34000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:27<00:00, 22.85it/s]


Batch 34000 ‚Üí 36000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.06it/s]


Batch 36000 ‚Üí 38000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:27<00:00, 22.91it/s]


Batch 38000 ‚Üí 40000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2000/2000 [01:26<00:00, 23.00it/s]


üî• ClinicalBERT_40k.csv saved!


In [None]:
'''
This block generates a quantitative summary of ClinicalBERT‚Äôs extraction performance across
the 40,000-note dataset. Empty strings are first converted to NaN values to ensure that
pandas functions such as .notna() and .isna() correctly identify missing extractions.
A list of the five extracted entity categories‚Äîdisease, treatment, procedure, medication,
and outcome‚Äîis then defined for systematic evaluation.

For each entity type, the script calculates three key metrics:
‚Ä¢ the number of notes where an entity was successfully extracted,
‚Ä¢ the number of notes where the entity is missing, and
‚Ä¢ the overall coverage percentage, representing how frequently ClinicalBERT identified
  that entity across the dataset.

These results are stored in a structured list and converted into a DataFrame to produce a
clear, interpretable summary table. This table provides an overview of ClinicalBERT‚Äôs
strengths and weaknesses across different clinical entity categories and is used later in
the evaluation and discussion sections of the dissertation.
'''

df_cb = df_cb_40k.replace("", np.nan)

entity_cols = ["CB_Disease", "CB_Treatment", "CB_Procedure", "CB_Medication", "CB_Outcome"]

summary_data = []

for col in entity_cols:
    extracted = df_cb[col].notna().sum()
    missing = df_cb[col].isna().sum()
    coverage = round((extracted / len(df_cb)) * 100, 2)

    summary_data.append({
        "Entity Type": col.replace("CB_", ""),
        "Extracted Count": extracted,
        "Missing Count": missing,
        "Coverage (%)": coverage
    })

cb_summary = pd.DataFrame(summary_data)
print("\n====================================")
print("üìä ClinicalBERT Extraction Summary")
print("====================================")
print(cb_summary)


üìä ClinicalBERT Extraction Summary
  Entity Type  Extracted Count  Missing Count  Coverage (%)
0     Disease            39975             25         99.94
1   Treatment            39916             84         99.79
2   Procedure            39420            580         98.55
3  Medication                0          40000          0.00
4     Outcome            24273          15727         60.68


  df_cb = df_cb_40k.replace("", np.nan)


In [None]:
'''
This block installs and loads the Stanza NLP pipeline configured with the i2b2 clinical
NER model. Stanza is a rule-based and sequence-model NLP toolkit developed at Stanford,
and its i2b2 package is trained specifically on clinical narratives from the i2b2/VA
challenge dataset. Installing the library and downloading the English i2b2 model ensures
that the pipeline is equipped with domain-specific entity recognisers suitable for
extracting diseases, treatments, and procedures from clinical text.

The stanza.Pipeline is initialised with tokenisation and NER processors, while specifying
the i2b2 model for the NER component. GPU usage is automatically enabled if available,
allowing faster inference on large datasets. Once loaded, the Stanza pipeline can be
applied directly to each discharge summary to generate structured clinical entity
annotations, making it the second system evaluated against ClinicalBERT.
'''

!pip install stanza
import stanza

stanza.download('en', package='i2b2')

nlp_stz = stanza.Pipeline(
    lang='en',
    processors='tokenize,ner',
    package={'ner': 'i2b2'},
    use_gpu=torch.cuda.is_available()
)

print("üî• Stanza loaded")


Collecting stanza
  Downloading stanza-1.11.0-py3-none-any.whl.metadata (14 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.15.0-py3-none-any.whl.metadata (5.7 kB)
Downloading stanza-1.11.0-py3-none-any.whl (1.7 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.7/1.7 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading emoji-2.15.0-py3-none-any.whl (608 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m608.4/608.4 kB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji, stanza
Successfully installed emoji-2.15.0 stanza-1.11.0


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  ‚Ä¶

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading these customized packages for language: en (English)...
| Processor       | Package |
-----------------------------
| ner             | i2b2    |
| backward_charlm | mimic   |
| pretrain        | mimic   |
| forward_charlm  | mimic   |



Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.11.0/models/ner/i2b2.pt:   0%|          | ‚Ä¶

INFO:stanza:Downloaded file to /root/stanza_resources/en/ner/i2b2.pt


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.11.0/models/backward_charlm/mimic.pt:   0%‚Ä¶

INFO:stanza:Downloaded file to /root/stanza_resources/en/backward_charlm/mimic.pt


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.11.0/models/pretrain/mimic.pt:   0%|      ‚Ä¶

INFO:stanza:Downloaded file to /root/stanza_resources/en/pretrain/mimic.pt


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.11.0/models/forward_charlm/mimic.pt:   0%|‚Ä¶

INFO:stanza:Downloaded file to /root/stanza_resources/en/forward_charlm/mimic.pt
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  ‚Ä¶

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.11.0/models/tokenize/combined.pt:   0%|   ‚Ä¶

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.11.0/models/mwt/combined.pt:   0%|        ‚Ä¶

INFO:stanza:Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |
| mwt       | combined |
| ner       | i2b2     |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


üî• Stanza loaded


In [None]:
'''
This function performs entity extraction using the Stanza i2b2 clinical NER model and returns
a structured dictionary of entities grouped into five categories. After validating the input
text, the function passes the note through the Stanza pipeline, which generates tokenisation
and NER annotations.

Stanza labels each detected entity with a type such as *problem*, *treatment*, *test*, or
*drug*. These labels reflect the original i2b2/VA challenge annotation scheme. Based on these
types, the function sorts each extracted span into one of five lists: disease, treatment,
procedure, medication, and outcome. Because Stanza often returns multiple entities for each
category, the function concatenates all extracted spans into a semicolon-separated string
rather than selecting only a single representative, preserving Stanza‚Äôs richer output.

If no entities are found for a category, the function stores a `None` value to indicate that
the field is missing. The final output is a dictionary containing Stanza‚Äôs extracted entities
for all clinical categories, ready to be merged with the dataset and compared against
ClinicalBERT‚Äôs predictions.
'''

def extract_stanza(text):
    if not isinstance(text, str) or not text.strip():
        return {"SZ_Disease":None, "SZ_Treatment":None, "SZ_Procedure":None,
                "SZ_Medication":None, "SZ_Outcome":None}

    doc = nlp_stz(text)

    disease, treatment, procedure, medication, outcome = [], [], [], [], []

    for ent in doc.ents:
        lab = ent.type.lower()

        if "problem" in lab:
            disease.append(ent.text)
        if "treatment" in lab:
            treatment.append(ent.text)
        if "test" in lab:
            procedure.append(ent.text)
        if "drug" in lab or "med" in lab:
            medication.append(ent.text)
        if "outcome" in lab:
            outcome.append(ent.text)

    return {
        "SZ_Disease": "; ".join(disease) if disease else None,
        "SZ_Treatment": "; ".join(treatment) if treatment else None,
        "SZ_Procedure": "; ".join(procedure) if procedure else None,
        "SZ_Medication": "; ".join(medication) if medication else None,
        "SZ_Outcome": "; ".join(outcome) if outcome else None
    }


In [None]:
'''
This block measures the processing speed of the Stanza i2b2 NER pipeline by evaluating how
long it takes to analyse a sample of 100 cleaned clinical notes. A small subset is selected
to compute an average runtime per note, which is necessary for assessing scalability and
comparing Stanza‚Äôs efficiency against ClinicalBERT.

The script records timestamps before and after processing the sample notes, allowing precise
calculation of Stanza‚Äôs inference time per note. This average is then extrapolated to estimate
the total time required to process the full dataset of 40,000 discharge summaries. Runtime
profiling is an essential aspect of evaluating NLP tools for clinical use, as real-world
deployment depends not only on accuracy but also on computational cost and throughput.
'''

import time

print("\n‚è± Measuring Stanza speed on 100 notes...")

sample_texts = df_cb_40k["cleaned_text"].head(100).tolist()

start = time.time()
for t in sample_texts:
    _ = nlp_stz(t)
end = time.time()

stz_time_per_note = (end - start) / 100
print(f"Stanza ‚Üí {stz_time_per_note:.4f} seconds per note")

# Estimate full 5k runtime
stz_total_minutes = (stz_time_per_note * 40000) / 60
print(f"Estimated runtime for 40,000 notes: {stz_total_minutes:.2f} minutes")



‚è± Measuring Stanza speed on 100 notes...
Stanza ‚Üí 0.9028 seconds per note
Estimated runtime for 40,000 notes: 601.87 minutes


In [None]:
'''
This block applies the Stanza i2b2 NER model to all 40,000 cleaned discharge summaries and
stores the extracted entities in a structured output file. The tqdm progress bar is enabled
to provide real-time feedback during processing, which is important because Stanza runs in a
sequential manner and can take considerable time on a large dataset.

For each note, the extract_stanza() function is applied, producing a dictionary of entities
across five clinical categories (disease, treatment, procedure, medication, and outcome).
The results are collected into a list, converted into a DataFrame, and concatenated with the
existing df_cb_40k dataset‚Äîpreserving original text, ClinicalBERT outputs, and now Stanza‚Äôs
outputs in a single unified table.

Finally, the completed dataset is saved as ‚ÄúStanza_40000.csv,‚Äù representing the full set of
Stanza-derived entity annotations for later comparison, evaluation, and analysis alongside
ClinicalBERT results.
'''

tqdm.pandas()

stanza_res = df_cb_40k["cleaned_text"].progress_apply(extract_stanza)

df_stanza_40k = pd.concat([df_cb_40k, pd.DataFrame(list(stanza_res))], axis=1)

df_stanza_40k.to_csv("/content/drive/MyDrive/Stanza_40000.csv", index=False)

print("üî• Stanza extraction on all 40,000 summaries saved!")


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 40000/40000 [9:59:38<00:00,  1.11it/s]


üî• Stanza extraction on all 40,000 summaries saved!


In [None]:
df_stz = pd.read_csv("/content/drive/MyDrive/Stanza_40000.csv")

print("Loaded:", df_stz.shape)
df_stz.head()


Loaded: (40000, 19)


Unnamed: 0,note_id,subject_id,hadm_id,note_type,note_seq,charttime,storetime,text,cleaned_text,CB_Disease,CB_Treatment,CB_Procedure,CB_Medication,CB_Outcome,SZ_Disease,SZ_Treatment,SZ_Procedure,SZ_Medication,SZ_Outcome
0,10202247-DS-15,10202247,28736349,DS,15,2173-11-11 00:00:00,2173-11-15 13:25:00,\nName: ___ Unit No: __...,name unit no admission date discharge date dat...,pseudocholinesterase defficiency in sister and...,laparoscopic cholecystectomy dietary modificat...,physical exam vital,,,abdominal pain; chronic pancreatitis; worsenin...,anticholinergics; invasive procedure; cholecys...,vitals; t; hr; bp; rr; o2 sat; ra.; physical e...,,
1,12784119-DS-19,12784119,27383409,DS,19,2196-11-13 00:00:00,2196-11-14 19:33:00,\nName: ___ Unit No: ___\n...,name unit no admission date discharge date dat...,bilateral pleuritic pain in the lateral ribs,trimethoprim,admission exam,,,cough; asthma; cough; doe; cough; thick green ...,invasive procedure; levofloxacin; benzonatate;...,initial vs; cxr; vbg; po2; vs; transfer; admis...,,
2,16314105-DS-3,16314105,27871553,DS,3,2141-05-10 00:00:00,2141-06-02 15:28:00,\nName: ___ Unit No: ___...,name unit no admission date discharge date dat...,partially cystic partially solid exophytic,major surgical or invasive procedure,trigger finger aortic reg,,transferred,left renal mass; a left kidney complex cystic ...,penicillins pseudoephedrine lisinopril; invasi...,blood; wbc; rbc; hgb; hct; mcv; mch; mchc; rdw...,,
3,16805731-DS-23,16805731,24081862,DS,23,2149-10-14 00:00:00,2149-10-14 15:27:00,\nName: ___. Unit No: ___\n...,name . unit no admission date discharge date d...,gastrointestinal disorders,intravenous pain medications,physical exam,,discharged home,known allergies; adverse drug reactions; ulcer...,ileostomy major surgical; invasive procedure i...,blood; wbc; rbc; hgb; hct; mcv; mch; mchc; rdw...,,
4,14334225-DS-10,14334225,29709912,DS,10,2154-09-26 00:00:00,2154-09-26 16:07:00,\nName: ___ Unit No: ___\n...,name unit no admission date discharge date dat...,significant epigastric pain,invasive procedure,physical exam,,,known allergies; vomiting; terrible epigastric...,invasive procedure; hospiatlization; any medic...,physical exam; glucose; urea n; creat; sodium;...,,


In [None]:
import numpy as np
import pandas as pd

# Replace empty values with NaN
df_stz_clean = df_stz.replace("", np.nan)

entity_cols_stanza = ["SZ_Disease", "SZ_Treatment", "SZ_Procedure", "SZ_Medication", "SZ_Outcome"]

summary_data_stanza = []

for col in entity_cols_stanza:
    extracted = df_stz_clean[col].notna().sum()
    missing = df_stz_clean[col].isna().sum()
    coverage = round((extracted / len(df_stz_clean)) * 100, 2)

    summary_data_stanza.append({
        "Entity Type": col.replace("SZ_", ""),
        "Extracted Count": extracted,
        "Missing Count": missing,
        "Coverage (%)": coverage
    })

stanza_summary = pd.DataFrame(summary_data_stanza)

print("\n====================================")
print("üìä STANZA Extraction Summary")
print("====================================")
print(stanza_summary)



üìä STANZA Extraction Summary
  Entity Type  Extracted Count  Missing Count  Coverage (%)
0     Disease            40000              0         100.0
1   Treatment            40000              0         100.0
2   Procedure            39918             82          99.8
3  Medication                0          40000           0.0
4     Outcome                0          40000           0.0


In [None]:
compare_cols = [
    "note_id",
    "cleaned_text",
    "CB_Disease", "SZ_Disease",
    "CB_Treatment", "SZ_Treatment",
    "CB_Procedure", "SZ_Procedure",
    "CB_Medication", "SZ_Medication",
    "CB_Outcome", "SZ_Outcome"
]

df_compare = df_stanza_40k[compare_cols].copy()


In [None]:
df_compare.head(5)


Unnamed: 0,note_id,cleaned_text,CB_Disease,SZ_Disease,CB_Treatment,SZ_Treatment,CB_Procedure,SZ_Procedure,CB_Medication,SZ_Medication,CB_Outcome,SZ_Outcome
0,10202247-DS-15,name unit no admission date discharge date dat...,pseudocholinesterase defficiency in sister and...,abdominal pain; chronic pancreatitis; worsenin...,laparoscopic cholecystectomy dietary modificat...,anticholinergics; invasive procedure; cholecys...,physical exam vital,vitals; t; hr; bp; rr; o2 sat; ra.; physical e...,,,,
1,12784119-DS-19,name unit no admission date discharge date dat...,bilateral pleuritic pain in the lateral ribs,cough; asthma; cough; doe; cough; thick green ...,trimethoprim,invasive procedure; levofloxacin; benzonatate;...,admission exam,initial vs; cxr; vbg; po2; vs; transfer; admis...,,,,
2,16314105-DS-3,name unit no admission date discharge date dat...,partially cystic partially solid exophytic,left renal mass; a left kidney complex cystic ...,major surgical or invasive procedure,penicillins pseudoephedrine lisinopril; invasi...,trigger finger aortic reg,blood; wbc; rbc; hgb; hct; mcv; mch; mchc; rdw...,,,transferred,
3,16805731-DS-23,name . unit no admission date discharge date d...,gastrointestinal disorders,known allergies; adverse drug reactions; ulcer...,intravenous pain medications,ileostomy major surgical; invasive procedure i...,physical exam,blood; wbc; rbc; hgb; hct; mcv; mch; mchc; rdw...,,,discharged home,
4,14334225-DS-10,name unit no admission date discharge date dat...,significant epigastric pain,known allergies; vomiting; terrible epigastric...,invasive procedure,invasive procedure; hospiatlization; any medic...,physical exam,physical exam; glucose; urea n; creat; sodium;...,,,,


In [None]:
df_compare_5k = df_compare.sample(500, random_state=42)


In [None]:
import pandas as pd
import numpy as np

# ================================
# LOAD DATA
# ================================
df_pred = pd.read_csv("/content/Compare_500_sample.csv")
df_gold = pd.read_csv("/content/disease_treatment_extraction.csv")

# Merge on note_id
df = df_pred.merge(df_gold, on="note_id", how="inner")
print("Merged dataset shape:", df.shape)

# ================================
# CLEAN TEXT
# ================================
def clean(x):
    return "" if pd.isna(x) else str(x).strip().lower()

# ================================
# STRICT MATCH
# ================================
def strict_match(pred, gold):
    pred = clean(pred)
    gold = clean(gold)
    return int(pred == gold and gold != "")

def strict_correct(pred, gold):
    pred = clean(pred)
    gold = clean(gold)
    return int(pred == gold)

# ================================
# LENIENT MATCH
# ================================
def lenient_overlap(pred, gold):
    pred = clean(pred)
    gold = clean(gold)
    if pred == "" or gold == "":
        return 0
    return int(pred in gold or gold in pred)

def lenient_correct(pred, gold):
    pred = clean(pred)
    gold = clean(gold)
    if pred == "" and gold == "":
        return 1
    return lenient_overlap(pred, gold)

# ================================
# METRIC CALCULATION
# ================================
def evaluate(pred_col, gold_col, df):

    TP_strict = df.apply(lambda x: strict_match(x[pred_col], x[gold_col]), axis=1).sum()
    TP_lenient = df.apply(lambda x: lenient_overlap(x[pred_col], x[gold_col]), axis=1).sum()

    total_gold = (df[gold_col].apply(clean) != "").sum()
    total_pred = (df[pred_col].apply(clean) != "").sum()

    FN = total_gold - TP_strict
    FP = total_pred - TP_strict

    strict_accuracy = df.apply(lambda x: strict_correct(x[pred_col], x[gold_col]), axis=1).mean()
    lenient_accuracy = df.apply(lambda x: lenient_correct(x[pred_col], x[gold_col]), axis=1).mean()

    def compute(tp, fp, fn):
        precision = tp / (tp + fp + 1e-9)
        recall = tp / (tp + fn + 1e-9)
        f1 = 2 * precision * recall / (precision + recall + 1e-9)
        return precision, recall, f1

    strict_scores = compute(TP_strict, FP, FN)
    lenient_scores = compute(TP_lenient, FP, FN)

    return {
        "strict_precision": strict_scores[0],
        "strict_recall": strict_scores[1],
        "strict_f1": strict_scores[2],
        "strict_accuracy": strict_accuracy,
        "lenient_precision": lenient_scores[0],
        "lenient_recall": lenient_scores[1],
        "lenient_f1": lenient_scores[2],
        "lenient_accuracy": lenient_accuracy
    }

# ================================
# RUN FOR DISEASE + TREATMENT ONLY
# ================================
entities = {
    "Disease": ("CB_Disease", "SZ_Disease", "Diseases"),
    "Treatment": ("CB_Treatment", "SZ_Treatment", "Treatments")
}

results = {}

for ent, (cb_col, sz_col, gold_col) in entities.items():
    results[f"CB_{ent}"] = evaluate(cb_col, gold_col, df)
    results[f"SZ_{ent}"] = evaluate(sz_col, gold_col, df)

# ================================
# PRINT CLEAN RESULTS (PERCENT FORMAT)
# ================================
for key, val in results.items():
    print("\n==============================")
    print(key)

    print(f"Strict Precision: {val['strict_precision'] * 100:.2f}%")
    print(f"Strict Recall: {val['strict_recall'] * 100:.2f}%")
    print(f"Strict F1 Score: {val['strict_f1'] * 100:.2f}%")
    print(f"Strict Accuracy: {val['strict_accuracy'] * 100:.2f}%")

    print(f"Lenient Precision: {val['lenient_precision'] * 100:.2f}%")
    print(f"Lenient Recall: {val['lenient_recall'] * 100:.2f}%")
    print(f"Lenient F1 Score: {val['lenient_f1'] * 100:.2f}%")
    print(f"Lenient Accuracy: {val['lenient_accuracy'] * 100:.2f}%")


Merged dataset shape: (500, 14)

CB_Disease
Strict Precision: 0.40%
Strict Recall: 0.42%
Strict F1 Score: 0.41%
Strict Accuracy: 0.40%
Lenient Precision: 3.49%
Lenient Recall: 3.62%
Lenient F1 Score: 3.55%
Lenient Accuracy: 3.60%

SZ_Disease
Strict Precision: 0.00%
Strict Recall: 0.00%
Strict F1 Score: 0.00%
Strict Accuracy: 0.00%
Lenient Precision: 23.08%
Lenient Recall: 23.77%
Lenient F1 Score: 23.42%
Lenient Accuracy: 30.00%

CB_Treatment
Strict Precision: 0.00%
Strict Recall: 0.00%
Strict F1 Score: 0.00%
Strict Accuracy: 0.00%
Lenient Precision: 0.40%
Lenient Recall: 0.51%
Lenient F1 Score: 0.45%
Lenient Accuracy: 0.40%

SZ_Treatment
Strict Precision: 0.00%
Strict Recall: 0.00%
Strict F1 Score: 0.00%
Strict Accuracy: 0.00%
Lenient Precision: 8.93%
Lenient Recall: 11.24%
Lenient F1 Score: 9.95%
Lenient Accuracy: 9.80%


In [None]:
# Function reused from evaluation
def clean(x):
    return "" if pd.isna(x) else str(x).strip().lower()

def strict_correct(pred, gold):
    return clean(pred) == clean(gold)

# Filter rows where ClinicalBERT disease is EXACT match
cb_correct = df[df.apply(lambda x: strict_correct(x["CB_Disease"], x["Diseases"]), axis=1)]

print("Number of strict matches for ClinicalBERT Disease:", len(cb_correct))
cb_correct[["note_id", "Diseases", "CB_Disease", "cleaned_text"]].head(20)


Number of strict matches for ClinicalBERT Disease: 2


Unnamed: 0,note_id,Diseases,CB_Disease,cleaned_text
326,11065839-DS-7,intentional clonidine overdose,intentional clonidine overdose,name unit no admission date discharge date dat...
370,11063129-DS-8,left knee osteoarthritis,left knee osteoarthritis,name unit no admission date discharge date dat...


In [None]:
06