# Introduction

The purpose of this notebook is to extract data from the DrugBank XML file regarding drug descriptions, targets and mechanisms of action (MoA), then convert the textual descriptions of MoA (plus additional information) into a MeSH label (plus additional information). 

[MeSH](https://www.ncbi.nlm.nih.gov/mesh/) (Medical Subject Headings) is the NLM controlled vocabulary thesaurus used for indexing articles for PubMed. MeSH categories for drugs are organized hierarchically under the main "Chemicals and Drugs" category. Within this category, there are main headings like "Drugs" and "Pharmaceutical Preparations," and these are further broken down by subheadings that describe specific aspects like administration and dosage, pharmacology, toxicity, or chemical synthesis.

The MeSH classes used in this notebook are:
- Protein Kinase Inhibitor
- Tyrosine Kinase Inhibitor
- GPCR Agonist
- GPCR Antagonist
- Ion Channel Blocker
- Ion Channel Opener
- Protease Inhibitor
- Enzyme Inhibitor (General)
- COX Inhibitor (NSAID)
- HMG-CoA Reductase Inhibitor (Statin)
- ACE Inhibitor
- ARB (Angiotensin Receptor Blocker)
- Glucocorticoid Receptor Agonist
- Estrogen Receptor Modulator
- DNA Synthesis Inhibitor / Antimetabolite
- Microtubule Inhibitor / Antimitotic
- Antibiotic (Antibacterial)
- Antiviral
- Immunosuppressant / Immunomodulator
- Monoclonal Antibody — Receptor Antagonist / Blocker
- Monoclonal Antibody — Ligand Neutralizer (e.g., cytokine neutralizer)
- Monoclonal Antibody — ADCC/CDC / Immune Effector (antibody with effector function)
- Recombinant Protein / Replacement Therapy (enzymes, hormones)
- Vaccine / Immunogen
- Peptide Therapeutic (non-replacement)
- ADC / Targeted Bioconjugate (antibody-drug conjugates)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import xml.etree.ElementTree as ET
import re
import csv

!pip install scispacy spacy
import scispacy, spacy
from scispacy.umls_linking import UmlsEntityLinker

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

from pathlib import Path
# path = Path('drugbank')
path = Path('../input/drugbank-db-sept-2025')

/kaggle/input/drugbank-xlm/full database.xml
/kaggle/input/drugbank-db-sept-2025/drugbank_full_database.xml


# Loading the Data

This dataset comes from [Hugging Face Datasets](https://huggingface.co/datasets/agenticx/DrugBank/tree/main) and is an XML file containing the entire DrugBank database, last updated September 2025.

In [2]:
NS = {'db': 'http://www.drugbank.ca'}

tree = ET.parse('../input/drugbank-db-sept-2025/drugbank_full_database.xml')
root = tree.getroot()

Let's check to see that parsing the XML file has worked, and that the expected features are present in the data, as well as how many drugs are represented in the data:

In [3]:
# Print all features in xml
for elem in root[0]:
    print(elem)

# Print the number of drugs in the database
print(f"\nNumber of drugs in database: {len(list(root))}")

<Element '{http://www.drugbank.ca}drugbank-id' at 0x7878b69ad8f0>
<Element '{http://www.drugbank.ca}drugbank-id' at 0x7878b69ad940>
<Element '{http://www.drugbank.ca}drugbank-id' at 0x7878b69ad990>
<Element '{http://www.drugbank.ca}name' at 0x7878b69ada30>
<Element '{http://www.drugbank.ca}description' at 0x7878b69adad0>
<Element '{http://www.drugbank.ca}cas-number' at 0x7878b69adb70>
<Element '{http://www.drugbank.ca}unii' at 0x7878b69adc10>
<Element '{http://www.drugbank.ca}state' at 0x7878b69adcb0>
<Element '{http://www.drugbank.ca}groups' at 0x7878b69add50>
<Element '{http://www.drugbank.ca}general-references' at 0x7878b69adee0>
<Element '{http://www.drugbank.ca}synthesis-reference' at 0x7878b69af010>
<Element '{http://www.drugbank.ca}indication' at 0x7878b69af0b0>
<Element '{http://www.drugbank.ca}pharmacodynamics' at 0x7878b69af150>
<Element '{http://www.drugbank.ca}mechanism-of-action' at 0x7878b69af1f0>
<Element '{http://www.drugbank.ca}toxicity' at 0x7878b69af240>
<Element '{h

...And now let's look at the first example, to check it looks okay.

In [4]:
index = 0 # The example to show
features = ['name', 'mechanism-of-action', 'state', 'description', 'indication', 'dosages', 
            'classification', 'groups', 'strength', 'targets', 'pathways', 'synonyms', 'categories']
drug_example = {}
example_df = pd.DataFrame()

for feature in features:
    f_tag = root[index].find("{http://www.drugbank.ca}" + feature)
    drug_example[feature] = None if f_tag is None or f_tag.text is None or f_tag.text.strip() == '' else f_tag.text

example_df = pd.concat([example_df, pd.DataFrame([drug_example])], ignore_index=True)
example_df

Unnamed: 0,name,mechanism-of-action,state,description,indication,dosages,classification,groups,strength,targets,pathways,synonyms,categories
0,Lepirudin,Lepirudin is a direct thrombin inhibitor used ...,solid,Lepirudin is a recombinant hirudin formed by 6...,Lepirudin is indicated for anticoagulation in ...,,,,,,,,


In [5]:
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz 
#import en_core_sci_lg

#from spacy import displacy
#!python -m spacy download en_core_sci_sm

!pip install scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz (14.8 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone


# Inferring the MeSH Labels From Textual MoA Descriptions

Now that the data has been loaded, it is clear that there are no simple, succinct categories of drug MoA included in the data set. Therefore, in order to be able to predict drug category from textual descriptions (see my notebook [Drug MoA From Description Classification](https://www.kaggle.com/code/audlang/drug-moa-from-description-classification)), we need to infer category labels for each drug to form the target of later model training and testing.

To do this, we will use a combination of:
- Keyword / synonym matching 
- Embedding similarity using an NLP model for biomedical named entity recognition (NER)

## Loading and Configuring the NLP Model and UMLS Entity Linker

`en_core_sci_sm` is a small, general scientific English NLP model released as part of SciSpaCy (a scientific NLP extension to spaCy created by the Allen Institute for AI).

It is trained specifically on biomedical + scientific text, unlike spaCy’s standard English models which are trained on news and web text. This makes it better at tokenizing and parsing biomedical wording, complex terminology, and scientific style grammar.

In [6]:
#nlp = spacy.load("en_core_sci_md")
#nlp = en_core_sci_lg.load()
nlp = spacy.load("en_core_sci_sm")

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


On its own, the model is *not* an NER model; it just parses biomedical/scientific text. To enable recognition of named entities and linking to other data sources, a **linker** must be attached to the model pipeline. 

The UMLS linker links entities already recognized by an NER model or rule-based entity finder to UMLS concepts.

For each entity span (`ent`), the linker computes `ent._.umls_ents`, which is a list of tuples in the format: `[(CUI, score), (CUI, score), ...]`. These CUIs (Concept Unique Identifiers) are unique identifiers for medical and clinical concepts within the UMLS (Unified Medical Language System), and the `score` is a similarity score, indicating how similar the entity is to the CUI. The metadata for the most similar CUI can then be looked up. In this case, we are interested in categories representing the drug's MoA, and this information is contained in the `Pharmacologic Action` concept.

Overall, the pipeline is structured as follows:
1. NER model (SciSpaCy) - Detects entities (what text refers to a thing)
2. UMLS Entity Linker - Converts detected entity to UMLS concept IDs
3. UMLS Knowledge Base - Provides metadata (preferred name, semantic type, pharmacological actions)

In the code below, we configure and load the UMLS entity linker, and add it to the NER model pipeline

In [8]:
# Configure linker
from spacy.tokens import Span
Span.set_extension('umls_ents', default=[])

ValueError: [E090] Extension 'umls_ents' already exists on Span. To overwrite the existing extension, set `force=True` on `Span.set_extension`.

In [9]:
# Add UMLS entity linker to NER model pipeline
def get_ent_linker(nlp, name):
    return UmlsEntityLinker(resolve_abbreviations=True, max_entities_per_mention=10)
from spacy.language import Language
Language.factory("ent_linker", func=get_ent_linker)
nlp.add_pipe("ent_linker")

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


<scispacy.linking.EntityLinker at 0x787661f342d0>

## Defining the Set of Target Labels

Now we define the target MoA label set:

In [10]:
# Map MeSH Terms → Your Controlled MoA Label Set
MOA_MAP = {
    "Receptor Tyrosine Kinase Inhibitors": "Tyrosine Kinase Inhibitor",
    "Protein Kinase Inhibitors": "Protein Kinase Inhibitor",
    "G-Protein-Coupled Receptor Antagonists": "GPCR Antagonist",
    "G-Protein-Coupled Receptor Agonists": "GPCR Agonist",
    "Ion Channel Blockers": "Ion Channel Blocker",
    "Ion Channel Agonists": "Ion Channel Opener",
    "Protease Inhibitors": "Protease Inhibitor",
    "Enzyme Inhibitors": "Enzyme Inhibitor",
    "Cyclooxygenase Inhibitors": "COX Inhibitor",
    "Immunosuppressive Agents": "Immunosuppressant",
    "Antiviral Agents": "Antiviral",
    "Anti-Bacterial Agents": "Antibiotic",
    "Antimetabolites": "DNA Synthesis Inhibitor",
    "Antimitotic Agents": "Microtubule Inhibitor",
    "Hydroxymethylglutaryl-CoA Reductase Inhibitors": "HMG-CoA Reductase Inhibitor",
    "Angiotensin-Converting Enzyme Inhibitors": "ACE Inhibitor",
    "Angiotensin Receptor Antagonists": "ARB (Angiotensin Receptor Blocker)",
    "Glucocorticoids": "Glucocorticoid Receptor Agonist",
    "Selective Estrogen Receptor Modulators": "Estrogen Receptor Modulator",
    # biologic-focused (source keys should be MeSH names or custom heuristics)
    "Monoclonal Antibodies": "Monoclonal Antibody — Receptor Antagonist / Blocker",
    "Antigen-Antibody Complexes": "Monoclonal Antibody — ADCC/CDC / Immune Effector",
    "Cytokine Inhibitors": "Monoclonal Antibody — Ligand Neutralizer",
    "Vaccines": "Vaccine / Immunogen",
    "Recombinant Proteins": "Recombinant Protein / Replacement Therapy",
    "Peptides": "Peptide Therapeutic",
    "Antibody-Drug Conjugates": "ADC / Targeted Bioconjugate"
}

## Defining the Functions to Extract Target Label

...And define the functions that will be used to define each drug's category

In [12]:
def is_biologic(drug_elem):
    """Return True if groups indicate biologic / biotech - narrows down possible categories"""
    groups = [g.text.lower() for g in drug_elem.findall('db:groups/db:group', NS) if g.text]
    for keyword in ('biotech','biological','biologic','vaccine','antibody','fusion protein','recombinant'):
        if any(keyword in gr for gr in groups):
            return True
    name = drug_elem.find('db:name', NS)
    if name is not None and re.search(r'\b(monoclonal|mab|mAb|antibody|recombinant|vaccine)\b', name.text, re.I):
        return True
    return False

def extract_text(drug_elem):
    '''
    Combines all textual descriptions (MoA, pharmacodynamics, description) that are 
    available in the dataset for a given drug into a single input feature (text).
    '''
    moa = drug_elem.find('db:mechanism-of-action', NS)
    pharm = drug_elem.find('db:pharmacodynamics', NS)
    desc = " ".join([x.text for x in (moa, pharm) if x is not None and x.text]).strip()
    # also include short description if present
    short = drug_elem.find('db:description', NS)
    if short is not None and short.text:
        desc = desc + " " + short.text
    return desc.strip()

def extract_targets(drug_elem):
    """Extract target info: name, gene, action, polypeptide id"""
    targets=[]
    for t in drug_elem.findall('db:targets/db:target', NS):
        actions = [a.text for a in t.findall('db:actions/db:action', NS) if a.text]
        polypep = t.find('db:polypeptide', NS)
        gene = None
        name = t.find('db:name', NS)
        if polypep is not None:
            gene = polypep.find('db:gene-name', NS)
            gene = gene.text if gene is not None else None
        targets.append({'name': name.text if name is not None else None,'gene': gene,'actions': actions})
    return targets

def extract_mesh_actions(text):
    '''
    Leverages the NER model -> UMLS entity linker -> UMLS knowledge base pipeline to 
    discover MeSH labels from the extracted text (MoA + pharmacodynamics + description) 
    '''
    doc = nlp(text)
    actions=set()
    for ent in doc.ents:
        for cui,score in ent._.umls_ents:
            concept = linker.umls.cui_to_entity.get(cui)
            if concept is None: 
                continue
            # concept.types is a list - check for pharmacologic action type or common MeSH action types
            if "Pharmacologic Action" in concept.types or "T109" in concept.types:
                actions.add(concept.preferred_name)
    return list(actions)


def heuristic_biologic_mapping(drug_elem, text, targets):
    """Apply biologic heuristics to derive labels"""
    labels = []
    # 1. Vaccine detection
    if re.search(r'\bvaccin(e|es|ation|ations|ic)\b', text, re.I):
        labels.append("Vaccine / Immunogen")
    # 2. ADC detection
    if re.search(r'\bADC\b|\bantibody-?drug conjugate\b', text, re.I):
        labels.append("ADC / Targeted Bioconjugate")
    # 3. Recombinant / replacement
    if re.search(r'\brecombinant\b|\breplacement therapy\b|\benhanced human\b', text, re.I):
        labels.append("Recombinant Protein / Replacement Therapy")
    # 4. Antibody patterns via targets
    for t in targets:
        actions = " ".join(t.get('actions') or [])
        gene = t.get('gene') or ""
        name = t.get('name') or ""
        # cytokine neutralizer
        if re.search(r'\b(IL|Interleukin|TNF|Tumor necrosis factor|VEGF|IFN)\b', gene + " " + name, re.I) or re.search(r'cytokine', text, re.I):
            if re.search(r'neutraliz|bind|block|sequester', actions_to_string(actions) + " " + text, re.I):
                labels.append("Monoclonal Antibody — Ligand Neutralizer")
        # receptor blocker
        if re.search(r'receptor', name + " " + actions, re.I) and re.search(r'antagonist|block|inhibit', actions_to_string(actions) + " " + text, re.I):
            labels.append("Monoclonal Antibody — Receptor Antagonist / Blocker")
        # effector
        if re.search(r'ADCC|CDC|effector', text, re.I):
            labels.append("Monoclonal Antibody — ADCC/CDC / Immune Effector")
    return sorted(set(labels))

def actions_to_string(actions):
    return " ".join(actions or [])

## Applying the Label Discovery Process to the Dataset

Finally, we apply the MeSH discovery process to each drug in the database:
1. Combine all textual descriptions (MoA, pharmacodynamics, description) that are available in the dataset for a given drug into a single input feature (`text`). If the `text` feature is empty (because there is no info available for any of the above), skip this drug.
2. Extract target info: name, gene, action, polypeptide id
3. Determine whether the drug is a biologic or not to narrow down class possibilities
4. Leverage the NER model -> UMLS entity linker -> UMLS knowledge base pipeline to discover MeSH labels from the extracted `text` feature (MoA + pharmacodynamics + description)
5. Map the discovered MeSH labels to the target MoA label set
6. If the MeSH label could not be discovered or the drug is a biologic, apply the set of heuristic mappings to search for keywords indicating a given MoA category
7. If all of the above fails, use a simple keyword search to look for patterns/words in the text that indicate unambiguous categories.
8. Output the results as a list of dictionaries, where each dict contains the following features for a single drug: `drugbank_id`, `name`, `is_biologic`, `mesh_actions`, `targets`, `labels`, `text`. 

In [13]:
# Iterate over all drugs
out_rows=[]

for drug in root.findall('db:drug', NS):
    drugbank_id = drug.find('db:drugbank-id[@primary="true"]', NS).text
    name = drug.find('db:name', NS).text
    text = extract_text(drug)
    if not text:
        continue
    targets = extract_targets(drug)
    biologic_flag = is_biologic(drug)
    mesh_actions = extract_mesh_actions(text)
    labels = []
    # map mesh actions first (high precision)
    for a in mesh_actions:
        if a in MOA_MAP:
            labels.append(MOA_MAP[a])
    # apply biologic heuristics if biologic or if mesh mapping empty
    if biologic_flag:
        labels += heuristic_biologic_mapping(drug, text, targets)
    # fallback: embed-based / keyword rules
    if not labels:
        # simple keyword fallback for non-biologics (and biologics if nothing found)
        kws = {
            'kinase': 'Protein Kinase Inhibitor',
            'tyrosine kinase': 'Tyrosine Kinase Inhibitor',
            'gpcr': 'GPCR Antagonist',
            'antagonist': 'GPCR Antagonist',
            'agonist': 'GPCR Agonist',
            'protease': 'Protease Inhibitor',
            'antibiotic': 'Antibiotic (Antibacterial)',
            'antiviral': 'Antiviral',
            'vaccine': 'Vaccine / Immunogen',
            'antibody': 'Monoclonal Antibody — Receptor Antagonist / Blocker'
        }
        for k,v in kws.items():
            if re.search(r'\b' + re.escape(k) + r'\b', text, re.I):
                labels.append(v)
    labels = sorted(set(labels))
    out_rows.append({
        'drugbank_id': drugbank_id,
        'name': name,
        'is_biologic': biologic_flag,
        'mesh_actions': ";".join(mesh_actions),
        'targets': ";".join([t.get('gene') or t.get('name') or "" for t in targets]),
        'labels': ";".join(labels),
        'text': text[:2000]  # truncate if needed
    })


The final output is saved as a csv file, for later use in other notebooks.

In [14]:
# save CSV
keys = ['drugbank_id','name','is_biologic','mesh_actions','targets','labels','text']
with open('drugbank_moa_labels.csv','w',encoding='utf-8',newline='') as f:
    writer = csv.DictWriter(f, fieldnames=keys)
    writer.writeheader()
    for r in out_rows:
        writer.writerow(r)
print("Saved", len(out_rows), "rows to drugbank_moa_labels.csv")

Saved 10172 rows to drugbank_moa_labels_with_biologics.csv


# Moving Forward: Future Uses of the MeSH Labelled Data

The MeSH labelled data will be used as input for the [Drug MoA From Description Classification](https://www.kaggle.com/code/audlang/drug-moa-from-description-classification-v2/edit) project, in which an NLP transformer model is trained to predict the MeSH MoA class of a drug from its textual description. The MeSH labels derived here will be the target feature for this project. 

# 