# Mapping Clinical Trials to ChEBI

This notebook assess the impact of mapping between MeSH and ChEBI through the scope of clinical trial data from ClinicalTrials.gov. Note that this notebook is rather difficult to re-run due to the difficulty of downloading the clinical trials data in bulk.

In [1]:
import time
from collections import defaultdict

import gilda
import pandas
import pystow
from indra_cogex.sources.clinicaltrials import get_correct_mesh_id
from tqdm.auto import tqdm

from biomappings import load_mappings

In [2]:
print(time.asctime())

Tue Mar  7 14:28:57 2023


## Loading ClinicalTrials.gov data

In [3]:
df = pandas.read_csv(
    pystow.join("indra", "cogex", "clinicaltrials", name="clinical_trials.csv.gz"), skiprows=10
)
del df["Rank"]

n_trials = len(df.index)

# Note that each row corresponds to a unique NCT identifier
print(f"There are {n_trials:,} clinical trials.")

There are 422,767 clinical trials.


Fix errors in data due to incorrect encoding of MeSH identifiers (both syntax and mismatch with labels for interventions/conditions).

In [4]:
conditions = defaultdict(list)
missing_conditions = 0
interventions = defaultdict(list)
missing_interventions = 0

for row in tqdm(df.itertuples(), unit_scale=True, leave=False):
    if pandas.isna(row.ConditionMeshTerm):
        missing_conditions += 1
    else:
        for mesh_id, mesh_term in zip(
            row.ConditionMeshId.split("|"), row.ConditionMeshTerm.split("|")
        ):
            fixed_mesh_id = get_correct_mesh_id(mesh_id, mesh_term)
            if not fixed_mesh_id:
                continue

            conditions[row.NCTId].append(fixed_mesh_id)
    if pandas.isna(row.InterventionMeshTerm):
        missing_interventions += 1
    else:
        for mesh_id, mesh_term in zip(
            row.InterventionMeshId.split("|"), row.InterventionMeshTerm.split("|")
        ):
            fixed_mesh_id = get_correct_mesh_id(mesh_id, mesh_term)
            if not fixed_mesh_id:
                continue
            interventions[row.NCTId].append(fixed_mesh_id)

print(
    f"""\
{missing_conditions:,}/{n_trials:,} ({missing_conditions / n_trials:.1%}) trials are \
missing condition annotations.

{missing_interventions:,}/{n_trials:,} ({missing_interventions / n_trials:.1%}) \
trials are missing intervention annotations."""
)

0.00it [00:00, ?it/s]

72,277/422,767 (17.1%) trials are missing condition annotations.

280,537/422,767 (66.4%) trials are missing intervention annotations.


In [5]:
condition_to_trials = defaultdict(list)
for ncit_id, mesh_ids in conditions.items():
    for mesh_id in mesh_ids:
        condition_to_trials[mesh_id].append(ncit_id)
n_condition_annotations = sum(len(v) for v in conditions.values())


print(f"There are {len(condition_to_trials):,} unique conditions.")
print(f"There are {n_condition_annotations:,} annotations.")

There are 4,181 unique conditions.
There are 721,997 annotations.


In [6]:
intervention_to_trials = defaultdict(list)
for ncit_id, mesh_ids in interventions.items():
    for mesh_id in mesh_ids:
        intervention_to_trials[mesh_id].append(ncit_id)
n_intervention_annotations = sum(len(v) for v in conditions.values())

print(f"There are {len(intervention_to_trials):,} unique interventions")
print(f"There are {n_intervention_annotations:,} intervention annotations")

There are 3,614 unique interventions
There are 721,997 intervention annotations


## Loading Biomappings data

In [7]:
mesh_chebi_mappings = {}

for mapping in load_mappings():
    if mapping["source prefix"] == "mesh" and mapping["target prefix"] == "chebi":
        mesh_chebi_mappings[mapping["source identifier"]] = mapping["target identifier"]
    elif mapping["target prefix"] == "mesh" and mapping["source prefix"] == "chebi":
        mesh_chebi_mappings[mapping["target identifier"]] = mapping["source identifier"]

print(
    f"Biomappings contains {len(mesh_chebi_mappings):,} manually curated "
    "positive mappings between MeSH and ChEBI"
)

Biomappings contains 2,909 manually curated positive mappings between MeSH and ChEBI


## Analysis

In [8]:
absolute_distribution = []
relative_distribution = []
all_mappable = 0
some_mappable = 0
none_mappable = 0
n_trials = len(interventions)
unique_chemicals = set()
for _trial, mesh_ids in interventions.items():
    n_mappable = 0
    for mesh_id in mesh_ids:
        chebi_id = mesh_chebi_mappings.get(mesh_id)
        if chebi_id:
            n_mappable += 1
            unique_chemicals.add(chebi_id)

    if n_mappable == len(mesh_ids):
        all_mappable += 1
    elif n_mappable > 0:
        some_mappable += 1
    else:
        none_mappable += 1

print(
    f"""\
{all_mappable:,}/{n_trials:,} ({all_mappable / n_trials:.1%}) trials were fully mapped
{some_mappable:,}/{n_trials:,} ({some_mappable / n_trials:.1%}) trials were only partially mapped
{all_mappable + some_mappable:,}/{n_trials:,} ({(all_mappable + some_mappable) / n_trials:.1%}) trials were either partially or fully mapped
{none_mappable:,}/{n_trials:,} ({none_mappable / n_trials:.1%}) trials were unmapped
{len(unique_chemicals):,}/{len(mesh_chebi_mappings):,} ({len(unique_chemicals) / len(mesh_chebi_mappings):.1%}) ChEBI mappings were used
"""
)

66,690/142,213 (46.9%) trials were fully mapped
33,652/142,213 (23.7%) trials were only partially mapped
100,342/142,213 (70.6%) trials were either partially or fully mapped
41,871/142,213 (29.4%) trials were unmapped
995/2,909 (34.2%) ChEBI mappings were used



## Post-game Check

This isn't actually within the scope of Biomappings, but it's interesting to see that clinical trials that don't have MeSH annotations but still have string labels for interventions can be post facto grounded.

In [9]:
potential_df = df[pandas.isna(df.InterventionMeshTerm) & pandas.notna(df.InterventionName)]
potential_df = potential_df.InterventionName.str.lower().value_counts().to_frame().reset_index()
potential_df.head()

Unnamed: 0,index,InterventionName
0,no intervention,1480
1,exercise,559
2,questionnaire,479
3,blood sample,275
4,mri,275


In [10]:
gilda.ground_df(potential_df, "index")

In [11]:
potential_df[potential_df["index_grounded"].notna()].head(20)

Unnamed: 0,index,InterventionName,index_grounded
1,exercise,559,mesh:D015444
4,mri,275,hgnc:22432
6,acupuncture,230,mesh:D026881
11,data collection,188,mesh:D003625
12,observation,186,mesh:D019370
13,transcranial direct current stimulation,170,mesh:D065908
16,physical activity,152,mesh:D015444
18,intervention,126,efo:0002571
19,observational study,126,mesh:D064888
20,education,110,mesh:D004493
