# Open Targets Cancer Gene-Disease-MeSH Pipeline Walkthrough

This notebook walks through each step of the data pipeline, showing the data transformations and key decisions.

In [1]:
!pip install polars

Collecting polars
  Using cached polars-1.36.1-py3-none-any.whl.metadata (10 kB)
Collecting polars-runtime-32==1.36.1 (from polars)
  Using cached polars_runtime_32-1.36.1-cp39-abi3-macosx_10_12_x86_64.whl.metadata (1.5 kB)
Using cached polars-1.36.1-py3-none-any.whl (802 kB)
Using cached polars_runtime_32-1.36.1-cp39-abi3-macosx_10_12_x86_64.whl (43.5 MB)
Installing collected packages: polars-runtime-32, polars
Successfully installed polars-1.36.1 polars-runtime-32-1.36.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import polars as pl
from pathlib import Path

DATA_DIR = Path("../data")
pl.Config.set_tbl_rows(10)
pl.Config.set_tbl_cols(12)

: 

## Step 1: Load the Open Targets Disease Index

Open Targets provides a disease index with ~23K diseases. Each disease has:
- `id`: Disease identifier (EFO/MONDO/Orphanet)
- `name`: Human-readable name
- `ancestors`: List of parent disease IDs in the ontology
- `dbXRefs`: Cross-references to other databases (including MeSH!)

In [None]:
# Load the disease index
diseases = pl.read_parquet(DATA_DIR / "opentargets/disease/disease.parquet")
print(f"Total diseases in Open Targets: {len(diseases):,}")
print(f"\nColumns: {diseases.columns}")
diseases.head()

## Step 2: Filter to Cancer Diseases Only

We filter diseases where `ancestors` contains **EFO_0000616** (the EFO ID for "neoplasm").

This gives us all cancer-related diseases in the ontology.

In [None]:
# Filter to cancer diseases (ancestors contain EFO_0000616 = neoplasm)
NEOPLASM_EFO_ID = "EFO_0000616"

cancer_diseases = diseases.filter(
    pl.col("ancestors").list.contains(NEOPLASM_EFO_ID)
)

print(f"Cancer diseases: {len(cancer_diseases):,} of {len(diseases):,} ({100*len(cancer_diseases)/len(diseases):.1f}%)")
cancer_diseases.select(["id", "name", "dbXRefs"]).head(10)

## Step 3: Extract MeSH IDs from dbXRefs

The `dbXRefs` field contains cross-references like:
- `MeSH:D009369` (MeSH descriptor)
- `UMLS:C0027651` (UMLS concept)
- `SNOMEDCT:363346000` (SNOMED CT)

We extract only the MeSH IDs. **Key finding**: Only ~18.5% of cancer diseases have MeSH mappings!

In [None]:
# Extract MeSH IDs from dbXRefs
cancer_with_mesh = cancer_diseases.with_columns(
    pl.col("dbXRefs")
    .list.eval(pl.element().filter(pl.element().str.starts_with("MeSH:")))
    .list.eval(pl.element().str.replace("MeSH:", ""))
    .alias("meshIds")
).filter(
    pl.col("meshIds").list.len() > 0
)

print(f"Cancer diseases WITH MeSH mappings: {len(cancer_with_mesh):,}")
print(f"Cancer diseases WITHOUT MeSH: {len(cancer_diseases) - len(cancer_with_mesh):,}")
print(f"Coverage: {100*len(cancer_with_mesh)/len(cancer_diseases):.1f}%")
print("\n⚠️  This low coverage is because research ontologies (EFO/MONDO) are more granular than clinical vocabulary (MeSH)")

cancer_with_mesh.select(["id", "name", "meshIds"]).head(10)

## Step 4: Load MeSH Neoplasm Hierarchy (C04)

MeSH organizes neoplasms under **C04** with two parallel hierarchies:
- **C04.588**: By anatomical site (lung, breast, liver...)
- **C04.557**: By histologic type (carcinoma, sarcoma, adenoma...)

Each term has a `tree_number` showing its position and a `level` (depth in hierarchy).

In [None]:
# Load MeSH C04 hierarchy
mesh_hierarchy = pl.read_csv(DATA_DIR / "mesh/mesh_c04_complete.csv")
print(f"Total MeSH neoplasm terms: {len(mesh_hierarchy)}")
print(f"\nColumns: {mesh_hierarchy.columns}")
mesh_hierarchy.head(10)

In [None]:
# Show the two parallel hierarchies
print("=== C04.588: Neoplasms by ANATOMICAL SITE ===")
mesh_hierarchy.filter(
    pl.col("tree_number").str.starts_with("C04.588")
).filter(pl.col("level") <= 4).head(15)

In [None]:
print("=== C04.557: Neoplasms by HISTOLOGIC TYPE ===")
mesh_hierarchy.filter(
    pl.col("tree_number").str.starts_with("C04.557")
).filter(pl.col("level") <= 4).head(15)

## Step 5: Load Gene-Disease Associations

Open Targets provides association scores between genes and diseases.

We use **"overall_direct"** associations (not indirect) because:
- **Direct**: Evidence explicitly links the gene to *that specific disease*
- **Indirect**: Inherited from parent diseases (if gene linked to "Breast Cancer", also counts for parent "Neoplasms")

Direct is stricter and avoids inflated counts.

In [None]:
# Load gene-disease associations (direct only)
associations = pl.read_parquet(DATA_DIR / "opentargets/association_overall_direct/")
print(f"Total gene-disease associations: {len(associations):,}")
print(f"\nColumns: {associations.columns}")
associations.head(10)

In [None]:
# Filter to cancer diseases only
cancer_disease_ids = cancer_with_mesh.select("id").to_series().to_list()
cancer_associations = associations.filter(
    pl.col("diseaseId").is_in(cancer_disease_ids)
)
print(f"Cancer gene-disease associations: {len(cancer_associations):,}")
print(f"Unique genes: {cancer_associations['targetId'].n_unique():,}")
print(f"Unique diseases: {cancer_associations['diseaseId'].n_unique():,}")
cancer_associations.head(10)

## Step 6: Build the Crosswalk (Disease → MeSH)

We explode the MeSH IDs (one disease can map to multiple MeSH terms) and join with the hierarchy to get tree numbers and levels.

In [None]:
# Explode meshIds to one row per disease-mesh pair
crosswalk = cancer_with_mesh.select(["id", "name", "meshIds"]).explode("meshIds").rename({"meshIds": "meshId", "id": "diseaseId", "name": "diseaseName"})
print(f"Disease-MeSH pairs: {len(crosswalk):,}")
crosswalk.head(10)

In [None]:
# Join with MeSH hierarchy to get tree numbers and levels
# Note: mesh_hierarchy uses 'mesh_id' column
crosswalk_with_hierarchy = crosswalk.join(
    mesh_hierarchy.select(["mesh_id", "mesh_name", "tree_number", "level"]),
    left_on="meshId",
    right_on="mesh_id",
    how="inner"
)
print(f"Crosswalk rows with hierarchy info: {len(crosswalk_with_hierarchy):,}")
crosswalk_with_hierarchy.head(10)

## Step 7: Final Join - Gene + Disease + MeSH

Join associations with the crosswalk to get the final dataset:
- Each row = one gene-disease pair with MeSH hierarchy info
- Multiple rows per gene-disease if disease maps to multiple MeSH terms

In [None]:
# Final join: associations + crosswalk
final_dataset = cancer_associations.join(
    crosswalk_with_hierarchy,
    on="diseaseId",
    how="inner"
)
print(f"Final dataset rows: {len(final_dataset):,}")
print(f"Unique genes: {final_dataset['targetId'].n_unique():,}")
print(f"Unique diseases: {final_dataset['diseaseId'].n_unique():,}")
print(f"Unique MeSH terms: {final_dataset['meshId'].n_unique():,}")
final_dataset.head(10)

## Step 8: Filter to Site-Only (C04.588)

For anatomical site analysis, filter to only the **C04.588** hierarchy.

This excludes histologic type classifications (C04.557) and gives cleaner organ-based groupings.

In [None]:
# Filter to anatomical site hierarchy only
site_only = final_dataset.filter(
    pl.col("tree_number").str.starts_with("C04.588")
)
print(f"Site-only rows: {len(site_only):,} ({100*len(site_only)/len(final_dataset):.1f}% of full dataset)")
print(f"Unique MeSH site terms: {site_only['meshId'].n_unique():,}")
site_only.head(10)

## Step 9: Granularity Analysis by MeSH Level

MeSH levels control granularity:
- **Level 3-4**: Broad ("Digestive System Neoplasms")
- **Level 5**: Clinical trial level ("Lung Neoplasms", "Prostatic Neoplasms") ← **Recommended**
- **Level 6+**: Research-specific ("Small Cell Lung Carcinoma")

In [None]:
# Summary by MeSH level
level_summary = final_dataset.group_by("level").agg([
    pl.count().alias("rows"),
    pl.col("targetId").n_unique().alias("unique_genes"),
    pl.col("diseaseId").n_unique().alias("unique_diseases"),
    pl.col("meshId").n_unique().alias("unique_mesh_terms"),
    pl.col("score").mean().alias("mean_score")
]).sort("level")

print("=== Coverage by MeSH Hierarchy Level ===")
level_summary

In [None]:
# Show example terms at each level (site hierarchy)
print("=== Example MeSH Terms by Level (C04.588 only) ===")
for level in range(3, 7):
    terms = mesh_hierarchy.filter(
        (pl.col("level") == level) & 
        (pl.col("tree_number").str.starts_with("C04.588"))
    ).select("mesh_name").head(5).to_series().to_list()
    print(f"\nLevel {level}: {', '.join(terms[:5])}")

## Step 10: Load Pre-Built Outputs

The pipeline saves these files for downstream use:

In [None]:
# Load the pre-built main dataset
main_output = pl.read_parquet(DATA_DIR / "processed/cancer_gene_disease_mesh.parquet")
print(f"Pre-built main dataset: {len(main_output):,} rows")
main_output.head()

In [None]:
# Load site-only version
site_output = pl.read_parquet(DATA_DIR / "processed/cancer_gene_disease_mesh_site_only.parquet")
print(f"Pre-built site-only dataset: {len(site_output):,} rows")
site_output.head()

In [None]:
# Load crosswalk reference
crosswalk_ref = pl.read_csv(DATA_DIR / "processed/cancer_mesh_crosswalk.csv")
print(f"Disease-MeSH crosswalk: {len(crosswalk_ref):,} rows")
crosswalk_ref.head()

## Summary

### Key Numbers
- **3,395** cancer diseases in Open Targets
- **627** (18.5%) have MeSH mappings
- **1.38M** gene-disease-MeSH association rows
- **~20K** unique genes

### Key Decisions
1. **EFO_0000616** filters for neoplasm/cancer diseases
2. **dbXRefs** field provides MeSH IDs (no external crosswalk needed)
3. **Direct associations** only (not indirect/inherited)
4. **C04.588** for anatomical site analysis
5. **Level 5** recommended for clinical trial granularity