# Open Targets Cancer Gene-Disease-MeSH Pipeline

This notebook demonstrates the data pipeline that maps cancer gene-disease associations to MeSH vocabulary.

**Pipeline Steps:**
1. Extract cancer diseases from Open Targets (filter by EFO_0000616)
2. Extract MeSH C04.588 hierarchy live from d2025.bin
3. Build gene-MeSH crosswalk with Entrez IDs

**Final Output:** `gene_disease_mesh_final.tsv`
- disease_mesh_id
- gene_entrez_id  
- ot_score
- evidence_count

In [17]:
import pandas as pd
from pathlib import Path

DATA_DIR = Path("../data")
PROCESSED_DIR = DATA_DIR / "processed"

pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 200)

---
## Final Output

The 4-column TSV for patent matching:

In [19]:
final = pd.read_csv(PROCESSED_DIR / "gene_disease_mesh_final.tsv", sep='\t')

print(f"Shape: {final.shape}")
print(f"Columns: {list(final.columns)}")
print(f"\nUnique MeSH diseases: {final['disease_mesh_id'].nunique()}")
print(f"Unique Entrez genes: {final['gene_entrez_id'].nunique()}")

final.tail(30)

Shape: (171856, 4)
Columns: ['disease_mesh_id', 'gene_entrez_id', 'ot_score', 'evidence_count']

Unique MeSH diseases: 146
Unique Entrez genes: 19275


Unnamed: 0,disease_mesh_id,gene_entrez_id,ot_score,evidence_count
171826,D002289,100506374,0.001043,1
171827,D018246,10935,0.001043,1
171828,D002289,2109,0.00104,1
171829,D018246,1740,0.00104,1
171830,D000077192,54813,0.001039,1
171831,D002289,89876,0.001038,1
171832,D002289,9315,0.001033,1
171833,D021441,55731,0.001033,1
171834,D009303,3397,0.001031,1
171835,D002289,642819,0.00103,1


---
## Step 2a: Extract MeSH Hierarchy from d2025.bin

The pipeline extracts the C04.588 (Neoplasms by Site) branch directly from the raw MeSH 2025 file.

In [21]:
# Show raw MeSH file
mesh_path = DATA_DIR / "mesh" / "d2025.bin"
print(f"MeSH source: {mesh_path}")
print(f"Size: {mesh_path.stat().st_size / 1e6:.1f} MB")

# Show first record
print("\n=== Sample record ===")
with open(mesh_path, 'r') as f:
    lines = []
    for line in f:
        lines.append(line.strip())
        if len(lines) > 30:
            break
print('\n'.join(lines[:30]))

MeSH source: ../data/mesh/d2025.bin
Size: 31.4 MB

=== Sample record ===
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA AD AE AG AI AN BI BL CF CH CL CS EC HI IM IP ME PD PK PO RE SD ST TO TU UR
PRINT ENTRY = 4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrrol-2-yl)ethyl)-1,7-dioxaspiro(5.5)undec-2-yl)methyl)-, (6S-(6alpha(2S*,3S*),8beta(R*),9beta,11alpha))-|T109|T195|NON|EQV|NLM (2024)|220922|abbcdef
ENTRY = A23187|T109|T195|LAB|NRW|UNK (19XX)|741111|abbcdef
ENTRY = A-23187|T109|T195|LAB|NRW|NLM (1991)|900308|abbcdef
ENTRY = Antibiotic A23187|T109|T195|NON|NRW|NLM (1991)|900308|abbcdef
ENTRY = A 23187
ENTRY = A23187, Antibiotic
MN = D02.355.291.933.125
MN = D02.540.576.625.125
MN = D03.633.100.221.173
MN = D04.345.241.654.125
MN = D04.345.674.625.125
PA = Anti-Bacterial Agents
PA = Calcium Ionophores
MH_TH = FDA SRS (2014)
MH_TH = NLM (1975)
ST = T109
ST = T195
RN = 37H9VM9WZL
RR = 52665-69-7 (Calcimycin)
PI = Antibiotics (1973-1974)
PI = 

In [22]:
# Extract C04.588 hierarchy (this is what the pipeline does live)
import sys
sys.path.insert(0, '..')
from src.pipeline.extract_mesh import parse_mesh_file, extract_c04_hierarchy

records = parse_mesh_file(mesh_path)
print(f"Total MeSH descriptors: {len(records):,}")

# Extract C04.588 (site only)
site_hierarchy = extract_c04_hierarchy(records, prefix="C04.588")
print(f"\nC04.588 (Neoplasms by Site):")
print(f"  Tree paths: {len(site_hierarchy)}")
print(f"  Unique terms: {site_hierarchy['mesh_id'].nunique()}")
print(f"  Levels: {site_hierarchy['level'].min()}-{site_hierarchy['level'].max()}")

site_hierarchy.head(15)

Total MeSH descriptors: 30,954

C04.588 (Neoplasms by Site):
  Tree paths: 271
  Unique terms: 236
  Levels: 2-9


Unnamed: 0,mesh_id,mesh_name,tree_number,level
0,D009371,Neoplasms by Site,C04.588,2
1,D000008,Abdominal Neoplasms,C04.588.033,3
2,D010534,Peritoneal Neoplasms,C04.588.033.513,4
3,D012186,Retroperitoneal Neoplasms,C04.588.033.731,4
4,D058288,Sister Mary Joseph's Nodule,C04.588.033.740,4
5,D000694,Anal Gland Neoplasms,C04.588.083,3
6,D001859,Bone Neoplasms,C04.588.149,3
7,D050398,Adamantinoma,C04.588.149.030,4
8,D005266,Femoral Neoplasms,C04.588.149.276,4
9,D012888,Skull Neoplasms,C04.588.149.721,4


In [23]:
# Level distribution
print("=== MeSH Level Distribution ===")
print(site_hierarchy['level'].value_counts().sort_index())

print("\n=== Sample Terms by Level ===")
for level in range(2, 7):
    terms = site_hierarchy[site_hierarchy['level'] == level]['mesh_name'].head(3).tolist()
    print(f"Level {level}: {', '.join(terms)}")

=== MeSH Level Distribution ===
2     1
3    17
4    67
5    90
6    65
7    20
8    10
9     1
Name: level, dtype: int64

=== Sample Terms by Level ===
Level 2: Neoplasms by Site
Level 3: Abdominal Neoplasms, Anal Gland Neoplasms, Bone Neoplasms
Level 4: Peritoneal Neoplasms, Retroperitoneal Neoplasms, Sister Mary Joseph's Nodule
Level 5: Jaw Neoplasms, Nose Neoplasms, Orbital Neoplasms
Level 6: Mandibular Neoplasms, Maxillary Neoplasms, Palatal Neoplasms


---
## Step 1: Cancer Diseases from Open Targets

Filter diseases where `ancestors` contains EFO_0000616 (neoplasm), then extract MeSH IDs from `dbXRefs`.

In [26]:
diseases = pd.read_parquet(PROCESSED_DIR / "intermediate" / "cancer_diseases_mesh_crosswalk.parquet")
diseases

Unnamed: 0,diseaseId,diseaseName,meshIds
0,HP_0100698,Subcutaneous neurofibroma,
1,MONDO_0000148,"pulmonary fibrosis and/or bone marrow failure,...",
2,MONDO_0000271,tuberculous salpingitis,
3,MONDO_0000371,oral cavity carcinoma in situ,
4,MONDO_0000372,pharynx carcinoma in situ,
...,...,...,...
3390,Orphanet_93178,Partial prune belly syndrome,
3391,Orphanet_94095,Spondylocostal dysostosis - anal and genitouri...,
3392,Orphanet_98086,"46,XY disorder of sex development due to a def...",
3393,Orphanet_98087,"Syndrome with 46,XY disorder of sex development",


In [27]:

total = len(diseases)
with_mesh = diseases['meshIds'].notna().sum()

print("=== CANCER DISEASES ===")
print(f"Total: {total:,}")
print(f"With MeSH: {with_mesh:,} ({with_mesh/total*100:.1f}%)")
print(f"Without MeSH: {total - with_mesh:,}")

diseases[diseases['meshIds'].notna()].head(10)

=== CANCER DISEASES ===
Total: 3,395
With MeSH: 627 (18.5%)
Without MeSH: 2,768


Unnamed: 0,diseaseId,diseaseName,meshIds
18,MONDO_0000430,mature T-cell and NK-cell non-Hodgkin lymphoma,[D016411]
21,MONDO_0000502,villous adenoma,[D018253]
78,MONDO_0000920,duodenum cancer,[D004379]
102,MONDO_0001023,prolymphocytic leukemia,[D015463]
121,MONDO_0001256,arteriovenous hemangioma/malformation,[D001165]
126,MONDO_0001340,heart cancer,[D006338]
136,MONDO_0001398,ureter benign neoplasm,[D014516]
139,MONDO_0001402,vaginal cancer,[D014625]
152,MONDO_0001528,vulva cancer,[D014846]
155,MONDO_0001569,acoustic neuroma,[D009464]


---
## Step 2b: Disease → MeSH Crosswalk

Join cancer diseases with MeSH hierarchy to build the crosswalk.

In [14]:
crosswalk = pd.read_csv(PROCESSED_DIR / "crosswalks" / "disease_mesh_crosswalk.csv")

print("=== DISEASE → MeSH CROSSWALK ===")
print(f"Pairs: {len(crosswalk):,}")
print(f"Diseases: {crosswalk['diseaseId'].nunique()}")
print(f"MeSH terms: {crosswalk['meshId'].nunique()}")

crosswalk.head(10)

=== DISEASE → MeSH CROSSWALK ===
Pairs: 181
Diseases: 181
MeSH terms: 158


Unnamed: 0,diseaseId,diseaseName,meshId,mesh_name,tree_number,level
0,EFO_0000181,head and neck squamous cell carcinoma,D000077195,Squamous Cell Carcinoma of Head and Neck,C04.588.443.177,4
1,EFO_0000182,hepatocellular carcinoma,D006528,"Carcinoma, Hepatocellular",C04.588.274.623.160,5
2,EFO_0000308,bronchoalveolar adenocarcinoma,D002282,"Adenocarcinoma, Bronchiolo-Alveolar",C04.588.894.797.520.055.500,7
3,EFO_0000326,central nervous system cancer,D016543,Central Nervous System Neoplasms,C04.588.614.250,4
4,EFO_0000571,lung adenocarcinoma,D000077192,Adenocarcinoma of Lung,C04.588.894.797.520.055,6
5,EFO_0000681,renal cell carcinoma,D002292,"Carcinoma, Renal Cell",C04.588.945.947.535.160,6
6,EFO_0000702,small cell lung carcinoma,D055752,Small Cell Lung Carcinoma,C04.588.894.797.520.109.220.624,8
7,EFO_0000756,melanoma,D008545,Melanoma,C04.588.805.377,4
8,EFO_0000762,hepatocellular adenoma,D018248,"Adenoma, Liver Cell",C04.588.274.623.040,5
9,EFO_0002429,polycythemia vera,D011087,Polycythemia Vera,C04.588.448.200.500,5


---
## Step 3: Ensembl → Entrez Gene Mapping

Map Ensembl Gene IDs to Entrez Gene IDs using NCBI gene2ensembl.

In [15]:
entrez = pd.read_csv(PROCESSED_DIR / "crosswalks" / "ensembl_entrez.csv")

print("=== ENSEMBL → ENTREZ ===")
print(f"Human gene mappings: {len(entrez):,}")
print(f"Source: NCBI gene2ensembl")

entrez.head(10)

=== ENSEMBL → ENTREZ ===
Human gene mappings: 38,278
Source: NCBI gene2ensembl


Unnamed: 0,entrezGeneId,ensemblGeneId
0,1,ENSG00000121410
1,2,ENSG00000175899
2,9,ENSG00000171428
3,10,ENSG00000156006
4,12,ENSG00000196136
5,13,ENSG00000114771
6,14,ENSG00000127837
7,15,ENSG00000129673
8,16,ENSG00000090861
9,18,ENSG00000183044


---
## Analysis: Top Cancer Sites by Gene Count

In [16]:
# Get MeSH names for the final data
mesh_names = site_hierarchy[['mesh_id', 'mesh_name']].drop_duplicates()

top_sites = final.groupby('disease_mesh_id').agg({
    'gene_entrez_id': 'nunique',
    'ot_score': 'mean',
    'evidence_count': 'sum'
}).reset_index()
top_sites.columns = ['mesh_id', 'unique_genes', 'mean_score', 'total_evidence']

top_sites = top_sites.merge(mesh_names, on='mesh_id', how='left')
top_sites = top_sites.sort_values('unique_genes', ascending=False)

print("=== TOP 20 CANCER SITES BY GENE COUNT ===")
top_sites[['mesh_id', 'mesh_name', 'unique_genes', 'mean_score', 'total_evidence']].head(20)

=== TOP 20 CANCER SITES BY GENE COUNT ===


Unnamed: 0,mesh_id,mesh_name,unique_genes,mean_score,total_evidence
38,D006528,"Carcinoma, Hepatocellular",13257,0.054394,403994
18,D002289,"Carcinoma, Non-Small-Cell Lung",10408,0.052866,297025
1,D000077192,Adenocarcinoma of Lung,10204,0.044312,106594
67,D010051,Ovarian Neoplasms,9665,0.055605,168383
78,D011471,Prostatic Neoplasms,9471,0.03695,275620
57,D008545,Melanoma,9152,0.048467,265669
145,D064726,Triple Negative Breast Neoplasms,7396,0.02985,132748
2,D000077195,Squamous Cell Carcinoma of Head and Neck,6680,0.048315,72347
4,D000077277,Esophageal Squamous Cell Carcinoma,6544,0.05438,53899
14,D001943,Breast Neoplasms,6013,0.023041,58200


---
## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|----------|
| MeSH source | d2025.bin (live extraction) | Reproducible, no stale CSV |
| MeSH branch | C04.588 (site only) | Anatomical classification, not histologic |
| Disease source | OT dbXRefs only | Curated mappings, no external crosswalks |
| Gene IDs | Entrez (from gene2ensembl) | Required for patent matching |
| Aggregation | MAX score, SUM evidence | Collapse diseases → MeSH terms |
| Coverage | 18% of OT diseases | Expected - research vs clinical vocabulary |