# 01 â€” Parse PDON & PMDO ontologies (Colab)

This notebook loads **PDON** and **PMDO** from files in your project folder, validates parsing, extracts basic statistics (classes/properties), and optionally exports Turtle versions for easier downstream work.

**Inputs** (place in `ontologies/`):
- `pdon.xrdf`
- `pmdo.xrdf`

**Outputs** (written to `output/`):
- `pdon.ttl` (optional)
- `pmdo.ttl` (optional)
- `ontologies_summary.csv`


## 0) Install dependencies

In [None]:
!pip -q install rdflib pandas owlrl

## 1) Project folders (Google Drive recommended)

This project is designed to be shareable and resilient across notebooks.

- We mount Google Drive
- Ensure the project folder exists
- Ensure standard subfolders exist

> If you prefer not to use Drive, set `USE_DRIVE = False` and it will work in `/content` (ephemeral).


In [None]:
from pathlib import Path
import os

USE_DRIVE = True  # set to False to work in /content only

if USE_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')
    PROJECT_DIR = Path('/content/drive/MyDrive/ppmi-ontology-alignment')
else:
    PROJECT_DIR = Path('/content/ppmi-ontology-alignment')

# Create project folder + standard subfolders
PROJECT_DIR.mkdir(parents=True, exist_ok=True)

ONTO = PROJECT_DIR / 'ontologies'
MAPS = PROJECT_DIR / 'mapping'
DATA = PROJECT_DIR / 'data'
OUT  = PROJECT_DIR / 'output'

for p in [ONTO, MAPS, DATA, OUT]:
    p.mkdir(parents=True, exist_ok=True)

print('PROJECT_DIR:', PROJECT_DIR)
print('ONTO:', ONTO)
print('MAPS:', MAPS)
print('DATA:', DATA)
print('OUT :', OUT)


## 2) Upload files (if needed)

If `pdon.xrdf` and `pmdo.xrdf` are already in `ontologies/`, you can skip this step.


In [None]:
from google.colab import files
import shutil

required = ['pdon.xrdf', 'pmdo.xrdf']
missing = [f for f in required if not (ONTO / f).exists()]
print('Missing:', missing)

if missing:
    uploaded = files.upload()
    for fname in uploaded.keys():
        src = Path('/content') / fname
        dst = ONTO / fname
        shutil.move(str(src), str(dst))
    print('Uploaded to:', ONTO)
else:
    print('All required ontology files are present in:', ONTO)


## 3) Robust parsing with rdflib

`.xrdf` is not a standard extension, so we try common RDF/OWL formats.


In [None]:
from rdflib import Graph

def load_graph(path: Path):
    g = Graph()
    tried = []
    for fmt in ['xml', 'application/rdf+xml', 'turtle', 'n3', 'nt']:
        try:
            g.parse(str(path), format=fmt)
            return g, fmt, tried
        except Exception as e:
            tried.append((fmt, str(e)[:220]))
            continue
    # last attempt: autodetect
    g.parse(str(path))
    return g, 'auto', tried

pdon_path = ONTO / 'pdon.xrdf'
pmdo_path = ONTO / 'pmdo.xrdf'

pdon_g, pdon_fmt, pdon_tried = load_graph(pdon_path)
pmdo_g, pmdo_fmt, pmdo_tried = load_graph(pmdo_path)

print('PDON triples:', len(pdon_g), 'format:', pdon_fmt)
print('PMDO triples:', len(pmdo_g), 'format:', pmdo_fmt)


## 4) Extract basic ontology statistics

We count OWL classes and properties, and extract ontology IRI(s) if present.


In [None]:
import pandas as pd
from rdflib.namespace import RDF, RDFS, OWL

def ontology_stats(g: Graph, name: str):
    ont_iris = list(g.subjects(RDF.type, OWL.Ontology))
    classes  = set(g.subjects(RDF.type, OWL.Class))
    objp     = set(g.subjects(RDF.type, OWL.ObjectProperty))
    datap    = set(g.subjects(RDF.type, OWL.DatatypeProperty))
    annp     = set(g.subjects(RDF.type, OWL.AnnotationProperty))
    return {
        'ontology': name,
        'triples': len(g),
        'ontology_iris': ';'.join(str(x) for x in ont_iris[:10]),
        'n_classes': len(classes),
        'n_object_properties': len(objp),
        'n_datatype_properties': len(datap),
        'n_annotation_properties': len(annp),
    }

stats = pd.DataFrame([
    ontology_stats(pdon_g, 'PDON'),
    ontology_stats(pmdo_g, 'PMDO'),
])

stats


## 5) Save summary to CSV

In [None]:
summary_csv = OUT / 'ontologies_summary.csv'
stats.to_csv(summary_csv, index=False)
print('Wrote:', summary_csv)


## 6) Quick label search helpers

Useful for locating candidate classes/properties by `rdfs:label`.


In [None]:
from rdflib.namespace import RDFS

def find_by_label(g: Graph, query: str, limit: int = 25):
    q = query.lower()
    hits = []
    for s, _, o in g.triples((None, RDFS.label, None)):
        if q in str(o).lower():
            hits.append((str(s), str(o)))
            if len(hits) >= limit:
                break
    return pd.DataFrame(hits, columns=['iri', 'label'])

# Example searches (edit freely)
find_by_label(pdon_g, 'parkinson', limit=10)


## 7) Optional: export Turtle versions

This can make later processing easier. Turtle files are written to `output/`.


In [None]:
pdon_ttl = OUT / 'pdon.ttl'
pmdo_ttl = OUT / 'pmdo.ttl'

pdon_g.serialize(destination=str(pdon_ttl), format='turtle')
pmdo_g.serialize(destination=str(pmdo_ttl), format='turtle')

print('Wrote:', pdon_ttl)
print('Wrote:', pmdo_ttl)


## 8) Next notebook

Proceed to `02_build_bridge_ontology.ipynb` to create the **bridge ontology** with:
- `owl:imports` to PDON and PMDO
- minimal T-box for Subject/Visit/Observation
- mapping scaffolding for PPMI variables
