# Standardizing Clinical Symptoms of Rare Disease with Human Phenotype Ontology (HPO) in Python

## Background

In literature reviews and evidence synthesis for rare diseases, clinical symptoms are often reported in non-standardized ways, making it difficult to compare or merge them computationally.

This real-world challenge motivated us to develop a data solution for standardizing free-text symptom reports. Using the open-source Human Phenotype Ontology (HPO) and its API, we can map reported symptoms to controlled ontology terms, with the process automated in Python.

I streamlined the workflow into several steps: retrieving candidate HPO terms and synonyms, linking them to IDs and definitions, and applying fuzzy matching to identify similarities between reported symptoms and retrieved HPO terms. This notebook demonstrates the pipeline with minimal documentation as a reference for the community.

The workflow has been tested in a real-world project on congenital myasthenic syndromes (CMS). While effective, there remain opportunities to refine and expand the approach.


## Alogrithm explained

**Goal**
Standardize free-text clinical symptoms by mapping them to HPO (Human Phenotype Ontology) terms, then verify and contextualize each match.

**Inputs**: symptom (str): A reported, free-text symptom (e.g., “ptosis”, “muscle weakness”).

**External resources & libs**
- Search API: https://ontology.jax.org/api/hp/search/?q=<symptom> (top result taken)
- Ontology file: http://purl.obolibrary.org/obo/hp.obo (definitions, synonyms, hierarchy)
- Python libs: requests, fuzzywuzzy.process.extractOne, obonet, functools.lru_cache, pandas (optional)

**High-level flow**
1. Search HPO: Query the JAX HPO API with the input symptom → get top candidate (name, id) or no result.
2.	Fuzzy validation: Compute a fuzzy score between the input symptom and the returned HPO term name.
3.	Context retrieval: From hp.obo, pull definition and synonyms for the candidate HPO ID.
4.	Lineage extraction: From the same ontology graph, compute depth and path to root HP:0000001 (using first parent if multiple).
5.	Accept/Reject decision
	- Accept if fuzzy_score ≥ 80 OR HPO name appears in its synonyms (case-insensitive check).
	- Reject otherwise (or if API/ontology lookup fails).

**Decision rule (acceptance)**
- Threshold: fuzzy_score ≥ 80
- Synonym override: Accept if HPO term name is present among its synonyms (case-insensitive)

**Rationale**: Puts speed/recall first (top hit) with a sanity check on similarity; adds semantic cushion via synonyms.

**Outputs (as implemented)**

The pipeline returns 8 fields in fixed order:  
1. reported_symptom (str) 
2.	hpo_term (str | None) 
3.	hpo_id (str | None)
4.	definition (str | None)
5.	rank (int | None)
6.	path (list[str] | [])
7.	fuzzy_score (int | 0)
8.	status (“matched” | “not matched”)

## Load necessary modules

In [None]:
# General modules
import requests
import pandas as pd

# Specific modules
from fuzzywuzzy import process
import obonet
from functools import lru_cache



**Brief information for the specific modules**

- `fuzzywuzzy.process`: Provides fuzzy string matching, useful for comparing free-text symptoms to HPO terms and synonyms.
- `obonet`: Loads and parses OBO-formatted ontology files, such as the Human Phenotype Ontology, into network structures.
- `functools.lru_cache`: Decorator for caching function results, improving performance when repeatedly querying or processing the same data.

## Implementation

### Step 1: Map reported symptoms to HPO terms using the HPA API

In [None]:
import requests

def map_symptoms_to_hpo(symptom):
    """
    Map reported symptoms to HPO terms using the HPA API.
    """
    url = f"https://ontology.jax.org/api/hp/search/?q={symptom}"
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # raises HTTPError if not 200 OK
        json_data = response.json()
        results = json_data.get('terms', [])
        if results:
            top = results[0]  # Take top result
            return (top["name"], top["id"])  # return matched term and HPO ID as a tuple
        else:
            return (None, None)
    except requests.exceptions.RequestException as e:
        return (None, None)
    except ValueError as ve:
        return (None, None)


In [4]:
# Example usage:
symptom = "headache"
print(map_symptoms_to_hpo(symptom))

('Headache', 'HP:0002315')


### Step 2: Estimate the fuzzy score between the input term and the HPO term

In [None]:

from fuzzywuzzy import process
import pandas as pd

def estimate_fuzzy_score(input_term, hpo_term):
    """
    Estimate the fuzzy score between the input term and the HPO term.

    Parameters:
    input_term: str
        The term reported in the study.
    hpo_term: str
        The term from the HPO database.
    Returns:
    fuzzy_score: float
        The fuzzy score between the input term and the HPO term.

    """
    if not isinstance(input_term, str):
        raise ValueError("reported_term must be a string.")
    
    best_match_fuzzy, fuzzy_score = process.extractOne(input_term, [hpo_term])
    return fuzzy_score


In [None]:
# Example usage:
input_term = "headache"
hpo_term = "Headache"
print(estimate_fuzzy_score(input_term, hpo_term))

100


### Step 3: Look up HPO synonyms and definition

In [None]:
# import obonet
# def get_hpo_definitions_and_synonyms(hpo_id):
#     """
#     Get the definition and synonyms for a given HPO term ID.
#     """
#     url = 'http://purl.obolibrary.org/obo/hp.obo' # URL points to the Human Phenotype Ontology (HPO) in OBO format, hosted by the OBO Foundry
#     graph = obonet.read_obo(url)

#     if hpo_id in graph.nodes:
#         synonyms = graph.nodes[hpo_id].get('synonym', []) or graph.nodes[hpo_id].get('synonyms', [])
#         definition = graph.nodes[hpo_id].get('def', 'NA')
#         return synonyms, definition
#     else:
#         return None, None
    


In [24]:
# --- Shared ontology loader (already in your notebook) ---
import obonet
from functools import lru_cache

HPO_URL = "http://purl.obolibrary.org/obo/hp.obo"

@lru_cache(maxsize=1)
def load_graph():
    """Load and cache the HPO graph once per session."""
    return obonet.read_obo(HPO_URL)


# --- Improved: fast, cached meta lookup ---
@lru_cache(maxsize=8192)  # cache per-HPO-ID lookups, too
def get_hpo_definitions_and_synonyms(hpo_id: str):
    """
    Return (synonyms, definition) for an HPO ID using the cached graph.
    - Avoids repeated obo downloads/parsing.
    - Cleans OBO-quoted strings for readability.
    """
    graph = load_graph()  # <-- reuse cached graph
    node = graph.nodes.get(hpo_id)
    if not node:
        return [], None

    # Synonyms: in OBO it’s usually 'synonym' (singular); keep a fallback.
    raw_syn = node.get("synonym", []) or node.get("synonyms", [])

    def _clean_obo_text(s: str) -> str:
        # OBO annotation format often looks like:  "\"text\" EXACT [XREF:...]""
        if isinstance(s, str) and '"' in s:
            try:
                return s.split('"', 2)[1]
            except Exception:
                return s
        return s

    synonyms = [_clean_obo_text(s) for s in raw_syn]

    # Definition may be a single string like "\"text\" [PMID:...]""
    raw_def = node.get("def")
    definition = _clean_obo_text(raw_def) if isinstance(raw_def, str) else None

    return synonyms, definition

In [26]:
# Example usage:
hpo_id = "HP:0002315"
synonyms, definition = get_hpo_definitions_and_synonyms(hpo_id)
print(f"Synonyms: {synonyms}")
print(f"Definition: {definition}")    


Synonyms: ['Headache', 'Headaches']
Definition: Cephalgia, or pain sensed in various parts of the head, not confined to the area of distribution of any nerve.


### Step 4: Get full lineage for a given HPO ID

In [25]:
import obonet
from functools import lru_cache

HPO_URL = "http://purl.obolibrary.org/obo/hp.obo"

# Cache the graph so it loads only once
@lru_cache(maxsize=1)
def load_graph():
    return obonet.read_obo(HPO_URL)

def get_rank_and_path(hpo_id):
    """
    Return rank and path from root to this term (shortest path).
    """
    graph = load_graph()
    if hpo_id not in graph:
        return None, []

    path = [hpo_id]
    depth = 0
    current = hpo_id
    while True:
        parents = graph.nodes[current].get("is_a", [])
        if not parents:
            break
        current = parents[0]  # take first parent if multiple
        path.append(current)
        depth += 1
        if current == "HP:0000001":
            break
    return depth, list(reversed(path)) 


In [11]:
# Example usage:
hpo_id = "HP:0002315"
rank, lineage = get_rank_and_path(hpo_id)
print(f"Rank: {rank}")
print(f"Lineage: {lineage}")

Rank: 4
Lineage: ['HP:0000001', 'HP:0000118', 'HP:0000707', 'HP:0012638', 'HP:0002315']


## Pipeline function to chain step 1-4


Add a small normalizer and synonym-match helper

In [29]:
from typing import Iterable, Tuple, Optional
try:
    # Prefer RapidFuzz (faster, no GPL issues)
    from rapidfuzz import fuzz, process as rf_process
    _USE_RF = True
except Exception:
    # Fall back to fuzzywuzzy if RapidFuzz isn’t available
    from fuzzywuzzy import fuzz, process as fw_process
    _USE_RF = False

def _norm(s: Optional[str]) -> str:
    return (s or "").strip().lower()

def synonym_matches_input(
    input_symptom: str,
    synonyms: Iterable[str],
    exact: bool = True,
    fuzzy_threshold: int = 90
) -> bool:
    """
    Return True if the input symptom matches any synonym (exact case-insensitive
    or fuzzy >= threshold).
    """
    inp = _norm(input_symptom)
    syns = [_norm(s) for s in (synonyms or []) if s]

    # Exact (case-insensitive)
    if exact and inp in syns:
        return True

    # Fuzzy fallback if desired
    if fuzzy_threshold is not None and len(syns) > 0:
        if _USE_RF:
            # RapidFuzz: compute max similarity quickly
            # (rf_process.extractOne returns (match, score, idx))
            _, score, _ = rf_process.extractOne(inp, syns, scorer=fuzz.ratio)
            return score >= fuzzy_threshold
        else:
            # FuzzyWuzzy fallback
            best, score = fw_process.extractOne(inp, syns)
            return score >= fuzzy_threshold

    return False

In [None]:

def map_symptoms_to_hpo_pipeline(symptom):
    """
    Map reported symptoms to HPO terms and get synonyms and definitions.

    Parameters:
    symptom: str
        The term reported in the study.

    Returns:
    hpo_term: str or None
        The term from the HPO database.
    hpo_id: str or None
        The ID of the term from the HPO database.
    fuzzy_score: float
        The fuzzy score between the input term and the HPO term (0 if no match).
    definition: str or None
        The definition of the HPO term.
    rank: int or None
        The rank (depth) of the HPO term in the ontology.
    path: list of str
        The list of HPO IDs representing the path from the root to this term.
    status: str
        The status of the mapping ('matched' or 'not matched').
    """
    # Step 1: Map reported symptoms to HPO terms
    hpo_term, hpo_id = map_symptoms_to_hpo(symptom)

    # If no match found, return immediately
    if hpo_term is None or hpo_id is None:
        return symptom, None, None, None, None, [], 0, 'not matched'

    # Step 2: Estimate fuzzy score
    fuzzy_score = estimate_fuzzy_score(symptom, hpo_term)

    # Step 3: Get HPO definitions and synonyms
    synonyms, definition = get_hpo_definitions_and_synonyms(hpo_id)
    
    # Step 4: Get full lineage
    rank, path = get_rank_and_path(hpo_id)
    # print(f"Rank: {rank}, Path: {path}")


    # ---- Fixed acceptance rule (compare INPUT to SYNONYMS) ----
    syn_match = synonym_matches_input(symptom, synonyms, exact=True, fuzzy_threshold=90)
    accept = (fuzzy_score >= 80) or syn_match
    status = 'matched' if accept else 'not matched'

    return symptom, hpo_term, hpo_id, definition, rank, path, fuzzy_score, status

    # # Step 5: Check if match is acceptable
    # if fuzzy_score >= 80 or (symptom.lower() in [s.lower() for s in synonyms]):
    #     return symptom, hpo_term, hpo_id, definition, synonyms, rank, path, fuzzy_score, 'matched'
    # else:
    #     return symptom, hpo_term, hpo_id, definition, synonyms, rank, path, fuzzy_score, 'not matched'


In [31]:
# Example usage:
result = map_symptoms_to_hpo_pipeline("headache")
print(result)


('headache', 'Headache', 'HP:0002315', 'Cephalgia, or pain sensed in various parts of the head, not confined to the area of distribution of any nerve.', 4, ['HP:0000001', 'HP:0000118', 'HP:0000707', 'HP:0012638', 'HP:0002315'], 100, 'matched')
