# MPS Annotation Pipeline #

## Pre-Install ##

Install [Python](https://www.python.org/downloads/), we recommend version 3.11.X.

### Windows ###
For Windows, packages like onnx, duckdb, pyarrow, or tiktoken require C/C++ compilation. Therefore you must:
1. Download [CMake](https://cmake.org/download/)
2. During installation, check the box to "Add CMake to system PATH for all users".
3. Download [Visual Studio C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)

### MacOS ###
MacOS is easier because it comes installed with a C/C++ compiler. 
1. Run: `xcode-select --install` to install command line utilities like git, make, and clang.
2. Download CMake with: `brew install cmake`

## Preparing Your Environment ##
1. Activate a virtual environment to avoid dependency hell:
    ```
    py -3.11 -m venv .venv
    .venv\Scripts\activate
    ```
2. Upgrade pip `python.exe -m pip install --upgrade pip`

## Install LLM Tools and Dependencies ##

CurateGPT is another library that uses LLM embeddings to prioritize semantically similar ontology content to text or structured data input. CurateGPT also enables users to suggest new ontology content and programmatically interact with GitHub issue trackers. Find the preprint for CurateGPT [here](https://doi.org/10.48550/arXiv.2411.00046).

OntoGPT is based on Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a novel method to extract ontological content from text or structured data authored by [Caufield et al., 2024](https://doi.org/10.1093/bioinformatics/btae104).

In [None]:
!pip install -r requirements.txt

This may take a few minutes. On install you will note there is a dependency conflict with the packages installed. Selenium 4.34.0 requires urllib3==2.4.0 while OntoGPT requires urllib3<2. Install urllib3 1.26.0 with `pip install "urllib3<2"`. This shouldn't cause problems at runtime. If there is a conflict, for using CurateGPT use urllib3==2.4.0 with `pip install "urllib3==2.4.0"` and then revert to the previous installation of urllib3 with the aforementioned command.

### Learn More ###
If you get stuck at anytime, visit the GitHub for either package, or visit the example pages for pre-built Jupyter Notebooks.
* [CurateGPT GitHub](https://github.com/monarch-initiative/curategpt/)
* CurateGPT examples: [here](https://github.com/monarch-initiative/curategpt/blob/main/notebooks/command-line/)
* [OntoGPT GitHub](https://github.com/monarch-initiative/ontogpt/) 
* OntoGPT examples: [here](https://github.com/monarch-initiative/ontogpt/blob/main/notebooks/)
* OntoGPT templates: [here](https://github.com/monarch-initiative/ontogpt/blob/main/src/ontogpt/templates/)


## Getting Started ##

Make sure your OPENAI_API_KEY is in the .env file as OPENAI_API_KEY=examplekey1234. Now let's get to curating!

In [None]:
import os
import subprocess
import yaml
import pipe
from dotenv import load_dotenv
from pprintpp import pprint as pp
from datetime import datetime, date
load_dotenv() # load .env

# load OpenAI API key
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# we don't want to use onnx embeddings
os.environ["CHROMADB_USE_ANN"] = "false"

### Set OpenAI API Key ###

In [None]:
!runoak set-apikey -e openai $OPENAI_API_KEY

### Show OntoGPT Options ###

OntoGPT has a --help extension. Unfortunately CurateGPT has no such thing, since it assumes a lot of background knowledge on the part of the user about current ontology editing workflows. The good thing for us is that we are only using it to 1) search the PubMed or Wikipedia APIs and 2) return semantically similar terms to terms missed by OntoGPT.

In [None]:
!ontogpt --help

## Search PubMed API or Wikipedia API for a Topic of Interest (TOI) ##

In our example, we are interested in Mucopolysaccharidosis (MPS), a rare lysosomal disease, caused by defects in lysosomal enzymes which catabolyze glycosaminoglycans. Learn more [here](https://www.ninds.nih.gov/health-information/disorders/mucopolysaccharidoses).

We are going to search PubMed for papers based on the query "What are mucopolysaccharidosis clinical symptoms?" based on OpenAI embeddings. CurateGPT depends on <i>indexes</i> for many operations, which are used to provide context for LLM queries. 

In [None]:
import json
import disable_onnx
from curategpt.wrappers.literature.pubmed_wrapper import PubmedWrapper
from curategpt.wrappers.clinical.hpoa_wrapper import HPOAWrapper
from curategpt import BasicExtractor

# Setup
extractor = BasicExtractor()
pubmed = PubmedWrapper(extractor=extractor)

query = "What are mucopolysaccharidosis clinical symptoms?"
results = pubmed.search(query, expand=False, limit=10)

# Save to JSON file
output = []
for result in results:
    if not isinstance(result, tuple) or len(result) < 2:
        continue

    doc, score = result[0], result[1]
    if not isinstance(doc, dict):
        continue

    output.append({
        "score": score,
        **doc  # includes id, title, abstract, pmcid, etc.
    })

# Save full JSON results
with open("pubmed_mps_results.json", "w", encoding="utf-8") as f:
    json.dump(output, f, indent=2, ensure_ascii=False)

# Print selected fields
for i, item in enumerate(output):
    title = item.get("title", "No title")
    pmid = item.get("id", "No PMID")
    score = item.get("score", "No score")
    print(f"{i+1}. {title}\n   {pmid}\n   Score: {score:.6f}\n")


In [None]:
import disable_onnx
# from curategpt import BasicExtractor
# from curategpt.wrappers.literature.pubmed_wrapper import PubmedWrapper
# from curategpt.store.chromadb_adapter import ChromaDBAdapter
# from curategpt.wrappers.clinical.hpoa_wrapper import HPOAWrapper

# # Step 1: Initialize extractor, store, and PubMed wrapper
# extractor = BasicExtractor()
# adapter = ChromaDBAdapter("chromadb_pubmed")
# pubmed = PubmedWrapper(extractor=extractor)

# # Step 2: Ask a question and save retrieved docs into ChromaDB
# results = pubmed.search("What neurons express VIP?", limit=10)
# for i, (doc, _, _) in enumerate(results):
#     title = doc.get("title", "")
#     abstract = doc.get("abstract", "")
#     combined = f"{title}\n{abstract}"
#     adapter.insert([{
#         "id": f"doc_{i}",
#         "text": combined
#     }], collection="pubmed")

# # Step 3: Wrap with HPOAWrapper and extract filtered annotations
# wrapper = HPOAWrapper(local_store=adapter, extractor=extractor)
# rows = list(wrapper.objects(collection="pubmed"))

# # Only keep rows with valid HP: phenotype IDs
# filtered_rows = [row for row in rows if str(row.get("phenotype_id", "")).startswith("HP:")]

# # Step 4: Write the filtered HPOA-formatted output
# with open("filtered_output.hpoa.tsv", "w") as f:
#     for row in filtered_rows:
#         f.write("\t".join([
#             row.get("disease_id", ""),
#             row.get("phenotype_id", ""),
#             row.get("qualifier", ""),
#             row.get("evidence", ""),
#             row.get("onset", ""),
#             row.get("frequency", ""),
#             row.get("sex", ""),
#             row.get("modifier", ""),
#             row.get("aspect", "P"),
#             row.get("publication", ""),
#             row.get("description", ""),
#         ]) + "\n")

# print(f"✅ Wrote {len(filtered_rows)} filtered HPOA entries to filtered_output.hpoa.tsv")


# from curategpt import BasicExtractor
# from curategpt.wrappers.literature.pubmed_wrapper import PubmedWrapper
# from curategpt.wrappers.clinical.hpoa_wrapper import HPOAWrapper
# from curategpt.store.chromadb_adapter import ChromaDBAdapter

# # Step 1: Search PubMed for MPS symptoms
# extractor = BasicExtractor()
# pubmed = PubmedWrapper(extractor=extractor)
# results = list(pubmed.search("Mucopolysaccharidosis clinical symptoms", limit=15))

# # Step 2: Insert PubMed docs into a local ChromaDB
# adapter = ChromaDBAdapter("hpoa_pubmed_db")
# adapter.client.reset()
# objects = []

# for i, (doc, score, meta) in enumerate(results):
#     text = f"{doc.get('title', '')}\n{doc.get('abstract', '')}".strip()
#     if text:
#         objects.append({"id": f"doc_{i}", "text": text})

# adapter.insert(objects, collection="pubmed")

# # Step 3: Extract HPOA-like objects
# wrapper = HPOAWrapper(local_store=adapter, extractor=extractor, expand_publications=False)
    
# # Step 4: Save to TSV
# output_file = "output.mps4.hpoa.tsv"
# seen_rows = set()

# with open(output_file, "w", encoding="utf-8") as f:
#     f.write("\t".join([ # write column headers
#     "disease_id", "phenotype_id", "qualifier", "evidence", "onset", "frequency",
#     "sex", "modifier", "aspect", "publication", "description"
#     ]) + "\n")
#     for row in wrapper.objects(collection="pubmed"):
#         # # Optional: filter for only valid HPO phenotype IDs
#         # phenotype_id = str(row.get("phenotype_id", ""))
#         # if not phenotype_id.startswith("HP:"):
#         #     continue

#         line = "\t".join([
#             row.get("disease_id", ""),
#             row.get("phenotype_id", ""),
#             row.get("qualifier", ""),
#             row.get("evidence", ""),
#             row.get("onset", ""),
#             row.get("frequency", ""),
#             row.get("sex", ""),
#             row.get("modifier", ""),
#             row.get("aspect", "P"),
#             row.get("publication", ""),
#             row.get("description", ""),
#         ])

#         if line not in seen_rows:
#             seen_rows.add(line)
#             f.write(line + "\n")

# print(f"\nExtracted {len(seen_rows)} unique HPO terms written to: {output_file}")

from curategpt import BasicExtractor
from curategpt.wrappers.literature.pubmed_wrapper import PubmedWrapper
from curategpt.wrappers.clinical.hpoa_wrapper import HPOAWrapper
from curategpt.store.chromadb_adapter import ChromaDBAdapter

# Step 1: Search PubMed for MPS symptoms
extractor = BasicExtractor()
pubmed = PubmedWrapper(extractor=extractor)
results = list(pubmed.search("Mucopolysaccharidosis clinical symptoms", limit=15))

# Step 2: Insert PubMed docs into a local ChromaDB
adapter = ChromaDBAdapter("hpoa_pubmed_db")
adapter.client.reset()
objects = []

for i, (doc, score, meta) in enumerate(results):
    text = f"{doc.get('title', '')}\n{doc.get('abstract', '')}".strip()
    pub_id = doc.get("id", "")
    if text:
        objects.append({
            "id": f"doc_{i}",
            "text": text,
            "publication": pub_id  # <-- Add PMID here
        })

adapter.insert(objects, collection="pubmed")

# Step 3: Extract HPOA-like objects
wrapper = HPOAWrapper(local_store=adapter, extractor=extractor, expand_publications=False)

# Step 4: Save to TSV
output_file = "output.mps4.hpoa.tsv"
seen_rows = set()

with open(output_file, "w", encoding="utf-8") as f:
    f.write("\t".join([  # write column headers
        "disease_id", "phenotype_id", "qualifier", "evidence", "onset", "frequency",
        "sex", "modifier", "aspect", "publication", "description"
    ]) + "\n")
    for row in wrapper.objects(collection="pubmed"):
        line = "\t".join([
            row.get("disease_id", ""),
            row.get("phenotype_id", ""),
            row.get("qualifier", ""),
            row.get("evidence", ""),
            row.get("onset", ""),
            row.get("frequency", ""),
            row.get("sex", ""),
            row.get("modifier", ""),
            row.get("aspect", "P"),
            row.get("publication", ""),  # now correctly filled
            row.get("description", ""),
        ])

        if line not in seen_rows:
            seen_rows.add(line)
            f.write(line + "\n")

print(f"\nExtracted {len(seen_rows)} unique HPO terms written to: {output_file}")

### Extract Human Phenotype Ontology Terms from MPS Papers ###

In [None]:
!ontogpt -vvv extract -i example1.txt -t templates/human_phenotype.yaml -o output/output.yaml --model-provider openai

### Index HPO For AUTO Prefix Terms ###

Behind the scenes, OAK is used to access a variety of different ontologies and allows them to be indexed. See the oaklib docs for documentation on handles such as sqlite:obo:hp.

Let's start by making an index of the Human Phenotype Ontology (HP) and the MONDO Disease Ontology (MONDO):

In [None]:
!curategpt ontology index -m openai: -e openai -c terms_hp sqlite:obo:hp
#!curategpt ontology index -m openai: -c terms_hp sqlite:obo:hp
#!curategpt ontology index -m openai: -c terms_mondo sqlite:obo:mondo

Warning: this currently takes about 2 hrs; if you use OpenAI to embed the terms you will need an OpenAI API key. You can also leave the -m option off and it will use the default chromadb embedding model. gpt-4o is recommended for using CurateGPT.

Next, we will determine which terms (prefix AUTO:) could not be matched to HPO by OntoGPT and run a search using OpenAI embeddings for the most similar terms.

In [None]:
with open("output/output.yaml", "r") as f:
    data = yaml.safe_load(f)

# Extract label and attribute it to a MONDO ID
label = data["extracted_object"]["label"]
#!curategpt search -c terms_mondo label

# Extract AUTO terms and find semantically similar phenotypes
raw_auto_terms = [item for item in data["extracted_object"]["phenotypes"] if item.startswith("AUTO:")]
auto_terms = [item.replace("AUTO:", "").replace("%20", " ") for item in raw_auto_terms]

print(auto_terms)

#!curategpt search -c terms_hp auto_terms

In [None]:
if (save):
        timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M")
        filename = f"article_urls_{timestamp}.json"
        output_path = os.path.join("requests//json", filename)

        # save json
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(results, f, ensure_ascii=False, indent=2)

In [None]:
!curategpt ask -c phenopackets_384 "what genes are associated with renal phenotypes?"