# MPS Annotation Pipeline #

In [None]:
import os
import subprocess
from dotenv import load_dotenv
import yaml
from pprintpp import pprint as pp
import pipe
from datetime import datetime, date

load_dotenv() # load .env

# load OpenAI API key
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

OntoGPT examples: [here](https://github.com/monarch-initiative/ontogpt/blob/main/notebooks/)

OntoGPT templates: [here](https://github.com/monarch-initiative/ontogpt/blob/main/src/ontogpt/templates/)

CurateGPT examples: [here](https://github.com/monarch-initiative/curategpt/blob/main/notebooks/command-line/)

### Pre-Install ###
1. Download [CMake](https://cmake.org/download/)
2. During installation, check the box to "Add CMake to system PATH for all users".
3. Download [Visual Studio C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/)

### Install LLM Tools ###

OntoGPT is based on Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a novel method to extract ontological content from text or structured data authored by [Caufield et al., 2024](https://doi.org/10.1093/bioinformatics/btae104).

CurateGPT is another library that uses LLM embeddings to prioritize semantically similar ontology content to text or structured data input. CurateGPT also enables users to suggest new ontology content and programmatically interact with GitHub issue trackers. Find the preprint for CurateGPT [here](https://doi.org/10.48550/arXiv.2411.00046)

In [None]:
!pip install ontogpt
!pip install curategpt

### Set OpenAI API Key ###

In [None]:
!runoak set-apikey -e openai $OPENAI_API_KEY

### Show OntoGPT and CurateGPT Options ###

In [None]:
!ontogpt --help

In [None]:
!curategpt --help

### Extract Human Phenotype Ontology Terms from MPS Papers ###

In [None]:
!ontogpt -vvv extract -i example1.txt -t templates/human_phenotype.yaml -o output/output.yaml --model-provider openai

### Index HPO For AUTO Prefix Terms ###

Behind the scenes, OAK is used to access a variety of different ontologies and allows them to be indexed. See the oaklib docs for documentation on handles such as sqlite:obo:hp.

Let's start by making an index of the Human Phenotype Ontology (HP) and the MONDO Disease Ontology (MONDO):

In [None]:
!curategpt ontology index -m openai: -c terms_hp sqlite:obo:hp
!curategpt ontology index -m openai: -c terms_mondo sqlite:obo:mondo

Warning: this currently takes about 2 hrs; if you use OpenAI to embed the terms you will need an OpenAI API key. You can also leave the -m option off and it will use the default chromadb embedding model. gpt-4o is recommended for using CurateGPT.

Next, we will determine which terms (prefix AUTO:) could not be matched to HPO by OntoGPT and run a search using OpenAI embeddings for the most similar terms.

In [None]:
with open("output/output.yaml", "r") as f:
    data = yaml.safe_load(f)

# Extract label and attribute it to a MONDO ID
label = data["extracted_object"]["label"]
#!curategpt search -c terms_mondo label

# Extract AUTO terms and find semantically similar phenotypes
raw_auto_terms = [item for item in data["extracted_object"]["phenotypes"] if item.startswith("AUTO:")]
auto_terms = [item.replace("AUTO:", "").replace("%20", " ") for item in raw_auto_terms]

print(auto_terms)

#!curategpt search -c terms_hp auto_terms

In [None]:
if (save):
        timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M")
        filename = f"article_urls_{timestamp}.json"
        output_path = os.path.join("requests//json", filename)

        # save json
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(results, f, ensure_ascii=False, indent=2)

In [None]:
!curategpt ask -c phenopackets_384 "what genes are associated with renal phenotypes?"