# **OntoGPT: Structured Information Extraction Using LLMs**

**OntoGPT** is a powerful Python package developed by the [Monarch Initiative](https://monarch-initiative.github.io/ontogpt/) for extracting structured, ontology-grounded information from unstructured text using **Large Language Models (LLMs)**. It enables zero-shot extraction based on a user-defined schema, and is ideal for:

- Named Entity Recognition (NER)  
- Relation extraction  
- Summarization  
- Knowledge base or knowledge graph construction  
- Clinical report structuring  
- Literature mining for biomedicine and beyond

> 🔗 **Documentation**: [https://monarch-initiative.github.io/ontogpt/](https://monarch-initiative.github.io/ontogpt/)

---

## **Methods**

### ✳️ SPIRES: Structured Prompt Interrogation and Recursive Extraction of Semantics

The main method implemented in OntoGPT is **SPIRES**, a zero-shot learning (ZSL) approach that extracts nested and hierarchical semantic structures from text.

**Inputs:**
1. **LinkML schema** – Defines the structure and **ontology** types (e.g., SNOMEDCT for diseases, HPO phenotypes, HGNC genes, etc).
2. **Free text** – Natural language input (e.g., clinical notes, literature abstracts).

**Outputs:**
- Structured knowledge in:
  - JSON
  - YAML

This allows for seamless integration with semantic web tools, databases, and downstream analytics.

---

## **LLM Backend Support**

OntoGPT can use multiple LLMs:

- **OpenAI GPT models** via API (e.g., GPT-3.5, GPT-4), but this costs £££
- **Local LLMs** via Ollama (e.g., LLaMA, Mistral, Mixtral, Phi-3)

Choose the backend based on your computational resources and privacy needs. IRIDIS X offers a selection of GPUs (A100s and H100s), and these can be leveraged to run larger LLMs. For this tutorial, we will stick with CPUs and smaller sized LLMs.

---

## **Running Locally with Ollama**

To run OntoGPT with local LLMs using [Ollama](https://ollama.com), open a new terminal window and do the following (this cannot be run in jupyter):

```bash
module load ollama  
ollama serve &
```

Press enter and proceed to download an appropriate model using:

```bash
ollama pull (model name, e.g. llama3.2:1b-instruct-q2_K)
```

Confirm it is downloaded with:

```bash
ollama list
```

Now leave the terminal instance running. 

Congratulations, you are now hosting an instance of a closed source LLM!

## **OntoGPT Setup**

## ⚙️ **OntoGPT Setup for This Hackathon**

You **do not need to install OntoGPT manually** — it is already pre-installed inside the Aptainer (Singularity) container environment provided for this hackathon.

However, in order to run the `ontogpt` command from anywhere in your terminal or notebooks, you need to **bind the local `bin` directory (where OntoGPT lives inside the container) to your `$PATH`**.

### 🔧 Step: Add OntoGPT to Your `$PATH`

Inside your shell or script that runs the container, make sure to bind the container's `/usr/local/bin` (or equivalent install path) like this:

In [1]:
%env PATH=../.local/bin:$PATH

env: PATH=../.local/bin:$PATH


Next, test that OntoGPT works as expected:

In [2]:
!ontogpt extract --help

Usage: ontogpt extract [OPTIONS]

  Extract knowledge from text guided by schema, using SPIRES engine.

  Example:

      ontogpt extract -t gocam.GoCamAnnotations -i gocam-27929086.txt

  The input argument may be:     A file path,     A directory,     or a
  string. Use the -i/--input-file option followed by the path to the input
  file or directory. If the input is a directory, all files with the .txt
  extension will be read. This is not recursive. Otherwise, the input is
  assumed to be a string to be read as input.

  You can also use fragments of existing schemas, use the --target-class
  option (-T) to specify an alternative Container/root class.

  Example:

      ontogpt extract -t gocam.GoCamAnnotations -T GeneOrganismRelationship
      "the mouse Shh gene"

Options:
  -i, --inputfile TEXT            Path to a file containing input text.
  -t, --template TEXT             Template to use. This may be the name of a
                                  predefined template or a pat

You can now test OntoGPT with Ollama's instance of gemma.

Gemma is a family of lightweight, state-of-the-art open models built by Google DeepMind. Several versions of Gemma exist, but for this exercise, please run the following (make sure your ollama server is running):

In [None]:
ollama pull gemma:7b
ollama list

To use an ollama model with OntoGPT, you can specify the model with -m ollama/[model name here]. Ensure the model name is as it exists in ollama (ollama list). Run the following command to test the "complete" function of OntoGPT:


In [3]:
!ontogpt complete -m ollama/gemma:7b  "Why did the squid cross the coral reef?"

To get to the other tentacles!


## 🤖 **Running OntoGPT**

### **Extracting information from scientifc papers**

We can extract knowledge from text (e.g. a scientific paper or abstract) into a given structured format. The GO pathway datamodel will extract information on genes, pathways, cells, etc. Below is the YAML schema which describes the classes to extract (prompts are given in the 'description' field). The fields are then standardised to a given ontology (e.g. HGNC for genes):

```
id: http://w3id.org/ontogpt/gocam
name: gocam-template
title: GO-CAM Template
description: >-
  A template for GO-CAMs
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
  CHEBI: http://purl.obolibrary.org/obo/CHEBI_
  CL: http://purl.obolibrary.org/obo/CL_
  EFO: http://www.ebi.ac.uk/efo/EFO_
  GO: http://purl.obolibrary.org/obo/GO_
  HGNC: http://identifiers.org/hgnc/
  NCBITaxon: http://purl.obolibrary.org/obo/NCBITAXON_
  PR: http://purl.obolibrary.org/obo/PR_
  PW: http://purl.obolibrary.org/obo/PW_
  UBERON: http://purl.obolibrary.org/obo/UBERON_
  UniProtKB: http://purl.uniprot.org/uniprot/
  gocam: http://w3id.org/ontogpt/gocam/
  linkml: https://w3id.org/linkml/

default_prefix: gocam
default_range: string

imports:
  - linkml:types
  - core

classes:
  GoCamAnnotations:
    tree_root: true
    attributes:
      genes:
        description: semicolon-separated list of genes
        multivalued: true
        range: Gene
      organisms:
        description: semicolon-separated list of organism taxons
        multivalued: true
        range: Organism
      gene_organisms:
        annotations:
          prompt: semicolon-separated list of asterisk separated gene to organism relationships
        multivalued: true
        range: GeneOrganismRelationship
      activities:
        description: semicolon-separated list of molecular activities
        multivalued: true
        range: MolecularActivity
      gene_functions:
        description: semicolon-separated list of gene to molecular activity relationships
        multivalued: true
        range: GeneMolecularActivityRelationship
      cellular_processes:
        description: semicolon-separated list of cellular processes
        multivalued: true
        range: CellularProcess
      pathways:
        description: semicolon-separated list of pathways
        multivalued: true
        range: Pathway
      gene_gene_interactions:
        description: semicolon-separated list of gene to gene interactions
        multivalued: true
        range: GeneGeneInteraction
      gene_localizations:
        description: >-
          semicolon-separated list of genes plus their location in the cell;
          for example, "gene1 / cytoplasm; gene2 / mitochondrion"
        multivalued: true
        range: GeneSubcellularLocalizationRelationship

  Gene:
    is_a: NamedEntity
    id_prefixes:
      - HGNC
      - PR
      - UniProtKB
    annotations:
      annotators: gilda:, bioportal:hgnc-nr
  Pathway:
    is_a: NamedEntity
    id_prefixes:
      - GO
      - PW
    annotations:
      annotators: sqlite:obo:go, sqlite:obo:pw
  CellularProcess:
    is_a: NamedEntity
    id_prefixes:
      - GO
    annotations:
      annotators: sqlite:obo:go
  MolecularActivity:
    is_a: NamedEntity
    id_prefixes:
      - GO
    annotations:
      annotators: sqlite:obo:go
  GeneLocation:
    is_a: NamedEntity
    id_prefixes:
      - GO
      - CL
      - UBERON
    annotations:
      annotators: "sqlite:obo:go, sqlite:obo:cl"
    slot_usage:
      id:
        values_from:
          - GOCellComponentType
          - CellType
  Organism:
    is_a: NamedEntity
    id_prefixes:
      - NCBITaxon
      - EFO
    annotations:
      annotators: gilda:, sqlite:obo:ncbitaxon
  Molecule:
    is_a: NamedEntity
    id_prefixes:
      - CHEBI
      - PR
    annotations:
      annotators: gilda:, sqlite:obo:chebi

  GeneOrganismRelationship:
    is_a: CompoundExpression
    attributes:
      gene:
        range: Gene
      organism:
        range: Organism

  GeneMolecularActivityRelationship:
    is_a: CompoundExpression
    attributes:
      gene:
        range: Gene
        annotations:
          prompt: the name of the gene in the pair. This comes first.
      molecular_activity:
        range: MolecularActivity
        annotations:
          prompt: the name of the molecular function in the pair. This comes second. May be a GO term.
    annotations:
      prompt.example: |-
        TODO

        gene: HGNC:1234
        molecular_activity: GO:0003674

  GeneMolecularActivityRelationship2:
    is_a: CompoundExpression
    attributes:
      gene:
        range: Gene
        annotations:
          prompt: the name of the gene.
      molecular_activity:
        range: MolecularActivity
        annotations:
          prompt: the name of the molecular activity, for example, ubiquitination. May be a GO term.
      target:
        range: Molecule
        annotations:
          prompt: the name of the molecular entity that is the target of the molecular activity.

  GeneSubcellularLocalizationRelationship:
    is_a: CompoundExpression
    attributes:
      gene:
        range: Gene
      location:
        range: GeneLocation

  GeneGeneInteraction:
    is_a: CompoundExpression
    attributes:
      gene1:
        range: Gene
      gene2:
        range: Gene

enums:
  GeneLocationEnum:
    inherits:
      - GOCellComponent
      - CellType

  GOCellComponentType:
    reachable_from:
      source_ontology: obo:go
      source_nodes:
        - GO:0005575 ## cellular_component
  CellType:
    reachable_from:
      source_ontology: obo:cl
      source_nodes:
        - CL:0000000 ## cell

The following command will run OntoGPT's extract function using gemma 7b with the gocam schema

In [4]:
!ontogpt extract -m ollama/gemma:7b -t gocam.GoCamAnnotations -i ../gocam-betacat.txt

ERROR:root:Line '- cGAS' does not contain a colon; ignoring
ERROR:root:Line '- STING' does not contain a colon; ignoring
ERROR:root:Line '- US3' does not contain a colon; ignoring
ERROR:root:Cannot find slot for gene-organism in Gene-organisms:;- β-catenin / HSV-1
ERROR:root:Line '- US3 / HSV-1' does not contain a colon; ignoring
ERROR:root:Cannot find slot for gene-gene_interaction in Gene-gene_interactions:;- Not mentioned in the provided text.
ERROR:root:Line 'The provided text does not include any information about genes or locations, so I am unable to extract the requested data from the given context.' does not contain a colon; ignoring
---
input_text: |-
  Title: β-Catenin Is Required for the cGAS/STING Signaling Pathway but Antagonized by the Herpes Simplex Virus 1 US3 Protein
  Text:
  The cGAS/STING-mediated DNA-sensing signaling pathway is crucial
  for interferon (IFN) production and host antiviral
  responses. Herpes simplex virus I (HSV-1) is a DNA virus that has
  evolved

Note: The value accepted by the -t / --template argument is the base name of one of the LinkML schema / data model available to OntoGPT.  

Use the command ontogpt list-templates to see all templates. Use the name in the first column with the --template option.  
The output returned from the above command can be optionally redirected into an output file using the -o / --output.  

In [None]:
!ontogpt extract -m ollama/gemma:7b -t drug -i ../gocam-betacat.txt

Now, try to extract information from a full text paper (converted to txt):

In [None]:
!ontogpt extract -m ollama/gemma:7b -t gocam.GoCamAnnotations -i ../Med15_paper.txt