KG Model Garbage Collection

Our Hackathon Team

Allen Baron (University of Maryland - Baltimore)
Yibei Chen (MIT)
Anne Ketter (Computercraft)
Samarpan Mohanty (University of Nebraska - Lincoln)
Evan Molinelli (Chan Zuckerburg Initiative)
Van Truong (University of Pennsylvania)

Project Overview

Biomedical knowledge graphs (KGs) are powerful tools for linking genes, diseases, and phenotypes — but when AI models generate new edges, they often hallucinate or introduce errors. Our project focuses on pruning these errors. We show how combining human review, grounded AI, and graph learning can work together to keep biomedical knowledge graphs accurate and trustworthy.

The KG Model Garbage Collection Tool is a proof-of-concept that:

Starts with a trusted subset of the Monarch KG (with edges randomly removed).
Fills in missing edges using three approaches to simulate a real-world, messy knowledge graph: random assignment (control), a general LLM, and an LLM with biomedical RAG.
Participants, for example subject-matter experts (SMEs), are invited to click and validate (some of) these edges through a simple interface to evaluate how close each method comes to the truth.
This data is used to train a graph neural network (GNN) to see if it can automatically spot questionable edges and flag them for review and removal.
The resulting knowledge graph is tested against the original, trusted knowledge graph.

Problem

Large Language Models (LLMs) are increasingly used to scale up knowledge graphs, but they introduce errors nad hallucinations. In biomedicine, these mistakes can have real-world consequences. While Human-in-the-Loop (HITL) approaches can mitigate risks, they are not scalable solutions for large, complex knowledge graphs.

Solution

The KG Model Garbage Collection Tool provides a proof-of-concept (PoC) framework allowing curators to probe a KG using real scientific questions, provide feedback, and use that feedback to train a GNN. This tool extends the impact of human curation by learning from expert human validation patterns.

Quickstart Instructions

What can you do with KG Garbage Model Collector?

You can collaboratively find and remove problem edges.
Isolate part of a large knowledge graph to curate a smaller, workable data set
Teach a GNN to find problems and curate only the problems it identifies
Check the problems manually - iterate until agreement on what to prune

System Requirements

Python 3.8 or higher
Node.js 14.x or higher (for frontend components)
AWS CLI configured with appropriate credentials
Access to PubMed E-utilities (for RAG functionality)

Dependencies

Key python dependencies include: pandas, numpy, boto3 (AWS SDK), sentence-transformers, chromadb, langchain, requests

AWS Configuration

Configure AWS credentials for Bedrock access:

aws configure

Ensure access to the following AWS services:

Amazon Bedrock (for LLM inference)
Appropriate IAM permissions for model access

Environment Setup

Clone the repository:

git clone https://github.com/collaborativebioinformatics/Model_Garbage_Collection.git
cd Model_Garbage_Collection

Install dependencies:

pip install .

Configure environment variables:

export AWS_REGION=your-region
export AWS_PROFILE=your-profile

App Frontend Setup

See app/frontend/README.md.

Directory Structure

Here's an overview of our filetree in this repo.

Model_Garbage_Collection/
├── app/
│   └── frontend/                    # React frontend application
├── data/                            # Input datasets
│   ├── alzheimers_nodes.json
│   └── alzheimers_triples.csv
├── notebooks/                       # Jupyter notebooks for analysis
│   └── model_testing.ipynb
├── outputs/                         # Generated results and datasets
│   └── cytoscape/                   # Graph visualization files
├── src/                             # Core source code
│   ├── gnn/                         # Graph Neural Network components
│   │   ├── lcilp/                   # Link prediction implementation
│   │   │   ├── data/                # Training datasets
│   │   │   ├── ensembling/          # Model ensemble methods
│   │   │   │   ├── blend.py
│   │   │   │   ├── compute_auc.py
│   │   │   │   └── score_triplets_kge.py
│   │   │   ├── kge/                 # Knowledge graph embeddings
│   │   │   │   ├── dataloader.py
│   │   │   │   ├── model.py
│   │   │   │   └── run.py
│   │   │   ├── managers/            # Training and evaluation
│   │   │   │   ├── evaluator.py
│   │   │   │   └── trainer.py
│   │   │   ├── model/               # Neural network architectures
│   │   │   │   └── dgl/
│   │   │   │       ├── aggregators.py
│   │   │   │       ├── graph_classifier.py
│   │   │   │       ├── layers.py
│   │   │   │       └── rgcn_model.py
│   │   │   ├── subgraph_extraction/ # Graph sampling
│   │   │   │   ├── datasets.py
│   │   │   │   ├── graph_sampler.py
│   │   │   │   └── multicom.py
│   │   │   ├── utils/               # Utility functions
│   │   │   ├── graph_sampler.py
│   │   │   ├── score_edges.py
│   │   │   ├── train.py
│   │   │   └── test_*.py
│   │   ├── extract.py
│   │   ├── model.py
│   │   └── README_HITL.md
│   ├── knowledge-graph/             # Knowledge graph processing
│   │   ├── create_cytoscape_files.py
│   │   ├── download_nodes.py
│   │   ├── download.py
│   │   ├── extract.py
│   │   ├── synthetic_llm.py
│   │   ├── synthetic_random.py
│   │   └── triples_to_csv.py
│   └── ux/                          # User experience components
│       ├── chat.py
│       └── select_edges_for_review.py
├── Edge_Assignor.ipynb              # Main RAG pipeline notebook
├── main.py                          # Main application entry point
├── logo.svg
├── pyproject.toml                   # Python project configuration
└── README.md                        # Project documentation

Detailed Explanation of Our Methods

Data Processing Pipeline

Knowledge Graph Extraction: Download subgraphs from the Monarch Knowledge Graph, including node metadata (identifiers, labels, descriptions) - src/knowledge-graph/download.py
Data Preprocessing: Convert graph triples from JSON to structured CSV format for analysis - src/knowledge-graph/triples_to_csv.py
Edge Removal & Assignment Methodologies: Systematically remove a percentage of edges from trusted graph data to create incomplete subgraphs. We used three strategies for creating our test KGs. - Edge_Assignor.ipynb. See README-Edge_Assignor.md for details on RAG pipeline.

Model training & validation pipeline

Validation Framework: Compare predicted edges against ground truth using exact matching and validation scoring
Graph Neural Network Training: Extract graph backbones for GNN input and training on validation patterns

Graph Neural Network Training

Extract Graph Backbone:

python src/knowledge-graph/extract.py

Prepare Training Data (see src/gnn/README_HITL.md for details):

python src/gnn/run_hitl_prep.sh

Train GNN Model (see src/gnn/lcilp/README.md for details):

python src/gnn/lcilp/train.py

Validation and Evaluation

Simulated Human Curation: A Python script that simulates human review by comparing assigned edges against ground truth, generating curated datasets for GNN training - src/human_simulator.py

Ground Truth Comparison: Systematic comparison against trusted Monarch KG data
Accuracy Metrics: Predicate matching rates, precision, and recall calculations
Error Analysis: Categorization of prediction errors and failure modes
Human Validation Interface: Prototype of an interactive web browser tool to collect expert review and feedback

Graphical User Interface

Prepare cytoscape visualization files: Create files for visualization in Cytoscape with node & edge data for each rebuilt knowledge graph & associated backbones - src/knowledge-graph/create_cytoscape_files.py

The GUI we built is a simulated example.

Future Directions

We built this prototype over 3 days as a hackathon team. We're stoked about it and are considering extending it in the future. We welcome any contributors or folks who wants to continue building off our proof-of-concept.

Contributing

We welcome contributions from the biomedical informatics and AI research communities. Please submit feedback and requests as 'issues'!

License

See LICENSE file.

Acknowledgements

The KG Model Garbage Collection tool uses and displays data and algorithms from the Monarch Initiative. The Monarch Initiative (https://monarchinitiative.org) makes biomedical knowledge exploration more efficient and effective by providing tools for genotype-phenotype analysis, genomic diagnostics, and precision medicine across broad areas of disease. We acknowledge the contributions of domain experts and the broader biomedical informatics community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KG Model Garbage Collection

Our Hackathon Team

Project Overview

Problem

Solution

Quickstart Instructions

What can you do with KG Garbage Model Collector?

System Requirements

Dependencies

AWS Configuration

Environment Setup

App Frontend Setup

Directory Structure

Detailed Explanation of Our Methods

Data Processing Pipeline

Model training & validation pipeline

Graph Neural Network Training

Validation and Evaluation

Graphical User Interface

Future Directions

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
app/frontend		app/frontend
data		data
docs		docs
notebooks		notebooks
outputs		outputs
src		src
.gitignore		.gitignore
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Edge_Assignor.ipynb		Edge_Assignor.ipynb
LICENSE		LICENSE
README-Edge_Assignor.md		README-Edge_Assignor.md
README.md		README.md
logo.svg		logo.svg
main.py		main.py
pyproject.toml		pyproject.toml

License

collaborativebioinformatics/Model_Garbage_Collection

Folders and files

Latest commit

History

Repository files navigation

KG Model Garbage Collection

Our Hackathon Team

Project Overview

Problem

Solution

Quickstart Instructions

What can you do with KG Garbage Model Collector?

System Requirements

Dependencies

AWS Configuration

Environment Setup

App Frontend Setup

Directory Structure

Detailed Explanation of Our Methods

Data Processing Pipeline

Model training & validation pipeline

Graph Neural Network Training

Validation and Evaluation

Graphical User Interface

Future Directions

Contributing

License

Acknowledgements

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages