# BERDL Data API Utilities

This notebook demonstrates how to use KBBERDLUtils to query the BERDL (Biological and Environmental Research Data Lake) API.

## Overview

BERDL provides access to KBase data through SQL queries. KBBERDLUtils offers:
- SQL query execution against delta tables
- Genome data queries (contigs, features)
- Ontology data queries (reactions, compounds)
- Automatic pagination for large result sets

## Prerequisites

- **KBase Token**: You need a valid KBase token with BERDL user role
- Token should be stored in `~/.kbase/token` or set programmatically

## 1. Setup

In [None]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
src_path = project_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

print(f"Project root: {project_root}")

## 2. Initialize KBBERDLUtils

In [None]:
from kbutillib import KBBERDLUtils

# Initialize BERDL utilities
util = KBBERDLUtils()

print(f"API URL: {util.api_url}")
print(f"Default limit: {util.berdl_default_limit}")
print(f"Timeout: {util.berdl_timeout}s")

## 3. Test Connection

Verify that we can connect to the BERDL API with our token:

In [None]:
# Test the connection
result = util.test_connection()

if result["success"]:
    print("Connected successfully!")
    print(f"API URL: {result['api_url']}")
else:
    print(f"Connection failed: {result['message']}")
    print("\nMake sure you have:")
    print("  1. A valid KBase token in ~/.kbase/token")
    print("  2. BERDL user role assigned to your account")

## 4. Query Genome Contigs

Query contig data from the `kbase_genomes.contig` table:

In [None]:
# Query contigs
result = util.query_contigs(limit=5)

if result["success"]:
    print(f"Retrieved {result['row_count']} contigs")
    print(f"Columns: {result['columns']}")
    print("\nData:")
    for row in result["data"]:
        print(f"  {row}")
else:
    print(f"Query failed: {result['error']}")

## 5. Custom SQL Queries

Execute arbitrary SQL queries against BERDL:

In [None]:
# Custom query - find contigs by GC content
sql = """
SELECT contig_id, gc_content, length 
FROM kbase_genomes.contig 
WHERE gc_content > 0.5
ORDER BY gc_content DESC
"""

result = util.query(sql, limit=10)

if result["success"]:
    print(f"Found {result['row_count']} contigs with GC > 50%")
    for row in result["data"]:
        print(f"  {row['contig_id']}: GC={row['gc_content']:.2%}, Length={row['length']}")
else:
    print(f"Query failed: {result['error']}")

## 6. Query Ontology Data

Query ontology statements for reactions, compounds, and other terms:

In [None]:
# Query SEED reaction names
result = util.query_ontology_statements(
    subject_prefix="seed.reaction:",
    predicate="rdfs:label",
    limit=10
)

if result["success"]:
    print(f"Found {result['row_count']} reaction labels")
    print("\nSEED Reactions:")
    for row in result["data"]:
        rxn_id = row.get("subject", "").replace("seed.reaction:", "")
        name = row.get("value", "")
        print(f"  {rxn_id}: {name}")
else:
    print(f"Query failed: {result['error']}")

## 7. Get Reaction and Compound Names

Convenience methods for common ontology queries:

In [None]:
# Get specific reaction names
reaction_ids = ["rxn00001", "rxn00002", "rxn00004"]
result = util.get_reaction_names(reaction_ids)

if result["success"]:
    print("Reaction Names:")
    for row in result["data"]:
        print(f"  {row['reaction_id']}: {row['reaction_name']}")
else:
    print(f"Query failed: {result['error']}")

In [None]:
# Get compound names
compound_ids = ["cpd00001", "cpd00002", "cpd00003"]
result = util.get_compound_names(compound_ids)

if result["success"]:
    print("Compound Names:")
    for row in result["data"]:
        print(f"  {row['compound_id']}: {row['compound_name']}")
else:
    print(f"Query failed: {result['error']}")

## 8. Paginated Queries

For large result sets, use pagination to fetch all data:

In [None]:
# Paginated query example
result = util.paginate_query(
    sql="SELECT * FROM kbase_genomes.contig WHERE length > 5000",
    page_size=100,
    max_pages=5  # Limit pages for demo
)

if result["success"]:
    print(f"Total rows retrieved: {result['row_count']}")
    print(f"Pages fetched: {result['pages_fetched']}")
    print(f"\nFirst 3 results:")
    for row in result["data"][:3]:
        print(f"  {row}")
else:
    print(f"Query failed: {result['error']}")

## 9. Configuration Options

KBBERDLUtils can be configured via config.yaml:

In [None]:
# Display current configuration
print("Current BERDL Configuration:")
print(f"  Base URL: {util.berdl_base_url}")
print(f"  API Path: {util.berdl_api_path}")
print(f"  Full API URL: {util.api_url}")
print(f"  Default Limit: {util.berdl_default_limit}")
print(f"  Timeout: {util.berdl_timeout} seconds")

print("\nTo customize, add to ~/.kbutillib/config.yaml:")
print("""
berdl:
  base_url: https://hub.berdl.kbase.us
  api_path: /apis/mcp/delta/tables
  timeout: 120
  default_limit: 500
""")

## Summary

KBBERDLUtils provides:
- **SQL Queries** - Execute arbitrary queries against BERDL
- **Genome Data** - Query contigs, features, and assemblies
- **Ontology Data** - Access reactions, compounds, and annotations
- **Pagination** - Handle large result sets automatically
- **Configuration** - Customize API endpoints and timeouts

### Performance Notes

- **JupyterHub Login**: Uses personal cluster (faster)
- **No Login**: Uses shared cluster (slower)
- **KBase Apps**: Always use shared cluster

### Next Steps
- Explore available tables via the [BERDL Swagger Docs](https://hub.berdl.kbase.us/apis/mcp/docs)
- Build custom queries for your research needs
- Integrate BERDL data with other KBUtilLib utilities