# NB01: PHB Gene Discovery and Data Exploration

**Purpose**: Identify all polyhydroxybutyrate (PHB) pathway gene clusters across the pangenome (27K species, 293K genomes) and explore NMDC schema for metagenomic analysis.

**Requires**: BERDL JupyterHub (Spark session)

**Outputs**:
- `data/phb_gene_clusters.tsv` — all PHB pathway gene clusters with species, KEGG KO, core/aux status
- `data/phb_species_summary.tsv` — per-species PHB pathway completeness
- NMDC schema exploration results (inline)

## PHB Pathway Markers

| Gene | Function | KEGG KO | Specificity |
|------|----------|---------|-------------|
| phaC | PHA synthase (committed step) | K03821 | **PHB-specific** |
| phaP | Phasin (granule protein) | K14205 | PHB-specific |
| phaR | PHB transcriptional regulator | K18080 | PHB-specific |
| phaZ | PHB depolymerase | K05973 | PHB-specific |
| phaA | Beta-ketothiolase | K00626 | Shared with fatty acid metabolism |
| phaB | Acetoacetyl-CoA reductase | K00023 | Shared with SDR family |

In [None]:
# Initialize Spark session
# On BERDL JupyterHub — no import needed (injected into kernel)
spark = get_spark_session()
print(f'Spark session active: {spark.version}')

In [None]:
import os
import pandas as pd
import numpy as np

# Project paths
PROJECT_DIR = os.path.expanduser('~/BERIL-research-observatory/projects/phb_granule_ecology')
DATA_DIR = os.path.join(PROJECT_DIR, 'data')
FIG_DIR = os.path.join(PROJECT_DIR, 'figures')
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIG_DIR, exist_ok=True)
print(f'Project dir: {PROJECT_DIR}')

## Part 1: PHB Gene Discovery in the Pangenome

Search `eggnog_mapper_annotations` for PHB pathway genes using KEGG KO identifiers.

**Important**: The KEGG_ko column can contain multiple KOs comma-separated, so we use LIKE patterns rather than exact matches.

In [None]:
# Define PHB pathway KEGG KOs
PHB_KOS = {
    'K03821': 'phaC - PHA synthase (committed step)',
    'K00023': 'phaB - acetoacetyl-CoA reductase',
    'K00626': 'phaA - beta-ketothiolase',
    'K05973': 'phaZ - PHB depolymerase',
    'K14205': 'phaP - phasin (granule protein)',
    'K18080': 'phaR - PHB transcriptional regulator',
}

# Also search by description keywords as a cross-check
PHB_KEYWORDS = [
    'polyhydroxybutyrate', 'polyhydroxyalkanoate', 'phasin',
    'pha synthase', 'phb synthase', 'poly-beta-hydroxybutyrate',
    'acetoacetyl-coa reductase',
]

print('PHB pathway markers defined')
for ko, desc in PHB_KOS.items():
    print(f'  {ko}: {desc}')

In [None]:
# Query 1: Find all gene clusters with PHB-related KEGG KOs
# Join eggnog_mapper_annotations with gene_cluster to get species and core/aux status

phb_clusters_df = spark.sql("""
    SELECT gc.gtdb_species_clade_id,
           gc.gene_cluster_id,
           gc.is_core,
           gc.is_auxiliary,
           gc.is_singleton,
           ann.KEGG_ko,
           ann.COG_category,
           ann.EC,
           ann.PFAMs,
           ann.Description
    FROM kbase_ke_pangenome.gene_cluster gc
    JOIN kbase_ke_pangenome.eggnog_mapper_annotations ann
        ON gc.gene_cluster_id = ann.query_name
    WHERE ann.KEGG_ko LIKE '%K03821%'
       OR ann.KEGG_ko LIKE '%K00023%'
       OR ann.KEGG_ko LIKE '%K00626%'
       OR ann.KEGG_ko LIKE '%K05973%'
       OR ann.KEGG_ko LIKE '%K14205%'
       OR ann.KEGG_ko LIKE '%K18080%'
""")

# Cache for reuse
phb_clusters_df.cache()
n_clusters = phb_clusters_df.count()
print(f'Found {n_clusters:,} PHB-related gene clusters across all species')

In [None]:
# Convert to pandas for analysis (should be manageable — one row per gene cluster per species)
phb_pd = phb_clusters_df.toPandas()
print(f'Shape: {phb_pd.shape}')
print(f'\nUnique species: {phb_pd["gtdb_species_clade_id"].nunique():,}')
print(f'\nKEGG KO distribution:')

# Parse which PHB KOs are present in each row
for ko, desc in PHB_KOS.items():
    mask = phb_pd['KEGG_ko'].str.contains(ko, na=False)
    n = mask.sum()
    n_species = phb_pd.loc[mask, 'gtdb_species_clade_id'].nunique()
    print(f'  {ko} ({desc.split(" - ")[0]}): {n:,} clusters in {n_species:,} species')

In [None]:
# Query 2: Cross-check with description-based search
# This catches annotations that might have PHB function but different KO assignments

desc_conditions = ' OR '.join(
    [f"LOWER(ann.Description) LIKE '%{kw}%'" for kw in PHB_KEYWORDS]
)

phb_desc_df = spark.sql(f"""
    SELECT ann.Description, ann.KEGG_ko, ann.COG_category, ann.PFAMs,
           COUNT(*) as n_clusters,
           COUNT(DISTINCT gc.gtdb_species_clade_id) as n_species
    FROM kbase_ke_pangenome.gene_cluster gc
    JOIN kbase_ke_pangenome.eggnog_mapper_annotations ann
        ON gc.gene_cluster_id = ann.query_name
    WHERE {desc_conditions}
    GROUP BY ann.Description, ann.KEGG_ko, ann.COG_category, ann.PFAMs
    ORDER BY n_clusters DESC
""")

phb_desc_pd = phb_desc_df.toPandas()
print(f'Description-based search found {len(phb_desc_pd)} distinct annotation groups')
print(f'Total clusters: {phb_desc_pd["n_clusters"].sum():,}')
print(f'\nTop annotations:')
phb_desc_pd.head(20)

In [None]:
# Assign each gene cluster a PHB gene label based on which KO it matches
# Priority: phaC > phaB > phaA > phaZ > phaP > phaR (most to least specific)

def assign_phb_gene(kegg_ko):
    """Assign PHB gene name based on KEGG KO content."""
    if pd.isna(kegg_ko):
        return 'unknown'
    ko = str(kegg_ko)
    if 'K03821' in ko: return 'phaC'
    if 'K05973' in ko: return 'phaZ'
    if 'K14205' in ko: return 'phaP'
    if 'K18080' in ko: return 'phaR'
    if 'K00023' in ko: return 'phaB'
    if 'K00626' in ko: return 'phaA'
    return 'unknown'

phb_pd['phb_gene'] = phb_pd['KEGG_ko'].apply(assign_phb_gene)
print('PHB gene assignments:')
print(phb_pd['phb_gene'].value_counts())

In [None]:
# Core/Accessory/Singleton status by PHB gene
status_summary = phb_pd.groupby('phb_gene').agg(
    n_clusters=('gene_cluster_id', 'count'),
    n_species=('gtdb_species_clade_id', 'nunique'),
    pct_core=('is_core', lambda x: (x == 1).mean() * 100),
    pct_aux=('is_auxiliary', lambda x: (x == 1).mean() * 100),
    pct_singleton=('is_singleton', lambda x: (x == 1).mean() * 100),
).round(1)

print('PHB gene clusters — core/accessory/singleton status:')
status_summary

In [None]:
# Save gene cluster data
out_path = os.path.join(DATA_DIR, 'phb_gene_clusters.tsv')
phb_pd.to_csv(out_path, sep='\t', index=False)
print(f'Saved {len(phb_pd):,} rows to {out_path}')

## Part 2: Species-Level PHB Pathway Completeness

Classify each species by PHB pathway status:
- **Complete**: has phaC + at least one of (phaA, phaB)
- **Synthase only**: has phaC but no phaA/phaB
- **Precursors only**: has phaA and/or phaB but no phaC
- **Absent**: no PHB pathway genes detected

In [None]:
# Aggregate to species level: which PHB genes does each species have?
species_genes = phb_pd.groupby('gtdb_species_clade_id')['phb_gene'].apply(set).reset_index()
species_genes.columns = ['gtdb_species_clade_id', 'phb_genes_present']

def classify_phb_status(gene_set):
    has_phaC = 'phaC' in gene_set
    has_phaA = 'phaA' in gene_set
    has_phaB = 'phaB' in gene_set
    if has_phaC and (has_phaA or has_phaB):
        return 'complete'
    elif has_phaC:
        return 'synthase_only'
    elif has_phaA or has_phaB:
        return 'precursors_only'
    else:
        return 'accessory_only'  # only phaP/phaR/phaZ

species_genes['phb_status'] = species_genes['phb_genes_present'].apply(classify_phb_status)
species_genes['phb_genes_str'] = species_genes['phb_genes_present'].apply(lambda s: ','.join(sorted(s)))

print('Species PHB pathway status:')
print(species_genes['phb_status'].value_counts())
print(f'\nTotal species with any PHB gene: {len(species_genes):,}')

In [None]:
# Get total species count to calculate PHB-absent species
total_species = spark.sql("""
    SELECT COUNT(*) as n FROM kbase_ke_pangenome.pangenome
""").collect()[0]['n']

n_phb_any = len(species_genes)
n_phb_absent = total_species - n_phb_any

print(f'Total species in pangenome: {total_species:,}')
print(f'Species with any PHB gene: {n_phb_any:,} ({n_phb_any/total_species*100:.1f}%)')
print(f'Species with no PHB genes: {n_phb_absent:,} ({n_phb_absent/total_species*100:.1f}%)')

In [None]:
# Get phaC core/accessory status per species (is the synthase itself core or accessory?)
phac_status = phb_pd[phb_pd['phb_gene'] == 'phaC'].groupby('gtdb_species_clade_id').agg(
    n_phaC_clusters=('gene_cluster_id', 'count'),
    phaC_is_core=('is_core', 'max'),
    phaC_is_aux=('is_auxiliary', 'max'),
).reset_index()

# Merge with species-level summary
species_summary = species_genes.merge(phac_status, on='gtdb_species_clade_id', how='left')

print('\nAmong species with phaC:')
phac_species = species_summary[species_summary['phb_status'].isin(['complete', 'synthase_only'])]
print(f'  Total: {len(phac_species):,}')
print(f'  phaC is core: {(phac_species["phaC_is_core"] == 1).sum():,}')
print(f'  phaC is accessory: {(phac_species["phaC_is_aux"] == 1).sum():,}')

In [None]:
# Save species summary
out_path = os.path.join(DATA_DIR, 'phb_species_summary.tsv')
species_summary.to_csv(out_path, sep='\t', index=False)
print(f'Saved {len(species_summary):,} species to {out_path}')

## Part 3: NMDC Schema Exploration

Explore the `nmdc_arkin` database to understand what per-sample functional annotation data is available for the metagenomic arm of the analysis.

In [None]:
# List all tables in nmdc_arkin
nmdc_tables = spark.sql("SHOW TABLES IN nmdc_arkin").toPandas()
print(f'NMDC tables: {len(nmdc_tables)}')
nmdc_tables

In [None]:
# Check trait_features columns — do any relate to PHB/PHA?
trait_schema = spark.sql("DESCRIBE nmdc_arkin.trait_features").toPandas()
print('trait_features columns:')
for _, row in trait_schema.iterrows():
    col = row['col_name']
    if any(kw in col.lower() for kw in ['pha', 'phb', 'poly', 'granule', 'storage', 'carbon', 'ferment']):
        print(f'  *** {col} ({row["data_type"]})')
    else:
        print(f'      {col} ({row["data_type"]})')

In [None]:
# Check abiotic_features schema — what environmental measurements are available?
abiotic_schema = spark.sql("DESCRIBE nmdc_arkin.abiotic_features").toPandas()
print(f'abiotic_features columns ({len(abiotic_schema)}):')
for _, row in abiotic_schema.iterrows():
    print(f'  {row["col_name"]} ({row["data_type"]})')

In [None]:
# Check study_table — what NMDC studies are available?
studies = spark.sql("SELECT * FROM nmdc_arkin.study_table").toPandas()
print(f'NMDC studies: {len(studies)}')
studies

In [None]:
# Check taxonomy_features — sample count and structure
tax_count = spark.sql("SELECT COUNT(*) as n FROM nmdc_arkin.taxonomy_features").collect()[0]['n']
tax_schema = spark.sql("DESCRIBE nmdc_arkin.taxonomy_features").toPandas()
print(f'taxonomy_features: {tax_count:,} samples, {len(tax_schema)} columns')
print('\nFirst 10 columns:')
tax_schema.head(10)

In [None]:
# Check if there are per-sample functional annotation tables we might have missed
# Look for tables with 'annotation', 'ko', 'kegg', 'function', 'gene' in the name
annotation_tables = nmdc_tables[nmdc_tables['tableName'].str.contains(
    'annot|ko|kegg|function|gene|contig', case=False, na=False
)]
print('Potential per-sample annotation tables:')
for _, row in annotation_tables.iterrows():
    tname = row['tableName']
    try:
        cnt = spark.sql(f"SELECT COUNT(*) as n FROM nmdc_arkin.{tname}").collect()[0]['n']
        print(f'  {tname}: {cnt:,} rows')
    except Exception as e:
        print(f'  {tname}: error - {e}')

In [None]:
# Check KEGG KO terms — verify our PHB KOs exist in NMDC
phb_ko_list = "', '".join(PHB_KOS.keys())
nmdc_kos = spark.sql(f"""
    SELECT * FROM nmdc_arkin.kegg_ko_terms 
    WHERE term_id IN ('{phb_ko_list}')
""").toPandas()
print('PHB KEGG KOs in NMDC reference:')
nmdc_kos

In [None]:
# Check metabolomics_gold for 3-hydroxybutyrate
hb_metabolites = spark.sql("""
    SELECT DISTINCT compound_name, compound_id
    FROM nmdc_arkin.metabolomics_gold
    WHERE LOWER(compound_name) LIKE '%hydroxybutyrate%'
       OR LOWER(compound_name) LIKE '%hydroxybutyr%'
       OR LOWER(compound_name) LIKE '%phb%'
    LIMIT 20
""").toPandas()
print('3-hydroxybutyrate-related metabolites in NMDC:')
hb_metabolites

In [None]:
# Check ncbi_env harmonized_name categories (for pangenome environment data)
env_categories = spark.sql("""
    SELECT harmonized_name, COUNT(*) as n
    FROM kbase_ke_pangenome.ncbi_env
    GROUP BY harmonized_name
    ORDER BY n DESC
""").toPandas()
print('NCBI environment metadata categories:')
env_categories

## Summary and Next Steps

### Pangenome Results
- Total PHB gene clusters found: ?
- Species with phaC (committed step): ?
- Species with complete pathway (phaC + phaA/B): ?
- phaC core vs accessory breakdown: ?

### NMDC Data Availability
- Per-sample functional annotation tables: ?
- Trait features relevant to PHB: ?
- Abiotic features available: ?
- 3-hydroxybutyrate in metabolomics: ?

### Next Notebook (NB02)
Map PHB pathway completeness across the GTDB tree using taxonomy data.