# NB04: NMDC Metagenomic Analysis of PHB Pathway Prevalence

**Purpose**: Test PHB pathway prevalence across NMDC environments using independent metagenomic data.

**Requires**: BERDL JupyterHub (Spark session)

**Inputs**:
- `data/phb_species_summary.tsv` from NB01
- `data/phb_by_taxonomy.tsv` from NB02

**Outputs**:
- `data/nmdc_phb_prevalence.tsv` — per-sample PHB inference scores
- `data/nmdc_study_summary.tsv` — study-level PHB summary
- `figures/nmdc_phb_by_environment.png` — PHB prevalence across NMDC environments
- `figures/nmdc_phb_vs_abiotic.png` — PHB signal vs abiotic variability

## Strategy

Two paths depending on data availability (explored in NB01):
- **Path A** (preferred): If per-sample functional annotations (KO counts) exist, directly count PHB pathway KOs
- **Path B** (fallback): Infer PHB capability from taxonomic composition × pangenome PHB status

In [None]:
spark = get_spark_session()

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

PROJECT_DIR = os.path.expanduser('~/BERIL-research-observatory/projects/phb_granule_ecology')
DATA_DIR = os.path.join(PROJECT_DIR, 'data')
FIG_DIR = os.path.join(PROJECT_DIR, 'figures')

# Load NB01/NB02 results
species_phb = pd.read_csv(os.path.join(DATA_DIR, 'phb_species_summary.tsv'), sep='\t')
tax_phb = pd.read_csv(os.path.join(DATA_DIR, 'phb_by_taxonomy.tsv'), sep='\t')
print(f'Loaded {len(species_phb):,} PHB+ species, {len(tax_phb):,} total species')

## Part 1: NMDC Study & Sample Overview

In [None]:
# Overview of NMDC studies
studies = spark.sql("""
    SELECT * FROM nmdc_arkin.study_table
""").toPandas()

print(f'Total NMDC studies: {len(studies)}')
print(f'\nColumns: {list(studies.columns)}')
studies.head(10)

In [None]:
# Check available abiotic features — these provide environmental context
abiotic = spark.sql("""
    SELECT * FROM nmdc_arkin.abiotic_features LIMIT 5
""").toPandas()

print(f'Abiotic feature columns ({len(abiotic.columns)}):') 
for col in abiotic.columns:
    print(f'  {col}')
abiotic

In [None]:
# How many samples have abiotic data?
abiotic_count = spark.sql("""
    SELECT COUNT(*) as n_samples FROM nmdc_arkin.abiotic_features
""").toPandas()
print(f'Samples with abiotic features: {abiotic_count["n_samples"].iloc[0]:,}')

# Check taxonomy features — this is our primary path for PHB inference
tax_feat_sample = spark.sql("""
    SELECT * FROM nmdc_arkin.taxonomy_features LIMIT 5
""").toPandas()

print(f'\nTaxonomy feature columns ({len(tax_feat_sample.columns)}):')
# Show first 20 columns and last 5
cols = list(tax_feat_sample.columns)
for c in cols[:20]:
    print(f'  {c}')
if len(cols) > 25:
    print(f'  ... ({len(cols) - 25} more columns) ...')
    for c in cols[-5:]:
        print(f'  {c}')

tax_feat_count = spark.sql("""
    SELECT COUNT(*) as n FROM nmdc_arkin.taxonomy_features
""").toPandas()
print(f'\nSamples with taxonomy features: {tax_feat_count["n"].iloc[0]:,}')

In [None]:
# Check for per-sample functional annotations (Path A)
# Look for tables with KO counts per sample
print('=== Checking for per-sample functional annotation tables ===')

# Check metatranscriptomics_gold — might have per-sample functional counts
try:
    mt_sample = spark.sql("SELECT * FROM nmdc_arkin.metatranscriptomics_gold LIMIT 3").toPandas()
    print(f'\nmetatranscriptomics_gold columns: {list(mt_sample.columns)}')
    print(mt_sample)
except Exception as e:
    print(f'metatranscriptomics_gold: {e}')

# Check trait_features — has functional group columns
try:
    trait_sample = spark.sql("SELECT * FROM nmdc_arkin.trait_features LIMIT 3").toPandas()
    func_cols = [c for c in trait_sample.columns if 'functional_group' in c.lower() or 'function' in c.lower()]
    print(f'\ntrait_features: {len(trait_sample.columns)} columns')
    print(f'Functional group columns: {len(func_cols)}')
    for c in func_cols[:15]:
        print(f'  {c}: {trait_sample[c].iloc[0]}')
except Exception as e:
    print(f'trait_features: {e}')

In [None]:
# Check if KEGG KO annotations exist per sample
# annotation_terms_unified has COG, EC, GO, KEGG, MetaCyc — but is it per-sample?
print('=== Checking annotation_terms_unified ===')
ann_sample = spark.sql("""
    SELECT * FROM nmdc_arkin.annotation_terms_unified 
    WHERE source = 'KEGG'
    LIMIT 5
""").toPandas()
print(f'Columns: {list(ann_sample.columns)}')
ann_sample

In [None]:
# Check if PHB-related KOs exist in NMDC KEGG terms
phb_kos = ['K03821', 'K00023', 'K00626', 'K05973', 'K14205', 'K18080']
ko_str = "','".join(phb_kos)

phb_in_nmdc = spark.sql(f"""
    SELECT * FROM nmdc_arkin.kegg_ko_terms
    WHERE ko_id IN ('{ko_str}')
""").toPandas()

print('PHB KOs in NMDC KEGG reference:')
if len(phb_in_nmdc) > 0:
    print(phb_in_nmdc[['ko_id', 'name']].to_string(index=False))
else:
    # Inspect table schema if no results
    all_kegg = spark.sql("SELECT * FROM nmdc_arkin.kegg_ko_terms LIMIT 5").toPandas()
    print(f'Columns: {list(all_kegg.columns)}')
    print(all_kegg)

## Part 2: Taxonomy-Based PHB Inference (Path B)

The `taxonomy_features` table contains per-sample taxonomic profiles from metagenomic classification.
We infer PHB capability for each sample by weighting the taxonomic composition by the pangenome PHB status from NB01/NB02.

**PHB inference score** = Σ (relative_abundance_taxon_i × phaC_prevalence_in_clade_i)

In [None]:
# Load taxonomy dimension table to map numeric column IDs to taxon names
# taxonomy_features columns are numeric IDs (7, 11, 33, ...), not taxon names
tax_dim = spark.sql("""
    SELECT * FROM nmdc_arkin.taxonomy_dim
""").toPandas()

print(f'taxonomy_dim: {tax_dim.shape[0]} rows x {tax_dim.shape[1]} columns')
print(f'Columns: {list(tax_dim.columns)}')
tax_dim.head(10)

In [None]:
# Also check taxstring_lookup for alternative ID-to-name mapping
try:
    taxstring = spark.sql("""
        SELECT * FROM nmdc_arkin.taxstring_lookup LIMIT 20
    """).toPandas()
    print(f'taxstring_lookup columns: {list(taxstring.columns)}')
    print(f'Shape: {taxstring.shape}')
    taxstring.head(10)
except Exception as e:
    print(f'taxstring_lookup: {e}')

# Load taxonomy features
tax_features = spark.sql("""
    SELECT * FROM nmdc_arkin.taxonomy_features
""").toPandas()

print(f'\nTaxonomy features: {tax_features.shape[0]} samples x {tax_features.shape[1]} columns')
taxon_cols = [c for c in tax_features.columns if c != 'sample_id']
print(f'Taxon columns (first 20): {taxon_cols[:20]}')
print(f'Total taxon columns: {len(taxon_cols)}')

In [None]:
# Build mapping from numeric taxon IDs (column names in taxonomy_features)
# to genus names that we can match to pangenome data
#
# Strategy: taxonomy_dim should have columns like taxon_id + taxon name/rank info
# We need to extract the genus-level name for each taxon ID

# Identify the ID column and name/rank columns in taxonomy_dim
print('taxonomy_dim columns and sample values:')
for col in tax_dim.columns:
    print(f'  {col}: {tax_dim[col].iloc[:3].tolist()}')

# Build ID → genus mapping
# The exact approach depends on taxonomy_dim schema (will adapt after seeing output above)
# Common patterns: taxon_id + lineage string, or taxon_id + rank + name

# Try to find genus from taxonomy_dim
id_col = tax_dim.columns[0]  # First column is likely the ID
print(f'\nUsing "{id_col}" as taxonomy ID column')

# Check which taxonomy_features column IDs exist in taxonomy_dim
taxon_id_set = set(str(x) for x in tax_dim[id_col])
matched_ids = [c for c in taxon_cols if c in taxon_id_set]
print(f'Taxon columns matched to taxonomy_dim: {len(matched_ids)}/{len(taxon_cols)}')

In [None]:
# Extract genus from taxonomy_dim and build ID → genus lookup
# Adapt based on what columns are available in taxonomy_dim

# Look for columns containing lineage, genus, name, rank
name_candidates = [c for c in tax_dim.columns 
                   if any(kw in c.lower() for kw in ['genus', 'name', 'lineage', 'tax', 'string', 'rank'])]
print(f'Potential name/lineage columns: {name_candidates}')

# Build the taxon_id → genus name mapping
# Will use the first suitable column that contains genus-level information
taxid_to_genus = {}

for col in name_candidates:
    sample_vals = tax_dim[col].dropna().head(5).tolist()
    print(f'\n  {col} samples: {sample_vals}')

# Extract genus from lineage string (e.g., "d__Bacteria;p__Proteobacteria;...;g__Pseudomonas;s__...")
# or from a dedicated genus column
for col in tax_dim.columns:
    vals = tax_dim[col].dropna().astype(str)
    # Check if this column contains genus-level taxonomy strings
    has_genus_prefix = vals.str.contains('g__', na=False).any()
    if has_genus_prefix:
        print(f'\nFound genus-level info in column: {col}')
        # Parse genus from taxonomy string
        for _, row in tax_dim.iterrows():
            taxid = str(row[id_col])
            lineage = str(row[col])
            # Extract genus from lineage (format: ...;g__GenusName;...)
            parts = lineage.split(';')
            genus = None
            for p in parts:
                p = p.strip()
                if p.startswith('g__') and len(p) > 3:
                    genus = p.replace('g__', '').strip()
                    break
            if genus:
                taxid_to_genus[taxid] = genus
        break

print(f'\nMapped {len(taxid_to_genus)} taxon IDs to genus names')
if taxid_to_genus:
    sample_items = list(taxid_to_genus.items())[:5]
    for tid, genus in sample_items:
        print(f'  ID {tid} → {genus}')

In [None]:
# Build genus-level PHB prevalence from pangenome data
genus_phb = tax_phb.groupby('gtdb_genus').agg(
    n_species=('gtdb_species_clade_id', 'count'),
    n_phaC=('has_phaC', 'sum'),
    pct_phaC=('has_phaC', lambda x: x.mean()),  # proportion, not percentage
).reset_index()
genus_phb['genus_clean'] = genus_phb['gtdb_genus'].str.replace('g__', '', regex=False).str.strip()

# Match NMDC taxon IDs → genus → pangenome PHB prevalence
genus_phb_lookup = dict(zip(genus_phb['genus_clean'].str.lower(), genus_phb['pct_phaC']))

# Build matched list: (taxon_col_name, genus_name) for columns we can map
matched = []
unmatched_ids = []
for col_id in taxon_cols:
    genus = taxid_to_genus.get(col_id, None)
    if genus and genus.lower() in genus_phb_lookup:
        matched.append((col_id, genus.lower()))
    else:
        unmatched_ids.append(col_id)

print(f'Matched taxon IDs to pangenome genera: {len(matched)}/{len(taxon_cols)}')
print(f'Unmatched: {len(unmatched_ids)}')
if matched:
    print(f'\nFirst 10 matches:')
    for col_id, genus in matched[:10]:
        print(f'  ID {col_id} → {genus} (phaC: {genus_phb_lookup[genus]*100:.1f}%)')

# Compute per-sample PHB inference score
sample_scores = []
for _, row in tax_features.iterrows():
    sample_id = row['sample_id']
    phb_score = 0.0
    total_abundance = 0.0
    matched_abundance = 0.0
    
    for col_id, genus in matched:
        abundance = pd.to_numeric(row.get(col_id, 0), errors='coerce')
        if pd.notna(abundance) and abundance > 0:
            phb_score += abundance * genus_phb_lookup.get(genus, 0)
            matched_abundance += abundance
    
    for col in taxon_cols:
        val = pd.to_numeric(row.get(col, 0), errors='coerce')
        if pd.notna(val) and val > 0:
            total_abundance += val
    
    sample_scores.append({
        'sample_id': sample_id,
        'phb_score': phb_score,
        'matched_abundance': matched_abundance,
        'total_abundance': total_abundance,
        'pct_matched': matched_abundance / total_abundance * 100 if total_abundance > 0 else 0,
    })

sample_phb = pd.DataFrame(sample_scores)
print(f'\nComputed PHB scores for {len(sample_phb):,} samples')
print(f'\nPHB score distribution:')
print(sample_phb['phb_score'].describe())
print(f'\nTaxon matching coverage:')
print(f'  Median % abundance matched: {sample_phb["pct_matched"].median():.1f}%')

## Part 3: Correlate PHB Signal with Abiotic Features

In [None]:
# Load abiotic features
abiotic_all = spark.sql("""
    SELECT * FROM nmdc_arkin.abiotic_features
""").toPandas()

print(f'Abiotic features: {abiotic_all.shape[0]} samples x {abiotic_all.shape[1]} columns')

# Merge with PHB scores
phb_abiotic = sample_phb.merge(abiotic_all, on='sample_id', how='inner')
print(f'Samples with both PHB scores and abiotic data: {len(phb_abiotic):,}')

In [None]:
# Identify key abiotic variables for feast/famine hypothesis
# Focus on: dissolved oxygen, pH, temperature, carbon-related measures
abiotic_cols = [c for c in abiotic_all.columns if c != 'sample_id']

# Cast numeric columns and check coverage
print('Abiotic variable coverage:')
abiotic_coverage = []
for col in abiotic_cols:
    vals = pd.to_numeric(phb_abiotic[col], errors='coerce')
    n_valid = vals.notna().sum()
    if n_valid > 0:
        abiotic_coverage.append({
            'column': col,
            'n_valid': n_valid,
            'pct_valid': n_valid / len(phb_abiotic) * 100,
            'mean': vals.mean(),
            'std': vals.std(),
        })

abiotic_cov_df = pd.DataFrame(abiotic_coverage).sort_values('n_valid', ascending=False)
print(abiotic_cov_df.to_string(index=False))

In [None]:
# Link samples to studies for within-study variability analysis
# Check if study_table has sample-to-study mapping
print('study_table columns:', list(studies.columns))

# Try to get sample-study links from embeddings or metadata
try:
    emb_meta = spark.sql("""
        SELECT * FROM nmdc_arkin.embedding_metadata LIMIT 5
    """).toPandas()
    print(f'\nembedding_metadata columns: {list(emb_meta.columns)}')
    emb_meta
except Exception as e:
    print(f'embedding_metadata: {e}')

In [None]:
# Correlate PHB score with key abiotic variables
# Select columns with sufficient coverage
target_keywords = ['oxygen', 'ph', 'temp', 'carbon', 'nitro', 'ammonium', 'salinity']
key_abiotic = []
for _, row in abiotic_cov_df.iterrows():
    col = row['column']
    if row['n_valid'] >= 30:  # minimum for meaningful correlation
        key_abiotic.append(col)

print(f'Abiotic variables with >=30 valid values: {len(key_abiotic)}')

# Compute Spearman correlations with PHB score
correlations = []
for col in key_abiotic:
    vals = pd.to_numeric(phb_abiotic[col], errors='coerce')
    valid = vals.notna() & phb_abiotic['phb_score'].notna()
    if valid.sum() >= 30:
        rho, p = stats.spearmanr(phb_abiotic.loc[valid, 'phb_score'], vals[valid])
        correlations.append({
            'abiotic_variable': col,
            'n': valid.sum(),
            'spearman_rho': rho,
            'p_value': p,
        })

corr_df = pd.DataFrame(correlations).sort_values('p_value')
print('\nSpearman correlations: PHB score vs abiotic variables')
print(corr_df.to_string(index=False))

In [None]:
# Check trait_features for PHB-relevant functional groups
traits = spark.sql("""
    SELECT * FROM nmdc_arkin.trait_features
""").toPandas()

print(f'Trait features: {traits.shape[0]} samples x {traits.shape[1]} columns')

# Look for PHB-related traits
phb_trait_cols = [c for c in traits.columns 
                  if any(kw in c.lower() for kw in ['pha', 'phb', 'polyhydrox', 
                         'carbon_storage', 'granule', 'storage'])]
print(f'\nPHB-related trait columns: {phb_trait_cols}')

# Also check for general metabolic traits that might correlate
metab_trait_cols = [c for c in traits.columns 
                    if any(kw in c.lower() for kw in ['ferment', 'aerob', 'anaerob',
                           'respir', 'nitrogen', 'carbon', 'fatty_acid'])]
print(f'Metabolic trait columns: {metab_trait_cols[:15]}')

In [None]:
# If PHB-related traits exist, correlate with our PHB inference score
if phb_trait_cols:
    traits_phb = traits[['sample_id'] + phb_trait_cols].merge(
        sample_phb[['sample_id', 'phb_score']], on='sample_id', how='inner')
    
    for col in phb_trait_cols:
        vals = pd.to_numeric(traits_phb[col], errors='coerce')
        valid = vals.notna() & traits_phb['phb_score'].notna()
        if valid.sum() >= 10:
            rho, p = stats.spearmanr(traits_phb.loc[valid, 'phb_score'], vals[valid])
            print(f'{col}: rho={rho:.3f}, p={p:.2e}, n={valid.sum()}')
else:
    print('No PHB-specific trait columns found in trait_features.')
    print('Will rely on taxonomy-based inference only.')

In [None]:
# Check metabolomics for 3-hydroxybutyrate / PHB monomers
try:
    metab_sample = spark.sql("""
        SELECT * FROM nmdc_arkin.metabolomics_gold LIMIT 5
    """).toPandas()
    print(f'metabolomics_gold columns: {list(metab_sample.columns)}')
    metab_sample
except Exception as e:
    print(f'metabolomics_gold error: {e}')

In [None]:
# Search metabolomics for 3-hydroxybutyrate or related compounds
# From NB01b: columns include "Compound Name", "Common Name", "Traditional Name", "name"
# Use backticks for columns with spaces
try:
    phb_metabolites = spark.sql("""
        SELECT DISTINCT `Compound Name`, `Common Name`, `Traditional Name`, name, kegg, smiles,
               `Sample Name`
        FROM nmdc_arkin.metabolomics_gold
        WHERE LOWER(`Compound Name`) LIKE '%hydroxybutyrat%'
           OR LOWER(`Compound Name`) LIKE '%phb%'
           OR LOWER(name) LIKE '%hydroxybutyrat%'
           OR LOWER(`Common Name`) LIKE '%hydroxybutyrat%'
        LIMIT 20
    """).toPandas()
    
    if len(phb_metabolites) > 0:
        print(f'Found PHB-related metabolites: {len(phb_metabolites)}')
        print(phb_metabolites)
    else:
        print('No PHB-related metabolites found by name.')
except Exception as e:
    print(f'Metabolomics query failed: {e}')

## Part 4: Visualize PHB Signal Across NMDC Environments

In [None]:
# Classify NMDC samples by environment type using study metadata and abiotic features
# Merge PHB scores with available metadata

# Attempt to get environment labels from embedding_metadata or study_table
try:
    emb_meta_all = spark.sql("""
        SELECT * FROM nmdc_arkin.embedding_metadata
    """).toPandas()
    print(f'Embedding metadata: {len(emb_meta_all)} samples')
    print(f'Columns: {list(emb_meta_all.columns)}')
    
    # Check for environment-related columns
    env_cols = [c for c in emb_meta_all.columns if any(kw in c.lower() 
                for kw in ['env', 'ecosystem', 'habitat', 'biome', 'study'])]
    print(f'\nEnvironment-related columns: {env_cols}')
    if env_cols:
        for col in env_cols:
            print(f'\n{col} value counts:')
            print(emb_meta_all[col].value_counts().head(10))
except Exception as e:
    print(f'embedding_metadata: {e}')

In [None]:
# Figure 1: PHB inference score distribution across samples
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
ax.hist(sample_phb['phb_score'], bins=50, color='#2196F3', alpha=0.8, edgecolor='white')
ax.set_xlabel('PHB inference score')
ax.set_ylabel('Number of samples')
ax.set_title('Distribution of PHB Scores Across NMDC Samples')
ax.axvline(sample_phb['phb_score'].median(), color='red', linestyle='--', 
           label=f'Median={sample_phb["phb_score"].median():.3f}')
ax.legend()

ax = axes[1]
ax.hist(sample_phb['pct_matched'], bins=50, color='#4CAF50', alpha=0.8, edgecolor='white')
ax.set_xlabel('% taxonomic abundance matched to pangenome')
ax.set_ylabel('Number of samples')
ax.set_title('Pangenome Matching Coverage')

plt.tight_layout()
plt.savefig(os.path.join(FIG_DIR, 'nmdc_phb_by_environment.png'), dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Figure 2: PHB score vs key abiotic variables
# Plot top correlations (by p-value)
if len(corr_df) > 0:
    top_n = min(4, len(corr_df))
    top_corr = corr_df.head(top_n)
    
    fig, axes = plt.subplots(1, top_n, figsize=(4*top_n, 4))
    if top_n == 1:
        axes = [axes]
    
    for i, (_, row) in enumerate(top_corr.iterrows()):
        col = row['abiotic_variable']
        ax = axes[i]
        vals = pd.to_numeric(phb_abiotic[col], errors='coerce')
        valid = vals.notna() & phb_abiotic['phb_score'].notna()
        
        ax.scatter(vals[valid], phb_abiotic.loc[valid, 'phb_score'], 
                   alpha=0.3, s=10, color='#2196F3')
        # Clean column name for display
        clean_name = col.replace('annotations_', '').replace('_has_numeric_value', '')
        ax.set_xlabel(clean_name)
        ax.set_ylabel('PHB score')
        ax.set_title(f'rho={row["spearman_rho"]:.2f}, p={row["p_value"]:.1e}')
    
    plt.suptitle('PHB Score vs Abiotic Variables (Top Correlations)', y=1.02)
    plt.tight_layout()
    plt.savefig(os.path.join(FIG_DIR, 'nmdc_phb_vs_abiotic.png'), dpi=150, bbox_inches='tight')
    plt.show()
else:
    print('No significant abiotic correlations found.')

In [None]:
# Save results
sample_phb.to_csv(os.path.join(DATA_DIR, 'nmdc_phb_prevalence.tsv'), sep='\t', index=False)

# Save correlation results
if len(corr_df) > 0:
    corr_df.to_csv(os.path.join(DATA_DIR, 'nmdc_abiotic_correlations.tsv'), sep='\t', index=False)

print(f'Saved PHB prevalence: {len(sample_phb):,} samples')
print(f'Saved abiotic correlations: {len(corr_df)} variables')

## Summary

### Key Findings (to be filled after execution)
- Per-sample functional annotations available?: ?
- Taxonomy-based PHB inference: ? samples scored
- Pangenome matching coverage: ?%
- Top abiotic correlations: ?
- PHB-related metabolites found: ?

### Caveats
- Taxonomy-based inference is indirect — assumes genus-level PHB prevalence applies to metagenome taxa
- Taxonomic classification methods (Centrifuge/Kraken/GOTTCHA) have different genus-level accuracy
- NMDC samples are biased toward specific ecosystems (soil, aquatic) per NMDC study portfolio
- Abiotic features are snapshots, not measures of temporal variability

### Next Notebook (NB05)
Subclade enrichment and cross-validation — test for differential PHB enrichment within clades and validate pangenome patterns against NMDC.