# Notebook 02: GapMind Pathway Analysis

**Goal**: Identify universally complete metabolic pathways in 48 Fitness Browser organisms

**Research Question**: Which metabolic pathways are universally complete across bacteria with essential gene data?

**Data sources**:
- Essential gene families: `projects/essential_genome/` (48 FB organisms)
- GapMind predictions: `kbase_ke_pangenome.gapmind_pathways` (305M predictions)
- Genome metadata: `kbase_ke_pangenome.genome`

**Approach**:
1. Load FB organism list (48 organisms from essential_genome)
2. Map FB organism names → pangenome genome_ids
3. Extract GapMind predictions for those 48 genomes
4. Identify universally complete pathways
5. Characterize essential metabolic repertoire

**Note**: This uses revised approach (v2) - pathway-level analysis instead of EC→reactions

In [None]:
# Imports
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

# Spark Connect
from get_spark_session import get_spark_session

print("✅ Imports successful")

In [None]:
# Create Spark session
spark = get_spark_session()
print(f"✅ Spark session created: version {spark.version}")

## Step 1: Load FB Organism List

Get the 48 organisms from essential_genome project. Need to extract organism names and map them to genome IDs.

In [None]:
# Download essential families from lakehouse (or use cached)
# This references data from essential_genome project (proper attribution)

import os

essential_families_file = Path("../data/essential_genome_families.tsv")

if not essential_families_file.exists():
    print("Downloading essential families from lakehouse...")
    os.system(f"""
        export https_proxy=http://127.0.0.1:8123 && 
        export no_proxy=localhost,127.0.0.1 && 
        mc cp berdl-minio/cdm-lake/tenant-general-warehouse/microbialdiscoveryforge/projects/essential_genome/data/essential_families.tsv {essential_families_file}
    """)
    print(f"✅ Downloaded to {essential_families_file}")
else:
    print(f"✅ Using cached file: {essential_families_file}")

# Load
families = pd.read_csv(essential_families_file, sep='\t')
print(f"\nLoaded {len(families):,} ortholog groups")
print(f"Universally essential: {(families['essentiality_class'] == 'universally_essential').sum():,}")

In [None]:
# Extract organism names from essential_organisms column
# Example: "BFirm;Burk376;Caulo;Cup4G11;..."

universal = families[families['essentiality_class'] == 'universally_essential']

# Get organism list from first universal family
organism_str = universal.iloc[0]['essential_organisms']
fb_organisms = sorted(organism_str.split(';'))

print(f"FB organisms ({len(fb_organisms)}):")
for i, org in enumerate(fb_organisms, 1):
    print(f"{i:2d}. {org}")

## Step 2: Map FB Names to Genome IDs

FB organism names (e.g., "Keio", "DvH") need to map to pangenome genome_ids (e.g., "GCF_000005845.2").

Strategy:
1. Use known mappings from previous projects if available
2. Otherwise, search genome table by taxonomy or NCBI IDs

In [None]:
# Check if mapping file exists from conservation_vs_fitness project
link_file = Path("../../conservation_vs_fitness/data/fb_pangenome_link.tsv")

if link_file.exists():
    print(f"✅ Found FB-pangenome link file: {link_file}")
    fb_link = pd.read_csv(link_file, sep='\t')
    print(f"   {len(fb_link):,} rows")
    print(f"\nColumns: {list(fb_link.columns)}")
    print("\nSample:")
    print(fb_link.head())
    
    # Extract unique organism → genome_id mappings
    if 'organism' in fb_link.columns and 'gtdb_species_clade_id' in fb_link.columns:
        org_genome_map = fb_link[['organism', 'gtdb_species_clade_id']].drop_duplicates()
        print(f"\nUnique organism mappings: {len(org_genome_map)}")
else:
    print("⚠️  FB-pangenome link file not found")
    print("   Will need to create mapping manually")
    fb_link = None

## Step 3: Extract GapMind Predictions for FB Organisms

Query GapMind for the 48 FB organisms (or their representative genomes).

In [None]:
# For now, let's explore GapMind structure to understand how to link
# We'll use genome_id patterns or species names

print("Exploring GapMind genome_id patterns...")
genome_sample = spark.sql("""
    SELECT DISTINCT genome_id
    FROM kbase_ke_pangenome.gapmind_pathways
    LIMIT 20
""").toPandas()

print("Sample genome IDs in GapMind:")
print(genome_sample)

In [None]:
# Get pathway summary across all genomes to understand data structure
print("Pathway completeness summary (amino acids):")
aa_summary = spark.sql("""
    SELECT 
        pathway,
        COUNT(DISTINCT genome_id) as n_genomes,
        SUM(CASE WHEN score_category = 'complete' THEN 1 ELSE 0 END) as n_complete,
        SUM(CASE WHEN score_category = 'likely_complete' THEN 1 ELSE 0 END) as n_likely,
        ROUND(100.0 * SUM(CASE WHEN score_category IN ('complete', 'likely_complete') THEN 1 ELSE 0 END) / COUNT(*), 1) as pct_present
    FROM kbase_ke_pangenome.gapmind_pathways
    WHERE metabolic_category = 'aa'
    GROUP BY pathway
    ORDER BY pct_present DESC
""").toPandas()

print(aa_summary.to_string(index=False))

## Step 4: Identify Universally Complete Pathways

Once we have the FB organism mappings, we'll:
1. Filter GapMind to those 48 genomes
2. Find pathways that are complete/likely_complete in ALL 48
3. Characterize the minimal metabolic repertoire

In [None]:
# Placeholder for next step
# Will need FB → genome_id mapping to proceed

print("Next steps:")
print("1. Complete FB organism → genome_id mapping")
print("2. Query GapMind for those specific genomes")
print("3. Identify universally complete pathways")
print("4. Create visualizations")
print("5. Compare to minimal genome studies")

## Save Progress

Save intermediate results.

In [None]:
# Save FB organism list
data_dir = Path("../data")
data_dir.mkdir(exist_ok=True)

fb_org_file = data_dir / "fb_organisms.txt"
with open(fb_org_file, 'w') as f:
    for org in fb_organisms:
        f.write(f"{org}\n")

print(f"✅ Saved FB organism list: {fb_org_file}")
print(f"   {len(fb_organisms)} organisms")

# Save pathway summary
aa_summary.to_csv(data_dir / "gapmind_aa_pathway_summary.tsv", sep='\t', index=False)
print(f"\n✅ Saved pathway summary: gapmind_aa_pathway_summary.tsv")

In [None]:
# Clean up
spark.stop()
print("✅ Spark session closed")

## Summary

**Completed**:
- ✅ Loaded 48 FB organism names from essential_genome project
- ✅ Explored GapMind pathway structure
- ✅ Analyzed amino acid pathway completeness across all genomes

**Findings**:
- GapMind has 80 pathways (18 amino acids + 62 carbon sources)
- Pathway completeness varies: met > gln > asn > gly > thr
- Need FB organism → genome_id mapping to proceed

**Next Steps**:
1. Create or find FB → genome_id mapping
2. Extract GapMind predictions for 48 FB genomes
3. Identify universally complete pathways
4. Visualize essential metabolic repertoire
5. Compare to JCVI-syn3.0 minimal genome