# Dump KE Genome List

This notebook creates a dictionary mapping genome IDs to their GTDB taxonomy IDs from the KBase KE pangenome database.

**Workflow:**
1. Query the kbase_ke_pangenome.genome table to get all genome IDs and their GTDB taxonomy IDs
2. Create a dictionary where genome_id maps to gtdb_taxonomy_id
3. Save the dictionary for future use

**Database:** kbase_ke_pangenome
- **genome**: Contains genome metadata including genome_id and gtdb_taxonomy_id

## Step 1: Query All Genomes

Query the genome table to retrieve all genome IDs and their corresponding GTDB taxonomy IDs.

**Key objectives:**
- Retrieve all genome_id and gtdb_taxonomy_id pairs from the genome table
- Display count of genomes retrieved
- Show sample of the data

**Input:**
- None (queries entire genome table)

**Output:**
- `genome_data`: List of all genome records with genome_id and gtdb_taxonomy_id

In [1]:
%run util.py
util.spark = get_spark_session()

print("Querying all genomes from kbase_ke_pangenome.genome table...")
print("="*80)

# Query all genomes with their taxonomy IDs
genome_query = """
    SELECT 
        genome_id,
        gtdb_taxonomy_id
    FROM kbase_ke_pangenome.genome
    ORDER BY genome_id
"""

genome_df = util.run_query(genome_query)
genome_records = genome_df.collect()

print(f"Retrieved {len(genome_records):,} genome records")
print()

# Display first 10 records as examples
print("Sample of genome data (first 10 records):")
print("-" * 80)
for i, record in enumerate(genome_records[:10]):
    print(f"{i+1}. {record['genome_id']} -> {record['gtdb_taxonomy_id']}")

# Save the raw genome data
genome_data = [record.asDict() for record in genome_records]
util.save('genome_data', genome_data)

print()
print("="*80)
print(f"Genome data saved. Total records: {len(genome_data):,}")

2025-12-03 20:44:54,086 - __main__.NotebookUtil - INFO - Notebook environment detected


/home/chenry/Projects/KBUtilLib/src
Querying all genomes from kbase_ke_pangenome.genome table...
wall=22.41s user=0.11s sys=0.04s
Retrieved 293,059 genome records

Sample of genome data (first 10 records):
--------------------------------------------------------------------------------
1. GB_GCA_000007325.1 -> s__Fusobacterium_nucleatum--GB_GCA_000007325.1
2. GB_GCA_000009845.1 -> s__Phytoplasma_sp000009845--GB_GCA_000009845.1
3. GB_GCA_000012145.1 -> s__Rickettsia_felis--GB_GCA_000012145.1
4. GB_GCA_000013185.1 -> s__Baumannia_cicadellinicola--GB_GCA_000013185.1
5. GB_GCA_000013685.1 -> s__Rhodopseudomonas_pseudopalustris--RS_GCF_900110435.1
6. GB_GCA_000013945.1 -> s__Leptospira_borgpetersenii--RS_GCF_003046425.1
7. GB_GCA_000013965.1 -> s__Leptospira_borgpetersenii--RS_GCF_003046425.1
8. GB_GCA_000015545.1 -> s__Diaphorobacter_nitroreducens--GB_GCA_003755025.1
9. GB_GCA_000016325.1 -> s__Lelliottia_sp000016325--GB_GCA_000016325.1
10. GB_GCA_000016605.1 -> s__Metallosphaera_sedula--G

## Step 2: Create Genome ID to Taxonomy ID Dictionary

Transform the genome data into a dictionary structure for efficient lookup.

**Key objectives:**
- Create dictionary with genome_id as keys and gtdb_taxonomy_id as values
- Verify dictionary structure
- Display statistics about the mapping

**Input:**
- `genome_data`: List of genome records from Step 1

**Output:**
- `genome_taxonomy_dict`: Dictionary mapping genome_id -> gtdb_taxonomy_id
- Dictionary saved to datacache for future use

In [2]:
%run util.py

# Load genome data from Step 1
genome_data = util.load('genome_data')

print("Creating genome_id -> gtdb_taxonomy_id dictionary...")
print("="*80)

# Create the dictionary
genome_taxonomy_dict = {
    record['genome_id']: record['gtdb_taxonomy_id']
    for record in genome_data
}

print(f"Dictionary created with {len(genome_taxonomy_dict):,} entries")
print()

# Display sample entries
print("Sample dictionary entries (first 10):")
print("-" * 80)
for i, (genome_id, taxonomy_id) in enumerate(list(genome_taxonomy_dict.items())[:10]):
    print(f"{i+1}. '{genome_id}': '{taxonomy_id}'")

# Check for any None or missing values
none_count = sum(1 for v in genome_taxonomy_dict.values() if v is None)
print()
print(f"Records with None taxonomy_id: {none_count}")

# Save the dictionary
util.save('genome_taxonomy_dict', genome_taxonomy_dict)

print()
print("="*80)
print("Dictionary saved to datacache as 'genome_taxonomy_dict'")
print(f"Total genome IDs mapped: {len(genome_taxonomy_dict):,}")

2025-12-03 20:45:25,012 - __main__.NotebookUtil - INFO - Notebook environment detected


/home/chenry/Projects/KBUtilLib/src
Creating genome_id -> gtdb_taxonomy_id dictionary...
Dictionary created with 293,059 entries

Sample dictionary entries (first 10):
--------------------------------------------------------------------------------
1. 'GB_GCA_000007325.1': 's__Fusobacterium_nucleatum--GB_GCA_000007325.1'
2. 'GB_GCA_000009845.1': 's__Phytoplasma_sp000009845--GB_GCA_000009845.1'
3. 'GB_GCA_000012145.1': 's__Rickettsia_felis--GB_GCA_000012145.1'
4. 'GB_GCA_000013185.1': 's__Baumannia_cicadellinicola--GB_GCA_000013185.1'
5. 'GB_GCA_000013685.1': 's__Rhodopseudomonas_pseudopalustris--RS_GCF_900110435.1'
6. 'GB_GCA_000013945.1': 's__Leptospira_borgpetersenii--RS_GCF_003046425.1'
7. 'GB_GCA_000013965.1': 's__Leptospira_borgpetersenii--RS_GCF_003046425.1'
8. 'GB_GCA_000015545.1': 's__Diaphorobacter_nitroreducens--GB_GCA_003755025.1'
9. 'GB_GCA_000016325.1': 's__Lelliottia_sp000016325--GB_GCA_000016325.1'
10. 'GB_GCA_000016605.1': 's__Metallosphaera_sedula--GB_GCA_000016605.1'


## Step 3: Export Dictionary Statistics and Summary

Generate summary statistics about the genome taxonomy mapping.

**Key objectives:**
- Count unique taxonomy IDs
- Identify most common taxonomy IDs
- Export summary to nboutput for documentation

**Input:**
- `genome_taxonomy_dict`: Dictionary from Step 2

**Output:**
- Summary statistics printed to console
- `genome_taxonomy_summary.tsv`: Summary table exported to nboutput/

In [3]:
%run util.py

# Load dictionary from Step 2
genome_taxonomy_dict = util.load('genome_taxonomy_dict')

print("Analyzing genome taxonomy mapping...")
print("="*80)

# Count taxonomy IDs
from collections import Counter

taxonomy_counts = Counter(genome_taxonomy_dict.values())

print(f"Total genome IDs: {len(genome_taxonomy_dict):,}")
print(f"Unique taxonomy IDs: {len(taxonomy_counts):,}")
print()

# Find most common taxonomy IDs
print("Top 20 most common taxonomy IDs:")
print("-" * 80)
for i, (taxonomy_id, count) in enumerate(taxonomy_counts.most_common(20)):
    print(f"{i+1:2d}. {taxonomy_id}: {count:,} genomes")

# Create summary DataFrame
summary_data = [
    {'taxonomy_id': taxonomy_id, 'genome_count': count}
    for taxonomy_id, count in taxonomy_counts.most_common()
]

summary_df = pd.DataFrame(summary_data)

# Save summary to nboutput
util.save_dataframe('genome_taxonomy_summary', summary_df, format='tsv')

print()
print("="*80)
print("Summary exported to nboutput/genome_taxonomy_summary.tsv")
print(f"Dictionary available in datacache as 'genome_taxonomy_dict'")

2025-12-03 20:45:25,409 - __main__.NotebookUtil - INFO - Notebook environment detected


/home/chenry/Projects/KBUtilLib/src
Analyzing genome taxonomy mapping...
Total genome IDs: 293,059
Unique taxonomy IDs: 27,690

Top 20 most common taxonomy IDs:
--------------------------------------------------------------------------------
 1. s__Staphylococcus_aureus--RS_GCF_001027105.1: 14,526 genomes
 2. s__Klebsiella_pneumoniae--RS_GCF_000742135.1: 14,240 genomes
 3. s__Salmonella_enterica--RS_GCF_000006945.2: 11,402 genomes
 4. s__Streptococcus_pneumoniae--RS_GCF_001457635.1: 8,434 genomes
 5. s__Mycobacterium_tuberculosis--RS_GCF_000195955.2: 6,903 genomes
 6. s__Pseudomonas_aeruginosa--RS_GCF_001457615.1: 6,760 genomes
 7. s__Acinetobacter_baumannii--RS_GCF_009759685.1: 6,647 genomes
 8. s__Clostridioides_difficile--RS_GCF_001077535.1: 2,604 genomes
 9. s__Enterococcus_B_faecium--RS_GCF_001544255.1: 2,533 genomes
10. s__Enterobacter_hormaechei_A--RS_GCF_001729745.1: 2,453 genomes
11. s__Campylobacter_D_jejuni--RS_GCF_001457695.1: 2,313 genomes
12. s__Enterococcus_faecalis--RS_