# SKANI Genome Distance Computation

This notebook demonstrates how to use SKANIUtils for fast genome sketching and distance computation.

## Overview

SKANI is a fast tool for genome sketching and ANI (Average Nucleotide Identity) calculation. SKANIUtils provides:
- Creating sketch databases from genome directories
- Querying genomes against sketch databases
- Managing multiple sketch databases
- Computing genome distances quickly

## 1. Setup

In [None]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
src_path = project_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

print(f"Project root: {project_root}")

## 2. Initialize SKANIUtils

In [None]:
from kbutillib import SKANIUtils

# Initialize SKANI utilities
util = SKANIUtils()

print(f"SKANI available: {util.skani_available}")
print(f"SKANI executable: {util.skani_executable}")
print(f"Cache file: {util.cache_file}")

## 3. Create Sketch Database

Sketch a directory of FASTA genomes:

In [None]:
# Example: Create sketch database from genome directory
'''
result = util.sketch_genome_directory(
    genome_dir="./genomes",
    database_name="MyGenomes",
    description="Collection of bacterial genomes",
    marker="marker" # optional: use marker-based sketching
)

print(f"Database created: {result['database']}")
print(f"Genomes sketched: {result['genome_count']}")
print(f"Database path: {result['database_path']}")
'''

print("Example: Sketch genomes from a directory")
print("Uncomment code above with your genome directory")

## 4. Query Genomes Against Database

Find similar genomes using the sketch database:

In [None]:
# Example: Query a genome against the database
'''
results = util.query_genomes(
    query_genome="./my_genome.fasta",
    database_name="MyGenomes",
    min_af=0.15  # minimum aligned fraction
)

print(f"Found {len(results)} matches")
for match in results[:5]:  # Show top 5
    print(f"  {match['Ref_name']}: ANI={match['ANI']:.2f}%, AF={match['Align_fraction_ref']:.2%}")
'''

print("Example: Query genome for similar matches")
print("Returns ANI (Average Nucleotide Identity) and alignment metrics")

## 5. Manage Sketch Databases

In [None]:
# List all registered databases
databases = util.list_databases()

print("Registered SKANI Databases:")
print("=" * 60)
if databases:
    for db in databases:
        print(f"Name: {db['name']}")
        print(f"  Path: {db['path']}")
        print(f"  Genomes: {db['genome_count']}")
        print(f"  Description: {db['description']}")
        print()
else:
    print("No databases registered yet.")
    print("Create one using sketch_genome_directory()")

## 6. Add External Database

In [None]:
# Example: Register an externally created SKANI database
'''
success = util.add_skani_database(
    database_name="GTDB_Bacteria",
    database_path="/path/to/gtdb_bacteria.sketch",
    description="GTDB bacterial genomes",
    genome_count=50000
)

print(f"Database registered: {success}")
'''

print("Example: Register externally created sketch database")
print("Useful for sharing databases or using pre-computed sketches")

## Summary

SKANIUtils provides:
- **Fast sketching** - Create database from genomes
- **ANI calculation** - Average Nucleotide Identity
- **Database management** - Track and reuse sketch databases
- **Flexible querying** - Find similar genomes quickly

### Next Steps
- Sketch your genome collections
- Query genomes for taxonomic identification
- Build reference databases for your projects