# Database pipeline for quantification of E.coli from metagenomic data

This notebook is the simplified version of `ecoli.ipynb` found in this same directory, using a small subset of markers (MLST genes only).

**[IMPORTANT!!!] This notebook is NOT for constructing the database used in our paper. This only downloads _REFERENCE_ genomes, and _ONLY_ from the species "Escherichia coli" using NCBI datasets API. Please note that running analyses on metagenomic samples should (probably) use all available genomes at the Family-level.**

## Prerequisites

### Software requirements

We recommend using a `conda` environment for this notebook.
This notebook requires that the following software is installed.
- chronostrain (python>=3.10, the basic recipe `conda_basic.yml` or the full recipe `conda_full.yml`)
- dashing2 (2023 Baker and Langmead: https://github.com/dnbaker/dashing2)

### Hardware requirements

None of the operations of this notebook requires a GPU. 

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord


""" ============================================ EDIT THESE SETTINGS BASED ON USER'S CHOICE. ============================================ """
""" RefSeq catalog settings"""
TARGET_DIR = Path("./ecoli_db_simple")  # the base directory for everything else.
TARGET_TAXA = "Escherichia coli"  # the taxonomic identifier to download. Can be species, genus or even family.
NCBI_REFSEQ_DIR = TARGET_DIR / "ref_genomes"
REFSEQ_INDEX = NCBI_REFSEQ_DIR / "index.tsv"

""" RefSeq BLAST database """
BLAST_DB_DIR = TARGET_DIR / "blast_db"  # The location of the Blast DB that you wish to create using RefSeq indices (they will be downloaded by this notebook).
BLAST_DB_NAME = "Enterobacteriaceae_refseq"  # Blast DB to create.

""" Marker seeds """
MARKER_SEED_DIR = TARGET_DIR / "marker_seeds"

""" chronost#rain-specific settings """
NUM_CORES = 8  # number of cores to use (e.g. for blastn)
MIN_PCT_IDTY = 75  # accept BLAST hits as markers above this threshold.
CHRONOSTRAIN_DB_DIR = TARGET_DIR / "chronostrain_files"  # The directory to use for chronostrain's database files.
CHRONOSTRAIN_TARGET_JSON = CHRONOSTRAIN_DB_DIR / "ecoli.json"  # the desired final product.
CHRONOSTRAIN_TARGET_CLUSTERS = CHRONOSTRAIN_DB_DIR / "ecoli.clusters.txt"  # the clustering file.
DASHING2_DIR = Path("/home/youn/work/bin")  # Directory that contains the dashing2 executable.

""" MetaPhlAn settings """
# METAPHLAN_DB_PATH = Path("/mnt/e/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103/mpa_vJan21_CHOCOPhlAnSGB_202103.pkl") # MetaPhlan 3 or newer
# METAPHLAN_TAXONOMIC_KEY = 's__Escherichia_coli'

""" ============================================ DO NOT EDIT BELOW ============================================ """
""" environment variable extraction """
try:
    VARS_SET
except NameError:
    VARS_SET = True
    _cwd = %pwd
    _parent_cwd = Path(_cwd).parent
    _start_path = %env PATH

# Work in parent directory, where all the helper scripts and settings.sh are.
%cd "$_parent_cwd"
%env TARGET_TAXA=$TARGET_TAXA
%env NCBI_REFSEQ_DIR=$NCBI_REFSEQ_DIR
%env REFSEQ_INDEX=$REFSEQ_INDEX
# Need basic executables, such as "which" and "basename" (required by primersearch)
%env PATH=/usr/bin:$_start_path:$DASHING2_DIR

/home/youn/work/chronostrain/examples/database
env: TARGET_TAXA=Escherichia coli
env: NCBI_REFSEQ_DIR=ecoli_db_simple/ref_genomes
env: REFSEQ_INDEX=ecoli_db_simple/ref_genomes/index.tsv
env: PATH=/usr/bin:/usr/bin:/home/youn/mambaforge/envs/chronostrain2/bin:/home/youn/work/bin


In [2]:
!dashing2 --version

#Calling Dashing2 version v2.1.19 with command '/home/youn/work/chronostrain/examples/database/dashing2 --version'
dashing2 has several subcommands: sketch, cmp, wsketch, and contain.
Usage can be seen in those subcommands. (e.g., `dashing2 sketch -h`)

	sketch: converts FastX into k-mer sets/sketches, and sketches BigWig and BED files; also contains functionality from cmp, for one-step sketch and comparisons
This is probably the most common subcommand to use.

	cmp: compares previously sketched/decomposed k-mer sets and emits results. alias: dist

	contain: Takes a k-mer database (built with dashing2 sketch --save-kmers), then computes coverage for all k-mer references using input streams.
	wsketch: Takes a tuple of [1-3] input binary files [(u32 or u64), (float or double), (u32 or u64)] and performs weighted minhash sketching.
Three files are treated as Compressed Sparse Row (CSR)-format, where the third file contains indptr values, specifying the lengths of consecutive runs of pairs

# Recipe starts here.

In [3]:
# Prepare directories.
TARGET_DIR.mkdir(exist_ok=True, parents=True)
BLAST_DB_DIR.mkdir(exist_ok=True, parents=True)
NCBI_REFSEQ_DIR.mkdir(exist_ok=True, parents=True)
MARKER_SEED_DIR.mkdir(exist_ok=True, parents=True)

### Step 1: Download chromosomal catalog.

In [13]:
!python python_helpers/download_ncbi.py -t "Escherichia coli" -d $NCBI_REFSEQ_DIR -o $REFSEQ_INDEX --level "complete" --reference-only

2025-05-17 05:29:33,257 - Got reference only = True. Fetching reference genomes only.
2025-05-17 05:29:33,257 - Handling taxid Escherichia coli
2025-05-17 05:29:33,257 - [EXECUTE] datasets summary genome taxon "Escherichia coli" --assembly-source all --assembly-version latest --assembly-level complete --exclude-atypical --mag exclude --reference> ecoli_db_simple/ref_genomes/catalog.Escherichia_coli.json
2025-05-17 05:29:33,542 - Found 2 raw records.
2025-05-17 05:29:33,543 - [Escherichia coli] Found 2 unique strain records.
SLEEPING -- Escherichia coli
2025-05-17 05:29:34,544 - Downloading assemblies to directory ecoli_db_simple/ref_genomes
[EXECUTE] datasets download genome accession --inputfile ecoli_db_simple/ref_genomes/__ncbi_input.txt --include genome,gff3 --assembly-version latest --filename ecoli_db_simple/ref_genomes/ncbi_dataset.zip
Collecting 2 genome records [------------------------------------------------]   0% 0/2
[1A[2K[1A[2KCollecting 2 genome records [------------

## Step 2: Download MLST markers.

In [8]:
!python python_helpers/mlst_download.py -t "Escherichia coli" -w "$TARGET_DIR"/mlst_schema -o "$MARKER_SEED_DIR"/mlst_seeds.tsv

Downloading marker seeds from MLST schema.
Targeting 1 taxa using MLST scheme.
Fetching URL resource https://pubmlst.org/static/data/dbases.xml
Got a response of size 152.35 KB.
Schema type id: 1
Handling locus adk
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_ecoli_achtman_seqdef/loci/adk/alleles_fasta
Got a response of size 1021.48 KB.
Handling locus fumC
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_ecoli_achtman_seqdef/loci/fumC/alleles_fasta
Got a response of size 1.19 MB.
Handling locus gyrB
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_ecoli_achtman_seqdef/loci/gyrB/alleles_fasta
Got a response of size 752.67 KB.
Handling locus icd
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_ecoli_achtman_seqdef/loci/icd/alleles_fasta
Got a response of size 1.05 MB.
Handling locus mdh
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_ecoli_achtman_seqdef/loci/mdh/alleles_fasta
Got a response of size 740.53 KB.
Handling locus purA
Fetching

## Step 3: Invoke `chronostrain make-db`.

In [9]:
!env \
    JAX_PLATFORM_NAME=cpu \
    CHRONOSTRAIN_DB_DIR={CHRONOSTRAIN_DB_DIR} \
    CHRONOSTRAIN_LOG_INI={_cwd}/logging.ini \
    chronostrain -c chronostrain.ini \
      make-db \
      -m "$MARKER_SEED_DIR"/mlst_seeds.tsv \
      -r $REFSEQ_INDEX \
      -b $BLAST_DB_NAME -bd $BLAST_DB_DIR \
      --min-pct-idty $MIN_PCT_IDTY \
      -o $CHRONOSTRAIN_TARGET_JSON \
      --threads $NUM_CORES

2025-05-17 05:21:49,732 [DEBUG - chronostrain.logging.initialize] - Using logging configuration /home/youn/work/chronostrain/examples/database/complete_recipes/logging.ini
2025-05-17 05:21:50,401 [DEBUG - chronostrain.config.initialize] - Loaded chronostrain INI from chronostrain.ini.
2025-05-17 05:21:50,489 [INFO - chronostrain.cli.make_db_json] - Using marker index: ecoli_db_simple/marker_seeds/mlst_seeds.tsv
2025-05-17 05:21:50,491 [INFO - chronostrain.cli.make_db_json] - Adding strain index catalog: ecoli_db_simple/ref_genomes/index.tsv [2 entries; 1 species]
2025-05-17 05:21:50,491 [INFO - chronostrain.cli.make_db_json] - Building raw DB using BLAST.
2025-05-17 05:21:50,492 [INFO - chronostrain.cli.make_db_json] - Blast DB `Enterobacteriaceae_refseq` not found in ecoli_db_simple/blast_db. Running makeblastdb.
2025-05-17 05:21:50,492 [INFO - chronostrain.cli.make_db_json] - Concatenating input files from 2 entries.
2025-05-17 05:21:50,510 [DEBUG - chronostrain.util.external.command

# Clustering?

Note that this example downloads just a few genomes (two E. coli genomes, as of May 2025) since we only are requesting reference E. coli genomes from NCBI.
These two genomes are fairly different strains; and thus there is no need to cluster. 

However, if you DID want to cluster, you would do something like this:

In [10]:
!env \
    JAX_PLATFORM_NAME=cpu \
    CHRONOSTRAIN_DB_DIR={CHRONOSTRAIN_DB_DIR} \
    CHRONOSTRAIN_LOG_INI={_cwd}/logging.ini \
    chronostrain -c chronostrain.ini \
      cluster-db \
      -i $CHRONOSTRAIN_TARGET_JSON \
      -o $CHRONOSTRAIN_TARGET_CLUSTERS \
      --ident-threshold 0.998

2025-05-17 05:25:11,444 [DEBUG - chronostrain.logging.initialize] - Using logging configuration /home/youn/work/chronostrain/examples/database/complete_recipes/logging.ini
2025-05-17 05:25:11,456 [INFO - chronostrain.cli.prune_json] - Pruning database via clustering
2025-05-17 05:25:11,456 [DEBUG - chronostrain.cli.prune_json] - Src: ecoli_db_simple/chronostrain_files/ecoli.json, Output: ecoli_db_simple/chronostrain_files/ecoli.clusters.txt
2025-05-17 05:25:11,456 [INFO - chronostrain.cli.prune_json] - Target identity threshold = 0.998
2025-05-17 05:25:12,114 [DEBUG - chronostrain.config.initialize] - Loaded chronostrain INI from chronostrain.ini.
2025-05-17 05:25:12,203 [INFO - chronostrain.cli.prune_json] - Preprocessing step -- Loading old DB instance, using data directory: ecoli_db_simple/chronostrain_files
2025-05-17 05:25:12,203 [DEBUG - chronostrain.database.parser.json] - Couldn't find instance (ecoli_db_simple/chronostrain_files/__ecoli_/database.posix.pkl).
2025-05-17 05:25:1