# Database pipeline for quantification of E.coli from metagenomic data

## What's in this notebook?

This notebook implements the pipeline that we used to generate a JSON configuration file (`ecoli.json`) for chronostrain in our paper. Our database encodes marker-seed aligning sequences from _E.coli_, as well as all sequences of >75% similarity known to us at the family level.

In principle, the JSON file is the only thing needed to run ChronoStrain; but this notebook has some other by-products:

1. A repository of chromosomal assemblies for the Enterobacteriaceae family, with a TSV-formatted index (`index.tsv`)
2. A repository of marker seed sequences and associated gene names.
3. (Optional) A binary-stored instance of the JSON database, readable by `chronostrain`. 

If the sequence records listed in the JSON file do not exist (the directory is specificable by a configuration file `chronostrain.ini`), then ChronoStrain will attempt to download it by searching the ID on NCBI. This notebook creates symbolic links to the sequences found in the chromosomal assembly index (byproduct #1) to skip this process.
After these sequences are downloaded, the database is loaded by extracting subsequences from the FASTA records; this is a very I/O-bound task and thus the pre-loaded binary (byproduct #3) helps skip the initialization.

## Prerequisites

### Software requirements

We recommend using a `conda` environment for this notebook, with `ipywidgets` installed and updated.
This notebook requires that the following software is installed.
- chronostrain (python>=3.10, the basic recipe `conda_basic.yml` or the full recipe `conda_full.yml`)
- primersearch (http://emboss.open-bio.org/, https://anaconda.org/bioconda/emboss)
- dashing2 (2023 Baker and Langmead: https://github.com/dnbaker/dashing2)

### Hardware requirements
 
None of the operations of this notebook requires a GPU. 
As of Feb 2023, we estimate that the contents of this notebook requires ~60 GB of hard disk space. At the time that we ran this pipeline, the catalog of chromosomal assemblies totalled 38.8 GB; other files (such as the BLAST database, marker seeds and chronostrain-specific byproducts) totalled 17.2 GB with a peak of ~60 GB when accounting for temporary files.

## File paths and environment variables

In [14]:
from pathlib import Path
import pandas as pd
import numpy as np

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord


""" ============================================ EDIT THESE SETTINGS BASED ON USER'S CHOICE. ============================================ """
""" RefSeq catalog settings"""
TARGET_DIR = Path("/mnt/e/ecoli_db")  # the base directory for everything else.
TARGET_TAXA = "Enterobacteriaceae"  # the taxonomic identifier to download. Can be species, genus or even family.
NCBI_REFSEQ_DIR = TARGET_DIR / "ref_genomes"
REFSEQ_INDEX = NCBI_REFSEQ_DIR / "index.tsv"

""" RefSeq BLAST database """
BLAST_DB_DIR = TARGET_DIR / "blast_db"  # The location of the Blast DB that you wish to create using RefSeq indices (they will be downloaded by this notebook).
BLAST_DB_NAME = "Enterobacteriaceae_refseq"  # Blast DB to create.

""" Marker seeds """
MARKER_SEED_DIR = TARGET_DIR / "marker_seeds"
MARKER_SEED_INDEX = MARKER_SEED_DIR / "marker_seed_index.tsv"

""" chronostrain-specific settings """
NUM_CORES = 8  # number of cores to use (e.g. for blastn)
MIN_PCT_IDTY = 75  # accept BLAST hits as markers above this threshold.
CHRONOSTRAIN_DB_DIR = TARGET_DIR / "chronostrain_files"  # The directory to use for chronostrain's database files.
CHRONOSTRAIN_TARGET_JSON = CHRONOSTRAIN_DB_DIR / "ecoli.json"  # the desired final product.
CHRONOSTRAIN_TARGET_CLUSTERS = CHRONOSTRAIN_DB_DIR / "ecoli.clusters.txt"  # the clustering file.
CHRONOSTRAIN_TARGET_CLUSTERS_100pct = CHRONOSTRAIN_DB_DIR / "ecoli.clusters_100pct.txt"  # the clustering file.
DASHING2_DIR = Path("/home/youn/work/bin")  # Directory that contains the dashing2 executable.

""" MetaPhlAn settings """
# METAPHLAN_DB_PATH = Path("/mnt/e/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103/mpa_vJan21_CHOCOPhlAnSGB_202103.pkl") # MetaPhlan 3 or newer
# METAPHLAN_TAXONOMIC_KEY = 's__Escherichia_coli'

""" ============================================ DO NOT EDIT BELOW ============================================ """
""" environment variable extraction """
try:
    VARS_SET
except NameError:
    VARS_SET = True
    _cwd = %pwd
    _parent_cwd = Path(_cwd).parent
    _start_path = %env PATH

# Work in parent directory, where all the helper scripts and settings.sh are.
%cd "$_parent_cwd"
%env TARGET_TAXA=$TARGET_TAXA
%env NCBI_REFSEQ_DIR=$NCBI_REFSEQ_DIR
%env REFSEQ_INDEX=$REFSEQ_INDEX
# Need basic executables, such as "which" and "basename" (required by primersearch)
%env PATH=/usr/bin:$_start_path:$DASHING2_DIR

/home/youn/work/chronostrain/examples/database
env: TARGET_TAXA=Enterobacteriaceae
env: NCBI_REFSEQ_DIR=/mnt/e/ecoli_db/ref_genomes
env: REFSEQ_INDEX=/mnt/e/ecoli_db/ref_genomes/index.tsv
env: PATH=/usr/bin:/home/youn/mambaforge/envs/chronostrain2/bin:/home/youn/work/bin


In [2]:
!primersearch --version
!dashing2 --version

EMBOSS:6.6.0.0
#Calling Dashing2 version v2.1.19 with command '/home/youn/work/chronostrain/examples/database/dashing2 --version'
dashing2 has several subcommands: sketch, cmp, wsketch, and contain.
Usage can be seen in those subcommands. (e.g., `dashing2 sketch -h`)

	sketch: converts FastX into k-mer sets/sketches, and sketches BigWig and BED files; also contains functionality from cmp, for one-step sketch and comparisons
This is probably the most common subcommand to use.

	cmp: compares previously sketched/decomposed k-mer sets and emits results. alias: dist

	contain: Takes a k-mer database (built with dashing2 sketch --save-kmers), then computes coverage for all k-mer references using input streams.
	wsketch: Takes a tuple of [1-3] input binary files [(u32 or u64), (float or double), (u32 or u64)] and performs weighted minhash sketching.
Three files are treated as Compressed Sparse Row (CSR)-format, where the third file contains indptr values, specifying the lengths of consecutiv

## Recipe starts here.

In [4]:
# Prepare directories.
TARGET_DIR.mkdir(exist_ok=True, parents=True)
BLAST_DB_DIR.mkdir(exist_ok=True, parents=True)
NCBI_REFSEQ_DIR.mkdir(exist_ok=True, parents=True)
MARKER_SEED_DIR.mkdir(exist_ok=True, parents=True)

### Step 1: Download chromosomal catalog.

In [3]:
!bash download_ncbi2.sh

^C

Aborted!


### Step 2: Build Blast database.

In [4]:
!env BLAST_DB_DIR=$BLAST_DB_DIR \
    BLAST_DB_NAME=$BLAST_DB_NAME \
    bash create_blast_db.sh

[*] Creating Blast database.
Target fasta file: __tmp_refseqs.fasta
Concatenating /mnt/e/ecoli_db/ref_genomes/human_readable/refseq/bacteria/Buttiauxella/agrestis/DSM_9389/NZ_AP023184.1.chrom.fna...
Concatenating /mnt/e/ecoli_db/ref_genomes/human_readable/refseq/bacteria/Buttiauxella/sp./3AFRM03/NZ_CP033076.1.chrom.fna...
Concatenating /mnt/e/ecoli_db/ref_genomes/human_readable/refseq/bacteria/Cedecea/lapagei/NCTC11466/NZ_LR134201.1.chrom.fna...
Concatenating /mnt/e/ecoli_db/ref_genomes/human_readable/refseq/bacteria/Cedecea/neteri/FDAARGOS_392/NZ_CP023525.1.chrom.fna...
Concatenating /mnt/e/ecoli_db/ref_genomes/human_readable/refseq/bacteria/Cedecea/neteri/M006/NZ_CP009458.1.chrom.fna...
Concatenating /mnt/e/ecoli_db/ref_genomes/human_readable/refseq/bacteria/Cedecea/neteri/ND14a/NZ_CP009459.1.chrom.fna...
Concatenating /mnt/e/ecoli_db/ref_genomes/human_readable/refseq/bacteria/Cedecea/neteri/SSMD04/NZ_CP009451.1.chrom.fna...
Concatenating /mnt/e/ecoli_db/ref_genomes/human_readable/re

### Step 3: Build the marker seed catalog.

**GOAL**: a FASTA file of marker seeds (one multi-fasta file per marker gene), and a single TSV file that catalogs them.

However, to get there, we need to take a few steps...

#### Step 3.1: Download MLST schema.

In [5]:
!python python_helpers/mlst_download.py -t "Escherichia coli" -w "$TARGET_DIR"/mlst_schema -o "$MARKER_SEED_DIR"/mlst_seeds.tsv

Downloading marker seeds from MLST schema.
Targeting 1 taxa using MLST scheme.
Fetching URL resource https://pubmlst.org/static/data/dbases.xml
Got a response of size 152.35 KB.
Schema type id: 1
Handling locus adk
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_ecoli_achtman_seqdef/loci/adk/alleles_fasta
Got a response of size 659.89 KB.
Handling locus fumC
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_ecoli_achtman_seqdef/loci/fumC/alleles_fasta
Got a response of size 941.86 KB.
Handling locus gyrB
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_ecoli_achtman_seqdef/loci/gyrB/alleles_fasta
Got a response of size 552.74 KB.
Handling locus icd
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_ecoli_achtman_seqdef/loci/icd/alleles_fasta
Got a response of size 842.08 KB.
Handling locus mdh
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_ecoli_achtman_seqdef/loci/mdh/alleles_fasta
Got a response of size 571.89 KB.
Handling locus purA
Fetch

#### Step 3.2: Non-standard genes: run GFF-annotated gene search with primers across entire catalog.

In [46]:
"""
Search #1:
-------------
Search for O-antigen gene cluster.
JumpSTART 5'--3' primer = CATGGTAGCTGTAAAGCCAGGGGCGGTAGCGTG
GND 5'--3' primer = CATGCTGCCATACCGACGACGCCGATCTGTTGCTTKGACA


Phylotyping gene primers from: https://github.com/A-BN/ClermonTyping
"""

def primer_search_cluster(gene_name: str, forward_primer: str, rev_primer: str):
    !python python_helpers/identify_gene_cluster_primersearch.py \
        -i "$REFSEQ_INDEX" \
        -t "$TARGET_DIR"/_tmp \
        -o "$MARKER_SEED_DIR"/"$gene_name".feather \
        -g "Escherichia" -s "coli" \
        -p1 "$forward_primer" -p2 "$rev_primer" \
        --mismatch-pct 30 \
        -n "$gene_name" \
        --use-gff
def primer_search_gene(gene_name: str, forward_primer: str, rev_primer: str, expected_len: int):
    !python python_helpers/identify_gene_cluster_primersearch.py \
        -i "$REFSEQ_INDEX" \
        -t "$TARGET_DIR"/_tmp \
        -o "$MARKER_SEED_DIR"/"$gene_name".feather \
        -g "Escherichia" -s "coli" \
        -p1 "$forward_primer" -p2 "$rev_primer" \
        --mismatch-pct 0 \
        -n "$gene_name" \
        -l "$expected_len" \
        --dont-use-gff

# Note: Here, we use GFF annotation to extract the specific genes from the cluster.
# primer_search_cluster("O_antigen_cluster", "CATGGTAGCTGTAAAGCCAGGGGCGGTAGCGTG", "CATGCTGCCATACCGACGACGCCGATCTGTTGCTTKGACA", use_gff_annotations=True)

primer_search_gene("trpA", "GCTACGAATCTCTGTTTGCC", "CGCTTTCATCGGTTGTACA", 783)
primer_search_gene("trpBA", "CGGCGATAAAGACATCTTCAC", "GCAACGCGGCCTGGCGGAAG", 489)
primer_search_gene("chuA", "ATGGTACCGGACGAACCAAC", "TGCCGCCAGTACCAAAGACA", 288)
primer_search_gene("yjaA", "CAAACGTGAAGTGTCAGGAG", "AATGCGTTCCTCAACCTGTG", 211)
primer_search_gene("TspE4.C2", "CACTATTCGTAAGGTCATCC", "AGTTTATCGCTGCGGGTCGC", 152)
primer_search_gene("arpA", "AACGCTATTCGCCAGCTTGC", "TCTCCCCATACCGTACGCTA", 400)
primer_search_gene("ArpAgpE", "GATTCCATCTTGTCAAAATATGCC", "GAAAAGAAAAAGAATTCCCAAGAG", 301)
primer_search_gene("trpAgpC", "AGTTTTATGCCCAGTGCGAG", "TCTGCGCCGGTCACGCCC", 219)
primer_search_gene("aesI", "CCTCTACTCACCCAAAAGTC", "ATCACGTAACCACAACGCAC", 315)
primer_search_gene("aesII", "CGCCTGTTGTCACTTCCACG", "GTTTATCACGCAGCCACAAG", 125)
primer_search_gene("chuIII", "GTGTTGAGATTGTCCGTGGG", "CAAAAGCACTGGCGCCCAG", 183)
primer_search_gene("chuIV", "CTGGCGAAAGGAACCTGGA", "GTTATCTCATCTTGCAGCCAA", 461)
primer_search_gene("chuV", "ACTGTATGGCAGTGGCGCAT", "GCAAAACTATCGGCAAACAGC", 600)
primer_search_gene("ybgD", "GTTGACTAAGCGCAGGTCGA", "TATGCGGCTGATGAAGGATC", 177)



Performing primer-based search for trpA in Escherichia coli. (FWD=GCTACGAATCTCTGTTTGCC, REV=CGCTTTCATCGGTTGTACA, len approx. 783)
Will NOT use GFF files; primer PCR hits will be interpreted as gene hits.
100%|███████████████████████████████████| 2063/2063 [04:39<00:00,  7.38genome/s]
Wrote 1787 dataframe records to trpA.feather
Performing primer-based search for trpBA in Escherichia coli. (FWD=CGGCGATAAAGACATCTTCAC, REV=GCAACGCGGCCTGGCGGAAG, len approx. 489)
Will NOT use GFF files; primer PCR hits will be interpreted as gene hits.
100%|███████████████████████████████████| 2063/2063 [04:43<00:00,  7.29genome/s]
Wrote 2046 dataframe records to trpBA.feather
Performing primer-based search for chuA in Escherichia coli. (FWD=ATGGTACCGGACGAACCAAC, REV=TGCCGCCAGTACCAAAGACA, len approx. 288)
Will NOT use GFF files; primer PCR hits will be interpreted as gene hits.
100%|███████████████████████████████████| 2063/2063 [04:10<00:00,  8.24genome/s]
Wrote 859 dataframe records to chuA.feather
Perfor

In [9]:
"""
Search #2:
------------
fim genes: fim*
H antigen: fliC, flk* fll* flm*
Shigatoxin: stx*
"""
!python python_helpers/extract_genes_by_name.py \
    -i "$REFSEQ_INDEX" \
    -o "$MARKER_SEED_DIR"/misc_genes.feather \
    -g "Escherichia" -s "coli" \
    -p "fim" -p "fliC" -p "flk" -p "fll" -p "flm" -p "stx"

100%|███████████████████████████████████████| 2063/2063 [08:17<00:00,  4.15it/s]


#### Step 3.3: Convert previous step to marker seed (multi-)FASTA files.

ChronoStrain uses FASTA files to read in marker seeds. The previous step merely creates a catalog of raw gene catalog.

We need to generate a FASTA file, but some cleanup has to be done, since sometimes genes are mis-annotated. (example: "stx1A" sometimes shows up as "stxA1")

In [54]:
gene_dfs = []

In [55]:
for gene_name in ['arpA', 'chuA', 'yjaA', 'tspE4.C4', 'ArpAgpE.f', 'trpAgpC.1']:
    feather_file = MARKER_SEED_DIR / f'{gene_name}.feather'
    df = pd.read_feather(feather_file)
    if df.shape[0] > 1000:
        df = df.sample(n=1000)
    if df.shape[0] > 0:
        gene_dfs.append(df)

In [57]:
np.random.seed(1234)  # for reproducibility

# ========== Load the dataframe.
misc_df = pd.read_feather(MARKER_SEED_DIR / "misc_genes.feather")
misc_df['GeneFull'] = misc_df['Gene']
index_df = pd.read_csv("/mnt/e/ecoli_db/ref_genomes/index.tsv", sep='\t')
misc_df = misc_df.merge(index_df[['Accession', 'Species']], on='Accession')
misc_df = misc_df.loc[misc_df['Species'] == 'coli']  # this is probably no longer needed as of latest version of this notebook.

print("Raw counts:")
display(misc_df.groupby("Gene")['Accession'].count().rename("Counts Per Gene"))

# Subsample if more than 1000 hits.
misc_df = misc_df.groupby("Gene").apply(
    lambda x: x.sample(n=min(1000, x.shape[0]))
)
misc_df = misc_df.set_index([pd.Index(list(range(misc_df.shape[0])))])


# STX labels are inconsistent. Fix them.
misc_df.loc[misc_df['Gene'] == 'stx1A', 'Gene'] = 'stxA1'
misc_df.loc[misc_df['Gene'] == 'stx2A', 'Gene'] = 'stxA2'
misc_df.loc[misc_df['Gene'] == 'stx1B', 'Gene'] = 'stxB1'
misc_df.loc[misc_df['Gene'] == 'stx2B', 'Gene'] = 'stxB2'

# print statistics
print("After fixing & subsampling: ")
display(misc_df.groupby("Gene")['Accession'].count().rename("Counts Per Gene"))

gene_dfs.append(misc_df)

Raw counts:


Gene
fim41a       1
fimA      4701
fimB      1625
fimC      3420
fimD      1719
fimE      1765
fimF      1858
fimG      1872
fimH      3469
fimI      1771
fimZ      1642
fliC       220
flk       2049
stx1A        1
stx1B        1
stx2A        1
stx2B        1
stxA1      166
stxA2       49
stxB1        2
stxB2       18
Name: Counts Per Gene, dtype: int64

After fixing & subsampling: 


Gene
fim41a       1
fimA      1000
fimB      1000
fimC      1000
fimD      1000
fimE      1000
fimF      1000
fimG      1000
fimH      1000
fimI      1000
fimZ      1000
fliC       220
flk       1000
stxA1      167
stxA2       50
stxB1        3
stxB2       19
Name: Counts Per Gene, dtype: int64

In [58]:
np.random.seed(1235)  # for reproducibility

# ========== Load the dataframe.
serotype_O_df = pd.read_feather(MARKER_SEED_DIR / "O_antigen_cluster.feather")
serotype_O_df['GeneFull'] = serotype_O_df['Gene']

index_df = pd.read_csv("/mnt/e/ecoli_db/ref_genomes/index.tsv", sep='\t')
serotype_O_df = serotype_O_df.merge(index_df[['Accession', 'Species']], on='Accession')
serotype_O_df = serotype_O_df.loc[serotype_O_df['Species'] == 'coli']

print("Raw counts: ")
display(serotype_O_df.groupby("Gene")['Accession'].count())


# Subsample if more than 1000 hits.
serotype_O_df = serotype_O_df.groupby("Gene").apply(
    lambda x: x.sample(n=min(1000, x.shape[0]))
)
serotype_O_df = serotype_O_df.set_index([pd.Index(list(range(serotype_O_df.shape[0])))])


# Remove all putative genes (e.g. WM47_RS18200). Annotated genes typically follow a 3-letter format (followed by a specifier).
ecoli_serotype_O_genes = set()
for x in serotype_O_df.groupby("Gene")['Accession'].count().index:
    if "_" in x:
        continue
    ecoli_serotype_O_genes.add(x)
serotype_O_df = serotype_O_df.loc[serotype_O_df['Gene'].isin(ecoli_serotype_O_genes)]


# print statistics
print("After fixing & subsampling: ")
display(serotype_O_df.groupby("Gene")['Accession'].count())

gene_dfs.append(serotype_O_df)

Raw counts: 


Gene
A0I22_RS25315       1
A0I22_RS25325       1
A0I22_RS25340       1
A0I22_RS25350       1
A0I22_RS25355       1
                 ... 
wzy              1324
yfgO                1
ypfH                1
ypfJ                1
ypfN                1
Name: Accession, Length: 11232, dtype: int64

After fixing & subsampling: 


Gene
acs        1
arnB       1
asnB      11
bamC       1
bcp        1
        ... 
wzy     1000
yfgO       1
ypfH       1
ypfJ       1
ypfN       1
Name: Accession, Length: 95, dtype: int64

In [59]:
print("Concatenating.")
concat_gene_df = pd.concat(gene_dfs, ignore_index=True)
print("# CUSTOM GENES TOTAL = {}".format( len(pd.unique(concat_gene_df['Gene'])) ))


# ======= Write FASTA files.
print("Generating FASTA files.")
for gene, section in concat_gene_df.groupby("Gene"):
    if gene.endswith("*"):
        prefix = gene[:-1]
        gene_name_for_file = f"{prefix}_any"
    else:
        gene_name_for_file = gene

    target_fasta = MARKER_SEED_DIR / f'{gene_name_for_file}.fasta'
    
    print("Handling {} ({} records) --> {}".format(gene, section.shape[0], target_fasta))
    with open(target_fasta, 'wt') as f:
        for row_idx, row in section.iterrows():
            record = SeqRecord(
                Seq(row['GeneSeq']),
                id="{}_{}".format(gene, row_idx),
                description="Src={}:{}".format(row['Accession'], row['GeneFull'])
            )
            SeqIO.write([record], f, 'fasta')

            
# ======= Human-readable metadata (to be written to TSV)
def assign_misc_metadata(gene_name):
    if gene_name.startswith("fim"):
        return "Fimbrial_gene".format(gene_name)
    elif gene_name.startswith("stx"):
        return "ShigaToxin".format(gene_name)
    elif gene_name == "fliC" or gene_name.startswith("flk") or gene_name.startswith("fll") or gene_name.startswith("flm"):
        return "H-antigen"
    elif gene_name in set(pd.unique(serotype_O_df['Gene'])):
        return "O-antigen"
    else:
        return "Misc gene"


# ======= write index TSV file.
tsv_path = MARKER_SEED_DIR / "manual_seeds.tsv"
print(f"Writing TSV file: {tsv_path}")

with open(tsv_path, 'wt') as metadata_f:
    for gene, section in concat_gene_df.groupby("Gene"):
        if gene.endswith("*"):
            prefix = gene[:-1]
            gene_nameAC_for_file = f"{prefix}_any"
        else:
            gene_name_for_file = gene

        target_fasta = MARKER_SEED_DIR / f'{gene_name_for_file}.fasta'
        print(
            "{}\t{}\t{}".format(
                gene_name_for_file, target_fasta, assign_misc_metadata(gene)
            ), 
            file=metadata_f
        )

print("Cleaning up.")
del concat_gene_df
del misc_df
del serotype_O_df

Concatenating.
# CUSTOM GENES TOTAL = 118
Generating FASTA files.
Handling ArpAgpE.f (1000 records) --> /mnt/e/ecoli_db/marker_seeds/ArpAgpE.f.fasta
Handling acs (1 records) --> /mnt/e/ecoli_db/marker_seeds/acs.fasta
Handling arnB (1 records) --> /mnt/e/ecoli_db/marker_seeds/arnB.fasta
Handling arpA (1000 records) --> /mnt/e/ecoli_db/marker_seeds/arpA.fasta
Handling asnB (11 records) --> /mnt/e/ecoli_db/marker_seeds/asnB.fasta
Handling bamC (1 records) --> /mnt/e/ecoli_db/marker_seeds/bamC.fasta
Handling bcp (1 records) --> /mnt/e/ecoli_db/marker_seeds/bcp.fasta
Handling bepA (1 records) --> /mnt/e/ecoli_db/marker_seeds/bepA.fasta
Handling chuA (859 records) --> /mnt/e/ecoli_db/marker_seeds/chuA.fasta
Handling cpsB (172 records) --> /mnt/e/ecoli_db/marker_seeds/cpsB.fasta
Handling cpsG (570 records) --> /mnt/e/ecoli_db/marker_seeds/cpsG.fasta
Handling dapA (1 records) --> /mnt/e/ecoli_db/marker_seeds/dapA.fasta
Handling dapE (1 records) --> /mnt/e/ecoli_db/marker_seeds/dapE.fasta
Handl

#### Step 3.4 add MetaPhlAn markers.

In [60]:
# !python python_helpers/extract_metaphlan_markers.py \
#     -t "$METAPHLAN_TAXONOMIC_KEY" \
#     -i "$METAPHLAN_DB_PATH" \
#     -o "$MARKER_SEED_DIR"/metaphlan_seeds.tsv

#### Step 3.5: process and combine marker TSV files.

In [61]:
!cat "$MARKER_SEED_DIR"/mlst_seeds.tsv > $MARKER_SEED_INDEX
!cat "$MARKER_SEED_DIR"/manual_seeds.tsv >> $MARKER_SEED_INDEX
# !cat "$MARKER_SEED_DIR"/metaphlan_seeds.tsv >> $MARKER_SEED_INDEX

print("Created Marker seed index: {}".format(MARKER_SEED_INDEX))
assert MARKER_SEED_INDEX.exists()

Created Marker seed index: /mnt/e/ecoli_db/marker_seeds/marker_seed_index.tsv


### Step 4: Run Chronostrain's make-db command.

By the end of the previous step, we have:

1) FASTA files for each gene, listing out seed sequence(s).
2) A TSV file (marker_seed_index.tsv) containing a list of gene names and the paths to each of these FASTA files.

Using these as inputs, we now construct the database files:
1) A JSON file of the strain records and their markers.
2) A TXT file of strain records clustered by similarity.

In [9]:
!env \
    JAX_PLATFORM_NAME=cpu \
    CHRONOSTRAIN_DB_DIR={CHRONOSTRAIN_DB_DIR} \
    CHRONOSTRAIN_LOG_INI={_cwd}/logging.ini \
    chronostrain -c chronostrain.ini \
      make-db \
      -m $MARKER_SEED_INDEX \
      -r $REFSEQ_INDEX \
      -b $BLAST_DB_NAME -bd $BLAST_DB_DIR \
      --min-pct-idty $MIN_PCT_IDTY \
      -o $CHRONOSTRAIN_TARGET_JSON \
      --threads $NUM_CORES

2024-04-17 14:12:49,799 [DEBUG - chronostrain.logging.initialize] - Using logging configuration /home/youn/work/chronostrain/examples/database/complete_recipes/logging.ini
2024-04-17 14:12:51,379 [DEBUG - chronostrain.config.initialize] - Loaded chronostrain INI from chronostrain.ini.
2024-04-17 14:12:51,523 [INFO - chronostrain.cli.make_db_json] - Using marker index: /mnt/e/ecoli_db/marker_seeds/marker_seed_index.tsv
2024-04-17 14:12:51,549 [INFO - chronostrain.cli.make_db_json] - Adding strain index catalog: /mnt/e/ecoli_db/ref_genomes/index.tsv [5406 entries; 96 species]
2024-04-17 14:12:51,549 [INFO - chronostrain.cli.make_db_json] - Building raw DB using BLAST.
2024-04-17 14:12:51,693 [DEBUG - chronostrain.cli.make_db_json] - Running blastn on adk.
2024-04-17 14:12:51,693 [DEBUG - chronostrain.cli.make_db_json] - Running blastn on fumC.
2024-04-17 14:12:51,693 [DEBUG - chronostrain.cli.make_db_json] - Running blastn on gyrB.
2024-04-17 14:12:51,694 [DEBUG - chronostrain.cli.make_d

In [11]:
# Perform clustering
thresh = 0.998
!env \
    JAX_PLATFORM_NAME=cpu \
    CHRONOSTRAIN_DB_DIR={CHRONOSTRAIN_DB_DIR} \
    CHRONOSTRAIN_LOG_INI={_cwd}/logging.ini \
    chronostrain -c chronostrain.ini \
      cluster-db \
      -i $CHRONOSTRAIN_TARGET_JSON \
      -o $CHRONOSTRAIN_TARGET_CLUSTERS \
      --ident-threshold $thresh

2024-04-17 18:05:58,241 [DEBUG - chronostrain.logging.initialize] - Using logging configuration /home/youn/work/chronostrain/examples/database/complete_recipes/logging.ini
2024-04-17 18:05:58,272 [INFO - chronostrain.cli.prune_json] - Pruning database via clustering
2024-04-17 18:05:58,272 [DEBUG - chronostrain.cli.prune_json] - Src: /mnt/e/ecoli_db/chronostrain_files/ecoli.json, Output: /mnt/e/ecoli_db/chronostrain_files/ecoli.clusters.txt
2024-04-17 18:05:58,272 [INFO - chronostrain.cli.prune_json] - Target identity threshold = 0.998
2024-04-17 18:05:59,688 [DEBUG - chronostrain.config.initialize] - Loaded chronostrain INI from chronostrain.ini.
2024-04-17 18:05:59,969 [INFO - chronostrain.cli.prune_json] - Preprocessing step -- Loading old DB instance, using data directory: /mnt/e/ecoli_db/chronostrain_files
2024-04-17 18:05:59,969 [DEBUG - chronostrain.database.parser.json] - Couldn't find instance (/mnt/e/ecoli_db/chronostrain_files/__ecoli_/database.posix.pkl).
2024-04-17 18:05:5

In [15]:
# Perform clustering
thresh = 0.9999999999
!env \
    JAX_PLATFORM_NAME=cpu \
    CHRONOSTRAIN_DB_DIR={CHRONOSTRAIN_DB_DIR} \
    CHRONOSTRAIN_LOG_INI={_cwd}/logging.ini \
    chronostrain -c chronostrain.ini \
      cluster-db \
      -i $CHRONOSTRAIN_TARGET_JSON \
      -o $CHRONOSTRAIN_TARGET_CLUSTERS_100pct \
      --ident-threshold $thresh

2024-04-17 18:21:19,458 [DEBUG - chronostrain.logging.initialize] - Using logging configuration /home/youn/work/chronostrain/examples/database/complete_recipes/logging.ini
2024-04-17 18:21:19,472 [INFO - chronostrain.cli.prune_json] - Pruning database via clustering
2024-04-17 18:21:19,472 [DEBUG - chronostrain.cli.prune_json] - Src: /mnt/e/ecoli_db/chronostrain_files/ecoli.json, Output: /mnt/e/ecoli_db/chronostrain_files/ecoli.clusters_100pct.txt
2024-04-17 18:21:19,472 [INFO - chronostrain.cli.prune_json] - Target identity threshold = 0.9999999999
2024-04-17 18:21:20,460 [DEBUG - chronostrain.config.initialize] - Loaded chronostrain INI from chronostrain.ini.
2024-04-17 18:21:20,593 [INFO - chronostrain.cli.prune_json] - Preprocessing step -- Loading old DB instance, using data directory: /mnt/e/ecoli_db/chronostrain_files
2024-04-17 18:21:22,360 [INFO - chronostrain.database.database] - Instantiating database `ecoli`.
2024-04-17 18:21:22,362 [DEBUG - chronostrain.database.parser.jso

In [28]:
with open("/data/cctm/youn/chronostrain_zenodo_new/done/umb/databases/chronostrain_files/ecoli.100pct.txt", "rt") as f:
    f.readline()
    _x = [l.strip().split("\t")[0] for l in f]

print(len(_x))
print("missing strains:")
_s_ids = {s.id for s in src_db.all_strains()}
for k in _x:
    if k not in _s_ids:
        print(k)
print("done.")

4082
missing strains:
done.


# OPTIONAL: compute some statistics

<!-- ChronoStrain's database is specified by a single JSON file. In this notebook, it is given by the variable `CHRONOSTRAIN_TARGET_JSON`. The `JSONParser` object will attempt to parse the file, and extract the markers (each specified by a start and end position on the assembly) from the chromosomal assembly.

The directory in which the extracted markers are stored is in `<data_dir>` (in this notebook it is given by the variable `CHRONOSTRAIN_DB_DIR`), or more specifically `<data_dir>/__<database_name>_/markers/<strain_id>/<marker_id>.fasta`.
Note that the JSON generated by this notebook has the sequence FASTA paths embedded into the entries (`seqs` field of each strain).
Without that field, ChronoStrain will attempt to download it from NCBI (by assuming that the sequence accession is an NCBI nucleotide acession).

The marker extraction is disk I/O-bound. The database will be stored in binary via `pickle`, so that on future usages it will load faster. (try running this cell a second time and see the difference!) -->

In [16]:
from chronostrain.database import JSONParser
src_db = JSONParser(
    entries_file=CHRONOSTRAIN_TARGET_JSON,
    data_dir=CHRONOSTRAIN_DB_DIR,
    marker_max_len=50000,
    force_refresh=False
).parse()

n = 0
for s in src_db.all_strains():
    if s.metadata.genus == 'Escherichia' and s.metadata.species == 'coli':
        n += 1
print("# of E. coli entries:", n)

2024-04-17 18:26:30,930 [INFO - chronostrain.database.database] - Instantiating database `ecoli`.
# of E. coli entries: 2058


In [17]:
n = 0
ratios = []
for s in src_db.all_strains():
    if s.metadata.genus == 'Escherichia' and s.metadata.species == 'coli':
        n += 1
        marker_len = sum(len(m) for m in s.markers)
        genome_len = s.metadata.total_len
        ratios.append(
            marker_len / genome_len
        )
print("Database's E.coli marker fraction of genome: {} [mean]".format(np.mean(ratios)))

Database's E.coli marker fraction of genome: 0.015044251863146507 [mean]


In [18]:
df_entries = []
with open(CHRONOSTRAIN_TARGET_CLUSTERS, "rt") as f:
    for line in f:
        if line.startswith("#"):
            continue
        tokens = line.rstrip().split("\t")
        rep = tokens[0]
        members = tokens[1].split(",")
        for member in members:
            # if member.startswith("GCA"):  # only include isolates
            df_entries.append({'Accession': member, 'Cluster': rep})
cluster_df = pd.DataFrame(df_entries)
del df_entries
print("# clusters = {}".format(len(pd.unique(cluster_df['Cluster']))))


n_efaecalis = 0
for cluster_id in pd.unique(cluster_df['Cluster']):
    s = src_db.get_strain(cluster_id)
    if s.metadata.genus == 'Escherichia' and s.metadata.species == 'coli':
        n_efaecalis += 1
print("# ecoli clusters = {}".format(n_efaecalis))

# clusters = 2323
# ecoli clusters = 842
