# Database pipeline for quantification of E.faecalis from metagenomic data

## What's in this notebook?

This notebook implements the pipeline that we used to generate a JSON configuration file (`efaecalis.json`) for chronostrain in our paper. Our database encodes marker-seed aligning sequences from _E.faecalis_, as well as all sequences of >75% similarity known to us at the family level.

In principle, the JSON file is the only thing needed to run ChronoStrain; but this notebook has some other by-products:

1. A repository of chromosomal assemblies for the Enterococcaceae family, with a TSV-formatted index (`index.tsv`)
2. A repository of marker seed sequences and associated gene names.
3. (Optional) A binary-stored instance of the JSON database, readable by `chronostrain`. 

If the sequence records listed in the JSON file do not exist (the directory is specificable by a configuration file `chronostrain.ini`), then ChronoStrain will attempt to download it by searching the ID on NCBI. This notebook creates symbolic links to the sequences found in the chromosomal assembly index (byproduct #1) to skip this process.
After these sequences are downloaded, the database is loaded by extracting subsequences from the FASTA records; this is a very I/O-bound task and thus the pre-loaded binary (byproduct #3) helps skip the initialization.

## Prerequisites

We recommend using a `conda` environment for this notebook, with `ipywidgets` installed and updated. None of the operations of this notebook requires a GPU.
This notebook requires that the following software is installed.
- chronostrain (python>=3.10, the basic recipe `conda_basic.yml` or the full recipe `conda_full.yml`)
- primersearch (http://emboss.open-bio.org/, https://anaconda.org/bioconda/emboss)
- dashing2 (2023 Baker and Langmead: https://github.com/dnbaker/dashing2)

### Hardware requirements
 
None of the operations of this notebook requires a GPU. 
As of Aug 2023, we estimate that the contents of this notebook requires ~20 GB of hard disk space. At the time that we ran this pipeline, the catalog of chromosomal assemblies totalled 14.3 GB; other files (such as the BLAST database, marker seeds and chronostrain-specific byproducts) totalled 5.4 GB, with a peak of ~28 GB when accounting for temporary files.

## File paths and environment variables

In [20]:
from pathlib import Path
import pandas as pd
import numpy as np

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

""" ============================================ EDIT THESE SETTINGS BASED ON USER'S CHOICE. ============================================ """
""" RefSeq catalog settings"""
TARGET_DIR = Path("/mnt/e/infant_nt/database")  # the base directory for everything else.
TARGET_TAXA = "Enterococcaceae"  # the taxonomic identifier to download. Can be species, genus or even family.
NCBI_REFSEQ_DIR = TARGET_DIR / "ref_genomes"
REFSEQ_INDEX = NCBI_REFSEQ_DIR / "index.tsv"


# ========== optional; for infant catalog. It's ok if these files don't exist.
INFANT_CATALOG_DIR = Path("/mnt/e/infant_nt")
INFANT_ISOLATE_INDEX = TARGET_DIR / 'infant_isolates' / 'index.tsv'


""" RefSeq BLAST database """
BLAST_DB_DIR = TARGET_DIR / "blast_db"
BLAST_DB_NAME = "Efaecalis_refseq"  # Blast DB to create.

""" Marker seeds """
MARKER_SEED_DIR = TARGET_DIR / "marker_seeds"
MARKER_SEED_INDEX = MARKER_SEED_DIR / "marker_seed_index.tsv"

""" chronostrain-specific settings """
NUM_CORES = 8  # number of cores to use (e.g. for blastn)
MIN_PCT_IDTY = 75  # accept BLAST hits as markers above this threshold.
CHRONOSTRAIN_DB_DIR = TARGET_DIR / "chronostrain_files"  # The directory to use for chronostrain's database files.
CHRONOSTRAIN_TARGET_JSON = CHRONOSTRAIN_DB_DIR / "efaecalis.json"  # the desired final product.
DASHING2_DIR = Path("/home/youn/work/bin")  # Directory that contains the dashing2 executable.

""" MetaPhlAn settings """
METAPHLAN_DB_PATH = Path("/mnt/e/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103/mpa_vJan21_CHOCOPhlAnSGB_202103.pkl") # MetaPhlan 3 or newer
METAPHLAN_TAXONOMIC_KEY = 's__Enterococcus_faecalis'

""" ============================================ DO NOT EDIT BELOW ============================================ """
""" environment variable extraction """
try:
    VARS_SET
except NameError:
    VARS_SET = True
    _cwd = %pwd
    _parent_cwd = Path(_cwd).parent
    _start_path = %env PATH

# Work in parent directory, where all the helper scripts and settings.sh are.
%cd "$_parent_cwd"
# Don't use GPU when importing jaxlib through chronostrain.
%env JAX_PLATFORM_NAME=cpu  
%env TARGET_TAXA=$TARGET_TAXA
%env NCBI_REFSEQ_DIR=$NCBI_REFSEQ_DIR
%env REFSEQ_INDEX=$REFSEQ_INDEX
# Need basic executables, such as "which" and "basename" (required by primersearch)
%env PATH=/usr/bin:$_start_path:$DASHING2_DIR

/home/youn/work/chronostrain/examples/database
env: JAX_PLATFORM_NAME=cpu
env: TARGET_TAXA=Enterococcaceae
env: NCBI_REFSEQ_DIR=/mnt/e/infant_nt/database/ref_genomes
env: REFSEQ_INDEX=/mnt/e/infant_nt/database/ref_genomes/index.tsv
env: PATH=/usr/bin:/home/youn/mambaforge/envs/chronostrain/bin:/home/youn/work/bin


In [2]:
# === Ensure that these commands work.
!primersearch --version
!dashing2 --version

EMBOSS:6.6.0.0
#Calling Dashing2 version v2.1.19 with command '/home/youn/work/chronostrain/examples/database/dashing2 --version'
dashing2 has several subcommands: sketch, cmp, wsketch, and contain.
Usage can be seen in those subcommands. (e.g., `dashing2 sketch -h`)

	sketch: converts FastX into k-mer sets/sketches, and sketches BigWig and BED files; also contains functionality from cmp, for one-step sketch and comparisons
This is probably the most common subcommand to use.

	cmp: compares previously sketched/decomposed k-mer sets and emits results. alias: dist

	contain: Takes a k-mer database (built with dashing2 sketch --save-kmers), then computes coverage for all k-mer references using input streams.
	wsketch: Takes a tuple of [1-3] input binary files [(u32 or u64), (float or double), (u32 or u64)] and performs weighted minhash sketching.
Three files are treated as Compressed Sparse Row (CSR)-format, where the third file contains indptr values, specifying the lengths of consecutiv

## Recipe starts here.

In [4]:
# Prepare directories.
TARGET_DIR.mkdir(exist_ok=True, parents=True)
BLAST_DB_DIR.mkdir(exist_ok=True, parents=True)
NCBI_REFSEQ_DIR.mkdir(exist_ok=True, parents=True)
MARKER_SEED_DIR.mkdir(exist_ok=True, parents=True)

### Step 1: Download RefSeq catalog.

In [5]:
!bash download_ncbi2.sh

2023-10-03 13:05:22,082 [INFO - chronostrain.download_ncbi] - Found 2197 records.
2023-10-03 13:05:22,082 [INFO - chronostrain.download_ncbi] - Downloading assemblies to directory /mnt/e/infant_nt/database/ref_genomes
Indexing GCF_003711125.1 (FDAARGOS_184)
Indexing GCF_005347505.1 (352)
Indexing GCF_013394325.1 (G-15)
Indexing GCA_003711125.1 (FDAARGOS_184)
Indexing GCA_005347505.1 (352)
Indexing GCA_013394325.1 (G-15)
Indexing GCF_003641225.1 (EC-369)
Indexing GCF_009707345.1 (EC291)
Indexing GCF_014844215.1 (EGM182)
Indexing GCF_016127635.1 (FDAARGOS_998)
Indexing GCF_016727305.1 (FDAARGOS_1122)
Indexing GCF_016727325.1 (FDAARGOS_1121)
Indexing GCF_016727345.1 (FDAARGOS_1120)
Indexing GCF_022870765.1 (ECB140)
Indexing GCF_023523715.1 (SP11)
Indexing GCF_027944555.1 (CQFYY22-063)
Indexing GCF_029201225.1 (ASE2)
Indexing GCF_029215545.1 (ASE4)
Indexing GCA_003641225.1 (EC-369)
Indexing GCA_009707345.1 (EC291)
Indexing GCA_014844215.1 (EGM182)
Indexing GCA_016127635.1 (FDAARGOS_998)
In

### Step 2: Build Blast database.

In [12]:
!env BLAST_DB_DIR=$BLAST_DB_DIR \
    BLAST_DB_NAME=$BLAST_DB_NAME \
    bash create_blast_db.sh

[*] Creating Blast database.
Target fasta file: __tmp_refseqs.fasta
Concatenating /mnt/e/infant_nt/database/ref_genomes/ncbi_dataset/data/GCF_003711125.1/NZ_CP024590.1.chrom.fna...
Concatenating /mnt/e/infant_nt/database/ref_genomes/ncbi_dataset/data/GCF_005347505.1/NZ_CP034169.1.chrom.fna...
Concatenating /mnt/e/infant_nt/database/ref_genomes/ncbi_dataset/data/GCF_013394325.1/NZ_AP019814.1.chrom.fna...
Concatenating /mnt/e/infant_nt/database/ref_genomes/ncbi_dataset/data/GCA_003711125.1/CP024590.1.chrom.fna...
Concatenating /mnt/e/infant_nt/database/ref_genomes/ncbi_dataset/data/GCA_005347505.1/CP034169.1.chrom.fna...
Concatenating /mnt/e/infant_nt/database/ref_genomes/ncbi_dataset/data/GCA_013394325.1/AP019814.1.chrom.fna...
Concatenating /mnt/e/infant_nt/database/ref_genomes/ncbi_dataset/data/GCF_003641225.1/NZ_CP032739.1.chrom.fna...
Concatenating /mnt/e/infant_nt/database/ref_genomes/ncbi_dataset/data/GCF_009707345.1/NZ_CP046123.1.chrom.fna...
Concatenating /mnt/e/infant_nt/databa

### Step 3: Build the marker seed catalog.

ChronoStrain only needs a FASTA file of marker seeds (one multi-fasta file per marker gene), and a single TSV file that catalogs them.
However, to get there, we need to take a few steps...

#### Step 3.1: Download MLST schema.

In [13]:
!python python_helpers/mlst_download.py -t "Enterococcus faecalis" -w "$TARGET_DIR"/mlst_schema -o "$MARKER_SEED_DIR"/mlst_seeds.tsv

Downloading marker seeds from MLST schema.
Targeting 1 taxa using MLST scheme.
Fetching URL resource https://pubmlst.org/static/data/dbases.xml
Got a response of size 152.35 KB.
Schema type id: 
Handling locus gdh
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_efaecalis_seqdef/loci/gdh/alleles_fasta
Got a response of size 69.5 KB.
Handling locus gyd
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_efaecalis_seqdef/loci/gyd/alleles_fasta
Got a response of size 20.51 KB.
Handling locus pstS
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_efaecalis_seqdef/loci/pstS/alleles_fasta
Got a response of size 74.72 KB.
Handling locus gki
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_efaecalis_seqdef/loci/gki/alleles_fasta
Got a response of size 54.14 KB.
Handling locus aroE
Fetching URL resource https://rest.pubmlst.org/db/pubmlst_efaecalis_seqdef/loci/aroE/alleles_fasta
Got a response of size 60.48 KB.
Handling locus xpt
Fetching URL resource https://res

#### Step 3.2: Non-standard genes: run PCR primer search across entire catalog.

In [11]:
def perform_primer_search(gene_name: str, forward_primer: str, rev_primer: str, est_length: int):
    !python python_helpers/identify_gene_cluster_primersearch.py \
        -i "$REFSEQ_INDEX" \
        -t "$TARGET_DIR"/_tmp \
        -o "$MARKER_SEED_DIR"/"$gene_name".feather \
        -g "Enterococcus" \
        -s "faecalis" \
        -p1 "$forward_primer" -p2 "$rev_primer" \
        -n "$gene_name" \
        -l "$est_length" \
        --dont-use-gff


# ====== Known strain polymorphisms
perform_primer_search("cpsA", "GTAGAAGAAGCAAGCCAGTACGCC", "CCTCTGCAGCAATCTGTTTCATGG", 478)
perform_primer_search("cpsB", "GTGTCATCACAGCTATCGTCGC", "CCGGCATTGGATAAGGAAATAGCC", 603)
perform_primer_search("cpsC", "CCTGAATATCAATGTATTTGGGCAGTC", "CCAACGCTTTGCTTCTTGAATGAC", 300)
perform_primer_search("cpsD", "GGATTCTCTTGTTCAACAAACCATTGG", "CGCATGGCTTCATAAAAGAACAGC", 522)
perform_primer_search("cpsE", "GAGGTTGAGCGAGATATATTATGGC", "CACTTCATAAACCGACTCATCACG", 450)
perform_primer_search("cpsF", "GCATTACAAGGTTATACAGTTGATGG", "GACTGTTCCATCTTATCTTTTATTCGG", 580)
perform_primer_search("cpsG", "GGCTCTGATCAAATGTGGAATCCC", "GGTGTATCTTCAGAAACATATTCTACTG", 503)
perform_primer_search("cpsH", "GTGTCTTTAGCAATTGGTATCGGTTG", "CACTAGAGTAGCTAATACTTTTTTTTCCC", 366)
perform_primer_search("cpsI", "GCTTGTGAAGCAGCTAAACGAGG", "CTCTGATAAGTAAGTTTCTTTCTCTGCC", 630)
perform_primer_search("cpsJ", "CCTCGACGTATATTCTGGAGAAAC", "GCTTAGTTTCACCAAATGCACGTAG", 553)
perform_primer_search("cpsK", "GCGTTGCACAACGAATTGCTAAATAC", "CGCTACAATATAGTAAGGTAGCTGAATC", 422)

# ======= suspected virulence determinants
perform_primer_search("cylA", "GGTTATGCATCAGATCTCTCAA", "TCTTCAGGTTTAAAATCTGG", 223)
perform_primer_search("cylB", "GGAGAATTAGTGTTTAGAGCG", "GCTTCATAACCATTGTTACTATAGAAAC", 522)
perform_primer_search("cylM", "AAGATTGTCTGTGCCATGGA", "TACTCACTTCCGGCAACCTT", 159)
perform_primer_search("cbh", "CTCATAGGATCCATCACCAACATCAC", "TGGCTGGAATTCACTTTTCAGGCTAT", 580)
perform_primer_search("gelE", "TTGTTGGAAGTTCATGTCTA", "TTCATTGACCAGAACAGATT", 1484)
perform_primer_search("fsrB", "GCATTGTTATCTATGTCGCCATACC", "GGCTTAGTTCCCACACCATC", 396)

# ======= pathogenicity island
perform_primer_search("gls24_like", "GCATTAGATGAGATTGATGGTC", "GCGAGGTTCAGTTTCTTC", 446)
perform_primer_search("esp", "CGATAAAGAAGAGAGCGGAG", "GCAAACTCTACATCCACGTC", 539)
perform_primer_search("psaA", "CTATTTTGCAGCAAGTGATG", "CGCATAGTAACTATCACCATCTTG", 540)

Performing primer-based search for cpsA in Enterococcus faecalis. (FWD=GTAGAAGAAGCAAGCCAGTACGCC, REV=CCTCTGCAGCAATCTGTTTCATGG, len approx. 478)
Will NOT use GFF files; primer PCR hits will be interpreted as gene hits.
100%|███████████████████████████████████| 1081/1081 [02:59<00:00,  6.02genome/s]
Performing primer-based search for cpsB in Enterococcus faecalis. (FWD=GTGTCATCACAGCTATCGTCGC, REV=CCGGCATTGGATAAGGAAATAGCC, len approx. 603)
Will NOT use GFF files; primer PCR hits will be interpreted as gene hits.
100%|███████████████████████████████████| 1081/1081 [03:00<00:00,  6.00genome/s]
Performing primer-based search for cpsC in Enterococcus faecalis. (FWD=CCTGAATATCAATGTATTTGGGCAGTC, REV=CCAACGCTTTGCTTCTTGAATGAC, len approx. 300)
Will NOT use GFF files; primer PCR hits will be interpreted as gene hits.
100%|███████████████████████████████████| 1081/1081 [02:56<00:00,  6.11genome/s]
Performing primer-based search for cpsD in Enterococcus faecalis. (FWD=GGATTCTCTTGTTCAACAAACCATTGG, RE

#### Step 3.3: Convert previous step to marker seed (multi-)FASTA files.

ChronoStrain uses FASTA files to read in marker seeds. The previous step merely creates a catalog of raw gene catalog.

We need to generate a FASTA file, but some cleanup has to be done!

In [51]:
with open(MARKER_SEED_DIR / "manual_seeds.tsv", "wt") as metadata_tsv:
    for feather_file in MARKER_SEED_DIR.glob("*.feather"):
        gene_name = feather_file.stem
        fasta_path = MARKER_SEED_DIR / f'{gene_name}.fasta'
        _df = pd.read_feather(feather_file)
        with open(fasta_path, 'wt') as f:
            for row_idx, row in _df.iterrows():
                record = SeqRecord(
                    Seq(row['GeneSeq']),
                    id="{}_{}".format(gene_name, row_idx),
                    description="Src={}:{}".format(row['Accession'], row['Gene'])
                )
                SeqIO.write([record], f, 'fasta')
        print(f'{feather_file.name} -> {fasta_path.name}')
        
        print(
            "{}\t{}\t{}".format(
                gene_name, fasta_path, f"POLYMORPHIC_{gene_name}"
            ), 
            file=metadata_tsv
        )

cpsK.feather -> cpsK.fasta
cbh.feather -> cbh.fasta
cpsA.feather -> cpsA.fasta
cpsB.feather -> cpsB.fasta
cpsC.feather -> cpsC.fasta
cpsD.feather -> cpsD.fasta
cpsE.feather -> cpsE.fasta
cpsF.feather -> cpsF.fasta
cpsG.feather -> cpsG.fasta
cpsH.feather -> cpsH.fasta
cpsI.feather -> cpsI.fasta
cpsJ.feather -> cpsJ.fasta
cylA.feather -> cylA.fasta
cylB.feather -> cylB.fasta
cylM.feather -> cylM.fasta
esp.feather -> esp.fasta
fsrB.feather -> fsrB.fasta
gelE.feather -> gelE.fasta
gls24_like.feather -> gls24_like.fasta
psaA.feather -> psaA.fasta


#### 3.4 Extract MetaPhlAn markers.

In [77]:
!python python_helpers/extract_metaphlan_markers.py \
    -t "$METAPHLAN_TAXONOMIC_KEY" \
    -i "$METAPHLAN_DB_PATH" \
    -o "$MARKER_SEED_DIR"/metaphlan_seeds.tsv

Searching for marker seeds from MetaPhlAn database: mpa_vJan21_CHOCOPhlAnSGB_202103.
Target # of markers: 400
Found metaphlan marker ID SGB7962__ILGNLFCD_00717.
Found marker `SGB7962__ILGNLFCD_00717` (length 1869)
Found metaphlan marker ID SGB7962__IDAMNEFF_00013.
Found marker `SGB7962__IDAMNEFF_00013` (length 1650)
Found metaphlan marker ID SGB7962__KIHIIFOM_01796.
Found marker `SGB7962__KIHIIFOM_01796` (length 1227)
Found metaphlan marker ID SGB7962__KGLKPILN_02253.
Found marker `SGB7962__KGLKPILN_02253` (length 1224)
Found metaphlan marker ID SGB7962__DCIIIAKG_00503.
Found marker `SGB7962__DCIIIAKG_00503` (length 1209)
Found metaphlan marker ID SGB7962__LMHKBOKG_02316.
Found marker `SGB7962__LMHKBOKG_02316` (length 1209)
Found metaphlan marker ID SGB7962__OIKIMJFJ_01574.
Found marker `SGB7962__OIKIMJFJ_01574` (length 1167)
Found metaphlan marker ID SGB7962__JFHBNJPK_02477.
Found marker `SGB7962__JFHBNJPK_02477` (length 675)
Found metaphlan marker ID SGB7962__CDLKOKGH_00112.
Found ma

#### 3.5 Combine marker seed files.

In [78]:
!cat "$MARKER_SEED_DIR"/mlst_seeds.tsv > $MARKER_SEED_INDEX
!cat "$MARKER_SEED_DIR"/manual_seeds.tsv >> $MARKER_SEED_INDEX
!cat "$MARKER_SEED_DIR"/metaphlan_seeds.tsv >> $MARKER_SEED_INDEX

print("Created Marker seed index: {}".format(MARKER_SEED_INDEX))
assert MARKER_SEED_INDEX.exists()

Created Marker seed index: /mnt/e/infant_nt/database/marker_seeds/marker_seed_index.tsv


### Step 4.1 [Specific to Infant dataset]: include isolate assemblies

Include infant isolates to the database catalog.

*Note: This cell does nothing if `infant_nt/download_assemblies.sh` has not been run. It can be safely skipped if one does not want to include these isolates.*

In [35]:
# this file points to the output of infant_nt/download_assembly_catalog.sh.
from Bio import SeqIO


infant_isolate_df_entries = []
isolate_seqdir = INFANT_ISOLATE_INDEX.parent / 'sequences'
for f in INFANT_CATALOG_DIR.glob("*/isolate_assemblies/metadata.tsv"):
    print("Found metadata in dir {}.".format(f.parent))
    isolate_seqdir.mkdir(exist_ok=True, parents=True)

    # =============== Create a dataframe.
    isolate_df = pd.read_csv(f, sep='\t')
    for _, row in isolate_df.iterrows():
        # Parse entries.
        participant = row['Participant']
        acc = row['Accession']
        genus = row['Genus']
        species = row['Species']
        timepoint = row['Timepoint']
        sample_id = row['SampleId']
        source_fasta_path = Path(row['FastaPath'])

        # Skip if not E. faecalis.
        if not (genus == 'Enterococcus' and species == 'faecalis'):
            continue

        # Extract the records.
        records = list(SeqIO.parse(source_fasta_path, format="fasta"))
        chrom_len = sum(len(record.seq) for record in records)

        # Do nothing if no records are available.
        if len(records) == 0:
            continue
            
        # Embed the accession into the record IDs.
        target_fasta_dir = isolate_seqdir / acc
        target_fasta_dir.mkdir(exist_ok=True, parents=True)
        for record_idx, record in enumerate(records):
            record.id = f'{acc}_CONTIG_{record_idx}'
            target_fasta = target_fasta_dir / f'{record.id}.fasta'
            with open(target_fasta, 'w') as out_f:
                SeqIO.write([record], out_f, format="fasta")

        # Add to the dataframe.
        infant_isolate_df_entries.append(
            (genus, species, f'{participant}_t:{timepoint}_s:{sample_id}', acc, acc, str(target_fasta_dir / '*.fasta'), chrom_len, 'None')
        )


# Create dataframe and save to file.
if len(infant_isolate_df_entries) > 0:
    infant_isolate_df = pd.DataFrame(
        infant_isolate_df_entries, 
        columns=['Genus', 'Species', 'Strain', 'Accession', 'Assembly', 'SeqPath', 'ChromosomeLen', 'GFF']
    ).astype(
        {
            'Genus': 'string',
            'Species': 'string',
            'Strain': 'string',
            'Accession': 'string',
            'Assembly': 'string',
            'SeqPath': 'string',
            'ChromosomeLen': 'int',
            'GFF': 'string'
        }
    )
    infant_isolate_df.to_csv(INFANT_ISOLATE_INDEX, sep='\t', index=False)

Found metadata in dir /mnt/e/infant_nt/A00021_T1/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/A00021_T2/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/A00031/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/A00043/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/A00067/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/A00106_T1/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/B00090/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/B00092/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/B00096/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/B00097/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/B00100/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/B00101/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/B00111/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/B00116/isolate_assemblies.
Found metadata in dir /mnt/e/infant_nt/B00119/isolate_assemblies.
F

### Step 4: Run Chronostrain's make-db command.

By the end of the previous step, we have:

1) FASTA files for each gene, listing out seed sequence(s).
2) A TSV file (marker_seed_index.tsv) containing a list of gene names and the paths to each of these FASTA files.

In [58]:
if INFANT_ISOLATE_INDEX.exists():
    !env \
        JAX_PLATFORM_NAME=cpu \
        CHRONOSTRAIN_DB_DIR="$CHRONOSTRAIN_DB_DIR" \
        chronostrain -c chronostrain.ini \
          make-db \
          -m $MARKER_SEED_INDEX \
          -r $REFSEQ_INDEX \
          -b $BLAST_DB_NAME -bd $BLAST_DB_DIR \
          --min-pct-idty $MIN_PCT_IDTY \
          --ident-threshold 0.998 \
          -o $CHRONOSTRAIN_TARGET_JSON \
          --threads $NUM_CORES \
          --add-isolates $INFANT_ISOLATE_INDEX
else:
    !env \
        JAX_PLATFORM_NAME=cpu \
        CHRONOSTRAIN_DB_DIR="$CHRONOSTRAIN_DB_DIR" \
        chronostrain -c chronostrain.ini \
          make-db \
          -m $MARKER_SEED_INDEX \
          -r $REFSEQ_INDEX \
          -b $BLAST_DB_NAME -bd $BLAST_DB_DIR \
          --min-pct-idty $MIN_PCT_IDTY \
          --ident-threshold 0.998 \
          -o $CHRONOSTRAIN_TARGET_JSON \
          --threads $NUM_CORES

2023-10-03 17:37:55,909 [INFO - chronostrain.cli.make_db_json] - Building raw DB using BLAST.
2023-10-03 17:37:55,920 [INFO - chronostrain.cli.make_db_json] - Creating strain entries from catalog /mnt/e/infant_nt/database/ref_genomes/index.tsv
2023-10-03 18:02:42,012 [INFO - chronostrain.cli.make_db_json] - Creating strain entries from catalog /mnt/e/infant_nt/database/infant_isolates/index.tsv
2023-10-03 18:10:10,036 [INFO - chronostrain.cli.make_db_json] - Wrote raw blast DB entries to /mnt/e/infant_nt/database/chronostrain_files/efaecalis-1raw.json.
2023-10-03 18:10:10,037 [INFO - chronostrain.cli.make_db_json] - Resolving overlaps.
2023-10-03 18:10:10,537 [INFO - chronostrain.cli.make_db_json] - Created 8 merged markers for strain CP022488.1, seqID CP022488.1.
2023-10-03 18:10:10,607 [INFO - chronostrain.cli.make_db_json] - Created 8 merged markers for strain NZ_CP022488.1, seqID NZ_CP022488.1.
2023-10-03 18:10:10,670 [INFO - chronostrain.cli.make_db_json] - Created 8 merged marker

# OPTIONAL: pre-load the database files.

ChronoStrain's database is specified by a single JSON file. In this notebook, it is given by the variable `CHRONOSTRAIN_TARGET_JSON`. The `JSONParser` object will attempt to parse the file, and extract the markers (each specified by a start and end position on the assembly) from the chromosomal assembly.

The directory in which the extracted markers are stored is in `<data_dir>` (in this notebook it is given by the variable `CHRONOSTRAIN_DB_DIR`), or more specifically `<data_dir>/__<database_name>_/markers/<strain_id>/<marker_id>.fasta`.
Note that the JSON generated by this notebook has the sequence FASTA paths embedded into the entries (`seqs` field of each strain).
Without that field, ChronoStrain will attempt to download it from NCBI (by assuming that the sequence accession is an NCBI nucleotide acession).

The marker extraction is disk I/O-bound. The database will be stored in binary via `pickle`, so that on future usages it will load faster. (try running this cell a second time and see the difference!)

In [59]:
from chronostrain.database import JSONParser
src_db = JSONParser(
    entries_file=CHRONOSTRAIN_TARGET_JSON,
    data_dir=CHRONOSTRAIN_DB_DIR,
    marker_max_len=50000,
    force_refresh=False
).parse()

2023-10-03 18:23:09,315 [INFO - chronostrain.database.parser.json] - Loading from JSON marker database file /mnt/e/infant_nt/database/chronostrain_files/efaecalis.json.
2023-10-03 18:23:18,642 [INFO - chronostrain.database.database] - Instantiating database `efaecalis`.
