# BEFORE YOU START

## Step 1: Try starting a GPU-enabled session. (`Runtime > Change runtime type > Dropdown "GPU"`)

# Install the package.

In [None]:
# ========================== Do not modify below. =======================
!git clone https://github.com/gibsonlab/chronostrain
%pip install -e chronostrain/.

## Some extra dependencies:

- bwa
- samtools

In [None]:
import os

# bwa (We'll be using this for alignments)
!apt install bwa

# samtools (required by ChronoStrain)
!wget https://github.com/samtools/samtools/releases/download/1.19.2/samtools-1.19.2.tar.bz2 -O samtools.tar.bz2
!tar -xjvf samtools.tar.bz2
%cd ./samtools-1.19.2
!make
%cd ..
os.environ['PATH'] += ":/content/samtools-1.19.2"

# Set up the test environment.

The following zip file contains the necessary configuration files, a pre-constructed (and cached) database, and FASTQ inputs containing simulated data.

In [None]:
!wget https://zenodo.org/records/14597417/files/example_files.zip?download=1 -O example_files.zip
!unzip example_files.zip
!rm example_files.zip

# Example usage

Display the help text for chronostrain and exit.

In [1]:
!chronostrain -h

Usage: chronostrain [OPTIONS] COMMAND [ARGS]...

  ChronoStrain (Time-Series Abundance Estimation from Metagenomic Shotgun
  Sequencing)

  Contact: Younhun Kim (ykim78@bwh.harvard.edu)

Options:
  -c, --config FILE               The path to a chronostrain INI configuration
                                  file.
  --profile-jax / --dont-profile-jax
                                  Specify whether to profile JAX memory usage
                                  using jax-smi.
  -h, --help                      Show this message and exit.

Commands:
  advi           Perform posterior estimation using ADVI.
  analyze        Run the pipeline (filter + advi) with mainly default...
  cluster-db     Cluster a JSON database's strain entries using markers'...
  filter         Perform filtering on a timeseries dataset, specified by...
  filter-single  Filter a single read file.
  make-db        Create a database using marker seeds.
  precompute     Perform alignments and fragment counting to...


# Running chronostrain:

The following cells run a simple example input of reads. The reads are simulated reads, pulled from our semisynthetic benchmark for profiling *E. coli* on reads from six mutated genomes.

*Note: the real background reads are not included in this demonstration.*

*Semisynthetic seed detail: mutation rate = 0.002, genome replicate #1, 5000 total simulated reads, read simulation replicate #1*


## Step 1: Pre-filter reads using alignment.
Use the CHRONOSTRAIN_INI environment variable (or the -c option) to pass in a configuration file, along with a command.

This cell uses this configuration to perform the initial filtering of reads. (`chronostrain filter`).

- The `env CHRONOSTRAIN_INI` command tells the program to use the `chronostrain.ini` file provided in this subdirectory.

- The `-r ./inputs/input_files.csv` argument tells the program to use the `input_files.csv` file to specify the input sequencing FASTQ files, as well as providing it metadata.

- The `-o ./output/filtered` argument tells the program to dump the filtered FASTQ results to the output/filtered subdirectory.

- The `--aligner bwa` argument tells the program to use `bwa` for alignments during filtering.

- The `-s ./database/ecoli.clusters.txt` tells the program to use the clustering specified by the txt file, starting with the database `database/ecoli.json` which is specified in the `chronostrain.ini` file.

In [2]:
%env CHRONOSTRAIN_INI=configs/chronostrain.ini

!chronostrain filter -r ./inputs/input_files.csv -o ./output/filtered --aligner bwa -s ./database/ecoli.clusters.txt

env: CHRONOSTRAIN_INI=configs/chronostrain.ini
2025-01-07 14:59:21,826 [INFO - chronostrain.cli.filter_timeseries] - Performing filtering to timeseries dataset `inputs/input_files.csv`.
2025-01-07 14:59:22,668 [INFO - chronostrain.util.alignments.pairwise.wrappers] - If invoked, bowtie2 will initialize using default setting `phred33`. (TODO: implement some flexibility here.)
2025-01-07 14:59:24,157 [INFO - chronostrain.database.database] - Instantiating database `ecoli`.
2025-01-07 14:59:24,192 [INFO - chronostrain.cli.filter_timeseries] - Loaded list of 2325 strains.
2025-01-07 14:59:24,192 [INFO - chronostrain.cli.filter_timeseries] - Target index file: output/filtered/filtered_input_files.csv
2025-01-07 14:59:31,332 [INFO - chronostrain.cli.filter_timeseries] - Applying filter to timepoint 10.0, inputs/0_sim_1.fq
2025-01-07 15:00:29,800 [INFO - chronostrain.cli.filter_timeseries] - Applying filter to timepoint 10.0, inputs/0_sim_2.fq
2025-01-07 15:00:30,433 [INFO - chronostrain.cli.

## Step 2: Perform Bayesian Inference
This command runs inference, using the filtered reads as input.

In [12]:
!chronostrain advi -r output/filtered/filtered_input_files.csv -o output/inference -s ./database/ecoli.clusters.txt

2025-01-07 18:31:29,411 - Pipeline for algs started.
2025-01-07 18:31:30,465 - If invoked, bowtie2 will initialize using default setting `phred33`. (TODO: implement some flexibility here.)
2025-01-07 18:31:30,477 - Loading time-series read files from output/filtered/filtered_input_files.csv
2025-01-07 18:31:32,141 - Instantiating database `ecoli`.
2025-01-07 18:31:32,175 - Loaded list of 2325 strains.
2025-01-07 18:31:32,201 - Initialized model inclusion prior with p=0.5
2025-01-07 18:31:32,201 - Initializing solver with Gaussian-Zero posterior
2025-01-07 18:31:36,565 - Initializing Fully joint posterior with Global Zeros
2025-01-07 18:31:36,643 - The first ELBO iteration will take longer due to JIT compilation.
2025-01-07 18:31:36,644 - Starting ELBO optimization.
2025-01-07 18:31:39,987 - Epoch 1 | Average ELBO = -14321.71 | LR = 0.0005
2025-01-07 18:31:40,546 - Epoch 2 | Average ELBO = -14249.47 | LR = 0.0005
2025-01-07 18:31:41,101 - Epoch 3 | Average ELBO = -14179.23 | LR = 0.0005

## Step 3: (Optional) Interpretation of posterior (transform abundance ratios into overall abundances)

Using `chronostrain advi`, the inference code has performed approximate posterior inference. The default options enable model trimming, where strains with low read mapping counts are removed from the model, and strains which are indistinguishable using reads have been merged together.

In the paper, we also perform **posterior thresholding** ($\overline{\pi}$ in the paper), where we sample the abundance profiles restricted to those strains with posterior indicator probability $q(Z_s) > \overline{\pi}$, say $\overline{\pi} = 0.99$.

The following call, `chronostrain interpret`, performs the necessary transformations of the output data (the raw approximate posterior distribution) into relative abundance profiles where the ad-hoc clustering have been "unwound", restricted to *E. coli* strains which have large posterior scores.

Note: The raw outputs are also **abundance ratios**, and sometimes the actual **relative abundance** (e.g. overall abundance of each strain divided by total bacterial count) is more interpretable. For that, one needs to specify the `--convert-to-overall` option below.

In [52]:
# use this for raw abundance ratios.
!chronostrain interpret -a ./output/inference -r ./output/filtered/filtered_input_files.csv -o ./output/abundances -s ./database/ecoli.clusters.txt -p 0.99 -rs "Escherichia coli"

# use this instead for conversion to relative abundance.
#!chronostrain interpret -a ./output/inference -r ./output/filtered/filtered_input_files.csv -o ./output/abundances -s ./database/ecoli.clusters.txt -p 0.99 -rs "Escherichia coli" --convert-to-overall

2025-01-07 19:31:11,944 - If invoked, bowtie2 will initialize using default setting `phred33`. (TODO: implement some flexibility here.)
2025-01-07 19:31:13,462 - Instantiating database `ecoli`.
2025-01-07 19:31:13,472 - Loaded list of 2325 strains.
2025-01-07 19:31:13,473 - Restricting to target species Escherichia coli (found 842 strain entries/clusters)
2025-01-07 19:31:13,490 - Initializing Fully joint posterior with Global Zeros
2025-01-07 19:31:13,552 - Using posterior threshold = 0.99
2025-01-07 19:31:14,136 - 7 of 852 inference strains passed Posterior p(Z_s|Data) > 0.99
2025-01-07 19:31:14,442 - Finished the conversion.


Now, load the abundance profiles using python (abundance profiles are saved as 3-d numpy arrays)

In [53]:
# The previous step generated a profile of 842 strain IDs.
with open("./output/abundances/profiled_strains.txt", "rt") as f:
    strain_ids = [line.strip() for line in f]

# The inference was performed jointly across 3 timepoints.
time_points = np.load("./output/abundances/time_points.npy")

# Abundance profiles are of shape (T, N, S), where T = (# timepoints), N = (# posterior samples), and S = (# strains profiled).
# Using 5000 posterior samples, this should be of shape (3, 5000, 842)
abundance_profile = np.load("./output/abundances/abundance_profile.npy")  


# Credibility intervals and the median.
per_timepoint_upper = np.quantile(abundance_profile, axis=1, q=0.975)
per_timepoint_lower = np.quantile(abundance_profile, axis=1, q=0.025)
per_timepoint_medians = np.median(abundance_profile, axis=1)

The ground-truth **abundance ratio** is this:

| timepoint |NZ_LS992166.1|NZ_CP090456.1|NZ_CP035870.1|NZ_CP054224.1|NZ_CP070103.1|NZ_LT883142.1|
|---|-------------|-------------|-------------|-------------|-------------|-------------|
10|0.261437908496732|0.2745098039215686|0.0980392156862745|0.2941176470588235|0.0130718954248366|0.05882352941176469|
22|0.3146067415730337|0.348314606741573|0.13483146067415727|0.12359550561797751|0.03370786516853932|0.04494382022471909|
36|0.2590673575129534|0.5008635578583766|0.07772020725388601|0.0690846286701209|0.08635578583765113|0.00690846286701209|

with $R=5000$ simulated reads (Illumina paired-end: $150 \times 2$ nt) per timepoint, so we should expect quite a noisy reconstruction.

(*back-of-the-napkin calculation: assuming each chromosome has 5,000,000 nucleotides, this roughly amounts to $(5000 \times 150 \times 2) / (6 \times 5 000 000) = 0.05$ mean coverage, averaged across the six ground-truth genomes.*)

In [54]:
import pandas as pd
ground_truth_ids = ["NZ_LS992166.1", "NZ_CP090456.1", "NZ_CP035870.1", "NZ_CP054224.1", "NZ_CP070103.1", "NZ_LT883142.1"]
strain_id_indices = {s_id: s_idx for s_idx, s_id in enumerate(strain_ids)}

# Let's print the median estimations.
df_entries = []
for t_idx, t in enumerate(time_points):
    df_entry = {"timepoint": t}
    for s_id in ground_truth_ids:
        # Note: the ground-truth ID may not exist, if it was culled during inference due to poor/non-existing alignments of reads to markers.
        if s_id in strain_id_indices:
            s_idx = strain_id_indices[s_id]
            pred = per_timepoint_medians[t_idx, s_idx]
        else:
            pred = 0.0
        df_entry[s_id] = pred
    df_entries.append(df_entry)
    

display(pd.DataFrame(df_entries))

Unnamed: 0,timepoint,NZ_LS992166.1,NZ_CP090456.1,NZ_CP035870.1,NZ_CP054224.1,NZ_CP070103.1,NZ_LT883142.1
0,10.0,0.237226,0.244581,0.0,0.0,0.008621,0.030648
1,22.0,0.361223,0.378973,0.0,0.0,0.004775,0.0247
2,36.0,0.214484,0.50008,0.0,0.0,0.05138,0.022046


The above result should have two false negatives (NZ_CP035870.1 and NZ_CP054224.1) and four true positives (NZ_LS992166.1, NZ_CP090456.1, NZ_CP070103.1, NZ_LT883142.1).
There are 7 nonzero-abundance strains in total (check the output of `chronostrain interpret`), so the posterior threshold employed ($\overline{\pi} = 0.99$) resulted in $7-4=3$ false positives.