# How much gets filtered when removing rRNA, tRNA, and mtDNA/RNA? 
# Notebook 2: supplementing input fastqs with faked RNA virus reads for tracking of their removal/retention   

in addition to the different cellular/organelle rRNA and tRNA sequences set, I want to see how the parameters for filtering these different types of sequences affect the amount of RNA viruses that get removed as well.  
This is a crude expriment as I will use the same refseq / RVMT (or subset of it) for both the fake reads and the masking of the potential contaminants, to see if masking really helps retain more RNA virus reads.  


The test datasets are loosely defined in [get_data.py](../../../src/rolypoly/utils/benchmarking/get_data.py):  
- concentrated viral RNA of river estuary (SRR11097768) https://www.nature.com/articles/s41564-020-0755-4  
- soil metatranscriptome (SRR14039684) (SRR14039684 https://www.nature.com/articles/s41564-022-01180-2 Total RNA metatranscriptome from soil)
- human infected RNA-seq (SRR14871112) https://www.ncbi.nlm.nih.gov/pubmed/34970230 (lyssa virus on cell culture, I think)
- [five hiv mix dataset](https://github.com/cbg-ethz/5-virus-mix) - **NOTE1** not 150x2 but: ([SRR961514](https://www.ncbi.nlm.nih.gov/sra/?term=SRR961514)) - **NOTE2** not a true RNA virus, so should help getting a sense of the retention of non-RNA virus reads.
- Soil metatranscriptome from Yellowstone National Park with rRNA depletion (SRX25111908) or polyA selection (SRX25111907) https://doi.org/10.1093/ismeco/ycae151
- [virmock dataset 9 ](https://gitlab.com/ilvo/VIROMOCKchallenge) - real or semi-artifical metatranscriptome with known composition. [Dataset 9](https://gitlab.com/ilvo/VIROMOCKchallenge/-/blob/master/Datasets/Dataset9.md) is "2 x 151 (R1), 2 x 84 (R2)
5,259,903 (R1), 5,259,903 (R2) - Concentration of different PiVB genomic segments". Will need to understand what the dual read means, doesn't look like regular 150x2...

- marine metatranscriptome (SRRnnn)  --- placeholder, need to pick a real one
- gut metatranscriptome (SRRnnn)  --- placeholder, need to pick a real one
- fungal isolate RNA-seq (SRRnnn)  --- placeholder, need to pick a real one - Marco replied on Slack he might have a suggestion. 


The general workflow (for each dataset):
- Calculate the inital stats of the input fastq (number of reads, total bases, mean Quality, Quality variance) (using `reformat.sh` and then `testformat2.sh`)
- Create fake RNA virus reads from the RVMT/RefSeq set (using `randomreadsmg.sh` or `shred.sh`), with an excess number of reads (e.g. let's say enough to get 10x coverage of each contig) with quality scores similar to the input fastq.  
- Create a Name-->read ID/Header mapping dataframe for the fake RNA virus reads record keeping/tracking. We might want to do this prior to creating the reads as randomreadsmg.sh reads only have the contig number (position from the original fasta file) in the faked read headers. 
- Combine (spike-in) the fake RNA virus reads with the input fastq to create a new fastq file (using `cat.sh`, to retain sequence order when stacking bgzipped stuff). 

Reminder:
- spike in < 10% of the original sample's total raw reads
- fake read - maybe only for <1000 genomes at a time.
- use startfied sampling to select most big taxa.

<!-- 
# coverage = (read count * read length ) / total genome size.
# (cov * gl)/rl = rc
# (10 * 5000 ) / 150 = 333.3** (*5 for 15kbp) -->

Loading libraries and defining paths to sets already created/downloaded:

In [None]:
import json
import logging
import shutil
import subprocess
import tempfile
import os
import time
import glob
from pathlib import Path as pt
import polars as pl
from tqdm.notebook import tqdm
from bbmapy import bbduk, bbmask, kcompress, cat, randomreadsmg, reformat, testformat2

from rolypoly.utils.bio.sequences import (
    filter_fasta_by_headers,
    write_fasta_file,
    remove_duplicates
)

from rolypoly.utils.bio.polars_fastx import from_fastx_eager, fasta_stats, compute_aggregate_stats

from rolypoly.utils.logging.loggit import  setup_logging
from rolypoly.utils.various import run_command_comp

### DEBUG ARGS (for manually building, not entering via CLI):
threads = 12
log_file = "notebooks/Exprimental/trrna.log"
data_dir = "/clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data"

global rrna_dir
global contam_dir

logger = setup_logging(log_file)
print(f"Starting data preparation to : {data_dir}")

contam_dir = os.path.join(data_dir, "contam")
os.makedirs(contam_dir, exist_ok=True)

rrna_dir = os.path.join(contam_dir, "rrna")
os.makedirs(rrna_dir, exist_ok=True)

trna_dir = os.path.join(contam_dir, "trna")
os.makedirs(trna_dir, exist_ok=True)

masking_dir = os.path.join(contam_dir, "masking")
os.makedirs(masking_dir, exist_ok=True)

# taxonomy_dir = os.path.join(data_dir, "taxdump")
# os.makedirs(taxonomy_dir, exist_ok=True)

reference_seqs = os.path.join(data_dir, "reference_seqs")
os.makedirs(reference_seqs, exist_ok=True)

mmseqs_ref_dir = os.path.join(reference_seqs, "mmseqs")
os.makedirs(mmseqs_ref_dir, exist_ok=True)

rvmt_dir = os.path.join(reference_seqs, "RVMT")
os.makedirs(rvmt_dir, exist_ok=True)

ncbi_ribovirus_dir = os.path.join(reference_seqs, "ncbi_ribovirus")
os.makedirs(ncbi_ribovirus_dir, exist_ok=True)

# Masking sequences preparation
rvmt_fasta_path = os.path.join(
    data_dir, "reference_seqs", "RVMT", "RVMT_cleaned_contigs.fasta"
)
ncbi_ribovirus_fasta_path = os.path.join(
    data_dir,
    "reference_seqs",
    "ncbi_ribovirus",
    "refseq_ribovirus_genomes.fasta",
)

rna_viruses_entropy_masked_path = os.path.join(
    masking_dir, "combined_entropy_masked.fasta"
)

Starting data preparation to : /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data


## Prepare Test Datasets
Define the test datasets and their SRA accessions:


In [None]:
# Define test datasets as polars DataFrame
test_datasets = pl.DataFrame({
    "dataset_name": ["viral_river", "soil_metatranscriptome", "human_infected", "hiv_mix", "hot_spring_soil_polyA", "hot_spring_soil_rRNAdepletion"],
    "accession": ["SRR11097768", "SRR14039684", "SRR14871112", "SRR961514", "SRX25111908", "SRX25111907"],
    "description": [
        "Concentrated viral RNA of river estuary",
        "Total RNA metatranscriptome from soil",
        "Lyssa virus on cell culture",
        "Five HIV mix dataset (not 150x2, not true RNA virus)",
        "Soil metatranscriptome from Yellowstone National Park with rRNA depletion"
        "Soil metatranscriptome from Yellowstone National Park with polyA selection",
    ],
    "url": [
        "https://www.nature.com/articles/s41564-020-0755-4",
        "https://www.nature.com/articles/s41564-022-01180-2",
        "https://www.ncbi.nlm.nih.gov/pubmed/34970230",
        "https://github.com/cbg-ethz/5-virus-mix",
        "https://doi.org/10.1093/ismeco/ycae151",
        "https://doi.org/10.1093/ismeco/ycae151"
    ]
})

# Add directory paths
test_datasets = test_datasets.with_columns(
    pl.format("{}/{}", pl.lit(os.path.join(data_dir, "test_fastqs")), pl.col("dataset_name")).alias("dataset_dir")
)

# TODO: add some more datasets...

# Create output directory for test data
test_data_dir = os.path.join(data_dir, "test_fastqs")
print(f"Creating test data directory at: {test_data_dir}")
# remember to add to .gitignore
os.makedirs(test_data_dir, exist_ok=True)

for dataset_dir in test_datasets["dataset_dir"].unique().to_list():
    os.makedirs(dataset_dir, exist_ok=True)
    print(f"Created directory: {dataset_dir}")

print("\nTest datasets:")
print(test_datasets)

Creating test data directory at: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs
Created directory: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/hot_spring_soil_rRNAdepletion
Created directory: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/viral_river
Created directory: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/hot_spring_soil_polyA
Created directory: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/soil_metatranscriptome
Created directory: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/human_infected
Created directory: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/hiv_mix

Test datasets:
shape: (6, 5)
┌─────────────────────┬─────────────┬────────────────────┬────────────────────┬────────────────────┐
│ dataset_name        ┆ accession   ┆ description        ┆ url                ┆ dataset

## Step 1: Download test FASTQ files from SRA
Using prefetch and fasterq-dump to download the test datasets

In [52]:
# Download SRA data for each test dataset using rolypoly's fetch method
from rolypoly.commands.misc.fetch_sra_fastq import download_fastq

for row in test_datasets.iter_rows(named=True):

    accession = row["accession"]
    dataset_dir = row["dataset_dir"]
    dataset_name = row["dataset_name"]
    logger.info(f"Processing {dataset_name} ({accession})")
    
    # Check if FASTQ files already exist (running this cell multiple times caus skipping requires mental capacity)
    existing_fastq = glob.glob(os.path.join(dataset_dir, "*.fastq.gz"))
    if len(existing_fastq) > 0:
        logger.info(f"FASTQ files already exist for {dataset_name}, skipping download")
        continue
    
    # Download FASTQ files from ENA using rolypoly's method
    logger.info(f"Downloading {accession} from ENA")
    try:
        download_fastq(accession, pt(dataset_dir))
        logger.info(f"Completed download of {dataset_name}")
    except Exception as e:
        logger.error(f"Failed to download {dataset_name}: {e}")
        continue

print("All datasets downloaded and processed")

Running command: aria2c  --dir /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/hot_spring_soil_polyA --out SRR29605745_1.fastq.gz --max-connection-per-server 10 --split 16 --summary-interval 0 --console-log-level warn  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR296/045/SRR29605745/SRR29605745_1.fastq.gz
[#34f488 0B/0B CN:1 DL:0B]
[#34f488 0B/0B CN:1 DL:0B]
[#34f488 0B/7.7GiB(0%) CN:1 DL:0B]
[#34f488 32KiB/7.7GiB(0%) CN:10 DL:158KiB ETA:14h14m52s]
[#34f488 2.9MiB/7.7GiB(0%) CN:10 DL:2.1MiB ETA:1h1m35s]
[#34f488 8.8MiB/7.7GiB(0%) CN:10 DL:3.7MiB ETA:35m23s]
[#34f488 13MiB/7.7GiB(0%) CN:10 DL:3.9MiB ETA:33m37s]
[#34f488 20MiB/7.7GiB(0%) CN:10 DL:4.6MiB ETA:28m16s]
[#34f488 43MiB/7.7GiB(0%) CN:10 DL:8.1MiB ETA:16m4s]
[#34f488 82MiB/7.7GiB(1%) CN:10 DL:13MiB ETA:10m1s]
[#34f488 123MiB/7.7GiB(1%) CN:10 DL:16MiB ETA:7m45s]
[#34f488 164MiB/7.7GiB(2%) CN:10 DL:19MiB ETA:6m35s]
[#34f488 202MiB/7.7GiB(2%) CN:10 DL:21MiB ETA:5m56s]
[#34f488 238MiB/7.7GiB(3%) CN:10 DL:25MiB ET

Running command: aria2c  --dir /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/hot_spring_soil_polyA --out SRR29605745_2.fastq.gz --max-connection-per-server 10 --split 16 --summary-interval 0 --console-log-level warn  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR296/045/SRR29605745/SRR29605745_2.fastq.gz
[#bbbe2c 0B/0B CN:1 DL:0B]
[#bbbe2c 0B/8.0GiB(0%) CN:1 DL:0B]
[#bbbe2c 32KiB/8.0GiB(0%) CN:10 DL:223KiB ETA:10h33m32s]
[#bbbe2c 2.2MiB/8.0GiB(0%) CN:10 DL:1.8MiB ETA:1h13m35s]
[#bbbe2c 13MiB/8.0GiB(0%) CN:10 DL:5.9MiB ETA:23m12s]
[#bbbe2c 48MiB/8.0GiB(0%) CN:10 DL:15MiB ETA:9m4s]
[#bbbe2c 84MiB/8.0GiB(1%) CN:10 DL:20MiB ETA:6m46s]
[#bbbe2c 119MiB/8.0GiB(1%) CN:10 DL:23MiB ETA:5m54s]
[#bbbe2c 151MiB/8.0GiB(1%) CN:10 DL:24MiB ETA:5m31s]
[#bbbe2c 183MiB/8.0GiB(2%) CN:10 DL:25MiB ETA:5m16s]
[#bbbe2c 214MiB/8.0GiB(2%) CN:10 DL:26MiB ETA:5m7s]
[#bbbe2c 245MiB/8.0GiB(2%) CN:10 DL:26MiB ETA:4m59s]
[#bbbe2c 278MiB/8.0GiB(3%) CN:10 DL:30MiB ETA:4m22s]
[#bbbe2c 311MiB/8.0GiB(

Running command: aria2c  --dir /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/hot_spring_soil_rRNAdepletion --out SRR29605746_1.fastq.gz --max-connection-per-server 10 --split 16 --summary-interval 0 --console-log-level warn  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR296/046/SRR29605746/SRR29605746_1.fastq.gz
[#265412 0B/0B CN:1 DL:0B]
[#265412 0B/5.3GiB(0%) CN:1 DL:0B]
[#265412 32KiB/5.3GiB(0%) CN:10 DL:158KiB ETA:9h50m50s]
[#265412 2.8MiB/5.3GiB(0%) CN:10 DL:2.2MiB ETA:40m14s]
[#265412 8.9MiB/5.3GiB(0%) CN:10 DL:3.9MiB ETA:23m14s]
[#265412 16MiB/5.3GiB(0%) CN:10 DL:5.1MiB ETA:17m47s]
[#265412 39MiB/5.3GiB(0%) CN:10 DL:9.3MiB ETA:9m43s]
[#265412 73MiB/5.3GiB(1%) CN:10 DL:13MiB ETA:6m27s]
[#265412 108MiB/5.3GiB(1%) CN:10 DL:17MiB ETA:5m8s]
[#265412 143MiB/5.3GiB(2%) CN:10 DL:19MiB ETA:4m29s]
[#265412 175MiB/5.3GiB(3%) CN:10 DL:21MiB ETA:4m8s]
[#265412 210MiB/5.3GiB(3%) CN:10 DL:22MiB ETA:3m51s]
[#265412 244MiB/5.3GiB(4%) CN:10 DL:26MiB ETA:3m18s]
[#265412 279MiB

Running command: aria2c  --dir /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/hot_spring_soil_rRNAdepletion --out SRR29605746_2.fastq.gz --max-connection-per-server 10 --split 16 --summary-interval 0 --console-log-level warn  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR296/046/SRR29605746/SRR29605746_2.fastq.gz
[#da5af5 0B/0B CN:1 DL:0B]
[#da5af5 0B/0B CN:1 DL:0B]
[#da5af5 0B/0B CN:1 DL:0B]
[#da5af5 0B/0B CN:1 DL:0B]
[#da5af5 0B/5.5GiB(0%) CN:1 DL:0B]
[#da5af5 0B/5.5GiB(0%) CN:10 DL:134KiB ETA:12h2m34s]
[#da5af5 2.2MiB/5.5GiB(0%) CN:10 DL:2.0MiB ETA:46m1s]
[#da5af5 10MiB/5.5GiB(0%) CN:10 DL:4.8MiB ETA:19m23s]
[#da5af5 43MiB/5.5GiB(0%) CN:10 DL:14MiB ETA:6m40s]
[#da5af5 90MiB/5.5GiB(1%) CN:10 DL:22MiB ETA:4m13s]
[#da5af5 136MiB/5.5GiB(2%) CN:10 DL:26MiB ETA:3m27s]
[#da5af5 174MiB/5.5GiB(3%) CN:10 DL:28MiB ETA:3m11s]
[#da5af5 212MiB/5.5GiB(3%) CN:10 DL:30MiB ETA:3m2s]
[#da5af5 252MiB/5.5GiB(4%) CN:10 DL:31MiB ETA:2m53s]
[#da5af5 292MiB/5.5GiB(5%) CN:10 DL:32MiB ETA:

All datasets downloaded and processed


## Step 2: Calculate initial FASTQ statistics
Using `reformat.sh` to get stats on read count, total bases, mean quality, quality variance

In [None]:
# Calculate initial stats of input fastqs
print("Calculating stats for downloaded FASTQs...")

# Lists to collect stats
mean_qualities = []
quality_variances = []
legnths = []
total_reads_list = []
total_bases_list = []
stats_list = []
for row in test_datasets.iter_rows(named=True):
    dataset_name = row["dataset_name"]
    dataset_dir = row["dataset_dir"]
    accession =row["accession"]
    
    print(f"\nProcessing {dataset_name}...")
    
    # Find FASTQ files
    r1_path = glob.glob(f"{dataset_dir}/*_1.fastq.gz")[0]
    r2_path = glob.glob(f"{dataset_dir}/*_2.fastq.gz")[0]
    
    # Output stats files
    qhist_path = os.path.join(dataset_dir,  f"{accession}_qhist.txt")
    out_fastq_file = os.path.join(dataset_dir, f"{accession}_rf.fastq.gz")
    
    # Run reformat to get quality histogram and stats
    # check if exists first cause I ran this cell before
    if os.path.exists(qhist_path) or os.path.exists(out_fastq_file):
        print("nice")
    else:
        reformat(
            in1=r1_path,
            in2=r2_path,
            out=out_fastq_file,
            bhist=os.path.join(dataset_dir, f"{accession}_bhist.txt"),
            lhist=os.path.join(dataset_dir, f"{accession}_lhist.txt"),
            gchist=os.path.join(dataset_dir, f"{accession}_gchist.txt"),
            threads=threads,
            qhist=qhist_path
      )
    testsss = testformat2(
        threads=threads,
        in_file=out_fastq_file,
        barcodes=os.path.join(dataset_dir, f"{accession}_barcodes.txt"),  # Print barcodes to this file. Not sure this works when barcodes are not explictly supplied
        bhist=os.path.join(dataset_dir, f"{accession}_bhist.txt"),
        lhist=os.path.join(dataset_dir, f"{accession}_lhist.txt"),
        gchist=os.path.join(dataset_dir, f"{accession}_gchist.txt"),
        qhist=qhist_path,
        junk=os.path.join(dataset_dir, f"{accession}_junk.txt"), # Print headers of junk reads to this file.
        ihist=os.path.join(dataset_dir, f"{accession}_ihist.txt"),
        zmwhist=os.path.join(dataset_dir, f"{accession}_zmwhist.txt"),
        sketch="false",
        merge="true",
        trim="t",
        capture_output=True
    )[0] # when capture_output is true bbmapy returns a tuple(stdout,stderr)
    # parse the stdout of testformat2
    stats_dict = {}
    for line in testsss.splitlines():
        if "\t" in line:
            key, value = line.split("\t", 1)
            key = key if not key.startswith("-") else key[1:]
            stats_dict[key.strip()] = value.strip()

    # Parse qhist.txt to calculate mean quality and variance
    qhist_data = pl.read_csv(
        qhist_path, 
        separator="\t", 
        comment_prefix="#",
        has_header=False,
        new_columns=["BaseNum", "Read1_linear", "Read1_log", "Read2_linear", "Read2_log"]
    )

    length_data = pl.read_csv(
        os.path.join(dataset_dir, f"{accession}_lhist.txt"), 
        separator="\t", 
        comment_prefix="#",
        has_header=False,
        new_columns=["Length","Count"]
    ) 

    total_reads = length_data["Count"].sum()
    total_bases = (length_data["Length"] * length_data["Count"]).sum()

    # insert_data = pl.read_csv(
    #     os.path.join(dataset_dir, f"{accession}_ihist.txt"), 
    #     separator="\t", 
    #     comment_prefix="#",
    #     has_header=False,
    #     new_columns=["Length","Count"]
    # ) # TODO: add to the dataframe 

    exp_read_len = qhist_data.height # this is really max read len (each row has of qhist_data is the mean q for that position), but assuming no outliers...
    
    # Calculate weighted mean quality (using base position as weight implicit in data)
    mean_q1 = qhist_data["Read1_log"].mean()
    mean_q2 = qhist_data["Read2_log"].mean()
    mean_quality = (mean_q1 + mean_q2) / 2
    
    # Calculate variance
    var_q1 = qhist_data["Read1_linear"].var()
    var_q2 = qhist_data["Read2_linear"].var()
    quality_variance = (var_q1 + var_q2) / 2
    
    mean_qualities.append(mean_quality)
    quality_variances.append(quality_variance)
    legnths.append(exp_read_len)
    total_reads_list.append(total_reads)
    total_bases_list.append(total_bases)
    stats_list.append(stats_dict)
    
test_datasets = test_datasets.with_columns(
    pl.Series(mean_qualities).alias("mean_quality"),
    pl.Series(quality_variances).alias("quality_variance"),
    pl.Series(legnths).alias("expected_read_length"),
    pl.Series(total_reads_list).alias("total_reads"),
    pl.Series(total_bases_list).alias("total_bases"),
    pl.Series(stats_list).alias("stats"),
)

Calculating stats for downloaded FASTQs...

Processing viral_river...
nice


ShapeError: 5 column names provided for a DataFrame of width 2

In [41]:
test_datasets

dataset_name,accession,description,url,dataset_dir
str,str,str,str,str
"""viral_river""","""SRR11097768""","""Concentrated viral RNA of rive…","""https://www.nature.com/article…","""/clusterfs/jgi/scratch/science…"
"""soil_metatranscriptome""","""SRR14039684""","""Total RNA metatranscriptome fr…","""https://www.nature.com/article…","""/clusterfs/jgi/scratch/science…"
"""human_infected""","""SRR14871112""","""Lyssa virus on cell culture""","""https://www.ncbi.nlm.nih.gov/p…","""/clusterfs/jgi/scratch/science…"
"""hiv_mix""","""SRR961514""","""Five HIV mix dataset (not 150x…","""https://github.com/cbg-ethz/5-…","""/clusterfs/jgi/scratch/science…"


## Step 3: Select RNA virus reference sequences
Select a subset of RVMT/RefSeq sequences for generating fake reads. Using stratified sampling by taxonomy to get diverse representation.

In [None]:
# Load RNA virus reference sequences
# Use RVMT sequences as reference
rvmt_fasta_df = from_fastx_eager(rvmt_fasta_path)

rvmt_info_df = pl.read_csv(
    "https://portal.nersc.gov/dna/microbial/prokpubs/Riboviria/RiboV1.4/RiboV1.6_Info.tsv",
    separator="\t",
    null_values=["NA", ""],
)
rvmt_info_df = pl.concat([rvmt_info_df.filter(
            ~(pl.col("Note")
            .str.contains_any(
                ["chim", "rRNA", "cell"], ascii_case_insensitive=True,
            )) 
        ),
    rvmt_info_df.filter(
            (pl.col("Note").is_null()) 
        )
])
rvmt_info_df = rvmt_fasta_df.join(
    rvmt_info_df, left_on="header", right_on="ND", how="inner"
)
print(f"Total RVMT sequences: {rvmt_info_df.height}")

# Add length column to joined dataframe
rvmt_info_df = rvmt_info_df.with_columns(
    pl.col("sequence").str.len_chars().alias("seq_length")
)

# Target ~1000 genomes, stratified by taxonomy to preserve abundance but still see most(all?) families
target_total = 1000

# Group by Family and calculate proportional sampling
family_counts = rvmt_info_df["Family"].value_counts().sort("count", descending=True)
print(f"\nOriginal Family distribution (top 10):")
print(family_counts.head(10))

# Calculate sampling size for each family proportional to its abundance
total_seqs = rvmt_info_df.height
family_counts = family_counts.with_columns(
    (pl.col("count") / total_seqs * target_total).cast(pl.Int64).alias("target_sample")
)

# Ensure at least 1 sequence per family if possible
family_counts = family_counts.with_columns(
    pl.when(pl.col("target_sample") == 0)
    .then(1)
    .otherwise(pl.col("target_sample"))
    .alias("target_sample")
)

# Sample from each family
sampled_seqs = []
for row in family_counts.iter_rows(named=True):
    family = row["Family"]
    target_sample = row["target_sample"]
    
    family_df = rvmt_info_df.filter(pl.col("Family") == family)
    family_count = family_df.height
    
    if family_count == 0:
        continue
    
    # Sample up to target, but not more than available
    sample_size = min(target_sample, family_count)
    
    if sample_size > 0:
        sampled = family_df.sample(n=sample_size, seed=42)
        sampled_seqs.append(sampled)
        print(f"Sampled {sample_size} from {family} (had {family_count}, target {target_sample})")

selected_viruses = pl.concat(sampled_seqs)
print(f"\nTotal selected sequences: {selected_viruses.height}")

# Verify taxonomic distribution is preserved
selected_family_counts = selected_viruses["Family"].value_counts().sort("count", descending=True)
print(f"\nSelected Family distribution (top 10):")
print(selected_family_counts.head(10))

# Write selected sequences to file
selected_virus_fasta = os.path.join(reference_seqs, "selected_rna_viruses.fasta")
write_fasta_file(
    seqs=selected_viruses["sequence"].to_list(),
    headers=selected_viruses["header"].to_list(),
    output_file=selected_virus_fasta,
)
print(f"\nWrote selected RNA virus sequences to {selected_virus_fasta}")

# Save mapping table
selected_viruses.write_parquet(
    os.path.join(reference_seqs, "selected_rna_viruses_metadata.parquet")
)

header,sequence,Full_name,RCR90,Segmented,RID,Note,Genetic_Code,Type.hit,Hit.s.,RBS,RvANI90,AfLvl,Host_Evidence,Host,BinID,Phylum,Class,Order,Family,Genus,Novel,Source,Set,Length
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,bool,str,str,i64
"""ND_000631""","""GTTCATCATTCGGAGAAACTCAATGACGAC…","""SRR6960799_865_length_3716_cov…","""al2_000124""",,"""al2_000124""","""Putative_Lysis_Encoding,""","""Standard""",,,"""3/4""","""ANI90_0022516""","""Lvl 0 - Megatree leaves.""","""Host known clade - Leviviricet…","""Bacteria""","""ND_000631""","""Lenarviricota""","""Leviviricetes""","""Norzivirales""","""Fiersviridae""","""Brudgevirus_borborovivens""",false,"""SRR6960799""","""Callanan et al 2020""",3716
"""ND_000651""","""CCGTTTCTGCTTTAAAAAAGAGTAAGCAGA…","""SRR6960799_991_length_3554_cov…","""al2_000141""",,"""al2_000141""","""Putative_Lysis_Encoding,""","""Standard""",,,"""3/3""","""ANI90_0022570""","""Lvl 0 - Megatree leaves.""","""Host known clade - Leviviricet…","""Bacteria""","""ND_000651""","""Lenarviricota""","""Leviviricetes""","""Norzivirales""","""Fiersviridae""","""Brudgevirus_caenenecus""",false,"""SRR6960799""","""Callanan et al 2020""",3554
"""ND_000726""","""AGATTGAGAACCTAACCTTGCGGTATAGGG…","""SRR6960799_6339_length_1654_co…","""Rv4_124201""",,"""al2_000173""","""Putative_Lysis_Encoding,""","""Standard""",,,"""1/1""","""ANI90_0022528""","""Lvl 1 - BLASTp match ID >= 90%…","""Host known clade - Leviviricet…","""Bacteria""","""ND_000726""","""Lenarviricota""","""Leviviricetes""","""Norzivirales""","""Fiersviridae""",,false,"""SRR6960799""","""Callanan et al 2020""",1654
"""ND_002418""","""ATCTCCTTTACGTCCCTCACAGGACAACCA…","""SRR6960803_3007_length_3659_co…","""al2_000202""",,"""al2_000202""","""Putative_Lysis_Encoding,""","""Standard""",,,"""3/3""","""ANI90_0022214""","""Lvl 0 - Megatree leaves.""","""Host known clade - Leviviricet…","""Bacteria""","""ND_002418""","""Lenarviricota""","""Leviviricetes""","""Norzivirales""","""Fiersviridae""","""Brudgevirus_defluviicola""",false,"""SRR6960803""","""Callanan et al 2020""",3659
"""ND_003654""","""AGAAGAGGGGGAACTCCCCTCTCCTCCCTT…","""SRR5466364_2068_length_3566_co…","""Rv4_039701""",,"""ND_3654.752-1087.fr2""","""Putative_Lysis_Encoding,""","""Standard""",,,"""3/3""","""ANI90_0011705""","""Lvl 1 - BLASTp match ID >= 90%…","""Host known clade - Leviviricet…","""Bacteria""","""ND_003654""","""Lenarviricota""","""Leviviricetes""","""Norzivirales""","""Fiersviridae""","""Brudgevirus_borborovicinum""",false,"""SRR5466364""","""Callanan et al 2020""",3566
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""ND_432607""","""AATATCACTCGGAATGCCTTCCGAGCGATT…","""3300030770_Ga0315878_103206""","""Rv4_238876""",,"""Rv4_311016""",,"""Standard""",,,,"""ANI90_0003396""","""Lvl 1 - BLASTp match ID >= 90%…",,,"""ND_432607""","""Lenarviricota""","""Amabiliviricetes""","""Wolframvirales""","""Narnaviridae""",,true,"""3300030770""","""IMG/M - Chen et al 2019""",1201
"""ND_432608""","""AAATAATCTACTACCTTATTCATTAAAGTA…","""3300021303_Ga0210308_1120055""","""Rv4_123694""",,,,"""Standard""",,,,"""ANI90_0003395""","""Lvl 2 - members of single-leaf…",,,"""ND_432608""","""Pisuviricota""","""Pisoniviricetes""","""Picornavirales""","""f.0032""",,true,"""3300021303""","""IMG/M - Chen et al 2019""",1084
"""ND_432609""","""CTTGTCTGTGGTGACAAGATCAAAAGCCAC…","""3300032466_Ga0214503_1022112""","""Rv4_258650""",,"""Rv4_311018""",,"""Standard""",,,,"""ANI90_0003394""","""Lvl 1 - BLASTp match ID >= 90%…",,,"""ND_432609""","""Duplornaviricota""","""Chrymotiviricetes""","""Ghabrivirales""","""Totiviridae""",,true,"""3300032466""","""IMG/M - Chen et al 2019""",1734
"""ND_432610""","""GCCAGCTGTTGAGCGCGTGGAAGGCGCATC…","""3300021845_Ga0210297_1033801""","""Rv4_122081""",,"""Rv4_311019""",,"""Standard""",,,,"""ANI90_0003393""","""Lvl 1 - BLASTp match ID >= 90%…",,,"""ND_432610""","""Lenarviricota""","""Miaviricetes""","""Ourlivirales""","""Botourmiaviridae""",,true,"""3300021845""","""IMG/M - Chen et al 2019""",1267


## Step 4: Generate fake RNA virus reads
Using `randomreads.sh` to create simulated reads with quality scores similar to the input datasets. Target 10x coverage with <10% spike-in rate.

In [None]:
# Generate fake reads for each dataset
# Coverage formula: (read_count * read_length) / genome_size = coverage
# Read count = (coverage * genome_size) / read_length
target_coverage = 10
# read_length = 150  # Standard Illumina read length, but not true for all...

# # Calculate total genome size
# total_genome_size = selected_viruses["seq_length"].sum()
# print(f"Total selected genome size: {total_genome_size:,} bp")

# # Calculate required reads for 10x coverage
# required_reads = (target_coverage * total_genome_size) / (read_length * 2)  # *2 for paired-end
# print(f"Required read pairs for {target_coverage}x coverage: {required_reads:,.0f}")


from math import ceil
# Generate fake reads for each dataset
for row in test_datasets.iter_rows(named=True):
    dataset_name = row["dataset_name"]
    dataset_dir = row["dataset_dir"]
    accession = row["accession"]
    mean_q = row["mean_quality"]
    
    print(f"Generating fake RNA virus reads for {dataset_name}")
    print(f"  Using mean quality: {mean_q:.2f}")
    
    # Output file for fake reads
    fake_reads_file = os.path.join(dataset_dir, f"{accession}_fake_virus_reads.fq.gz")
    from bbmapy import testformat2
    # Use bbmapy randomreads wrapper
    randomreadsmg(
        ref=selected_virus_fasta,
        paired="t",
        illuminanames="t",  
        mindepth=target_coverage,
        qavg=ceil(mean_q),
        qrange= ceil(row["quality_variance"]),
        length=row["expected_read_length"],
        out=fake_reads_file,
        overwrite="t",
        Xmx = "58G"
    )
    
    print(f"Generated fake reads: {fake_reads_file}")

print("Fake RNA virus reads generated for all datasets")


Generating fake RNA virus reads for viral_river
  Using mean quality: 35.92


  Generated fake reads: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/viral_river/SRR11097768_fake_virus_reads.fq.gz

Generating fake RNA virus reads for soil_metatranscriptome
  Using mean quality: 38.91


  Generated fake reads: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/soil_metatranscriptome/SRR14039684_fake_virus_reads.fq.gz

Generating fake RNA virus reads for human_infected
  Using mean quality: 35.49


  Generated fake reads: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/human_infected/SRR14871112_fake_virus_reads.fq.gz

Generating fake RNA virus reads for hiv_mix
  Using mean quality: 36.39


  Generated fake reads: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/hiv_mix/SRR961514_fake_virus_reads.fq.gz

Fake RNA virus reads generated for all datasets


## Step 5: Create read ID mapping for fake reads
Parse fake read headers and create mapping dataframe for tracking. Note: randomreads.sh uses contig position numbers in headers.

In [72]:
# Create mapping between fake read IDs and source contigs
# randomreads.sh creates headers with contig position (0-indexed)

# First, create position-to-name mapping from selected virus fasta
position_to_name = {}
for i, header in enumerate(selected_viruses["header"].to_list()):
    position_to_name[i] = header

print(f"Created position mapping for {len(position_to_name)} contigs")

# Parse fake read headers and create mapping
fake_read_mappings = {}

for row in test_datasets.iter_rows(named=True):
    dataset_name = row["dataset_name"]
    dataset_dir = row["dataset_dir"]
    accession = row["accession"]
    fake_reads_file = os.path.join(dataset_dir, f"{accession}_fake_virus_reads.fq.gz")
    
    if not os.path.exists(fake_reads_file):
        print(f"WARNING: Fake reads file not found for {dataset_name}")
        continue
    
    print(f"\nParsing fake read headers for {dataset_name}")
    
    # Read fake read headers
    fake_reads_df = from_fastx_eager(fake_reads_file)
    
    # Extract contig position, strand, start position, and insert size from header
    # randomreads.sh format: f_0_c_9_s_1_p_434_i_355_name_selected_rna_viruses.fasta
    # c_9 means contig 9 (the 9th contig in the file)
    # s_1 means strand (0=forward, 1=reverse)
    # p_434 means start position (0-indexed)
    # i_355 means insert size
    fake_reads_df = fake_reads_df.with_columns([
        pl.col("header").str.extract(r"c_(\d+)", 1).cast(pl.Int64).alias("contig_position"),
        pl.col("header").str.extract(r"s_(\d+)", 1).cast(pl.Int64).alias("strand"),
        pl.col("header").str.extract(r"p_(\d+)", 1).cast(pl.Int64).alias("start_position"),
        pl.col("header").str.extract(r"i_(\d+)", 1).cast(pl.Int64).alias("insert_size")
    ])
    
    # Map position to original contig name
    fake_reads_df = fake_reads_df.with_columns(
        pl.col("contig_position").map_elements(
            lambda x: position_to_name.get(x, "unknown"),
            return_dtype=pl.Utf8
        ).alias("source_contig")
    )
    
    # Save mapping with all extracted fields
    mapping_file = os.path.join(dataset_dir, f"{accession}_fake_read_mapping.parquet")
    fake_reads_df.select([
        "header", "contig_position", "source_contig", 
        "strand", "start_position", "insert_size"
    ]).write_parquet(mapping_file)
    
    fake_read_mappings[dataset_name] = mapping_file
    print(f"  Saved read mapping: {mapping_file}")
    print(f"  {fake_reads_df.height} fake reads mapped")

print("\nRead ID mapping completed for all datasets")

Created position mapping for 1176 contigs

Parsing fake read headers for viral_river
  Saved read mapping: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/viral_river/SRR11097768_fake_read_mapping.parquet
  1393950 fake reads mapped

Parsing fake read headers for soil_metatranscriptome
  Saved read mapping: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/viral_river/SRR11097768_fake_read_mapping.parquet
  1393950 fake reads mapped

Parsing fake read headers for soil_metatranscriptome
  Saved read mapping: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/soil_metatranscriptome/SRR14039684_fake_read_mapping.parquet
  683648 fake reads mapped

Parsing fake read headers for human_infected
  Saved read mapping: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/soil_metatranscriptome/SRR14039684_fake_read_mapping.parquet
  683648 fake reads mapped

Parsing fake read headers for human_infe

In [73]:
fake_reads_df

header,sequence,quality,contig_position,strand,start_position,insert_size,source_contig
str,str,str,i64,i64,i64,i64,str
"""LH00088:93:90GLGMLT3:7:2275:83…","""AATAAAAGTGGAGATAGAGAACCCGTAACA…","""??????????????????????????????…",0,0,531,368,"""ND_381844"""
"""LH00088:93:90GLGMLT3:7:2271:13…","""TGCCAGACAGCATTAGTTAGTCAGCTGTCA…","""??????????????????????????????…",0,0,531,368,"""ND_381844"""
"""LH00088:93:90GLGMLT3:7:1453:13…","""ATGTCAGAAGCTCAATTTGATAGAGAGTTG…","""??????????????????????????????…",0,1,116,318,"""ND_381844"""
"""LH00088:93:90GLGMLT3:7:1099:94…","""GTACTGATCCGACTCATTAATAGTACAAAT…","""??????????????????????????????…",0,1,116,318,"""ND_381844"""
"""LH00088:93:90GLGMLT3:7:2509:14…","""CTCATAAGAGAGTCAGAATTTAATCCAACA…","""??????????????????????????????…",0,0,1203,410,"""ND_381844"""
…,…,…,…,…,…,…,…
"""LH00088:93:90GLGMLT3:7:2496:57…","""CTGCATTTGTTAATCTTTTATACATTTTAC…","""??????????????????????????????…",1175,0,6520,348,"""ND_022671"""
"""LH00088:93:90GLGMLT3:7:2059:39…","""GGAAGACATCTATTCACTATGTAGTCATTT…","""??????????????????????????????…",1175,0,4484,391,"""ND_022671"""
"""LH00088:93:90GLGMLT3:7:2429:79…","""CTCTGATGGTTCTAGATCTAAAAGATCGTT…","""??????????????????????????????…",1175,0,4484,391,"""ND_022671"""
"""LH00088:93:90GLGMLT3:7:1945:62…","""AACTCAATGATTCTGTACCCTGTAGAAGAC…","""??????????????????????????????…",1175,0,7278,304,"""ND_022671"""


## Step 6: Spike-in fake reads with original FASTQ
Combine fake RNA virus reads with original FASTQ ensuring <10% spike-in rate. Using reformat.sh to maintain proper FASTQ format and bgzip compression.

In [75]:
row

{'dataset_name': 'hiv_mix',
 'accession': 'SRR961514',
 'description': 'Five HIV mix dataset (not 150x2, not true RNA virus)',
 'url': 'https://github.com/cbg-ethz/5-virus-mix',
 'dataset_dir': '/clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/data/test_fastqs/hiv_mix',
 'mean_qualities': 36.39075896414343,
 'quality_variance': 2.6106767956015937,
 'mean_quality': 36.39075896414343,
 'expected_read_length': 251,
 'total_reads': 1429988,
 'total_bases': 308025617}

In [83]:
# Spike-in fake reads with original FASTQ files
# Verify that fake reads are <10% of original

max_spike_in_fraction = 0.10

for row in test_datasets.iter_rows(named=True):
    dataset_name = row["dataset_name"]
    dataset_dir = row["dataset_dir"]
    accession = row["accession"]
    total_reads = row["total_reads"]
    
    # Find original FASTQ files
    original_fastq = sorted(glob.glob(os.path.join(dataset_dir, f"{accession}*.fastq.gz")))
    original_fastq = [f for f in original_fastq if "fake" not in f and "spiked" not in f]
    
    fake_reads_file = os.path.join(dataset_dir, f"{accession}_fake_virus_reads.fq.gz")
    
    if not os.path.exists(fake_reads_file):
        print(f"WARNING: Fake reads file not found for {dataset_name}")
        continue
    
    if len(original_fastq) == 0:
        print(f"WARNING: Original FASTQ files not found for {dataset_name}")
        continue
    
    print(f"\nSpiking-in fake reads for {dataset_name}")
    print(f"  Original read count: {total_reads:,}")
    
    # Load fake reads count
    fake_reads_df = from_fastx_eager(fake_reads_file)
    fake_read_count = fake_reads_df.height
    print(f"  Fake read count: {fake_read_count:,}")
    
    # Verify spike-in fraction is acceptable
    spike_in_fraction = fake_read_count / (total_reads + fake_read_count)
    print(f"  Spike-in fraction: {spike_in_fraction:.2%}")
    
    if spike_in_fraction > max_spike_in_fraction:
        print(f"  WARNING: Spike-in fraction {spike_in_fraction:.2%} exceeds {max_spike_in_fraction:.2%}!")
    
    # get the reformted interleave file
    interleaved = os.path.join(dataset_dir, f"{accession}_rf.fq.gz")
    # Combine original and fake reads using bbmapy.cat
    # cat handles different compression types automatically
    spiked_fastq = os.path.join(dataset_dir, f"{accession}_spiked.fq.gz")
    
    print(f"Concatenating with fake reads using bbmapy.cat...")
    cat(
        in_file=",".join([fake_reads_file,interleaved]),
        out=spiked_fastq
    )
    
    print(f"  Created spiked FASTQ: {spiked_fastq}")

print("\nSpike-in completed for all datasets")


Spiking-in fake reads for viral_river
  Original read count: 123,363,818
  Fake read count: 1,393,950
  Spike-in fraction: 1.12%
  Concatenating with fake reads using bbmapy.cat...
  Fake read count: 1,393,950
  Spike-in fraction: 1.12%
  Concatenating with fake reads using bbmapy.cat...


KeyboardInterrupt: 

## Summary: Spiked datasets ready for filtering experiments
All test datasets have been prepared with fake RNA virus reads. The spiked FASTQs can now be used in the filtering experiments.

In [77]:
# Create summary dataframe of prepared datasets
summary_records = []

for row in test_datasets.iter_rows(named=True):
    dataset_name = row["dataset_name"]
    dataset_dir = row["dataset_dir"]
    accession = row["accession"]
    description = row["description"]
    
    spiked_fastq = os.path.join(dataset_dir, f"{accession}_spiked.fq.gz")
    mapping_file = os.path.join(dataset_dir, f"{accession}_fake_read_mapping.parquet")
    
    if os.path.exists(spiked_fastq) and os.path.exists(mapping_file):
        # Get file size
        file_size_mb = os.path.getsize(spiked_fastq) / (1024 * 1024)
        
        # Load mapping to get stats
        mapping_df = pl.read_parquet(mapping_file)
        fake_read_count = mapping_df.height
        
        summary_records.append({
            "dataset_name": dataset_name,
            "accession": accession,
            "description": description,
            "spiked_fastq": spiked_fastq,
            "mapping_file": mapping_file,
            "fake_read_count": fake_read_count,
            "file_size_mb": file_size_mb,
            "status": "ready"
        })
    else:
        summary_records.append({
            "dataset_name": dataset_name,
            "accession": accession,
            "description": description,
            "spiked_fastq": spiked_fastq if os.path.exists(spiked_fastq) else "missing",
            "mapping_file": mapping_file if os.path.exists(mapping_file) else "missing",
            "fake_read_count": 0,
            "file_size_mb": 0,
            "status": "incomplete"
        })

summary_df = pl.DataFrame(summary_records)

# Save summary
summary_file = os.path.join(test_data_dir, "dataset_preparation_summary.parquet")
summary_df.write_parquet(summary_file)

print("Dataset Preparation Summary:")
print(summary_df)
print(f"\nSummary saved to: {summary_file}")

Dataset Preparation Summary:
shape: (4, 8)
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┬───────────┬────────┐
│ dataset_na ┆ accession  ┆ descriptio ┆ spiked_fas ┆ mapping_fi ┆ fake_read_ ┆ file_size ┆ status │
│ me         ┆ ---        ┆ n          ┆ tq         ┆ le         ┆ count      ┆ _mb       ┆ ---    │
│ ---        ┆ str        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---       ┆ str    │
│ str        ┆            ┆ str        ┆ str        ┆ str        ┆ i64        ┆ f64       ┆        │
╞════════════╪════════════╪════════════╪════════════╪════════════╪════════════╪═══════════╪════════╡
│ viral_rive ┆ SRR1109776 ┆ Concentrat ┆ /clusterfs ┆ /clusterfs ┆ 1393950    ┆ 0.000027  ┆ ready  │
│ r          ┆ 8          ┆ ed viral   ┆ /jgi/scrat ┆ /jgi/scrat ┆            ┆           ┆        │
│            ┆            ┆ RNA of     ┆ ch/science ┆ ch/science ┆            ┆           ┆        │
│            ┆            ┆ rive…      ┆ …      

## Optional: Prepare entropy-masked contamination databases
Create masked versions of rRNA, tRNA, mtDNA databases to test if masking helps retain RNA virus reads during filtering.

In [None]:
# Create entropy-masked versions of contamination databases
# This tests whether masking shared regions with RNA viruses helps retention

# Check if RNA virus masked sequences exist
if not os.path.exists(rna_viruses_entropy_masked_path):
    logger.info("Creating entropy-masked RNA virus sequences")
    
    # Merge RVMT and NCBI ribovirus sequences
    combined_virus_fasta = os.path.join(masking_dir, "combined_rna_viruses.fasta")
    run_command_comp(
        base_cmd="cat",
        positional_args=[rvmt_fasta_path, ncbi_ribovirus_fasta_path],
        positional_args_location="end",
        params={},
        output_file=combined_virus_fasta,
        logger=logger,
    )
    
    # Apply entropy masking to RNA virus sequences
    bbduk(
        in1=combined_virus_fasta,
        out=rna_viruses_entropy_masked_path,
        entropy=0.6,
        entropyk=4,
        entropywindow=24,
        maskentropy=True,
        ziplevel=9,
    )
    
    logger.info(f"Created entropy-masked RNA virus sequences: {rna_viruses_entropy_masked_path}")
else:
    logger.info(f"Entropy-masked RNA virus sequences already exist: {rna_viruses_entropy_masked_path}")

# Now create masked versions of contamination databases
# These will mask regions that overlap with RNA viruses

contam_databases = {
    "rrna_silva": silva_masked,
    "rrna_ncbi": rrna_fasta_path,
    "trna": os.path.join(trna_dir, "tRNA_sequences_deduplicated_filtered.fasta"),
    "mito": os.path.join(reference_seqs, "mito_refseq", "combined_mito_refseq.fasta"),
    "plastid": os.path.join(reference_seqs, "plastid_refseq", "combined_plastid_refseq.fasta"),
}

# For each contamination database, create a virus-masked version
for db_name, db_path in contam_databases.items():
    if not os.path.exists(db_path):
        logger.warning(f"Database not found: {db_path}")
        continue
    
    masked_output = db_path.replace(".fasta", "_virus_masked.fasta").replace(".fa", "_virus_masked.fa")
    
    if os.path.exists(masked_output):
        logger.info(f"Virus-masked database already exists: {masked_output}")
        continue
    
    logger.info(f"Creating virus-masked version of {db_name}")
    
    # Use bbmask to mask regions matching RNA viruses
    # This requires bbmap alignment first
    temp_sam = db_path + ".temp.sam"
    
    # Map virus sequences to contamination database
    run_command_comp(
        base_cmd="bbmap.sh",
        positional_args=[
            f"ref={db_path}",
            f"in={rna_viruses_entropy_masked_path}",
            f"out={temp_sam}",
            f"threads={threads}",
            "minid=0.85",
            "maxindel=3",
            "ambiguous=all",
            "nodisk",
        ],
        positional_args_location="end",
        params={},
        logger=logger,
    )
    
    # Use bbmask to mask aligned regions
    bbmask(
        in_file=db_path,
        out=masked_output,
        sam=temp_sam,
        minkr=4,
        maxkr=8,
        minlen=40,
        minke=4,
        ziplevel=9,
    )
    
    # Clean up temp file
    if os.path.exists(temp_sam):
        os.remove(temp_sam)
    
    logger.info(f"Created virus-masked database: {masked_output}")

print("Entropy-masked contamination databases prepared")
print("Now ready for filtering experiments comparing masked vs unmasked databases")


## Next Steps

The prepared datasets are now ready for filtering experiments in the companion notebook [trrna_filter.ipynb](trrna_filter.ipynb).

**Summary of outputs:**
- Spiked FASTQ files with fake RNA virus reads (<10% spike-in)
- Read ID mapping files for tracking fake reads through filtering
- Entropy-masked contamination databases (optional, for comparison)
- Initial statistics for all datasets

**Filtering experiments to perform:**
1. Filter with different rRNA/tRNA/mtDNA combinations
2. Compare masked vs unmasked contamination databases
3. Track retention/removal of fake RNA virus reads
4. Measure filtering time and resource usage
5. Analyze taxonomic distribution of removed reads

The goal is to optimize contamination removal while maximizing RNA virus retention.

In [4]:
summary_file

NameError: name 'summary_file' is not defined