In [1]:
# import os
# os.environ['NUMEXPR_MAX_THREADS'] = '16'

In [2]:
%%bash

hashFrag -h

usage: hashFrag [-h]
                {blastn_module,filter_candidates_module,filter_test_split_module,stratify_test_split_module,identify_homologous_groups_module,create_orthogonal_splits_module,filter_existing_splits,stratify_test_split,create_orthogonal_splits}
                ...

hashFrag is a tool developed to mitigate the impacts of homology-based data leakage in sequence-to-expression models. By identifying homology (based on pairwise alignment scores) in a sequence dataset, this tool can be used to
filter homologous sequences spanning existing train-test splits (e.g., chromosomal splits), stratify a test split according to different levels of homology, or create homology-aware train-test splits.

positional arguments:
  {blastn_module,filter_candidates_module,filter_test_split_module,stratify_test_split_module,identify_homologous_groups_module,create_orthogonal_splits_module,filter_existing_splits,stratify_test_split,create_orthogonal_splits}
    blastn_module       A wrapper s

# Introduction

> This notebook refers to the case when users have existing train-test splits and are interested in identifying and mitigating data leakage attributed to shared sequence homology across splits.

The basic workflow is provided with respect two data splits from a subsampled MPRA dataset (K562): a 8,000-sequence train split and a 2,000-sequence test split (provided in the `data` directory).

hashFrag is a command line tool. This notebook serves as a resource to provide example calls and explanations to each step in the process.

# Section 0: Basic usage - full pipeline command

In [3]:
%%bash

TRAIN_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_train_split.fa.gz
TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_test_split.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work

WORD_SIZE=7
MAX_TARGET_SEQS=8000 # size of train dataset
E_VALUE=100
STEP=10

hashFrag stratify_test_split \
--train_fasta_path $TRAIN_FASTA_PATH \
--test_fasta_path $TEST_FASTA_PATH \
--word_size $WORD_SIZE \
--max_target_seqs $MAX_TARGET_SEQS \
--e_value $E_VALUE \
--step $STEP \
--output_dir $WORK_DIR

2025-01-01 05:50:18 - pipeline - INFO - Initializing `stratify_test_split` pipeline.

2025-01-01 05:50:18 - blastn_module - INFO - Calling module...
2025-01-01 05:50:18 - blastn_module - INFO - Train and test FASTA files detected. Computing pairwise BLAST comparisons across splits...
2025-01-01 05:50:22 - blastn_module - INFO - BLASTn output: 

Building a new DB, current time: 01/01/2025 05:50:20
New DB name:   /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work/hashFrag.blastdb
New DB title:  hashFrag
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 8000 sequences in 0.208322 seconds.



2025-01-01 05:50:22 - blastn_module - INFO - BLAST DataBase construction finished and written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work/hashFrag.blastdb
2025-01-01 05:52:06 - blastn_module - INFO - BLASTn process finished and written to: /home/brett/work/Or

# Section 1 - Identifying candidate similar sequences

When user-derived train-test splits are provided, comparisons are constrained to pairs of sequences across splits. The process of identifying candidate pairs of similar sequences involves first creating a BLAST database of sequences in the train split, and then querying each test split sequence against the database. The BLASTn algorithm returns pairwise matches that represent potential cases of homology. 

Successful identification of cases of homology is paramount to effectively mitigate homology-based data leakage. As such, we configure the BLASTn parameters such that recall is maximized, even if it comes at the expense of increased false-positives. Here we consider the following parameters of BLASTn:

* word_size: smaller word sizes results in more exact word matches found between the query and sequences in the database, leading to more alignment score calculations being initialized.
* max_target_seqs: set to the size of the database to remove any constraints and allow for all possible candidate sequences to be returned for a given query.
* evalue: the e-value statistic is a measure of how likely you observe the alignment by chance (lower value corresponds to less likely to observe). By increasing the e-value threshold, less stringent matches that could be due to chance are returned.
* dust: by setting dust off, low-complexity (e.g., repetitive sequences) are no longer masked/filtered out.

In [4]:
%%bash

TRAIN_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_train_split.fa.gz
TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_test_split.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work

WORD_SIZE=7
MAX_TARGET_SEQS=8000 # size of train dataset
E_VALUE=100

hashFrag blastn_module \
--train_fasta_path $TRAIN_FASTA_PATH \
--test_fasta_path $TEST_FASTA_PATH \
-w $WORD_SIZE \
-m $MAX_TARGET_SEQS \
-e $E_VALUE \
--blastdb_label "hashFrag" \
-o $WORK_DIR

2025-01-01 05:52:07 - blastn_module - INFO - Calling module...
2025-01-01 05:52:07 - blastn_module - INFO - Train and test FASTA files detected. Computing pairwise BLAST comparisons across splits...
2025-01-01 05:52:07 - blastn_module - INFO - Existing BLAST database found. Path: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work/hashFrag.blastdb
	skipping `makeblastdb` call.
2025-01-01 05:52:07 - blastn_module - INFO - Existing BLAST results file found. Path: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work/hashFrag.blastn.out
	skipping `blastn` call.
2025-01-01 05:52:07 - blastn_module - INFO - Module execution completed.



# Section 2: Use Case(s)

## Stratify test split based on homology

Another potentially useful feature is to stratify the test split based on each test sequence's maximum alignment score compared to all sequences in the train split. This can aid in studying the effects that homology has on model performance evaluation. The range of values is specified by the `step` parameter.

### `hashFrag-lightning` mode (Default behavior)

In [5]:
%%bash

TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_test_split.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work
INPUT_PATH=$WORK_DIR/hashFrag.blastn.out
MODE=lightning
STEP=10

LABEL=$( basename -s ".test.fa.gz" $TEST_FASTA_PATH )
OUTPUT_PATH=$WORK_DIR/hashFrag_lightning.stratified_test.tsv.gz

hashFrag stratify_test_split_module -f $TEST_FASTA_PATH -i $INPUT_PATH -m $MODE -s $STEP -o $OUTPUT_PATH

2025-01-01 05:52:08 - stratify_test_split_module - INFO - Calling module...
2025-01-01 05:52:08 - stratify_test_split_module - INFO - Stratifying based on corrected BLAST alignment scores (lightning mode).
2025-01-01 05:52:08 - stratify_test_split_module - INFO - Stratification results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work/hashFrag_lightning.stratified_test.tsv.gz
2025-01-01 05:52:08 - stratify_test_split_module - INFO - Module execution completed.



### `hashFrag-pure` mode

In [6]:
%%bash

SPLIT_SIZE=100000
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work
BLAST_PATH=$WORK_DIR/hashFrag.blastn.out
BLAST_DIR=$WORK_DIR/blast_partitions

LABEL=$( basename -s ".out" $BLAST_PATH )

mkdir -p $BLAST_DIR
cd $BLAST_DIR

split -l $SPLIT_SIZE -a 4 --additional-suffix=.tsv $BLAST_PATH ${LABEL}.partition_

This bash script execution will call a custom python script that computes pairwise Smith-Waterman local alignment scores for the candidate pairs of sequences identified by the BLASTn algorithm. Note that this could feasibly be replaced with any scoring metric of interest.

The expected format output files consists of a tab-delimited file with 3 columns: the query sequence iD, the target seqeuence ID, and their alignment score:
```
seq1    seq2    100
seq3    seq4    30
seq5    seq6    65
```

In [7]:
%%bash

cd ../src/external

# NOTE THIS IS THE CONCATENATED TRAIN AND TEST FASTA FILES
FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.fa.gz

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work
BLAST_DIR=$WORK_DIR/blast_partitions

for PARTITIONED_BLAST_PATH in $BLAST_DIR/*.blastn.partition_*.tsv
do
    echo $PARTITIONED_BLAST_PATH
    bash compute_blast_candidate_SW_scores.sh $FASTA_PATH $PARTITIONED_BLAST_PATH
done

/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work/blast_partitions/hashFrag.blastn.partition_aaaa.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work/blast_partitions/hashFrag.blastn.partition_aaab.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work/blast_partitions/hashFrag.blastn.partition_aaac.tsv


The SW scores for candidate pairs of sequences can subsequently be concatenated into a single `.tsv` file.

In [8]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work
BLAST_DIR=$WORK_DIR/blast_partitions

zcat $BLAST_DIR/*.pairwise_scores.tsv.gz | gzip > $WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz

Rather than using the heuristic alignment scores provided by the BLASTn algorithm, we can filter false-positives based on SW alignment scores. Make sure to set the mode to `pure`.

In [9]:
%%bash

TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_test_split.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work
INPUT_PATH=$WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz
MODE=pure
STEP=10

LABEL=$( basename -s ".test.fa.gz" $TEST_FASTA_PATH )
OUTPUT_PATH=$WORK_DIR/hashFrag_pure.stratified_test.tsv.gz

hashFrag stratify_test_split_module -f $TEST_FASTA_PATH -i $INPUT_PATH -m $MODE -s $STEP -o $OUTPUT_PATH


2025-01-01 05:54:46 - stratify_test_split_module - INFO - Calling module...
2025-01-01 05:54:46 - stratify_test_split_module - INFO - Stratifying based on precomputed alignment scores (pure mode).
2025-01-01 05:54:46 - stratify_test_split_module - INFO - Stratification results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.stratify_test_split.work/hashFrag_pure.stratified_test.tsv.gz
2025-01-01 05:54:46 - stratify_test_split_module - INFO - Module execution completed.

