In [2]:
%%bash

hashFrag -h

usage: hashFrag [-h]
                {blastn_module,filter_candidates_module,filter_test_split_module,stratify_test_split_module,identify_homologous_groups_module,create_orthogonal_splits_module,filter_existing_splits,stratify_test_split,create_orthogonal_splits}
                ...

hashFrag is a tool developed to mitigate the impacts of homology-based data leakage in sequence-to-expression models. By identifying homology (based on pairwise alignment scores) in a sequence dataset, this tool can be used to
filter homologous sequences spanning existing train-test splits (e.g., chromosomal splits), stratify a test split according to different levels of homology, or create homology-aware train-test splits.

positional arguments:
  {blastn_module,filter_candidates_module,filter_test_split_module,stratify_test_split_module,identify_homologous_groups_module,create_orthogonal_splits_module,filter_existing_splits,stratify_test_split,create_orthogonal_splits}
    blastn_module       A wrapper s

# Introduction

> This notebook refers to the case when users have existing train-test splits and are interested in identifying and mitigating data leakage attributed to shared sequence homology across splits.

The basic workflow is provided with respect two data splits from a subsampled MPRA dataset (K562): a 8,000-sequence train split and a 2,000-sequence test split (provided in the `data` directory).

hashFrag is a command line tool. This notebook serves as a resource to provide example calls and explanations to each step in the process.

# Section 0: Basic usage - full pipeline command

In [3]:
%%bash

TRAIN_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_train_split.fa.gz
TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_test_split.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work

WORD_SIZE=7
MAX_TARGET_SEQS=8000 # size of train dataset
E_VALUE=100
THRESHOLD=60

hashFrag filter_existing_splits \
--train_fasta_path $TRAIN_FASTA_PATH \
--test_fasta_path $TEST_FASTA_PATH \
--word_size $WORD_SIZE \
--max_target_seqs $MAX_TARGET_SEQS \
--e_value $E_VALUE \
--output_dir $WORK_DIR \
--threshold $THRESHOLD

2025-01-01 05:42:47 - pipeline - INFO - Initializing `filter_existing_splits` pipeline.

2025-01-01 05:42:47 - blastn_module - INFO - Calling module...
2025-01-01 05:42:47 - blastn_module - INFO - Train and test FASTA files detected. Computing pairwise BLAST comparisons across splits...
2025-01-01 05:42:51 - blastn_module - INFO - BLASTn output: 

Building a new DB, current time: 01/01/2025 05:42:51
New DB name:   /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/hashFrag.blastdb
New DB title:  hashFrag
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 8000 sequences in 0.318449 seconds.



2025-01-01 05:42:51 - blastn_module - INFO - BLAST DataBase construction finished and written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/hashFrag.blastdb
2025-01-01 05:44:36 - blastn_module - INFO - BLASTn process finished and written to: /home/bret

# Section 1 - Identifying candidate similar sequences

When user-derived train-test splits are provided, comparisons are constrained to pairs of sequences across splits. The process of identifying candidate pairs of similar sequences involves first creating a BLAST database of sequences in the train split, and then querying each test split sequence against the database. The BLASTn algorithm returns pairwise matches that represent potential cases of homology. 

Successful identification of cases of homology is paramount to effectively mitigate homology-based data leakage. As such, we configure the BLASTn parameters such that recall is maximized, even if it comes at the expense of increased false-positives. Here we consider the following parameters of BLASTn:

* word_size: smaller word sizes results in more exact word matches found between the query and sequences in the database, leading to more alignment score calculations being initialized.
* max_target_seqs: set to the size of the database to remove any constraints and allow for all possible candidate sequences to be returned for a given query.
* evalue: the e-value statistic is a measure of how likely you observe the alignment by chance (lower value corresponds to less likely to observe). By increasing the e-value threshold, less stringent matches that could be due to chance are returned.
* dust: by setting dust off, low-complexity (e.g., repetitive sequences) are no longer masked/filtered out.

In [4]:
%%bash

TRAIN_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_train_split.fa.gz
TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_test_split.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work

WORD_SIZE=7
MAX_TARGET_SEQS=8000 # size of train dataset
E_VALUE=100

hashFrag blastn_module \
--train_fasta_path $TRAIN_FASTA_PATH \
--test_fasta_path $TEST_FASTA_PATH \
-w $WORD_SIZE \
-m $MAX_TARGET_SEQS \
-e $E_VALUE \
--blastdb_label "hashFrag" \
-o $WORK_DIR

2025-01-01 05:45:00 - blastn_module - INFO - Calling module...
2025-01-01 05:45:00 - blastn_module - INFO - Train and test FASTA files detected. Computing pairwise BLAST comparisons across splits...
2025-01-01 05:45:00 - blastn_module - INFO - Existing BLAST database found. Path: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/hashFrag.blastdb
	skipping `makeblastdb` call.
2025-01-01 05:45:00 - blastn_module - INFO - Existing BLAST results file found. Path: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/hashFrag.blastn.out
	skipping `blastn` call.
2025-01-01 05:45:00 - blastn_module - INFO - Module execution completed.



# Section 2: Filter false-positives based on a defined threshold

The next step involves filtering candidate pairings with alignment scores lower than the specified threshold. There are two different modes of hashFrag depending on what alignment score is selected.

* `hashFrag-lightning` is the faster version where the alignment score computed from the BLAST output file. BLASTn is a heuristic method and the alignment scores were found to highly correlate with the optimal alignment scores; however, its underestimation of moderate levels of homology leads to slightly lower recall. 
* `hashFrag-pure` is the slower but more comprehensive method that is based on the optimal, Smith-Waterman local alignment scores between pairs of sequences. The calculation of optimal alignment scores incurs an additional cost to filtering.

An alignment score threshold of 60 was determined to be appropriate for a sequence length of 200bp based on analyses assessing alignment scores between dinucleotide shuffled (i.e., random) and genomic nucleotide sequences.

## Section 2.1: Lightning mode (Default behavior)

Candidate pairs are subjected to filtering based on the alignment score calculated by the BLASTn algorithm.

In [5]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work
INPUT_PATH=$WORK_DIR/hashFrag.blastn.out
METHOD=lightning
THRESHOLD=60

hashFrag filter_candidates_module -m $METHOD -i $INPUT_PATH -t $THRESHOLD -o $WORK_DIR

2025-01-01 05:45:04 - filter_candidates_module - INFO - Calling module...
2025-01-01 05:45:04 - filter_candidates_module - INFO - Filtering based on corrected BLAST alignment scores (lightning mode).
2025-01-01 05:45:04 - filter_candidates_module - INFO - Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/hashFrag_lightning.similar_pairs.tsv.gz
2025-01-01 05:45:04 - filter_candidates_module - INFO - Module execution completed.



## Section 2.2: Pure mode (optional)

To limit memory usage, we'll start by partitioning the blast output file based on size. 

In [6]:
%%bash

SPLIT_SIZE=100000
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work
BLAST_PATH=$WORK_DIR/hashFrag.blastn.out
BLAST_DIR=$WORK_DIR/blast_partitions

LABEL=$( basename -s ".out" $BLAST_PATH )

mkdir -p $BLAST_DIR
cd $BLAST_DIR

split -l $SPLIT_SIZE -a 4 --additional-suffix=.tsv $BLAST_PATH ${LABEL}.partition_

This bash script execution will call a custom python script that computes pairwise Smith-Waterman local alignment scores for the candidate pairs of sequences identified by the BLASTn algorithm. Note that this could feasibly be replaced with any scoring metric of interest.

The expected format output files consists of a tab-delimited file with 3 columns: the query sequence iD, the target seqeuence ID, and their alignment score:
```
seq1    seq2    100
seq3    seq4    30
seq5    seq6    65
```

In [7]:
%%bash

cd ../src/external

# NOTE THIS IS THE CONCATENATED TRAIN AND TEST FASTA FILES
FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.fa.gz

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work
BLAST_DIR=$WORK_DIR/blast_partitions

for PARTITIONED_BLAST_PATH in $BLAST_DIR/*.blastn.partition_*.tsv
do
    echo $PARTITIONED_BLAST_PATH
    bash compute_blast_candidate_SW_scores.sh $FASTA_PATH $PARTITIONED_BLAST_PATH
done

/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/blast_partitions/hashFrag.blastn.partition_aaaa.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/blast_partitions/hashFrag.blastn.partition_aaab.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/blast_partitions/hashFrag.blastn.partition_aaac.tsv


The SW scores for candidate pairs of sequences can subsequently be concatenated into a single `.tsv` file.

In [15]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work
BLAST_DIR=$WORK_DIR/blast_partitions

zcat $BLAST_DIR/*.pairwise_scores.tsv.gz | gzip > $WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz

zcat $WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz | head -n 10

BCL11A_1238	peak21149	16.0
BCL11A_1238	peak79031	13.0
BCL11A_1238	peak4329	16.0
BCL11A_1238	peak7720	13.0
BCL11A_1238	peak49476_Reversed	13.0
BCL11A_1238	peak52197_Reversed	14.0
BCL11A_1238	peak42571	15.0
BCL11A_1238	peak29196	13.0
BCL11A_1238	peak49119_Reversed	14.0
BCL11A_1238	peak69717	14.0


Rather than using the heuristic alignment scores provided by the BLASTn algorithm, we can filter false-positives based on SW alignment scores. Make sure to set the mode to `pure`.

In [9]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work
INPUT_PATH=$WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz
METHOD=pure

THRESHOLD=60

hashFrag filter_candidates_module -m $METHOD -i $INPUT_PATH -t $THRESHOLD -o $WORK_DIR

2025-01-01 05:47:04 - filter_candidates_module - INFO - Calling module...
2025-01-01 05:47:04 - filter_candidates_module - INFO - Filtering based on precomputed alignment scores (pure mode).
2025-01-01 05:47:04 - filter_candidates_module - INFO - Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/hashFrag_pure.similar_pairs.tsv.gz
2025-01-01 05:47:04 - filter_candidates_module - INFO - Module execution completed.



# Section 3: Use Case(s)

## Filter test split sequences that exhibit homology with any sequences in the train split

We show this process for `hashFrag-lightning` or `hashFrag-pure` filtering methods.

In [10]:
%%bash

TRAIN_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_train_split.fa.gz
TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_test_split.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work

HITS_PATH=$WORK_DIR/hashFrag_lightning.similar_pairs.tsv.gz

hashFrag filter_test_split_module --train_fasta_path $TRAIN_FASTA_PATH --test_fasta_path $TEST_FASTA_PATH --hits_path $HITS_PATH

2025-01-01 05:47:36 - filter_test_split_module - INFO - Calling module...
2025-01-01 05:47:36 - filter_test_split_module - INFO - 199 sequences filtered from test split.
2025-01-01 05:47:37 - filter_test_split_module - INFO - Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_test_split.filtered.fa.gz
2025-01-01 05:47:37 - filter_test_split_module - INFO - Module execution completed.



In [11]:
%%bash

TRAIN_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_train_split.fa.gz
TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_test_split.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work

HITS_PATH=$WORK_DIR/hashFrag_pure.similar_pairs.tsv.gz

hashFrag filter_test_split_module --train_fasta_path $TRAIN_FASTA_PATH --test_fasta_path $TEST_FASTA_PATH --hits_path $HITS_PATH

2025-01-01 05:47:40 - filter_test_split_module - INFO - Calling module...
2025-01-01 05:47:40 - filter_test_split_module - INFO - 201 sequences filtered from test split.
2025-01-01 05:47:40 - filter_test_split_module - INFO - Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/example_test_split.filtered.fa.gz
2025-01-01 05:47:40 - filter_test_split_module - INFO - Module execution completed.

