# `help` commands for general usage

In [2]:
%%bash

hashFrag -h

usage: hashFrag [-h]
                {blastn_module,filter_candidates_module,filter_test_split_module,stratify_test_split_module,identify_homologous_groups_module,create_orthogonal_splits_module,filter_existing_splits,stratify_test_split,create_orthogonal_splits}
                ...

hashFrag is a tool developed to mitigate the impacts of homology-based data leakage in sequence-to-expression models. By identifying homology (based on pairwise alignment scores) in a sequence dataset, this tool can be used to
filter homologous sequences spanning existing train-test splits (e.g., chromosomal splits), stratify a test split according to different levels of homology, or create homology-aware train-test splits.

positional arguments:
  {blastn_module,filter_candidates_module,filter_test_split_module,stratify_test_split_module,identify_homologous_groups_module,create_orthogonal_splits_module,filter_existing_splits,stratify_test_split,create_orthogonal_splits}
    blastn_module       A wrapper s

In [2]:
%%bash

hashFrag filter_existing_splits -h

usage: hashFrag filter_existing_splits [-h] [--train_fasta_path TRAIN_FASTA_PATH] [--test_fasta_path TEST_FASTA_PATH] [-w WORD_SIZE] [-g GAPOPEN]
                                       [-x GAPEXTEND] [-p PENALTY] [-r REWARD] [-m MAX_TARGET_SEQS] [--xdrop_ungap XDROP_UNGAP] [--xdrop_gap XDROP_GAP]
                                       [--xdrop_gap_final XDROP_GAP_FINAL] [-e E_VALUE] [-d DUST] [-b BLASTDB_ARGS] [--blastdb_label BLASTDB_LABEL]
                                       [-B BLASTN_ARGS] [-T THREADS] [--force] -t THRESHOLD [-o OUTPUT_DIR]

Execute the full workflow of commands to filter homology spanning the input test splits. This involves identifying identifying pairs of sequences
sharing similarities with BLAST, filtering candidates based on a specified threshold, and filtering the input test split such that there exist no shared
homology between the train and test splits..

optional arguments:
  -h, --help            show this help message and exit
  --train_fasta_path TRA

# Section 0: Introduction

> This notebook refers to the case when users have existing train-test splits and are interested in identifying and mitigating data leakage attributed to shared sequence homology across splits.

The basic workflow is provided with respect two data splits from a subsampled MPRA dataset (K562): a 8,000-sequence train split and a 2,000-sequence test split (provided in the `data` directory).

This notebook serves a walkthrough of calling the individual modules comprising the `hashFrag filter_existing_splits` command. The basic usage command only supports "lightning" mode, and can be called as follows:
```
TRAIN_FASTA_PATH=../data/example_train_split.fa.gz
TEST_FASTA_PATH=../data/example_test_split.fa.gz
WORK_DIR=../data/tutorial.filter_existing_splits.work

WORD_SIZE=7
MAX_TARGET_SEQS=8000 # size of train dataset
E_VALUE=100
THRESHOLD=60

hashFrag filter_existing_splits \
--train_fasta_path $TRAIN_FASTA_PATH \
--test_fasta_path $TEST_FASTA_PATH \
--word_size $WORD_SIZE \
--max_target_seqs $MAX_TARGET_SEQS \
--e_value $E_VALUE \
--output_dir $WORK_DIR \
--threshold $THRESHOLD \
--force
```

# Section 1 - Identifying candidate similar sequences

When user-derived train-test splits are provided, comparisons are constrained to pairs of sequences across splits. The process of identifying candidate pairs of similar sequences involves first creating a BLAST database of sequences in the train split, and then querying each test split sequence against the database. The BLASTn algorithm returns pairwise matches that represent potential cases of homology. 

Successful identification of cases of homology is paramount to effectively mitigate homology-based data leakage. As such, we configure the BLASTn parameters such that recall is maximized, even if it comes at the expense of increased false-positives. Here we consider the following parameters of BLASTn:

* word_size: smaller word sizes results in more exact word matches found between the query and sequences in the database, leading to more alignment score calculations being initialized.
* max_target_seqs: set to the size of the database to remove any constraints and allow for all possible candidate sequences to be returned for a given query.
* evalue: the e-value statistic is a measure of how likely you observe the alignment by chance (lower value corresponds to less likely to observe). By increasing the e-value threshold, less stringent matches that could be due to chance are returned.
* dust: by setting dust off, low-complexity (e.g., repetitive sequences) are no longer masked/filtered out.

In [5]:
%%bash

TRAIN_FASTA_PATH=../data/example_train_split.fa.gz
TEST_FASTA_PATH=../data/example_test_split.fa.gz
WORK_DIR=../data/tutorial.filter_existing_splits.work

WORD_SIZE=7
MAX_TARGET_SEQS=8000 # size of train dataset
E_VALUE=100

hashFrag blastn_module \
--train_fasta_path $TRAIN_FASTA_PATH \
--test_fasta_path $TEST_FASTA_PATH \
-w $WORD_SIZE \
-m $MAX_TARGET_SEQS \
-e $E_VALUE \
--blastdb_label "hashFrag" \
-o $WORK_DIR

2025-01-11 06:08:59 - blastn_module - INFO - Calling module...
2025-01-11 06:08:59 - blastn_module - INFO - Train and test FASTA files detected. Computing pairwise BLAST comparisons across splits...
2025-01-11 06:08:59 - blastn_module - INFO - Existing BLAST database found. Path: ../data/tutorial.filter_existing_splits.work/hashFrag.blastdb
2025-01-11 06:08:59 - blastn_module - INFO - Skipping `makeblastdb` call...
2025-01-11 06:08:59 - blastn_module - INFO - Existing BLAST results file found. Path: ../data/tutorial.filter_existing_splits.work/hashFrag.blastn.out
2025-01-11 06:08:59 - blastn_module - INFO - Skipping `blastn` call.
2025-01-11 06:08:59 - blastn_module - INFO - Module execution completed.



# Section 2: Filter false-positives based on a defined threshold

The next step involves filtering candidate pairings with alignment scores lower than the specified threshold. There are two different modes of hashFrag depending on what alignment score is selected.

* `hashFrag-lightning` is the faster version where the alignment score computed from the BLAST output file. BLASTn is a heuristic method and the alignment scores were found to highly correlate with the optimal alignment scores; however, its underestimation of moderate levels of homology leads to slightly lower recall. 
* `hashFrag-pure` is the slower but more comprehensive method that is based on the optimal, Smith-Waterman local alignment scores between pairs of sequences. The calculation of optimal alignment scores incurs an additional cost to filtering.

An alignment score threshold of 60 was determined to be appropriate for a sequence length of 200bp based on analyses assessing alignment scores between dinucleotide shuffled (i.e., random) and genomic nucleotide sequences.

## Section 2.1: Lightning mode (Default behavior)

Candidate pairs are subjected to filtering based on the alignment score calculated by the BLASTn algorithm.

In [6]:
%%bash

WORK_DIR=../data/tutorial.filter_existing_splits.work
INPUT_PATH=$WORK_DIR/hashFrag.blastn.out
METHOD=lightning
THRESHOLD=60

hashFrag filter_candidates_module -m $METHOD -i $INPUT_PATH -t $THRESHOLD -o $WORK_DIR

2025-01-11 06:09:06 - numexpr.utils - INFO - Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2025-01-11 06:09:06 - numexpr.utils - INFO - Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-01-11 06:09:06 - numexpr.utils - INFO - NumExpr defaulting to 16 threads.
2025-01-11 06:09:06 - filter_candidates_module - INFO - Calling module...
2025-01-11 06:09:06 - filter_candidates_module - INFO - Filtering based on corrected BLAST alignment scores (lightning mode).
2025-01-11 06:09:06 - filter_candidates_module - INFO - Filtered results written to: ../data/tutorial.filter_existing_splits.work/hashFrag_lightning.similar_pairs.tsv.gz
2025-01-11 06:09:06 - filter_candidates_module - INFO - Module execution completed.



## Section 2.2: Pure mode (optional)

To limit memory usage, we'll start by partitioning the blast output file based on size. 

In [9]:
%%bash

WORK_DIR=../data/tutorial.filter_existing_splits.work

cd $WORK_DIR
BLAST_PATH=$PWD/hashFrag.blastn.out
BLAST_DIR=$PWD/blast_partitions
LABEL=$( basename -s ".out" $BLAST_PATH )

mkdir -p $BLAST_DIR
cd $BLAST_DIR

SPLIT_SIZE=100000

split -l $SPLIT_SIZE -a 4 --additional-suffix=.tsv $BLAST_PATH ${LABEL}.partition_

ls -thor $BLAST_DIR

total 1.5K
-rw-r----- 1 brett 4.6M Jan 11 06:11 hashFrag.blastn.partition_aaac.tsv
-rw-r----- 1 brett 7.2M Jan 11 06:11 hashFrag.blastn.partition_aaab.tsv
-rw-r----- 1 brett 7.2M Jan 11 06:11 hashFrag.blastn.partition_aaaa.tsv


This bash script execution will call a custom python script that computes pairwise Smith-Waterman local alignment scores for the candidate pairs of sequences identified by the BLASTn algorithm. Note that this could feasibly be replaced with any scoring metric of interest.

The expected format output files consists of a tab-delimited file with 3 columns: the query sequence iD, the target seqeuence ID, and their alignment score:
```
seq1    seq2    100
seq3    seq4    30
seq5    seq6    65
```

In [16]:
%%bash

DATA_DIR=../data
cd $DATA_DIR

FASTA_PATH=$PWD/example_full_dataset.fa.gz
WORK_DIR=$PWD/tutorial.filter_existing_splits.work
BLAST_DIR=$WORK_DIR/blast_partitions

cd ../src/external

for PARTITIONED_BLAST_PATH in $BLAST_DIR/*.blastn.partition_*.tsv
do
    echo $PARTITIONED_BLAST_PATH
    bash compute_blast_candidate_SW_scores.sh $FASTA_PATH $PARTITIONED_BLAST_PATH
done

/rshare1/ZETTAI_path_WA_slash_home_KARA/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/blast_partitions/hashFrag.blastn.partition_aaaa.tsv
/rshare1/ZETTAI_path_WA_slash_home_KARA/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/blast_partitions/hashFrag.blastn.partition_aaab.tsv
/rshare1/ZETTAI_path_WA_slash_home_KARA/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.filter_existing_splits.work/blast_partitions/hashFrag.blastn.partition_aaac.tsv


The SW scores for candidate pairs of sequences can subsequently be concatenated into a single `.tsv` file.

In [17]:
%%bash

WORK_DIR=../data/tutorial.filter_existing_splits.work
BLAST_DIR=$WORK_DIR/blast_partitions

zcat $BLAST_DIR/*.pairwise_scores.tsv.gz | gzip > $WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz

zcat $WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz | head -n 10

BCL11A_1238	peak73841_Reversed	16.0
BCL11A_1238	peak42457	13.0
BCL11A_1238	peak82207_Reversed	21.0
BCL11A_1238	peak69717	14.0
BCL11A_1238	peak21149	16.0
BCL11A_1238	peak1501	15.0
BCL11A_1238	HBE1_1082	13.0
BCL11A_1238	peak16460_Reversed	15.0
BCL11A_1238	HBE1_7814	13.0
BCL11A_1238	peak62315	15.0


Rather than using the heuristic alignment scores provided by the BLASTn algorithm, we can filter false-positives based on SW alignment scores. Make sure to set the mode to `pure`.

In [18]:
%%bash

WORK_DIR=../data/tutorial.filter_existing_splits.work
INPUT_PATH=$WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz
METHOD=pure

THRESHOLD=60

hashFrag filter_candidates_module -m $METHOD -i $INPUT_PATH -t $THRESHOLD -o $WORK_DIR

2025-01-11 06:17:28 - numexpr.utils - INFO - Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2025-01-11 06:17:28 - numexpr.utils - INFO - Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-01-11 06:17:28 - numexpr.utils - INFO - NumExpr defaulting to 16 threads.
2025-01-11 06:17:28 - filter_candidates_module - INFO - Calling module...
2025-01-11 06:17:28 - filter_candidates_module - INFO - Filtering based on precomputed alignment scores (pure mode).
2025-01-11 06:17:28 - filter_candidates_module - INFO - Filtered results written to: ../data/tutorial.filter_existing_splits.work/hashFrag_pure.similar_pairs.tsv.gz
2025-01-11 06:17:28 - filter_candidates_module - INFO - Module execution completed.



# Section 3: Use Case(s)

## Filter test split sequences that exhibit homology with any sequences in the train split

We show this process for `hashFrag-lightning` or `hashFrag-pure` filtering methods.

In [19]:
%%bash

TRAIN_FASTA_PATH=../data/example_train_split.fa.gz
TEST_FASTA_PATH=../data/example_test_split.fa.gz
WORK_DIR=../data/tutorial.filter_existing_splits.work

HITS_PATH=$WORK_DIR/hashFrag_lightning.similar_pairs.tsv.gz

hashFrag filter_test_split_module --train_fasta_path $TRAIN_FASTA_PATH --test_fasta_path $TEST_FASTA_PATH --hits_path $HITS_PATH

2025-01-11 06:17:35 - numexpr.utils - INFO - Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2025-01-11 06:17:35 - numexpr.utils - INFO - Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-01-11 06:17:35 - numexpr.utils - INFO - NumExpr defaulting to 16 threads.
2025-01-11 06:17:35 - filter_test_split_module - INFO - Calling module...
2025-01-11 06:17:35 - filter_test_split_module - INFO - 199 sequences filtered from test split.
2025-01-11 06:17:36 - filter_test_split_module - INFO - Filtered results written to: ../data/example_test_split.filtered.fa.gz
2025-01-11 06:17:36 - filter_test_split_module - INFO - Module execution completed.



In [20]:
%%bash

TRAIN_FASTA_PATH=../data/example_train_split.fa.gz
TEST_FASTA_PATH=../data/example_test_split.fa.gz
WORK_DIR=../data/tutorial.filter_existing_splits.work

HITS_PATH=$WORK_DIR/hashFrag_pure.similar_pairs.tsv.gz

hashFrag filter_test_split_module --train_fasta_path $TRAIN_FASTA_PATH --test_fasta_path $TEST_FASTA_PATH --hits_path $HITS_PATH

2025-01-11 06:17:38 - numexpr.utils - INFO - Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2025-01-11 06:17:38 - numexpr.utils - INFO - Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-01-11 06:17:38 - numexpr.utils - INFO - NumExpr defaulting to 16 threads.
2025-01-11 06:17:38 - filter_test_split_module - INFO - Calling module...
2025-01-11 06:17:38 - filter_test_split_module - INFO - 201 sequences filtered from test split.
2025-01-11 06:17:39 - filter_test_split_module - INFO - Filtered results written to: ../data/example_test_split.filtered.fa.gz
2025-01-11 06:17:39 - filter_test_split_module - INFO - Module execution completed.

