# `help` commands for general usage

In [1]:
%%bash

hashFrag -h

usage: hashFrag [-h]
                {blastn_module,filter_candidates_module,filter_test_split_module,stratify_test_split_module,identify_homologous_groups_module,create_orthogonal_splits_module,filter_existing_splits,stratify_test_split,create_orthogonal_splits}
                ...

hashFrag is a tool developed to mitigate the impacts of homology-based data leakage in sequence-to-expression models. By identifying homology (based on
pairwise alignment scores) in a sequence dataset, this tool can be used to filter homologous sequences spanning existing train-test splits (e.g.,
chromosomal splits), stratify a test split according to different levels of homology, or create homology-aware train-test splits.

positional arguments:
  {blastn_module,filter_candidates_module,filter_test_split_module,stratify_test_split_module,identify_homologous_groups_module,create_orthogonal_splits_module,filter_existing_splits,stratify_test_split,create_orthogonal_splits}
    blastn_module       A wrapper s

In [2]:
%%bash

hashFrag create_orthogonal_splits -h

usage: hashFrag create_orthogonal_splits [-h] [-f FASTA_PATH] [-w WORD_SIZE] [-g GAPOPEN] [-x GAPEXTEND] [-p PENALTY] [-r REWARD] [-m MAX_TARGET_SEQS]
                                         [-e E_VALUE] [-d DUST] [-b BLASTDB_ARGS] [--blastdb_label BLASTDB_LABEL] [-B BLASTN_ARGS] [-T THREADS] -t
                                         THRESHOLD [--p_train P_TRAIN] [--p_test P_TEST] [-n N_SPLITS] [-s SEED] [--force] [-o OUTPUT_DIR]

Execute the full workflow of commands to create homology-aware train-test splits. This involves identifying identifying pairs of sequences sharing
similarities with BLAST, filtering candidates based on a specified threshold, identifying all the different subgroups of sequences exhibiting a distinct
case of homology, and creating train-test splits with no leakage.

optional arguments:
  -h, --help            show this help message and exit
  -f FASTA_PATH, --fasta_path FASTA_PATH
                        Input FASTA file containing all sequences in the datas

# Section 0: Introduction

> This notebook refers to the case when users have a nucleotide sequence dataset and are interested in creating homology-aware train-test data splits for sequence-to-expression models.

The basic workflow is provided with respect a subsampled MPRA dataset (K562): a 10,000-sequence FASTA file (provided in the `data` directory).

This notebook serves a walkthrough of calling the individual modules comprising the `hashFrag create_orthogonal_splits` command. The basic usage command only supports "lightning" mode, and can be called as follows:
```
FASTA_PATH=../data/example_full_dataset.fa.gz
WORK_DIR=../data/tutorial.create_orthogonal_splits.work

WORD_SIZE=7
MAX_TARGET_SEQS=10000 # size of train dataset
EVALUE=100
THRESHOLD=60
N_SPLITS=10

hashFrag create_orthogonal_splits \
--fasta-path $FASTA_PATH \
--word-size $WORD_SIZE \
--max-target-seqs $MAX_TARGET_SEQS \
--evalue $EVALUE \
--threshold $THRESHOLD \
--n-splits $N_SPLITS \
--force \
--output-dir $WORK_DIR
```

# Section 1 - Identifying candidate similar sequences

The process of identifying candidate pairs of similar sequences involves first creating a BLAST database of the dataset, and then querying each sequence against the database. The BLASTn algorithm returns pairwise matches that represent potential cases of homology. 

Successful identification of cases of homology is paramount to effectively mitigate homology-based data leakage. As such, we configure the BLASTn parameters such that recall is maximized, even if it comes at the expense of increased false-positives. Here we consider the following parameters of BLASTn:

* word_size: smaller word sizes results in more exact word matches found between the query and sequences in the database, leading to more alignment score calculations being initialized.
* max_target_seqs: set to the size of the database to remove any constraints and allow for all possible candidate sequences to be returned for a given query.
* evalue: the e-value statistic is a measure of how likely you observe the alignment by chance (lower value corresponds to less likely to observe). By increasing the e-value threshold, less stringent matches that could be due to chance are returned.
* dust: by setting dust off, low-complexity (e.g., repetitive sequences) are no longer masked/filtered out.

In [None]:
%%bash

FASTA_PATH=../data/example_full_dataset.fa.gz
WORK_DIR=../data/tutorial.create_orthogonal_splits.work

WORD_SIZE=7
MAX_TARGET_SEQS=10000 # size of dataset
E_VALUE=100

hashFrag blastn_module \
-f $FASTA_PATH \
-m $MAX_TARGET_SEQS \
-w $WORD_SIZE \
-e $E_VALUE \
--blastdb_label "hashFrag" \
-o $WORK_DIR

2025-01-11 05:03:59 - blastn_module - INFO - Calling module...
2025-01-11 05:03:59 - blastn_module - INFO - One FASTA files detected. Computing pairwise BLAST comparisons for all sequence-pairs...
2025-01-11 05:03:59 - blastn_module - INFO - Existing BLAST database found. Path: ../data/tutorial.create_orthogonal_splits.work/hashFrag.blastdb
2025-01-11 05:03:59 - blastn_module - INFO - Skipping `makeblastdb` call...
2025-01-11 05:03:59 - blastn_module - INFO - Existing BLAST results file found. Path: ../data/tutorial.create_orthogonal_splits.work/hashFrag.blastn.out
2025-01-11 05:03:59 - blastn_module - INFO - Skipping `blastn` call.
2025-01-11 05:03:59 - blastn_module - INFO - Module execution completed.



# Section 2: Filter false-positives based on a defined threshold

The next step involves filtering candidate pairings with alignment scores lower than the specified threshold. There are two different modes of hashFrag depending on what alignment score is selected.

* `hashFrag-lightning` is the faster version where the alignment score computed from the BLAST output file. BLASTn is a heuristic method and the alignment scores were found to highly correlate with the optimal alignment scores; however, its underestimation of homology in some cases can lead to slightly worse recall. 
* `hashFrag-pure` is the slower but more comprehensive method that is based on the optimal, Smith-Waterman local alignment scores between pairs of sequences. The calculation of optimal alignment scores incurs an additional cost to filtering.

An alignment score threshold of 60 was determined to be appropriate based on an analysis looking at alignment scores between dinucleotide shuffled (i.e., random) sequences.

## Section 2.1: Lightning mode (Default behaviour)

In [9]:
%%bash

WORK_DIR=../data/tutorial.create_orthogonal_splits.work
INPUT_PATH=$WORK_DIR/hashFrag.blastn.out
MODE=lightning
THRESHOLD=60

hashFrag filter_candidates_module -m $MODE -i $INPUT_PATH -t $THRESHOLD -o $WORK_DIR

2025-01-11 05:04:09 - numexpr.utils - INFO - Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2025-01-11 05:04:09 - numexpr.utils - INFO - Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-01-11 05:04:09 - numexpr.utils - INFO - NumExpr defaulting to 16 threads.
2025-01-11 05:04:11 - filter_candidates_module - INFO - Calling module...
2025-01-11 05:04:11 - filter_candidates_module - INFO - Filtering based on corrected BLAST alignment scores (lightning mode).
2025-01-11 05:04:12 - filter_candidates_module - INFO - Filtered results written to: ../data/tutorial.create_orthogonal_splits.work/hashFrag_lightning.similar_pairs.tsv.gz
2025-01-11 05:04:12 - filter_candidates_module - INFO - Module execution completed.



## Section 2.2: Pure mode (optional)

To limit memory usage, we'll start by partitioning the blast output file based on size. 

In [18]:
%%bash

WORK_DIR=../data/tutorial.create_orthogonal_splits.work

cd $WORK_DIR
BLAST_PATH=$PWD/hashFrag.blastn.out
BLAST_DIR=$PWD/blast_partitions
LABEL=$( basename -s ".out" $BLAST_PATH )

mkdir -p $BLAST_DIR
cd $BLAST_DIR

SPLIT_SIZE=100000

split -l $SPLIT_SIZE -a 4 --additional-suffix=.tsv $BLAST_PATH ${LABEL}.partition_

ls -thor $BLAST_DIR

total 4.5K
-rw-r----- 1 brett 584K Jan 11 05:11 hashFrag.blastn.partition_aaai.tsv
-rw-r----- 1 brett 7.3M Jan 11 05:11 hashFrag.blastn.partition_aaah.tsv
-rw-r----- 1 brett 7.3M Jan 11 05:11 hashFrag.blastn.partition_aaag.tsv
-rw-r----- 1 brett 7.3M Jan 11 05:11 hashFrag.blastn.partition_aaaf.tsv
-rw-r----- 1 brett 7.3M Jan 11 05:11 hashFrag.blastn.partition_aaae.tsv
-rw-r----- 1 brett 7.3M Jan 11 05:11 hashFrag.blastn.partition_aaad.tsv
-rw-r----- 1 brett 7.3M Jan 11 05:11 hashFrag.blastn.partition_aaac.tsv
-rw-r----- 1 brett 7.3M Jan 11 05:11 hashFrag.blastn.partition_aaab.tsv
-rw-r----- 1 brett 7.3M Jan 11 05:11 hashFrag.blastn.partition_aaaa.tsv


This bash script execution will call a custom python script that computes pairwise Smith-Waterman local alignment scores for the candidate pairs of sequences identified by the BLASTn algorithm. Note that this could feasibly be replaced with any scoring metric of interest.

The expected format output files consists of a tab-delimited file with 3 columns: the query sequence iD, the target seqeuence ID, and their alignment score:
```
seq1    seq2    100
seq3    seq4    30
seq5    seq6    65
```

In [20]:
%%bash

DATA_DIR=../data
cd $DATA_DIR

FASTA_PATH=$PWD/example_full_dataset.fa.gz
WORK_DIR=$PWD/tutorial.create_orthogonal_splits.work
BLAST_DIR=$WORK_DIR/blast_partitions

cd ../src/external

for PARTITIONED_BLAST_PATH in $BLAST_DIR/*.blastn.partition_*.tsv
do
    echo $PARTITIONED_BLAST_PATH
    bash compute_blast_candidate_SW_scores.sh $FASTA_PATH $PARTITIONED_BLAST_PATH
done

/rshare1/ZETTAI_path_WA_slash_home_KARA/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.create_orthogonal_splits.work/blast_partitions/hashFrag.blastn.partition_aaaa.tsv
/rshare1/ZETTAI_path_WA_slash_home_KARA/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.create_orthogonal_splits.work/blast_partitions/hashFrag.blastn.partition_aaab.tsv
/rshare1/ZETTAI_path_WA_slash_home_KARA/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.create_orthogonal_splits.work/blast_partitions/hashFrag.blastn.partition_aaac.tsv
/rshare1/ZETTAI_path_WA_slash_home_KARA/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.create_orthogonal_splits.work/blast_partitions/hashFrag.blastn.partition_aaad.tsv
/rshare1/ZETTAI_path_WA_slash_home_KARA/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/tutorial.create_orthogonal_splits.work/blast_partitions/hashFrag.blastn.partition_aaae.tsv
/rshare1/ZETTAI_path_WA_slash_home_KARA/home/brett/work/OrthogonalTrai

The SW scores for candidate pairs of sequences can subsequently be concatenated into a single `.tsv` file.

In [21]:
%%bash

WORK_DIR=../data/tutorial.create_orthogonal_splits.work
BLAST_DIR=$WORK_DIR/blast_partitions

zcat $BLAST_DIR/*.pairwise_scores.tsv.gz | gzip > $WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz

zcat $WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz | head -n 10

BCL11A_1542	GATA1_9703	15.0
BCL11A_1542	peak63709_Reversed	15.0
BCL11A_1542	BCL11A_1542	200.0
BCL11A_1542	peak64077_Reversed	17.0
BCL11A_1542	peak9935	14.0
BCL11A_1542	peak83991_Reversed	15.0
BCL11A_1542	peak58146	16.0
BCL11A_1542	RBM38_1662	16.0
BCL11A_1542	peak31193	14.0
BCL11A_1542	HBA2_591_Reversed	16.0


Rather than using the heuristic alignment scores provided by the BLASTn algorithm, we can filter false-positives based on SW alignment scores. Make sure to set the mode to `pure`.

In [22]:
%%bash

WORK_DIR=../data/tutorial.create_orthogonal_splits.work
INPUT_PATH=$WORK_DIR/hashFrag_pure.blastn_candidates.sw_scores.tsv.gz
MODE=pure
THRESHOLD=60

hashFrag filter_candidates_module -m $MODE -i $INPUT_PATH -t $THRESHOLD -o $WORK_DIR

2025-01-11 05:22:14 - numexpr.utils - INFO - Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2025-01-11 05:22:14 - numexpr.utils - INFO - Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-01-11 05:22:14 - numexpr.utils - INFO - NumExpr defaulting to 16 threads.
2025-01-11 05:22:15 - filter_candidates_module - INFO - Calling module...
2025-01-11 05:22:15 - filter_candidates_module - INFO - Filtering based on precomputed alignment scores (pure mode).
2025-01-11 05:22:15 - filter_candidates_module - INFO - Filtered results written to: ../data/tutorial.create_orthogonal_splits.work/hashFrag_pure.similar_pairs.tsv.gz
2025-01-11 05:22:15 - filter_candidates_module - INFO - Module execution completed.



# Section 3: Determine groups of homology

There are often distinct groups of sequences exhibiting different cases of homology throughout the dataset. To determine such groups, we represent the "hits" (i.e., pairs of sequences with an alignment score greater than the threshold) as a sparse adjacency matrix. A graph can then be constructed, where nodes correspond to sequences and edges denote shared homology between the two sequences. The process of identifying groups of homology can readily be solved by identifying disconnected subgraphs. 

An efficient implementation for this graph-based task is provided in the `igraph` Python library.

## Section 3.1: `lightning`-filtered homologous pairs (Default behaviour)

In [23]:
%%bash

WORK_DIR=../data/tutorial.create_orthogonal_splits.work
HITS_PATH=$WORK_DIR/hashFrag_lightning.similar_pairs.tsv.gz
OUTPUT_PATH=$WORK_DIR/homologous_groups.lightning.csv

hashFrag identify_homologous_groups_module -i $HITS_PATH -o $OUTPUT_PATH

2025-01-11 05:22:29 - numexpr.utils - INFO - Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2025-01-11 05:22:29 - numexpr.utils - INFO - Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-01-11 05:22:29 - numexpr.utils - INFO - NumExpr defaulting to 16 threads.
2025-01-11 05:22:30 - identify_homologous_groups_module - INFO - Calling module...
2025-01-11 05:22:30 - identify_homologous_groups_module - INFO - 1114 sequences exhibiting homology.
2025-01-11 05:22:30 - identify_homologous_groups_module - INFO - 90 distinct groups.
2025-01-11 05:22:30 - identify_homologous_groups_module - INFO - Homologous groups written to: ../data/tutorial.create_orthogonal_splits.work/homologous_groups.lightning.csv
2025-01-11 05:22:30 - identify_homologous_groups_module - INFO - Module execution completed.



## Section 3.2: `pure`-filtered homologous pairs

In [24]:
%%bash

WORK_DIR=../data/tutorial.create_orthogonal_splits.work
HITS_PATH=$WORK_DIR/hashFrag_pure.similar_pairs.tsv.gz
OUTPUT_PATH=$WORK_DIR/homologous_groups.pure.csv

hashFrag identify_homologous_groups_module -i $HITS_PATH -o $OUTPUT_PATH

2025-01-11 05:22:31 - numexpr.utils - INFO - Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2025-01-11 05:22:31 - numexpr.utils - INFO - Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-01-11 05:22:31 - numexpr.utils - INFO - NumExpr defaulting to 16 threads.
2025-01-11 05:22:31 - identify_homologous_groups_module - INFO - Calling module...
2025-01-11 05:22:31 - identify_homologous_groups_module - INFO - 1138 sequences exhibiting homology.
2025-01-11 05:22:31 - identify_homologous_groups_module - INFO - 92 distinct groups.
2025-01-11 05:22:31 - identify_homologous_groups_module - INFO - Homologous groups written to: ../data/tutorial.create_orthogonal_splits.work/homologous_groups.pure.csv
2025-01-11 05:22:31 - identify_homologous_groups_module - INFO - Module execution completed.



# Section 4: Use case(s)

Upon identifying groups of sequences exhibiting high similarity (i.e., homology), we can create train-test data splits using a graph-based method. Specifically, by representing sequences as nodes and using edges to denote whether sequences were found to be homologous (yes or no), identifying homologous groups of sequences can be reduced to the task of identifying all disconnected subgraphs in the population. 

## Creating homology-aware data splits

Below we show how splits can be created based on the homologous groups identified from either the `hashFrag-lightning` or `hashFrag-pure` methods.

In [25]:
%%bash

FASTA_PATH=../data/example_full_dataset.fa.gz
WORK_DIR=../data/tutorial.create_orthogonal_splits.work
HOMOLOGY_PATH=$WORK_DIR/homologous_groups.lightning.csv # lightning mode (Default behavior)
OUT_DIR=$WORK_DIR

hashFrag create_orthogonal_splits_module -f $FASTA_PATH -i $HOMOLOGY_PATH -n 10 -o $OUT_DIR

2025-01-11 05:22:49 - create_orthogonal_splits_module - INFO - Calling module...
2025-01-11 05:22:49 - create_orthogonal_splits_module - INFO - Creating 10 orthogonal splits in directory: ../data/tutorial.create_orthogonal_splits.work
2025-01-11 05:22:52 - create_orthogonal_splits_module - INFO - Module execution completed.



In [26]:
%%bash

FASTA_PATH=../data/example_full_dataset.fa.gz
WORK_DIR=../data/tutorial.create_orthogonal_splits.work
HOMOLOGY_PATH=$WORK_DIR/homologous_groups.pure.csv # pure mode
OUT_DIR=$WORK_DIR

hashFrag create_orthogonal_splits_module -f $FASTA_PATH -i $HOMOLOGY_PATH -n 10 -o $OUT_DIR

2025-01-11 05:22:52 - create_orthogonal_splits_module - INFO - Calling module...
2025-01-11 05:22:52 - create_orthogonal_splits_module - INFO - Creating 10 orthogonal splits in directory: ../data/tutorial.create_orthogonal_splits.work
2025-01-11 05:22:55 - create_orthogonal_splits_module - INFO - Module execution completed.

