In [1]:
import os
from pathlib import Path

In [2]:
!export PATH=$PATH:/home/brett/work/OrthogonalTrainValSplits/hashFrag/src

In [4]:
%%bash

hashFrag -h

usage: hashFrag [-h]
                {blastn,filter_false_positives,filter_existing_splits,stratify_test_split,identify_homologous_groups,create_orthogonal_splits}
                ...

positional arguments:
  {blastn,filter_false_positives,filter_existing_splits,stratify_test_split,identify_homologous_groups,create_orthogonal_splits}

optional arguments:
  -h, --help            show this help message and exit


# Introduction

This notebook provides a basic tutorial on the core functionality of hashFrag. Specically this notebook refers to the case when users have existing train-test splits and are interested in identifying and mitigating data leakage attributed to shared sequence homology across splits.

Working with two small MPRA datasets: a 8,000-sequence train split and a 2,000-sequence test split (provided in the `data` directory).

This notebook will encompass the following tasks:

1. `hashFrag blastn`: Identifying candidate pairs of sequences exhibiting similarity across data splits.
2. (Optional) For each candidate pairing, compute their optimal local alignment score ([Smith-Waterman algorithm](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)).
3. `hashFrag filter_false_positives`: Using the local alignment scores calculated in the previous step, filter false-positive candidates based on a specified threshold.
    + Note that if the previous step was NOT performed, false-positive candidates will be filtered based on the heuristic local alignment scores provided by the BLAST algorithm.
    + The threshold to use for tuning is not only dataset-dependent, but also highly depends on BLAST (and, if performed, Smith-Waterman) local alignment parameters. See the advanced tutorial (TODO) for an example on our recommended workflow to tune relevant parameters and select an appropriate threshold.
4. `hashFrag filter_existing_splits`: .
5. `hashFrag stratify_test_split`: .

hashFrag is a command line tool. This notebook serves as a resource to better understand each step in the process.

In [5]:
data_dir   = "/home/brett/work/OrthogonalTrainValSplits/hashFrag/data"
fasta_path = os.path.join(data_dir,"K562.sample_10000.fa.gz")
label      = os.path.basename(fasta_path).replace(".fa.gz","")
score_path = os.path.join(data_dir,f"{label}.pairwise_scores.csv.gz")
work_dir   = os.path.join(data_dir,f"{label}.hashFrag.existing_splits.work")
Path(work_dir).mkdir(parents=True,exist_ok=True)
print(work_dir)

blast_dir = os.path.join(work_dir,"blast_partitions")
Path(blast_dir).mkdir(parents=True,exist_ok=True)

/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work


# Section 1 - Identifying candidate similar sequences

The process of identifying candidate pairs of similar sequences involves first creating a BLAST database of the dataset, and then querying each sequence against the database, collecting all pairwise matches that represent potential cases of homology. 

To effectively mitigate potential biases caused by homology-based data leakage, it is imperative that all cases of homology are successfully identified. To that end, we configure the BLASTn call such that recall is maximized (this comes at the expense of increased false-positives, which are subsequently filtered). Here we consider the following parameters of BLASTn:

* word_size: smaller word sizes results in more exact word matches found between the query and sequences in the database, leading to more alignment score calculations being initialized.
* max_target_seqs: set to the size of the database to remove any constraints and allow for all possible candidate sequences to be returned for a given query.
* evalue: the e-value statistic is a measure of how likely you observe the alignment by chance (lower value corresponds to less likely to observe). By increasing the e-value threshold, matches that could be due to chance are returned.
* dust: by setting dust off, low-complexity (e.g., repetitive sequences) are no longer masked/filtered out.

In [7]:
%%bash

TRAIN_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_8000.train.fa.gz
TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_2000.test.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work

WORD_SIZE=7
MAX_TARGET_SEQS=8000 # size of train dataset
E_VALUE=100
DUST=no

hashFrag blastn --train_fasta_path $TRAIN_FASTA_PATH --test_fasta_path $TEST_FASTA_PATH -w $WORD_SIZE -e $E_VALUE -d $DUST -o $WORK_DIR

FASTA files for existing train-test splits detected.
Computing pairwise BLAST comparisons across splits.
Existing BLAST DataBase found (/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work/K562.sample_8000.train.blastdb). Skipping makeblastdb call.

BLASTn process finished and written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work/K562.sample_2000.test.blastn.out


# Section 3: Filter false-positives based on a defined threshold

The next step involves filtering candidate pairings with alignment scores lower than the specified threshold. There are two different modes of hashFrag depending on what alignment score is selected.

* `hashFrag-lightning` is the faster version where the alignment score computed from the BLAST output file. BLASTn is a heuristic method and the alignment scores were found to highly correlate with the optimal alignment scores; however, its underestimation of homology in some cases can lead to slightly worse recall. 
* `hashFrag-pure` is the slower but more comprehensive method that is based on the optimal, Smith-Waterman local alignment scores between pairs of sequences. The calculation of optimal alignment scores incurs an additional cost to filtering.

An alignment score threshold of 60 was determined to be appropriate based on an analysis looking at alignment scores between dinucleotide shuffled (i.e., random) sequences.

## Section 3.1: Lightning mode

In [8]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work
INPUT_PATH=$WORK_DIR/K562.sample_2000.test.blastn.out
METHOD=lightning
THRESHOLD=60

hashFrag filter_false_positives -m $METHOD -i $INPUT_PATH -t $THRESHOLD -o $WORK_DIR

Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work/hashFrag_lightning.similar_pairs.tsv.gz


## Section 3.2: Pure mode (optional)

In [9]:
%%bash

SPLIT_SIZE=100000
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work
BLAST_PATH=$WORK_DIR/K562.sample_2000.test.blastn.out
BLAST_DIR=$WORK_DIR/blast_partitions

LABEL=$( basename -s ".out" $BLAST_PATH )

mkdir -p $BLAST_DIR
cd $BLAST_DIR

split -l $SPLIT_SIZE -a 4 --additional-suffix=.tsv $BLAST_PATH ${LABEL}.partition_

In [10]:
%%bash

cd ../src/external

# NOTE THIS IS THE CONCATENATED TRAIN AND TEST FASTA FILES
FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.fa.gz

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work
BLAST_DIR=$WORK_DIR/blast_partitions

for PARTITIONED_BLAST_PATH in $BLAST_DIR/*.blastn.partition_*.tsv
do
    echo $PARTITIONED_BLAST_PATH
    bash compute_blast_candidate_SW_scores.sh $FASTA_PATH $PARTITIONED_BLAST_PATH
done

/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work/blast_partitions/K562.sample_2000.test.blastn.partition_aaaa.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work/blast_partitions/K562.sample_2000.test.blastn.partition_aaab.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work/blast_partitions/K562.sample_2000.test.blastn.partition_aaac.tsv


In [11]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work
BLAST_DIR=$WORK_DIR/blast_partitions

zcat $BLAST_DIR/*.pairwise_scores.tsv.gz | gzip > $WORK_DIR/K562.sample_10000.blastn_candidates.custom_scores.tsv.gz

In [12]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work
INPUT_PATH=$WORK_DIR/K562.sample_10000.blastn_candidates.custom_scores.tsv.gz
METHOD=pure

THRESHOLD=60

hashFrag filter_false_positives -m $METHOD -i $INPUT_PATH -t $THRESHOLD -o $WORK_DIR

Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work/hashFrag_pure.similar_pairs.tsv.gz


# Section: Use Cases

## Section : Filter homology from test split

In [14]:
%%bash

TRAIN_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_8000.train.fa.gz
TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_2000.test.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work

HITS_PATH=$WORK_DIR/hashFrag_lightning.similar_pairs.tsv.gz

hashFrag filter_existing_splits --train_fasta_path $TRAIN_FASTA_PATH --test_fasta_path $TEST_FASTA_PATH --hits_path $HITS_PATH

199 sequences filtered from test split.
Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_2000.test.filtered.fa.gz


In [15]:
%%bash

TRAIN_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_8000.train.fa.gz
TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_2000.test.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work

HITS_PATH=$WORK_DIR/hashFrag_pure.similar_pairs.tsv.gz

hashFrag filter_existing_splits --train_fasta_path $TRAIN_FASTA_PATH --test_fasta_path $TEST_FASTA_PATH --hits_path $HITS_PATH

201 sequences filtered from test split.
Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_2000.test.filtered.fa.gz


## Section: Stratify test split based on homology

In [19]:
%%bash

TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_2000.test.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work
INPUT_PATH=$WORK_DIR/K562.sample_2000.test.blastn.out
MODE=lightning
STEP=10

LABEL=$( basename -s ".test.fa.gz" $TEST_FASTA_PATH )
OUTPUT_PATH=$WORK_DIR/${LABEL}.stratified_test.tsv.gz

hashFrag stratify_test_split -f $TEST_FASTA_PATH -i $INPUT_PATH -m $MODE -s $STEP -o $OUTPUT_PATH


Stratified test split written to file: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work/K562.sample_2000.stratified_test.tsv.gz


In [20]:
%%bash

TEST_FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_2000.test.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work
INPUT_PATH=$WORK_DIR/K562.sample_10000.blastn_candidates.custom_scores.tsv.gz
MODE=pure
STEP=10

LABEL=$( basename -s ".test.fa.gz" $TEST_FASTA_PATH )
OUTPUT_PATH=$WORK_DIR/${LABEL}.stratified_test.tsv.gz

hashFrag stratify_test_split -f $TEST_FASTA_PATH -i $INPUT_PATH -m $MODE -s $STEP -o $OUTPUT_PATH


Stratified test split written to file: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.existing_splits.work/K562.sample_2000.stratified_test.tsv.gz
