In [1]:
import os
from pathlib import Path

In [5]:
!export PATH=$PATH:/home/brett/work/OrthogonalTrainValSplits/hashFrag/src

In [1]:
%%bash

hashFrag

usage: hashFrag [-h]
                {blastn,filter_false_positives,identify_homologous_groups,create_orthogonal_splits}
                ...

positional arguments:
  {blastn,filter_false_positives,identify_homologous_groups,create_orthogonal_splits}

optional arguments:
  -h, --help            show this help message and exit


# Introduction

> This notebook refers to the case when users have a nucleotide sequence dataset and are interested in creating homology-aware train-test data splits for sequence-to-expression models.

The basic workflow is provided with respect a subsampled MPRA dataset (K562): a 10,000-sequence FASTA file (provided in the `data` directory).

hashFrag is a command line tool. This notebook serves as a resource to provide example calls and explanations to each step in the process.


# Section 0: Setup

In [72]:
data_dir   = "/home/brett/work/OrthogonalTrainValSplits/hashFrag/data"
fasta_path = os.path.join(data_dir,"K562.sample_10000.fa.gz")
label      = os.path.basename(fasta_path).replace(".fa.gz","")
score_path = os.path.join(data_dir,f"{label}.pairwise_scores.csv.gz")
work_dir   = os.path.join(data_dir,f"{label}.hashFrag.create_splits.work")
Path(work_dir).mkdir(parents=True,exist_ok=True)
print(work_dir)

blast_dir = os.path.join(work_dir,"blast_partitions")
Path(blast_dir).mkdir(parents=True,exist_ok=True)

/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work


# tl;dr

In [4]:
%%bash

FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits2.work

WORD_SIZE=7
MAX_TARGET_SEQS=8000 # size of train dataset
E_VALUE=100
DUST=no
THRESHOLD=60
SEED=21
P_TRAIN=0.8
P_TEST=0.2
N_SPLITS=10

hashFrag create_orthogonal_splits_pipeline \
--fasta_path $FASTA_PATH \
--word_size $WORD_SIZE \
--max_target_seqs $MAX_TARGET_SEQS \
--e_value $E_VALUE \
--dust $DUST \
--threshold $THRESHOLD \
--seed $SEED \
--p_train $P_TRAIN \
--p_test $P_TEST \
--n_splits $N_SPLITS \
--output_dir $WORK_DIR

Running blastn...
FASTA file containing all sequences detected.
Computing all pairwise BLAST comparisons.
Existing BLAST DataBase found (/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits2.work/hashFrag_lightning.blastdb). Skipping makeblastdb call.

BLASTn process finished and written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits2.work/hashFrag_lightning.blastn.out
Running filter_false_positives...
Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits2.work/hashFrag_lightning.similar_pairs.tsv.gz
Running identify_homologous_groups...
1114 sequences exhibiting homology.
90 homologous groups identified.
Homologous groups written to file: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits2.work/hashFrag_lightning.homologous_groups.tsv
Running create_orthogonal_splits...
Wr

# Section 1 - Identifying candidate similar sequences

The process of identifying candidate pairs of similar sequences involves first creating a BLAST database of the dataset, and then querying each sequence against the database. The BLASTn algorithm returns pairwise matches that represent potential cases of homology. 

Successful identification of cases of homology is paramount to effectively mitigate homology-based data leakage. As such, we configure the BLASTn parameters such that recall is maximized, even if it comes at the expense of increased false-positives. Here we consider the following parameters of BLASTn:

* word_size: smaller word sizes results in more exact word matches found between the query and sequences in the database, leading to more alignment score calculations being initialized.
* max_target_seqs: set to the size of the database to remove any constraints and allow for all possible candidate sequences to be returned for a given query.
* evalue: the e-value statistic is a measure of how likely you observe the alignment by chance (lower value corresponds to less likely to observe). By increasing the e-value threshold, less stringent matches that could be due to chance are returned.
* dust: by setting dust off, low-complexity (e.g., repetitive sequences) are no longer masked/filtered out.

In [73]:
%%bash

FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work

WORD_SIZE=7
MAX_TARGET_SEQS=10000 # size of dataset
E_VALUE=100
DUST=no

hashFrag blastn -f $FASTA_PATH -w $WORD_SIZE -e $E_VALUE -d $DUST -o $WORK_DIR

FASTA file containing all sequences detected.
Computing all pairwise BLAST comparisons.


Building a new DB, current time: 12/20/2024 17:15:18
New DB name:   /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/K562.sample_10000.blastdb
New DB title:  K562.sample_10000
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 10000 sequences in 0.666264 seconds.



BLAST DataBase construction finished and written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/K562.sample_10000.blastdb

BLASTn process finished and written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/K562.sample_10000.blastn.out


# Section 2: Filter false-positives based on a defined threshold

The next step involves filtering candidate pairings with alignment scores lower than the specified threshold. There are two different modes of hashFrag depending on what alignment score is selected.

* `hashFrag-lightning` is the faster version where the alignment score computed from the BLAST output file. BLASTn is a heuristic method and the alignment scores were found to highly correlate with the optimal alignment scores; however, its underestimation of homology in some cases can lead to slightly worse recall. 
* `hashFrag-pure` is the slower but more comprehensive method that is based on the optimal, Smith-Waterman local alignment scores between pairs of sequences. The calculation of optimal alignment scores incurs an additional cost to filtering.

An alignment score threshold of 60 was determined to be appropriate based on an analysis looking at alignment scores between dinucleotide shuffled (i.e., random) sequences.

## Section 2.1: Lightning mode

In [75]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work
INPUT_PATH=$WORK_DIR/K562.sample_10000.blastn.out
MODE=lightning
THRESHOLD=60

hashFrag filter_false_positives -m $MODE -i $INPUT_PATH -t $THRESHOLD -o $WORK_DIR

Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag_lightning.similar_pairs.tsv.gz


## Section 2.2: Pure mode (optional)

To limit memory usage, we'll start by partitioning the blast output file based on size. 

In [2]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work
BLAST_PATH=$WORK_DIR/K562.sample_10000.blastn.out
BLAST_DIR=$WORK_DIR/blast_partitions

LABEL=$( basename -s ".out" $BLAST_PATH )

mkdir -p $BLAST_DIR
cd $BLAST_DIR

SPLIT_SIZE=100000

split -l $SPLIT_SIZE -a 4 --additional-suffix=.tsv $BLAST_PATH ${LABEL}.partition_

This bash script execution will call a custom python script that computes pairwise Smith-Waterman local alignment scores for the candidate pairs of sequences identified by the BLASTn algorithm. Note that this could feasibly be replaced with any scoring metric of interest.

The expected format output files consists of a tab-delimited file with 3 columns: the query sequence iD, the target seqeuence ID, and their alignment score:
```
seq1    seq2    100
seq3    seq4    30
seq5    seq6    65
```

In [9]:
%%bash

cd ../src/external

FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work
BLAST_DIR=$WORK_DIR/blast_partitions

for PARTITIONED_BLAST_PATH in $BLAST_DIR/*.blastn.partition_*.tsv
do
    echo $PARTITIONED_BLAST_PATH
    bash compute_blast_candidate_SW_scores.sh $FASTA_PATH $PARTITIONED_BLAST_PATH
done

/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/blast_partitions/K562.sample_10000.blastn.partition_aaaa.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/blast_partitions/K562.sample_10000.blastn.partition_aaab.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/blast_partitions/K562.sample_10000.blastn.partition_aaac.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/blast_partitions/K562.sample_10000.blastn.partition_aaad.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/blast_partitions/K562.sample_10000.blastn.partition_aaae.tsv
/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/blast_partitions/K562.sample_10000.blastn.partition_aaaf.tsv
/home/brett/work/Ortho

The SW scores for candidate pairs of sequences can subsequently be concatenated into a single `.tsv` file.

In [10]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work
BLAST_DIR=$WORK_DIR/blast_partitions

zcat $BLAST_DIR/*.pairwise_scores.tsv.gz | gzip > $WORK_DIR/K562.sample_10000.blastn_candidates.custom_scores.tsv.gz

Rather than using the heuristic alignment scores provided by the BLASTn algorithm, we can filter false-positives based on SW alignment scores. Make sure to set the mode to `pure`.

In [11]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work
INPUT_PATH=$WORK_DIR/K562.sample_10000.blastn_candidates.custom_scores.tsv.gz
MODE=pure
THRESHOLD=60

hashFrag filter_false_positives -m $MODE -i $INPUT_PATH -t $THRESHOLD -o $WORK_DIR

Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag_pure.similar_pairs.tsv.gz


# Section 3: Determine groups of homology

There are often distinct groups of sequences exhibiting different cases of homology throughout the dataset. To determine such groups, we represent the "hits" (i.e., pairs of sequences with an alignment score greater than the threshold) as a sparse adjacency matrix. A graph can then be constructed, where nodes correspond to sequences and edges denote shared homology between the two sequences. The process of identifying groups of homology can readily be solved by identifying disconnected subgraphs. 

An efficient implementation for this graph-based task is provided in the `igraph` Python library.

## Section 3.1: `lightning`-filtered homologous pairs

In [12]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work
HITS_PATH=$WORK_DIR/hashFrag_lightning.similar_pairs.tsv.gz
OUTPUT_PATH=$WORK_DIR/homologous_groups.lightning.csv

hashFrag identify_homologous_groups -i $HITS_PATH -o $OUTPUT_PATH

1114 sequences exhibiting homology.
90 homologous groups identified.
Homologous groups written to file: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/homologous_groups.lightning.csv


## Section 3.2: `pure`-filtered homologous pairs

In [13]:
%%bash

WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work
HITS_PATH=$WORK_DIR/hashFrag_pure.similar_pairs.tsv.gz
OUTPUT_PATH=$WORK_DIR/homologous_groups.pure.csv

hashFrag identify_homologous_groups -i $HITS_PATH -o $OUTPUT_PATH

1138 sequences exhibiting homology.
92 homologous groups identified.
Homologous groups written to file: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/homologous_groups.pure.csv


# Section 4: Use case(s)

Upon identifying groups of sequences exhibiting high similarity (i.e., homology), we can create train-test data splits using a graph-based method. Specifically, by representing sequences as nodes and using edges to denote whether sequences were found to be homologous (yes or no), identifying homologous groups of sequences can be reduced to the task of identifying all disconnected subgraphs in the population. 

## Creating homology-aware data splits

Below we show how splits can be created based on the homologous groups identified from either the `hashFrag-lightning` or `hashFrag-pure` methods.

In [15]:
%%bash

FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work
HOMOLOGY_PATH=$WORK_DIR/homologous_groups.lightning.csv # lightning mode
OUT_DIR=$WORK_DIR

hashFrag create_orthogonal_splits -f $FASTA_PATH -i $HOMOLOGY_PATH -n 10 -o $OUT_DIR

Writing splits...
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_001.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_002.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_003.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_004.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_005.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_006.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.

In [16]:
%%bash

FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work
HOMOLOGY_PATH=$WORK_DIR/homologous_groups.pure.csv # pure mode
OUT_DIR=$WORK_DIR

hashFrag create_orthogonal_splits -f $FASTA_PATH -i $HOMOLOGY_PATH -n 10 -o $OUT_DIR

Writing splits...
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_001.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_002.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_003.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_004.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_005.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.create_splits.work/hashFrag.train_8000.test_2000.split_006.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.