In [1]:
import os
from Bio import SeqIO
from tqdm import tqdm
from pathlib import Path

In [2]:
!export PATH=$PATH:/home/brett/work/OrthogonalTrainValSplits/hashFrag/src

# Introduction

This notebook provides a basic tutorial on the core functionality of hashFrag.

Working with a small MPRA dataset of 10,000 sequences (provided in the `data` directory), users can see example calls to the hashFrag. Specifically, this notebook will encompass the following tasks:

1. `hashFrag blastn`: Identifying candidate pairs of sequences exhibiting similarity in the dataset.
2. (Optional) For each candidate pairing, compute their optimal local alignment score ([Smith-Waterman algorithm](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm)).
3. `hashFrag filter_false_positives`: Using the local alignment scores calculated in the previous step, filter false-positive candidates based on a specified threshold.
    + Note that if the previous step was NOT performed, false-positive candidates will be filtered based on the heuristic local alignment scores provided by the BLAST algorithm.
    + The threshold to use for tuning is not only dataset-dependent, but also highly depends on BLAST (and, if performed, Smith-Waterman) local alignment parameters. See the advanced tutorial (TODO) for an example on our recommended workflow to tune relevant parameters and select an appropriate threshold.
4. `hashFrag identify_homologous_groups`: determine the different subgroups of sequences exhibiting homology.
5. `hashFrag create_orthogonal_splits`: create homology-aware data splits.

hashFrag is a command line tool. This notebook serves as a resource to better understand each step in the process.


# Section 0: Setup

In [3]:
data_dir   = "/home/brett/work/OrthogonalTrainValSplits/hashFrag/data"
fasta_path = os.path.join(data_dir,"K562.sample_10000.fa.gz")
label      = os.path.basename(fasta_path).replace(".fa.gz","")
score_path = os.path.join(data_dir,f"{label}.pairwise_scores.csv.gz")
work_dir   = os.path.join(data_dir,f"{label}.hashFrag.work")
Path(work_dir).mkdir(parents=True,exist_ok=True)
print(work_dir)

blast_dir = os.path.join(work_dir,"blast_partitions")
Path(blast_dir).mkdir(parents=True,exist_ok=True)

/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work


# Section 1 - Identifying candidate similar sequences

The process of identifying candidate pairs of similar sequences involves first creating a BLAST database of the dataset, and then querying each sequence against the database, collecting all pairwise matches that represent potential cases of homology. 

To effectively mitigate potential biases caused by homology-based data leakage, it is imperative that all cases of homology are successfully identified. To that end, we configure the BLASTn call such that recall is maximized (this comes at the expense of increased false-positives, which are subsequently filtered). Here we consider the following parameters of BLASTn:

* word_size: smaller word sizes results in more exact word matches found between the query and sequences in the database, leading to more alignment score calculations being initialized.
* max_target_seqs: set to the size of the database to remove any constraints and allow for all possible candidate sequences to be returned for a given query.
* evalue: the e-value statistic is a measure of how likely you observe the alignment by chance (lower value corresponds to less likely to observe). By increasing the e-value threshold, matches that could be due to chance are returned.
* dust: by setting dust off, low-complexity (e.g., repetitive sequences) are no longer masked/filtered out.

In [6]:
%%bash

FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.fa.gz
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work

WORD_SIZE=7
MAX_TARGET_SEQS=10000 # size of dataset
E_VALUE=100
DUST=no

hashFrag blastn -f $FASTA_PATH -w $WORD_SIZE -e $E_VALUE -d $DUST -o $WORK_DIR



Building a new DB, current time: 12/10/2024 08:00:47
New DB name:   /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/K562.sample_10000.blastdb
New DB title:  K562.sample_10000
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 10000 sequences in 0.260447 seconds.



BLAST DataBase construction finished and written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/K562.sample_10000.blastdb

BLASTn process finished and written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/K562.sample_10000.blastn.out


# Section 3: Filter false-positives based on a defined threshold

We can filter candidate pairings with alignment scores lower than the specified threshold. An alignment score threshold of 60 was determined to be appropriate based on an analysis looking at alignment scores between dinucleotide shuffled (i.e., random) sequences.

In [17]:
%%bash

BLAST_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/K562.sample_10000.blastn.out
THRESHOLD=60
WORK_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work

hashFrag filter_false_positives -m "lightning"-b $BLAST_PATH -t $THRESHOLD -o $WORK_DIR

Filtered results written to: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/blastn_results.filtered_candidates.tsv.gz


# Section 4: Determine groups of homology

There are often distinct groups of sequences exhibiting different cases of homology throughout the dataset. To determine such groups, we represent the "hits" (i.e., pairs of sequences with an alignment score greater than the threshold) as a sparse adjacency matrix. A graph can then be constructed, where nodes correspond to sequences and edges denote shared homology between the two sequences. The process of identifying groups of homology can readily be solved by identifying disconnected subgraphs. 

An efficient implementation for this graph-based task is provided in the `igraph` Python library.

In [22]:
%%bash

HITS_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/blastn_results.filtered_candidates.tsv.gz
OUTPUT_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/homologous_groups.csv

hashFrag identify_homologous_groups -i $HITS_PATH -o $OUTPUT_PATH

1138 sequences exhibiting homology.
92 homologous groups identified.
Homologous groups written to file: /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/homologous_groups.csv


# Section 5: Create orthogonal data split

In [40]:
%%bash

FASTA_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.fa.gz
HOMOLOGY_PATH=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/homologous_groups.csv
OUT_DIR=/home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work

hashFrag create_orthogonal_splits -f $FASTA_PATH -i $HOMOLOGY_PATH -n 10 -o $OUT_DIR

Writing splits...
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/hashFrag.train_8000.test_2000.split_001.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/hashFrag.train_8000.test_2000.split_002.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/hashFrag.train_8000.test_2000.split_003.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/hashFrag.train_8000.test_2000.split_004.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/hashFrag.train_8000.test_2000.split_005.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/hashFrag.train_8000.test_2000.split_006.csv.gz
  /home/brett/work/OrthogonalTrainValSplits/hashFrag/data/K562.sample_10000.hashFrag.work/hashFrag.train_8000.test_2000.split_007.csv.gz
  /home/brett/work/Orth