DARDN (DnAResDualNet): Identifying transcription factors inducing cancer-specific CTCF binding using multi-CNNs and DeepLIFT
git@github.com:berkuva/DARDN.git
wget https://zanglab.github.io/data/cancerCTCF/data/union_binding.bed
For other chromosomes, you can use generate_onehot.py
to one-hot encoded generate T-ALL gained CTCF sites.
3. Download T-ALL gained and constitutive CTCF sites for chromosome 1 chr1_len5045.npz
from Google Drive.
This file contains 26009 CTCF-centered sites, each with length 10090.
python run_model.py
Otherwise, modify load_data.py
as needed to load the desired input data to train DARDN. To modify subsequence selection from gained CTCF sites, modify parameters in extract_subsequences.py
.
DARDN is a deep learning framework for identifying transcription factors bound near cancer-specific CTCF sites. DARDN is a product of an extension of a previous study.
We ran DARDN on T-cell acute lymphoblastic leukemia (T-ALL) data and identified RBPJ as one of the most enriched motifs under various perturbations to show the robustness of our pipeline.
DARDN is cancer-type agnostic and can easily be adapted to other cancer types. We show the adaptability by running our pipeline on 5 additional cancer types: acute myeloid leukemia (AML), breast cancer (BRCA), colorectal cancer (CRC), lung adenocarcinoma (LUAD), and prostate cancer (PRAD). Please find detailed tables and figures in our manuscript.
- Prepare CTCF-centered DNA sequences.
- Perform data augmentation if necessary.
- Train and evaluate DARDN, which contains dual-CNN networks and residual connections.
- Apply DeepLIFT to gained sites to assign scores to subsequences.
- Select subsequences with high DeepLIFT scores (above a cutoff score or a fixed number of subsequences) and input them into a motif analysis tool such as HOMER.
If you don't have your own CTCF-centered data and would like to use the data used in our work, you can download cancer-specific CTCF gained sites here. This link includes CTCF gained sites for T-ALL, AML, BRCA, CRC, LUAD, and PRAD as well as constitutive CTCF sites. Additionally, you can generate CTCF-centered data by using the provided generate_onehot.py
Augmenting DNA sequences can include shifting and reverse complementation of the original sequences. utils.py contains functions for data augmentation. Please modify them as necessary.
You can run run_model.py to train and evaluate DARDN. We also provide pre-trained weights in pretrained_weights.pth
extract_subsequences.py contains two methods to select subsequences after DARDN is trained: 1. Select a fixed number (say 10) of subsequences from each gained sequence. 2. Select a fixed number (say 10) of subsequences from each gained sequence. We also provided attributions.pt which contains DeepLIFT calculations for each gained site for T-ALL.
Sample HOMER command: findMotifsGenome.pl /path/to/subsequences.txt hg38 /path/to/save_dir/ -size 200
- Run generate_onehot.py to generate sequences.
- Run run_model.py for model training and DeepLIFT calculations. Or, use provided pre-trained parameters (pretrained_weights.pth) and DeepLIFT scores for each gained site in T-ALL (attributions.pt).
- Run extract_subsequences.py to filter subsequences with the most positive DeepLIFT scores. This creates a txt file with the genomic coordinates of the selected subsequences. - Run motif analysis.
If you use any data from this website, please cite:
Cho HJ, Wang Z, Cong Y, Bekiranov S, Zhang A, Zang C. DARDN: A Deep-Learning Approach for CTCF Binding Sequence Classification and Oncogenic Regulatory Feature Discovery. Genes. 2024; 15(2):144. https://doi.org/10.3390/genes15020144