Tzu-Hsien Yang*, Yu-Huai Yu+, Shang-Hang Wu+, Fang-Yuan Zhang+, "CFA: an explainable deep learning model for cis-regulatory module transcriptional role annotation based on epigenetic codes", Computers in Biology and Medicine, 2022.
+: These authors contributed equally.
Suggested running environments: Linux Ubuntu 16.04.6, Python 3.8.13
We recommend that you can use the conda package to create a new environment. This will automatically install the required python packages.
Here is an example:
-
Install the Conda package for you system. The installation of the package can be found here.
-
Create the CFA Conda environment. This may take a while, depending on the network status.
conda create -n "CFA" python=3.8.13
- Activate your CFA Conda environment.
conda activate CFA
- Download the codes from the following link and unzip the file. Please skip it if you have done this step.
wget https://cobis-fs.bme.ncku.edu.tw/CFA/CFA_Model.tar.gz
- Unzip the file.
tar -zxvf CFA_Model.tar.gz
- Change the working directory.
cd CFA_Model
- Download the processed epigenetic datasets from the following link.
wget https://cobis-fs.bme.ncku.edu.tw/CFA/CFA_Dataset.tar.gz
- Unzip the file.
tar -zxvf CFA_Dataset.tar.gz
- If this is the first time you use CFA, run the following command to install necessary packages.
pip install -r requirements.txt
CFA can also support GPU acceleration. If you want to utilize GPU, please run the following command instead:
pip install -r requirements_gpu.txt
-
Prepare the input Drosophila CRM chromosomal regions (ver. DM6). Multiple chromosomal regions can be provided in the same input file.
The input format SHOULD followed the following formats: chromosome,start,end
For example: (as the input file named input_Test.csv)
Note: The input chromosomal regions start from the 5' end.
-
Predict the identification probability of CRM transcriptional roles (promoter, enhancer, and insulator).
python main.py -i <input_csv_file> -o <output_directory>
Required arguments:
-i: The input file for the CFA Model.
-o: The output directory of the predicted results and SHAP bar plots.
The probability results will be saved to the file named "predicted_result.csv" in the specified output directory.
CFA also provides the top five determining epigenetic profiling types for every CRM. The SHAP value plots and the standardized ChIP-seq socres are stored in the specified output directory.
If we use the following as our inputs with the example command:
python main.py -i input_Test.csv -o output/
** output/predicted_result.csv :**
Output format explanation:
- CRM_location,0,0,0 --> Contains nothing.
- CRM_location,1,0,0 --> Contains promoter.
- CRM_location,0,1,0 --> Contains enhancer.
- CRM_location,0,0,1 --> Contains insulator.
- CRM_location,1,1,0 --> Contains promoter and enhancer.
- CRM_location,1,0,1 --> Contains promoter and insulator.
- CRM_location,0,1,1 --> Contains enhancer and insulator.
- CRM_location,1,1,1 --> Contains promoter, enhancer and insulator.
Output Plots :
Take X_8305146_8307354 from the above input as an example.
The CRM is identified to have promoter and insulator functions. And the output plots for this CRM is in the subdirectory output/X_8305146_8307354.
- The SHAP value plots
- The standardized ChIP-seq score plots