Boosting the Unbiased Functional Enrichment Analysis (BUFET)
Table of contents
- Compiling BUFET
- Executing BUFET
3.1. Files Required
3.2. Script Execution
3.3. Adding more species
- Reproduction of the BUFET paper's experiments
The BUFET algorithm generates an empirical distribution of genes targeted by miRNA and calculates p-values for related biological processes. Benjamini-Hochberg FDR correction produces a '*' or '**' for significance at 0.05 FDR and 0.01 FDR respectively.
The publication for BUFET can be found here: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1812-8
Konstantinos Zagganas, Thanasis Vergoulis, Ioannis S. Vlachos, Maria D. Paraskevopoulou, Spiros Skiadopoulos and Theodore Dalamagas. BUFET: boosting the unbiased miRNA functional enrichment analysis using bitsets. BMC Bioinformatics volume 18, page 399, doi 10.1186/s12859-017-1812-8, 2017.
2. Compiling BUFET
In order for the program to run, the system must comply with the following specifications:
- A system with at least 4GB of RAM
- Linux and MacOS
- Python interpreter (>= version 2.7) that can run from the command line.
- g++ 4.8 and above.
In order to be able to run the BUFET script, you first need to compile the C++ program file. A Makefile is provided for that reason. The process is as follows:
- Download the code and unzip the files.
- From the command line, navigate inside the folder containing the .cpp, .py and Makefile files.
- Run the following command:
3. Executing BUFET
3.1. Files required
- Input miRNA file, which is a text file containing only the names
of differentially expressed miRNAs, each on a separate line. For
- Gene synonym data file from NCBI, http://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/All_Mammalia.gene_info.gz. Decompress the file with:
gzip -d All_Mammalia.gene_info.gz
- Gene annotation data file retrieved by GO, KEGG, PANTHER, DisGeNet, etc. The file must contain the following CSV format for each line:
*Alternatively, a list of Ensembl formatted annotations of genes to GO terms can be supplied. From http://www.ensembl.org/biomart select Ensembl Genes XX and species of interest. In attributes select in the following order:
- Ensembl Gene ID
- Ensembl Transcript ID
- Associated Gene Name
- GO Term Accession
- GO Term Name
- GO Term Definition
- GO domain
- miRNA-gene interactions file, which has the following CSV format for each line:
*The user can also use the output from miRanda target prediction run. This requires:
- FASTA sequences for known mature miRNA from http://www.mirbase.org/ftp.shtml filtered for species of interest
- FASTA sequences for 3' UTR of genes from http://www.biomart.org/
Select the the following order:
- Sequence Retrieval
- 3' UTR
- Ensembl Gene ID
- Ensembl Transcript ID
- Associated Gene Name
- miRanda software from http://www.microrna.org/microrna/getDownloads.do.
To generate correct format for script input, please run as:
miranda hsa-mature-miRNA.fa ensembl3utr.txt -quiet | grep '>>hsa' > miRandaPredictions.txt
3.2. Script Execution
Navigate inside the folder containing the .py and .bin files and run the following command:
python bufet.py [OPTIONS]
By default, the python script verifies that all input files exist, that they are not empty and that they have the correct format. Since the file check leads to increased execution times, it can be disabled by using the "--disable-file-check". However, we recommend that the file check remains enabled, since non-existing or empty files can crash the C++ core.
The script options are listed below:
- "-miRNA [filename]": path to the input miRNA file.
- "-interactions [filename]": path to miRNA-gene interactions file.
- "-synonyms [filename]": path to the gene synonym data file.
- "-ontology [filename]": path to gene annotation data file.
- "-output [filename]": path to output file. Created if it doesn't exist. Default filename: "output.txt"
- "-iterations [value]": number of random miRNA groups to test against. Default value: 10000
- "-processors [value]": the number of cores to be used in parallel. Default value: system cores-1.
- "-species [species_name]": specify either "human" or "mouse". Default species: "human". Further species can be added by following the instructions in section 3.4.
- "--ensGO": must be added when using GO ontology data supplied by Ensembl
- "--miRanda": must be added when using prediction data from a miRanda run.
- "-miScore [score]": miRanda score thresold if the miRanda mode is specified. Default score: "155"
- "-miFree [energy]": miRanda free energy threshold if the miRanda mode is specified. Default energy: "-20.0"
- "--disable-interactions-check": disables the validation for the interactions file (not recommended).
- "--disable-ontology-check": disables the validation for the ontology file (not recommended).
- "--disable-synonyms-check": disables the validation for the synonyms file (not recommended).
- "--disable-file-check": disables the validation for all files (not recommended).
- '-h" or "--help": print help message and exit
- Download the code and compile it according to the instructions (See section "Compiling BUFET").
- Download synonym data from NCBI.
- Place all files in the same folder as the .py and .bin files.
- Assuming that your current folder contains the .py and .bin files and all input files are located in the example folder, run an experiment as follows (or add the right paths for each file, accordingly):
where XX is the one of the sample input miRNA files (example/input_example5.txt, example/input_example10.txt, example/input_example25.txt, input_example50.txt).
python bufet.py -interactions example/interactions_example.csv -ontology example/ontology_example.csv -output output.txt -miRNA XX -synonyms example/All_Mammalia.gene_info
- The file "output.txt" contains the results of the analysis.
Open bufet.py and in line 239 add the label you want to use for the species and the taxonomy ID inside the dictionary, enclosed by single or double quotes (', "). The script can now be run using the new label as an argument to the "-species" option. 3.4. Adding more species
4. Reproduction of the BUFET paper's experiments
- Download the code and compile it according to the instructions (See section "Compiling the code").
- Files required for the reproduction of the experiments:
- GO gene annotations for Ensembl v.69
- microT miRNA-gene interactions for miRBase v.21 and Ensembl v.69
- miRanda miRNA-gene interactions for miRBase v.21 and Ensembl v.69
- miRNA input files
- Synonym data from NCBI
- Decompress the file:
- Copy all files and folders from the "reproduction_files/" directory, to the folder containing the .py and .bin files.
- Assuming that your current folder contains the .py, .bin, and all input files, run an experiment as follows (if not, modify the paths for each file, accordingly):
This will run a BUFET analysis for an input of 5 miRNAs using microT interactions (random miRNA groups used: 10000, cores used: 1). The file "output.txt" contains the results of the analysis.
python bufet.py -interactions microT_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp1/miRNA-5.txt -synonyms All_Mammalia.gene_info -processors 1 -iterations 10000
- Reproduce the rest of the experiments. Repeat the analysis for:
- every input miRNA file in folders exp1, exp2, ..., exp10 (miRNA-5.txt, miRNA-10.txt, miRNA-50.txt, miRNA-100.txt). Examples: "input/exp7/miRNA-50.txt", "input/exp4/miRNA-100.txt"
- both types of miRNA-gene interactions data files: microT_dataset.csv and miRanda_dataset.csv.
- 10000, 100000 and 1000000 random miRNA groups.
- 1 and 7 cores.
- (miRanda interactions, 10 miRNAs, 100000 random groups, 7 cores):
python bufet.py -interactions miRanda_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp8/miRNA-10.txt -synonyms All_Mammalia.gene_info -processors 7 -iterations 100000
- (microT interactions, 50 miRNAs, 1000000 random groups, 7 cores):
python bufet.py -interactions microT_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp3/miRNA-50.txt -synonyms All_Mammalia.gene_info -processors 7 -iterations 1000000
- (miRanda interactions, 100 miRNAs, 10000 random groups, 1 core):
python bufet.py -interactions miRanda_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp2/miRNA-100.txt -synonyms All_Mammalia.gene_info -processors 1 -iterations 10000
For any problems with the execution of this code please contact us at firstname.lastname@example.org