Skip to content
Boosting the Unbiased Functional Enrichment Analysis
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
example Update Readme_first!.txt Jul 5, 2017
LICENSE Initial commit Aug 29, 2016
Makefile Add files via upload Jun 15, 2017
README.md Update README.md Oct 19, 2017
bufet.cpp Add files via upload Jul 6, 2017
bufet.py

README.md

Boosting the Unbiased Functional Enrichment Analysis (BUFET)

Table of contents

  1. Introduction
      1.1. Publication
      1.2. Citation
  2. Compiling BUFET
  3. Executing BUFET
      3.1. Files Required
      3.2. Script Execution
      3.4. Example
      3.3. Adding more species
  4. Reproduction of the BUFET paper's experiments
  5. Contact

1. Introduction

BUFET is an open-source software under the GPL v.3 licence, designed to speed up the unbiased miRNA enrichment analysis algorithm as described by Bleazard et al. in in their paper.

The BUFET algorithm generates an empirical distribution of genes targeted by miRNA and calculates p-values for related biological processes. Benjamini-Hochberg FDR correction produces a '*' or '**' for significance at 0.05 FDR and 0.01 FDR respectively.

1.1 Publication

The publication for BUFET can be found here: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1812-8

1.2 Citation

Please cite:
Konstantinos Zagganas, Thanasis Vergoulis, Ioannis S. Vlachos, Maria D. Paraskevopoulou, Spiros Skiadopoulos and Theodore Dalamagas. BUFET: boosting the unbiased miRNA functional enrichment analysis using bitsets. BMC Bioinformatics volume 18, page 399, doi 10.1186/s12859-017-1812-8, 2017.

2. Compiling BUFET

In order for the program to run, the system must comply with the following specifications:

Hardware:

  • A system with at least 4GB of RAM

Software:

  • Linux and MacOS
  • Python interpreter (>= version 2.7) that can run from the command line.
  • g++ 4.8 and above.

In order to be able to run the BUFET script, you first need to compile the C++ program file. A Makefile is provided for that reason. The process is as follows:

  1. Download the code and unzip the files.
  2. From the command line, navigate inside the folder containing the .cpp, .py and Makefile files.
  3. Run the following command:
    make
This will compile the code and create a .bin file. The .bin file must be in the same folder as the .py file at all times, in order for the program to execute correctly.

3. Executing BUFET

3.1. Files required

  1. Input miRNA file, which is a text file containing only the names of differentially expressed miRNAs, each on a separate line. For example:
    hsa-miR-132-5p
    hsa-miR-132-3p
  2. Gene synonym data file from NCBI, http://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/All_Mammalia.gene_info.gz. Decompress the file with:
    gzip -d All_Mammalia.gene_info.gz
  3. Gene annotation data file retrieved by GO, KEGG, PANTHER, DisGeNet, etc. The file must contain the following CSV format for each line:
    gene_name|pathway_id|pathway_name
    *Alternatively, a list of Ensembl formatted annotations of genes to GO terms can be supplied. From http://www.ensembl.org/biomart select Ensembl Genes XX and species of interest. In attributes select in the following order:
    • Ensembl Gene ID
    • Ensembl Transcript ID
    • Associated Gene Name
    • GO Term Accession
    • GO Term Name
    • GO Term Definition
    • GO domain
    Note that in this case you will need to use the "--ensGO" option in order for the script to execute correctly!
  4. miRNA-gene interactions file, which has the following CSV format for each line:
    miRNA_name|gene_name
    *The user can also use the output from miRanda target prediction run. This requires:
    • FASTA sequences for known mature miRNA from http://www.mirbase.org/ftp.shtml filtered for species of interest
    • FASTA sequences for 3' UTR of genes from http://www.biomart.org/ Select the the following order:
      • Sequence Retrieval
      • 3' UTR
      • headers
      • Ensembl Gene ID
      • Ensembl Transcript ID
      • Associated Gene Name
      After the file has been downloaded, remove entries with "Sequence Unavailable".
    • miRanda software from http://www.microrna.org/microrna/getDownloads.do. To generate correct format for script input, please run as:
      miranda hsa-mature-miRNA.fa ensembl3utr.txt -quiet | grep '>>hsa' >  miRandaPredictions.txt 
    In this case you will need to use the "--miRanda" option in order for the script to execute correctly!
Note that all files listed above can contain header lines starting with the "#" character.

3.2. Script Execution

Navigate inside the folder containing the .py and .bin files and run the following command:

python bufet.py [OPTIONS]

By default, the python script verifies that all input files exist, that they are not empty and that they have the correct format. Since the file check leads to increased execution times, it can be disabled by using the "--disable-file-check". However, we recommend that the file check remains enabled, since non-existing or empty files can crash the C++ core.

The script options are listed below:

  • "-miRNA [filename]": path to the input miRNA file.
  • "-interactions [filename]": path to miRNA-gene interactions file.
  • "-synonyms [filename]": path to the gene synonym data file.
  • "-ontology [filename]": path to gene annotation data file.
  • "-output [filename]": path to output file. Created if it doesn't exist. Default filename: "output.txt"
  • "-iterations [value]": number of random miRNA groups to test against. Default value: 10000
  • "-processors [value]": the number of cores to be used in parallel. Default value: system cores-1.
  • "-species [species_name]": specify either "human" or "mouse". Default species: "human". Further species can be added by following the instructions in section 3.4.
  • "--ensGO": must be added when using GO ontology data supplied by Ensembl
  • "--miRanda": must be added when using prediction data from a miRanda run.
  • "-miScore [score]": miRanda score thresold if the miRanda mode is specified. Default score: "155"
  • "-miFree [energy]": miRanda free energy threshold if the miRanda mode is specified. Default energy: "-20.0"
  • "--disable-interactions-check": disables the validation for the interactions file (not recommended).
  • "--disable-ontology-check": disables the validation for the ontology file (not recommended).
  • "--disable-synonyms-check": disables the validation for the synonyms file (not recommended).
  • "--disable-file-check": disables the validation for all files (not recommended).
  • '-h" or "--help": print help message and exit

3.3. Example

  1. Download the code and compile it according to the instructions (See section "Compiling BUFET").
  2. Download synonym data from NCBI.
  3. Place all files in the same folder as the .py and .bin files.
  4. Assuming that your current folder contains the .py and .bin files and all input files are located in the example folder, run an experiment as follows (or add the right paths for each file, accordingly):
    python bufet.py -interactions example/interactions_example.csv -ontology example/ontology_example.csv -output output.txt -miRNA XX -synonyms example/All_Mammalia.gene_info
    where XX is the one of the sample input miRNA files (example/input_example5.txt, example/input_example10.txt, example/input_example25.txt, input_example50.txt).
  5. The file "output.txt" contains the results of the analysis.

3.4. Adding more species

Open bufet.py and in line 239 add the label you want to use for the species and the taxonomy ID inside the dictionary, enclosed by single or double quotes (', "). The script can now be run using the new label as an argument to the "-species" option.

4. Reproduction of the BUFET paper's experiments

  1. Download the code and compile it according to the instructions (See section "Compiling the code").
  2. Files required for the reproduction of the experiments:
    • GO gene annotations for Ensembl v.69
    • microT miRNA-gene interactions for miRBase v.21 and Ensembl v.69
    • miRanda miRNA-gene interactions for miRBase v.21 and Ensembl v.69
    • miRNA input files
    • Synonym data from NCBI
    All files listed above can be found in a compressed file here.
  3. Decompress the file:
    unzip reproduction_files.zip
  4. Copy all files and folders from the "reproduction_files/" directory, to the folder containing the .py and .bin files.
  5. Assuming that your current folder contains the .py, .bin, and all input files, run an experiment as follows (if not, modify the paths for each file, accordingly):
    python bufet.py -interactions microT_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp1/miRNA-5.txt -synonyms All_Mammalia.gene_info -processors 1 -iterations 10000
    This will run a BUFET analysis for an input of 5 miRNAs using microT interactions (random miRNA groups used: 10000, cores used: 1). The file "output.txt" contains the results of the analysis.
  6. Reproduce the rest of the experiments. Repeat the analysis for:
    • every input miRNA file in folders exp1, exp2, ..., exp10 (miRNA-5.txt, miRNA-10.txt, miRNA-50.txt, miRNA-100.txt). Examples: "input/exp7/miRNA-50.txt", "input/exp4/miRNA-100.txt"
    • both types of miRNA-gene interactions data files: microT_dataset.csv and miRanda_dataset.csv.
    • 10000, 100000 and 1000000 random miRNA groups.
    • 1 and 7 cores.

    Examples:

    • (miRanda interactions, 10 miRNAs, 100000 random groups, 7 cores):
      python bufet.py -interactions miRanda_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp8/miRNA-10.txt -synonyms All_Mammalia.gene_info -processors 7 -iterations 100000
    • (microT interactions, 50 miRNAs, 1000000 random groups, 7 cores):
      python bufet.py -interactions microT_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp3/miRNA-50.txt -synonyms All_Mammalia.gene_info -processors 7 -iterations 1000000
    • (miRanda interactions, 100 miRNAs, 10000 random groups, 1 core):
      python bufet.py -interactions miRanda_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp2/miRNA-100.txt -synonyms All_Mammalia.gene_info -processors 1 -iterations 10000

5. Contact

For any problems with the execution of this code please contact us at zagganas@imis.athena-innovation.gr

You can’t perform that action at this time.