Skip to content
/ BUFET Public

Boosting the Unbiased Functional Enrichment Analysis

License

Notifications You must be signed in to change notification settings

diwis/BUFET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Boosting the Unbiased Functional Enrichment Analysis (BUFET)

Table of contents

  1. Introduction
      1.1. Publication
      1.2. Citation
  2. Compiling BUFET
  3. Executing BUFET
      3.1. Files Required
      3.2. Script Execution
      3.3. Example
      3.4. Adding more species
      3.5. Print target genes involved in each annotation class (NEW!)
      3.6. Disable synonym matching (NEW!)
  4. Reproduction of the BUFET paper's experiments
  5. Funding
  6. Contact

1. Introduction

BUFET is an open-source software under the GPL v.3 licence, designed to speed up the unbiased miRNA enrichment analysis algorithm as described by Bleazard et al. in in their paper.

The BUFET algorithm generates an empirical distribution of genes targeted by miRNA and calculates p-values for related biological processes. Benjamini-Hochberg FDR correction produces a '*' or '**' for significance at 0.05 FDR and 0.01 FDR respectively.

1.1 Publication

The publication for BUFET can be found here: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1812-8

1.2 Citation

Please cite:
Konstantinos Zagganas, Thanasis Vergoulis, Ioannis S. Vlachos, Maria D. Paraskevopoulou, Spiros Skiadopoulos and Theodore Dalamagas. BUFET: boosting the unbiased miRNA functional enrichment analysis using bitsets. BMC Bioinformatics volume 18, page 399, doi 10.1186/s12859-017-1812-8, 2017.

2. Compiling BUFET

In order for the program to run, the system must comply with the following specifications:

Hardware:

  • A system with at least 4GB of RAM

Software:

  • Linux and MacOS
  • Python interpreter (>= version 2.7) that can run from the command line.
  • g++ 4.8 and above.

In order to be able to run the BUFET script, you first need to compile the C++ program file. A Makefile is provided for that reason. The process is as follows:

  1. Download the code and unzip the files.
  2. From the command line, navigate inside the folder containing the .cpp, .py and Makefile files.
  3. Run the following command:
    make
This will compile the code and create a .bin file. The .bin file must be in the same folder as the .py file at all times, in order for the program to execute correctly.

3. Executing BUFET

3.1. Files required

  1. Input miRNA file, which is a text file containing only the names of differentially expressed miRNAs, each on a separate line. For example:
    hsa-miR-132-5p
    hsa-miR-132-3p
  2. Gene synonym data file from NCBI, http://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/All_Mammalia.gene_info.gz. Decompress the file with:
    gzip -d All_Mammalia.gene_info.gz
  3. Gene annotation data file retrieved by GO, KEGG, PANTHER, DisGeNet, etc. The file must contain the following CSV format for each line:
    gene_name|pathway_id|pathway_name
    *Alternatively, a list of Ensembl formatted annotations of genes to GO terms can be supplied. From http://www.ensembl.org/biomart select Ensembl Genes XX and species of interest. In attributes select in the following order:
    • Ensembl Gene ID
    • Ensembl Transcript ID
    • Associated Gene Name
    • GO Term Accession
    • GO Term Name
    • GO Term Definition
    • GO domain
    Note that in this case you will need to use the "--ensGO" option in order for the script to execute correctly!
  4. miRNA-gene interactions file, which has the following CSV format for each line:
    miRNA_name|gene_name
    *The user can also use the output from miRanda target prediction run. This requires:
    • FASTA sequences for known mature miRNA from http://www.mirbase.org/ftp.shtml filtered for species of interest
    • FASTA sequences for 3' UTR of genes from http://www.biomart.org/ Select the the following order:
      • Sequence Retrieval
      • 3' UTR
      • headers
      • Ensembl Gene ID
      • Ensembl Transcript ID
      • Associated Gene Name
      After the file has been downloaded, remove entries with "Sequence Unavailable".
    • miRanda software from http://www.microrna.org/microrna/getDownloads.do. To generate correct format for script input, please run as:
      miranda hsa-mature-miRNA.fa ensembl3utr.txt -quiet | grep '>>hsa' >  miRandaPredictions.txt 
    In this case you will need to use the "--miRanda" option in order for the script to execute correctly!
Note that all files listed above can contain header lines starting with the "#" character.

3.2. Script Execution

Navigate inside the folder containing the .py and .bin files and run the following command:

python bufet.py [OPTIONS]

By default, the python script verifies that all input files exist, that they are not empty and that they have the correct format. Since the file check leads to increased execution times, it can be disabled by using the "--disable-file-check". However, we recommend that the file check remains enabled, since non-existing or empty files can crash the C++ core.

The script options are listed below:

  • "-miRNA [filename]": path to the input miRNA file.
  • "-interactions [filename]": path to miRNA-gene interactions file.
  • "-synonyms [filename]": path to the gene synonym data file.
  • "-ontology [filename]": path to gene annotation data file.
  • "-output [filename]": path to output file. Created if it doesn't exist. Default filename: "output.txt"
  • "-iterations [value]": number of random miRNA groups to test against. Default value: 10000
  • "-processors [value]": the number of cores to be used in parallel. Default value: system cores-1.
  • "-species [species_name]": specify either "human" or "mouse". Default species: "human". Further species can be added by following the instructions in section 3.4.
  • "--ensGO": must be added when using GO ontology data supplied by Ensembl
  • "--miRanda": must be added when using prediction data from a miRanda run.
  • "-miScore [score]": miRanda score thresold if the miRanda mode is specified. Default score: "155"
  • "-miFree [energy]": miRanda free energy threshold if the miRanda mode is specified. Default energy: "-20.0"
  • "--disable-interactions-check": disables the validation for the interactions file (not recommended).
  • "--disable-ontology-check": disables the validation for the ontology file (not recommended).
  • "--disable-synonyms-check": disables the validation for the synonyms file (not recommended).
  • "--disable-file-check": disables the validation for all files (not recommended).
  • "--no-synonyms: disables use of synonyms (synonyms file not required by the script)"
  • '-h" or "--help": print help message and exit

3.3. Example

  1. Download the code and compile it according to the instructions (See section "Compiling BUFET").
  2. Download synonym data from NCBI.
  3. Place all files in the same folder as the .py and .bin files.
  4. Assuming that your current folder contains the .py and .bin files and all input files are located in the example folder, run an experiment as follows (or add the right paths for each file, accordingly):
    python bufet.py -interactions example/interactions_example.csv -ontology example/ontology_example.csv -output output.txt -miRNA XX -synonyms example/All_Mammalia.gene_info
    where XX is the one of the sample input miRNA files (example/input_example5.txt, example/input_example10.txt, example/input_example25.txt, input_example50.txt).
  5. The file "output.txt" contains the results of the analysis.

3.4. Adding more species

Open bufet.py and in line 240 add the label you want to use for the species and the taxonomy ID inside the dictionary, enclosed by single or double quotes (', "). The script can now be run using the new label as an argument to the "-species" option.

3.5 Print target genes involved in each annotation class (NEW!)

Use the argument
--print-involved-genes
to print the common genes between each annotation class (GO category) and each miRNA in the sample under examination in a file. You can also use the
-involved-genes-filename < filename >
argument to specify an output file name other than the default (involved-genes.txt). More specifically, the structure of the output is:
>OntologyClass1
miRNA1     Gene1,Gene2,Gene3
miRNA2     
miRNAn     Gene5,Gene6,Gene1

An example of the output be seen bellow:

>GO:0034199
hsa-miR-6834-3p 
hsa-miR-3142    
hsa-miR-4306    ADCY4,PRKAR2A,ADCY2
hsa-miR-3613-3p PRKAR2A,PRKAR2B,ADCY8,PRKAR1A
hsa-miR-30d-3p  

3.6 Disable gene synonym matching (NEW!)

Use the argument
--no-synonyms
to disable matching for gene synonyms. In this case a synonyms file is not required and if one is provided, it will be ignored.

4. Reproduction of the BUFET paper's experiments

  1. Download the code and compile it according to the instructions (See section "Compiling the code").
  2. Files required for the reproduction of the experiments:
    • GO gene annotations for Ensembl v.69
    • microT miRNA-gene interactions for miRBase v.21 and Ensembl v.69
    • miRanda miRNA-gene interactions for miRBase v.21 and Ensembl v.69
    • miRNA input files
    • Synonym data from NCBI
    All files listed above can be found in a compressed file here.
  3. Decompress the file:
    unzip reproduction_files.zip
  4. Copy all files and folders from the "reproduction_files/" directory, to the folder containing the .py and .bin files.
  5. Assuming that your current folder contains the .py, .bin, and all input files, run an experiment as follows (if not, modify the paths for each file, accordingly):
    python bufet.py -interactions microT_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp1/miRNA-5.txt -synonyms All_Mammalia.gene_info -processors 1 -iterations 10000
    This will run a BUFET analysis for an input of 5 miRNAs using microT interactions (random miRNA groups used: 10000, cores used: 1). The file "output.txt" contains the results of the analysis.
  6. Reproduce the rest of the experiments. Repeat the analysis for:
    • every input miRNA file in folders exp1, exp2, ..., exp10 (miRNA-5.txt, miRNA-10.txt, miRNA-50.txt, miRNA-100.txt). Examples: "input/exp7/miRNA-50.txt", "input/exp4/miRNA-100.txt"
    • both types of miRNA-gene interactions data files: microT_dataset.csv and miRanda_dataset.csv.
    • 10000, 100000 and 1000000 random miRNA groups.
    • 1 and 7 cores.

    Examples:

    • (miRanda interactions, 10 miRNAs, 100000 random groups, 7 cores):
      python bufet.py -interactions miRanda_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp8/miRNA-10.txt -synonyms All_Mammalia.gene_info -processors 7 -iterations 100000
    • (microT interactions, 50 miRNAs, 1000000 random groups, 7 cores):
      python bufet.py -interactions microT_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp3/miRNA-50.txt -synonyms All_Mammalia.gene_info -processors 7 -iterations 1000000
    • (miRanda interactions, 100 miRNAs, 10000 random groups, 1 core):
      python bufet.py -interactions miRanda_dataset.csv -ontology annotation_dataset.csv -output output.txt -miRNA input/exp2/miRNA-100.txt -synonyms All_Mammalia.gene_info -processors 1 -iterations 10000

5. Funding

This work was funded by the European Commission under the Research Infrastructure (H2020) programme (project: ELIXIR-EXCELERATE, grant: GA676559).

6. Contact

For any problems with the execution of this code please contact us at zagganas@athenarc.gr

About

Boosting the Unbiased Functional Enrichment Analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published