Scripts and datasets to reproduce the experiments reported in the paper: “Deep learning predicts non-coding RNA functions from only raw sequence data”
Two datasets have been adopted in the experiments: Rfam novel, a novel generated dataset of sequences downloaded from Rfam database; and RNAGCN/nRC, the dataset made public available by (Fiannaca et al. 2017) and adopted by the authors of RNAGCN (Rossi et al. 2019). Raw data are available in datasets. To generate the initial set of fasta files the Rfam novel dataset needs to be prepared first with an R script available at dataset-preparation.R. The script must be executed in the same directory as:
Rscript dataset-preparation.R
To run this script a working R environment with Biostrings and
ggplot2 packages is needed. The script generates in the same directory
three fasta files, x_train.fasta
, x_val.fasta
, and
x_test.fasta
adopted by the subsequent scripts and the distribution
graph of sequences among Rfam classes class-distribution.pdf
shown
in the paper.
To run the experiments a working Python 3 environment with the following libraries is necessary:
- tensorflow (1.13.1)
- sklearn (0.21.3)
- numpy (1.16.2)
- matplotlib (3.1.1)
- pandas (0.25.0)
Included functions:
- ExpConfiguration.py, contains settings for the experiments and encoders configurations.
- modelUtils.py, includes functions to build the standard and the improved CNN architectures.
- seqEncoders.py, a collection of functions to encode sequences into k-mers and spatial-curves representations.
The Python notebook datasets.ipynb generates all the data, in numpy format, necessary to run the experiments for both Rfam novel and RNAGCN/nRC datasets. The notebook is self explained and is able to create the necessary train, validation, and test splits for each combination of boundary noise (0, 25, 50, 75, 100), padding (new, random, constant), and encoder (K-mer, Snake, Morton, Hilbert).
The Python notebooks experiment-Rfam-novel.ipynb, experiment-dataset-nRC.ipynb, and experiment-dropout-rejection.ipynb provide all the necessary to execute the experiments described in the paper respectively on Rfam novel and RNAGCN/nRC datasets. While, the Python notebook experiment-dataset-nRC-ImprovedModel.ipynb provides the necessary to run the experiments with the improved architecture on RNAGCN/nRC dataset.
All experiment results are organized into a Python dictionary and then stored into a file through pickle library in results. Results shown in the paper are provided in this directory for convenience.
The Python notebook figures-tables.ipynb provides the necessary scripts to generate figures and tables summarizing results obtained and stored by the experiments scripts. Supplemental figures and tables can be easly generated by changing variable parameters at the beginning of each cell.