Analysis workflow for smallRNA sequenced data
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
config
docs
images
scripts
testDataset
.gitignore
Changelog
LICENSE
Programs.md
README.md
counts_merge.sh
extract_fasteris_inserts.sh
extract_lcscience_inserts.sh
install.sh
miRPursuit.sh
mirprof.sh
pipe_count_reads.sh
pipe_fasta.sh
pipe_fastq.sh
pipe_filter_genome_mirbase.sh
pipe_filter_wbench.sh
pipe_mircat.sh
pipe_tasi.sh
pipe_trim_adaptors.sh
predict-target.sh
write_report.sh
xvfb-run-safe

README.md

miRPursuit

DOI

Check out our read the docs page for a more structured overview of this project:

miRPursuit – a pipeline for automated analyses of small RNAs in non-model plants

    Inês Chaves a,b,*,φ , Bruno Costa a,b,φ , Andreia S. Rodrigues a,b , Andreas Bohn a,b , Célia M. Miguel a,b,c
    a iBET, Instituto de Biologia Experimental e Tecnológica, Apartado 12, 2781-901 Oeiras, Portugal
    b Instituto de Tecnologia Química e Biológica António Xavier, Universidade Nova de Lisboa, Av. República, 2780-157 Oeiras, Portugal
    c Biosystems & Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa (FCUL), Campo Grande, 1749-016, Lisboa, Portugal
      φ These authors contributed equally to this work.

Table of Contents

Abstract

MiRPursuit, a pipeline developed for running end- to-end analyses of high-throughput small RNA (sRNA) sequence data in model and non-model plants, from raw data to identified and annotated conserved and novel sequences. It consists of a series of UNIX shell scripts, which connect the in- and outputs of several established, open-source sRNA analysis software. This way, high customizability and full transparency of the analyses and the involved parameters can be combined with convenient workflow management, also for users without advanced computational skills. One considerable advantage is that several sRNA libraries can be processed in parallel.

Small non-coding RNAs (sRNAs) are pivotal in the regulation of gene expression during plant growth and development, and in response to abiotic and biotic stresses. The affordable, high-throughput sequencing provided by NGS platforms is an attractive approach to discover the small RNAs involved in the regulation of important biological processes in plants. However, the large amounts of data generated by such type of studies can be staggering and requires efficient tools to quickly analyze the data produced.

This pipeline has been built around a publicly available software package, the University of East Anglia sRNA workbench[1], which includes various tools which can be used to identify sRNA classes, such as micro RNAs (miRNAs) and trans-acting siRNA (tasi), both conserved and novel and predict their precursor RNA using a user specified reference genome. Moreover, the target genes can be predicted and validated by using degradome fragment sequences and a reference transcriptome.

By setting up a workflow, a predefined sequence of tools can be run autonomously. The NGS raw data obtained from various libraries can be supplied as input files, allowing the user to process multiple libraries in one command line interaction. The degree of customization in this pipeline provides the ability to fine tune the workflow with the freedom to use user supplied omics data.

Thus, the main advantage of using this system over the workbench's individual tools is minimizing the need to perform manual repetitive tasks. The pipeline automatically connects each step by processing the data flow between tools. This sRNA workflow was implemented in bash which is optimal to be run on unix servers allowing uninterrupted runs on high capacity clusters enabling the processing of large scale multiple datasets. The end result provides the identification and annotation of conserved and novel miRNAs and tasiRNAs, along with the expression matrix of the libraries from the input dataset, which can be easily imported to excel or R to perform differential expression analyses.

As future work the development of the pipeline will include, a database of the annotations generated and a user friendly graphic interface.

This pipeline was build to simplify the manipulation of NGS sequenced data. Use of this pipeline provides a seamless classification of sRNA, prediction of TaSi and sRNA targets from FASTQ files.

How to start:

    Make sure you have all the software necessary (Check list)
      UEA Workbench Optimized for linux version (~3.2)
      perl version (5.8)
      Java optimized for version (~1.7)
      RNAfold depends on /lib/ld-linux.so.2 (You can compile a new version of it an replace RNAfold in srna-workbenchV4.0Alpha/ExeFiles/linux/)
        On ubuntu it can be found in this package: libc6-i386
        sudo apt-get update
        sudo apt-get libc6-i386
    Set up the variables in the config dir.
    You should also have the following software configured in your path
      Patman (Can be installed with install script)
      Tar sudo apt-get install tar
    run miRPursuit.sh

Installation

From git hub
$ cd /toDesiredLocation/
$ git clone https://github.com/forestbiotech-lab/miRPursuit.git
$ cd miRPursuit
From tar
#Download archive from github
$ cd /toDesiredLocation/
$ unzip miRPursuit-master.zip
Dependencies
    To install the necessary dependencies you can run install.sh in the main folder
$ cd /pathtoMiRPursuit/
$ ./install.sh
Custom Installation
    Set software dir in config file
    Fill out the software variables in the software.cfg file.
    Set the paths to any program listed if already installed.
$ cd /pathtoMiRPusuit/
$ vim config/software_dirs.cfg
Running test dataset
    A test dataset was provided to ensure the pipeline is installed successfully
      edit config/workdirs.cfg
      Set INSERTS_DIRS=pathToMiRPursuit/testDataset (Example for test dataset)
      Use as reference genome a simple plant genome. (Dataset has sRNAS detected by C.canephora genome)
    Example code to analyse test_dataset (Make sure all var above mentions are already set):
$ bash pathToMirPursuit/miRPursuit.sh -f 1 -l 2 --fasta test_dataset-

Analysing sRNA

Works for fastq and fasta input formats.

config - Directory that has all the variables for the workflow.

    workdirs.cfg- Sets variables with directories and files necessary for the project.
      workdir - path to workdir (will create one if it doesn't exist)
      genomes path to genomes
      GENOME_MIRCAT _The path to the genome to be used by mircat. Set to ${GENOME} if you don't need to run various parts. (My be necessary if you have short amount of ram.)"
      FILTER_SUF _Filter-suffix to chose the predefined filter settings to be used.
      MEMORY - Amount of memory to be used my java when using memory intensive scripts. Ex:10g, 2000m ...
      THREADS - Number of cores to be used during execution
      INSERTS_DIR Path to the inserts directory
      MIRBASE Path to mirbase database

    software_dirs.cfg - Sets the directory paths to all major programs

    patman_genome.cfg - General genome filtering parameters

    wbench_mircat.cfg - General parameters for mircat

    wbench_tasi.cfg - General parameters for TaSi.

Programs

    sRNAworkFlow.sh
    Description: This is the main script that runs the full pipeline. Some commands are being changed to config files.
      inputs:
        -f|--lib-first "First library to be processed"
        -l|--lib-last "last Library to be processed"
        -h|--help "Display help"
      Optional arguments:
        -s|--step Step is an optional argument used to jump steps and start analysis from a different point.
          Step 1: Wbench Filter
          Step 2: Filter Genome & miRBase
          Step 3: Tasi
          Step 4: Mircat
          Step 5: PareSnip
        --fasta Set the program to start using fasta files. As an argument supply the file name that identifies the series to be used. Ex: Lib_1.fa, Lib_2.fa, .. --> argument should be Lib_
        --fastq Set the program to start using fastq files. As an argument supply the file name that identifies the series to be used. Ex: Lib_1.fq, Lib_2.fq, .. --> argument should be Lib, will also extract the file if extension is fastq.gz
        --trim Set this flag to perform adaptor triming. No argument should be given. The adaptor is in the workdirs.cfg config file in the variable ADAPTOR.
      Outputs:
        mirbase hits
        predicted targets
        predicted mRNA
        [workdir]/logs
        [workdir]/counts

      Figure 1 - MiRPursuit pipeline diagram and miRPursuit file structure.
      In the file structure, each rectangle represents a folder, dotted lines indicate relative paths, while solid lines indicate direct relation (folder is child of arrow origin).
      /miRPursuit is located in the path where it was installed. /[workdir_name] has the path that was set in workdir in workdir.cfg. /config has all the configuration files specified in supp file 1. /count has all count files generated. /data stores all generated files along if any intermediary files generated by the processes in the pipeline. /log stores all the log file related to the pipeline execution.
------
    predict_target.sh
    Description: This is last step of the pipeline responsible for identifying sRNA targets in the transcriptome through degradome mediated search.
      inputs:
        -f|--lib-first "First library to be processed"
        -l|--lib-last "last Library to be processed"
      Optional arguments: (If no degradome file parameter is given the script will give a list of options based on the location of the last used degradome file
        -d|--degradome "Degradome location"
        -h|--help "Display help"
      Outputs:
        targets

For detailed file names check the corresponding pipeline. This program executes the following programs in that order. Stats on the number of reads are stored in the count directory. The count file is not really a tsv it is in fact a space separated values. But I though i was close enough to a tsv. The format used for counts is %y%m%d:%h%m&s-type-lib[lib_first]-[lib_last].tsv

The log directory has alot of information about what happened during the execution of the scripts. It has a similar file notations as the count files. %y%m%d:%h%m%s-type.log or *.log.ok if it ran till the end. *.


References:

1 - Borges F & Martienssen RA (2015) The expanding world of small RNAs in plants. Nat Rev Mol Cell Biol 16, 727–741.

2 - Sunkar R (2010) MicroRNAs with macro-effects on plant stress responses. Semin Cell Dev Biol 21, 805–811.

3 - Liu J & Vance CP (2010) Crucial roles of sucrose and miRNA399 in systemic signaling of P deficiency - A tale of two team players? Plant Signaling and Behaviour 5, 1–5.

4 - Bartel DP (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116, 281–297.

5 - Allen E, Xie Z, Gustafson AM & Carrington JC (2005) microRNA-directed phasing during trans-acting siRNA biogenesis in plants. Cell 121, 207–221.

6 - Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szcześniak MW, Gaffney DJ, Elo LL, Zhang X & Mortazavi A (2016) A survey of best practices for RNA-seq data analysis. Genome Biol 17, 13.Page 11 of 399 FEBS Letters.

7 - Kozomara A & Griffiths-Jones S (2011) miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res 39, D152–7.

8 - Chaves I, Lin Y-C, Pinto-Ricardo C, Van de Peer Y & Miguel C (2014) miRNA profiling in leaf and cork tissues of Quercus suber reveals novel miRNAs and tissue-specific expression patterns. Tree Genet. Genomes 10, 721–737.

9 - Stocks MB, Moxon S, Mapleson D, Woolfenden HC, Mohorianu I, Folkes L, Schwach F, Dalmay T & Moulton V (2012) The UEA sRNA workbench: a suite of tools for analysing and visualizing next generation sequencing microRNA and small RNA datasets. Bioinformatics 28, 2059–2061.

10 - BabrahamBioinformatics (2016) A quality control tool for high throughput sequence data http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

11 - HannonLab (2010) FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/index.html.

12 - Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, Floden EW, Gardner PP, Jones TA, Tate J & Finn RD (2015) Rfam 12.0: updates to the RNA families database. Nucleic Acids Res 43, D130–7.

13 - Prüfer K, Stenzel U, Dannemann M, Green RE, Lachmann M & Kelso J (2008) PatMaN: rapid alignment of short sequences to large databases. Bioinformatics 24, 1530–1531.

14 - Chen H-M, Li Y-H & Wu S-H (2007) Bioinformatic prediction and experimental validation of a microRNA-directed tandem trans-acting siRNA cascade in Arabidopsis. Proc Natl Acad Sci U S A 104, 3318–3323.

15 - Griffiths-Jones S (2006) miRBase: the microRNA sequence database. Methods Mol Biol 342, 129–138.

16 - Griffiths-Jones S, Saini HK, van Dongen S & Enright AJ (2008) miRBase: tools for microRNA genomics. Nucleic Acids Res 36, D154–8.

17 - Taylor RS, Tarver JE, Foroozani A & Donoghue PCJ (2017) MicroRNA annotation of plant genomes - Do it right or not at all. Bioessays 39.

18 - Meyers BC, Axtell MJ, Bartel B, Bartel DP, Baulcombe D, Bowman JL, Cao X, Carrington JC, Chen X, Green PJ, Griffiths-Jones S, Jacobsen SE, Mallory AC, Martienssen RA, Poethig RS, Qi Y, Vaucheret H, Voinnet O, Watanabe Y, Weigel D & Zhu J-K (2008) Criteria for annotation of plant MicroRNAs. Plant Cell 20, 3186–3190.

19 Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N & Rokhsar DS (2012) Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res 40, D1178–86.

20 Kersey PJ, Allen JE, Armean I, Boddu S, Bolt BJ, Carvalho-Silva D, Christensen M, Davis P, Falin LJ, Grabmueller C, Humphrey J, Kerhornou A, Khobova J, Aranganathan NK, Langridge N, Lowy E, McDowall MD, Maheswari U, Nuhn M, Ong CK & Staines DM (2016) Ensembl Genomes 2016: more genomes, more complexity. Nucleic Acids Res 44, D574–80.