Skip to content

The config.yaml configuration file

markrobinsonuzh edited this page Apr 25, 2019 · 22 revisions

The configuration file (config.yaml) contains all the paths to input, output and reference files and additional parameters to customize the pipeline and the performed tests. All of these need to be carefully specified in accordance with the specific experiment.

Important: ALL relative paths will be interpreted relative to the directory where the Snakefile is located. Alternatively, you can use absolute paths.

The following details, parameters and paths have to be adjusted for your specific experiment:

1. Reference annotation details: Specify the source of your reference files. Please be consistent with build and release versions and fill in:

  • annotation: Either Ensembl or Gencode
  • organism: Species name separated by an underscore _ (e.g. Homo_sapiens)
  • build: Genome build (e.g. GRCh38)
  • release: Release number (e.g. 93)

2. Paths to existing reference files: Add paths to reference files. Note that you will also have to add desired paths for indexes etc that will be generated by the workflow. The following reference files are used in the workflow, and can be downloaded from Ensembl or Gencode:

  • txome: A single transcriptome fasta file (for Ensembl references, the cDNA or ncRNA fasta files need to be combined before running the workflow, e.g., cat cdna.fastq.gz ncrna.fastq.gz > cdna.ncrna.fastq.gz). Can be compressed.
  • genome: Genome fasta file: Must be uncompressed
  • gtf: Corresponding GTF file: Must be uncompressed

3. Information about the experiment:

  • readlength: Specify the read length for your RNA-seq reads.
  • fldMean:, fldSD: Specify the mean and standard deviation of the fragment length (see the Salmon documentation for more information). This is only important for single-end libraries. For paired-end libraries, the fragment length distribution will be learned from the data and the specified values will only be used to define the prior distribution. Thus, for paired-end libraries these values can be left to the default values (which are also the Salmon defaults).
  • metatxt: Specify the path to the metadata file in the requested line (metatxt: ).
  • design: Specify a design formula, e.g. "~ 0 + group". This formula will be used to fit a model in edgeR, camera and DRIMSeq. It must be a string (enclose in "") and the predictors (here group) should correspond to column names in the metadata text file. Arbitrary designs are supported.
  • contrast: Define one or more contrasts as a (comma-separated) list.
  • genesets: Specify the gene sets for the gene set analysis with camera. This parameter is only required if run_camera is True (see below: 7. Optional rules)
  • ncores: Sets the maximal number of cores to use for the tools that support multi-threading (e.g., FastQC, STAR, Salmon and DRIMSeq). Note that this is separate to the total number of cores that snakemake uses (via --cores argument); see Running the analysis for more details.

4. FASTQ input files

  • FASTQ: Set the path to a folder containing gzipped FASTQ files
  • fqext1:, fqext2: Specify the extension distinguishing between the two files of paired-end reads (not required for single-end reads).
  • fqsuffix: Specify the file extension (e.g. fastq).

5. Set the path to the output directory. If you want the output in the directory with your Snakefile, set this to ".".

6. R setup

  • useCondaR: Specify if you want to install R and all required packages inside a conda enviroment, set useCondaR to "False" if not. See 03 Managing software for more information on using a system R installation.
  • Rbin: Specify the R binary if you want to use a local R installation (only required if useCondaR: False).

7. Optional rules: Set any of the following variables to False if the corresponding analysis step is not required:

  • run_trimming: Adapter and quality trimming with TrimGalore!.
  • run_STAR: Genome mapping with STAR, BAM file indexing and conversion to bigWig format.
  • run_DRIMSeq: Differential transcript usage analysis with DRIMSeq.
  • run_camera: Gene set analysis with camera.

Real data example

An example setup for a full dataset with the ARMOR workflow is provided in the chiron_readataworkflow branch of the ARMOR GitHub repository. You can find it here.