Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Provide additional documentation on configuration parameters includin…

…g resources section and ensemble calling
  • Loading branch information...
commit 4d81bbadc3aa5050eae139bf4cbfcd518ad7916e 1 parent e9e0b9d
@chapmanb authored
View
7 config/bcbio_system.yaml
@@ -3,6 +3,8 @@
# These pipeline apply generally across multiple projects. Adjust them in sample
# specific configuration files when needed.
+# -- Base setup
+
# General attributes that apply across multiple pipelines.
algorithm:
aligner: bowtie
@@ -65,6 +67,10 @@ resources:
snpEff:
jvm_opts: ["-Xms2g", "-Xmx4g"]
dir: /usr/share/java/snpeff
+ bcbio_variation:
+ dir: /usr/share/java/bcbio_variation
+
+# -- Additional options for specific integration, not required for standalone usage.
# Galaxy integration. Required for upload/download from a Galaxy instance.
galaxy_url: http://your/galaxy/url
@@ -92,4 +98,3 @@ analysis:
base_dir: /array0/projects/Sequencing
upload_program: upload_to_galaxy.py
worker_program: nextgen_analysis_server.py
-
View
1  config/examples/NA12878-ensemble.yaml
@@ -22,7 +22,6 @@ details:
calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
classifier-params:
type: svm
- classifier-type: svm
trusted-pct: 0.65
quality_format: Standard
coverage_interval: regional
View
112 docs/contents/configuration.rst
@@ -17,12 +17,38 @@ Commented example files are available in the ``config`` directory:
- `example system config`_
- `example sample config`_
-Options
-~~~~~~~
+Sample information
+~~~~~~~~~~~~~~~~~~
+
+The sample configuration file defines ``details`` of each sample to process::
+
+ details:
+ - analysis: variant
+ algorithm:
+ metadata:
+ batch: Batch1
+ description: Example1
+ genome_build: hg19
+
+- ``analysis`` Analysis method to use [variant, RNA-seq]
+- ``algorithm`` Parameters to configure algorithm inputs. Options
+ described in more detail below.
+- ``metadata`` Additional descriptive metadata about the sample. The
+ ``batch`` input defines a batch that the sample falls in. We perform
+ multi-sample variant calling on all samples with the same batch name.
+- ``description`` Unique name for this sample. Required.
+- ``genome_build`` Genome build to align to, which references a genome
+ keyword in Galaxy to find location build files.
+
+Algorithm parameters
+~~~~~~~~~~~~~~~~~~~~
The YAML configuration file provides a number of hooks to customize
analysis in the sample configuration file. Place these under the
-``analysis`` keyword. For variant calling:
+``analysis`` keyword.
+
+Alignment
+=========
- ``aligner`` Aligner to use: [bwa, bowtie, bowtie2, mosaik, novoalign,
false]
@@ -31,21 +57,31 @@ analysis in the sample configuration file. Place these under the
- ``align_split_size``: Split FASTQ files into specified number of
records per file. Allows parallelization at the cost of increased
temporary disk space usage.
-- ``variantcaller`` Variant calling algorithm [gatk, freebayes]
- ``quality_format`` Quality format of fastq inputs [illumina,
standard]
+- ``write_summary`` Write a PDF summary of results [true, false]
+
+Experimental information
+========================
+
- ``coverage_interval`` Regions covered by sequencing. Influences GATK
options for filtering [exome, genome, regional]
- ``coverage_depth`` Depth of sequencing coverage. Influences GATK
variant calling [high, low]
- ``hybrid_target`` BED file with target regions for hybrid selection
experiments.
-- ``variant_regions`` BED file of regions to call variants in.
- ``ploidy`` Ploidy of called reads. Defaults to 2 (diploid).
-- ``recalibrate`` Perform variant recalibration [true, false]
+
+Variant calling
+===============
+
+- ``variantcaller`` Variant calling algorithm. Can be a list of
+ multiple options [gatk, freebayes, varscan, samtools,
+ gatk-haplotype, cortex]
+- ``variant_regions`` BED file of regions to call variants in.
- ``mark_duplicates`` Identify and remove variants [false, true]
+- ``recalibrate`` Perform variant recalibration [true, false]
- ``realign`` Do variant realignment [true, false]
-- ``write_summary`` Write a PDF summary of results [true, false]
Broad's `GATK`_ pipeline drives variant (SNP and Indel) analysis.
This requires some associated data files, and also has some configurable
@@ -61,6 +97,68 @@ are inputs into the training models for recalibration. The automated
`CloudBioLinux`_ data scripts will download and install these in the
variation subdirectory relative to the genome files.
+Ensemble variant calling
+========================
+
+In addition to single method variant calling, we support calling with
+multiple calling methods and consolidating into a final Ensemble
+callset. This requires the `bcbio.variation`_ toolkit to perform the
+consolidation. An example configuration in the ``algorithm`` section is::
+
+ variantcaller: [gatk, freebayes, samtools, gatk-haplotype, varscan]
+ ensemble:
+ format-filters: [DP < 4]
+ classifier-params:
+ type: svm
+ classifiers:
+ balance: [AD, FS, Entropy]
+ calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
+ trusted-pct: 0.65
+
+The ``ensemble`` set of parameters configure how to combine calls from
+the multiple methods:
+
+- ``format-filters`` A set of filters to apply to variants before
+ combining. The example removes all calls with a depth of less than
+ 4.
+- ``classifier-params`` Parameters to configure the machine learning
+ approaches used to consolidate calls. The example defines an SVM
+ classifier.
+- ``classifiers`` Groups of classifiers to use for training and
+ evaluating during machine learning. The example defines two set of
+ criteria for distinguishing reads with allele balance issues and
+ those with low calling support.
+- ``trusted-pct`` Define threshold of variants to include in final
+ callset. In the example, variants called by more than 65% of the
+ approaches (4 or more callers) pass without being requiring SVM
+ filtering.
+
+Resources
+~~~~~~~~~
+
+The ``resources`` section allows customization of locations of programs
+and memory and compute resources to devote to them::
+
+ resources:
+ bwa:
+ cores: 12
+ cmd: /an/alternative/path/to/bwa
+ gatk:
+ jvm_opts: ["-Xms2g", "-Xmx4g"]
+ dir: /usr/share/java/gatk
+
+- ``cmd`` Location of an executable. By default, we assume executables
+ are on the path.
+- ``dir`` For software not distributed as a single executable, like
+ files of Java jars, the location of the base directory.
+- ``cores`` Cores to use for multi-proccessor enabled software.
+- ``jvm_opts`` Specific memory usage options for Java software.
+
+Resources will continue to expand to allow direct customization of
+commandline options as well as fine grained control over research
+usage.
+
+.. _bcbio.variation: https://github.com/chapmanb/bcbio.variation
.. _CloudBioLinux: https://github.com/chapmanb/cloudbiolinux
.. _YAML format: https://en.wikipedia.org/wiki/YAML#Examples
.. _GATK resource bundle: http://www.broadinstitute.org/gsa/wiki/index.php/GATK_resource_bundle
Please sign in to comment.
Something went wrong with that request. Please try again.