Skip to content

eriqande/mega-non-model-wgs-snakeflow

Repository files navigation

mega-non-model-wgs-snakeflow

Major Update/Upgrade Notes (New)

I have now pulled the more-parallel branch into main. There are two main changes:

  1. The genomics databases are now created directly from the gvcf sections. As a consequence, it is not necessary for gvcfs at all chromosomes and scaffold groups to be complete before going on to the genomics DB stage. This is helpful when there are some very long chromosomes. This change entails no difference, otherwise, to the user’s resposibilities.

  2. The GenotypeGVCFs step is now done in a scattered fashion over intervals that are all less than some desired length. This means that instead of having to wait a super long time for a long chromosome to finish this step, you can sort of normalize the times required for each instance of this step, but effectively chopping each chromosome into, say, 5 megabase chunks, and then dispatching each of those jobs separately, and stitching them all together at the end. This also means that the workflow can be designed so that none of these jobs exceed 24 hours, so that they can be run on the standard queue on most clusters. There are some new requirements of the user, however!!

    1. There is a new required file: the “scatter_intervals_file.” The path to this must be given in the config file variable scatter_intervals_file. For example, in the .test/config/config.yaml we have:

      scatter_intervals_file: .test/config/scatters_1200000.tsv

      The name of the file can be anything you want. In the above case, I named it that to remind me that it has chopped the genome up into segments that are no longer than 1.2 megabases for genotype calling.

    2. The format of the file can be seen by looking at in on GitHub, here. Importantly, the order of the columns must be as shown in this file.

    3. If you already have your chromosomes.tsv and your scaffold_groups.tsv file paths noted in your config file, you can create the scatter intervals file by first setting scatter_intervals_file: "" in your config file, then running this command:

      snakemake --cores 1 --use-conda results/scatter_config/scatters_XXXXXXX.tsv --configfile path_to/your_config/config.yaml

      Where XXXXXXX should be replaced with the maximum length segment of genome you want to have in a scatter (for example 5000000), and path_to/your_config/config.yaml should be replaced by the actual path to your config file. This step will create the file results/scatter_config/scatters_XXXXXXX.tsv, after which you can copy it to your config directory and give the path to it in the scatter_intervals_file: line in your config.yaml.

Major Update/Upgrade Notes (Old)

  • I have now pulled the bqsr-directory-sructure branch into main. This lets us easily do one or more rounds of “bootstrapped base quality score recalibration.” The directory structure has been modified rather simply. For each round of BQSR (0 = no BQSR; 1 = using variants from 0 to do a round of BQSR and call variants from the results, 2 = using the variants from 1 to do another round of BQSR and call variants from the result; etc.), the entire old directory structure (except for the slurm_logs) get placed into a trunk directory of bqsr-round-x where x is 0 or 1 or 2, etc.
  • There are a few files/steps that could be marked as temporary, but I haven’t done that yet, because, while developing, it can be useful to have some of those stages for re-running from that point.
  • The way I have done the directory structure, there doesn’t seem to be a good way to mark, for example, the gVCFs from the 0-round of BQSR be deleted after the 1-round has been done, without also triggering deletion of the 1-round’s GVCFs. So, for now, this has to be done by hand.
  • Some new things are required in the config file now. Basically the new additions looks like this:
bqsr_rounds: 2
bqsr_maf: 0.0225
bqsr_qual: 37
bqsr_qd: 15

# the following must be a list, even if it is just one element
maf_cutoffs: [0.01, 0.05]

Some notes on Indel Realignment

I am working now on making ANGSD-ready BAMs, and I would like to do indel realignment, as recommended by Nina’s group. I note in their workflow, they submit all the bamfiles to RealignorTargetCreator and then realign all the bam files together, in these lines:

It seems to me like that would not parallelize well.

However, since we already have created VCF files with all the known indels in them it makes sense to just use those. Those will hold all the indels found found from this particular set of bams, and so they should give a similar result. Here is a note on the legacy GATK website about this. It points out most of the indels within an indivdiual will already be known if you have a fairly complete set of indels, so it is permissible in large projects to realign one lane at a time. (Ha! For the folks at the Broad, one lane typically means one sample).

In our case, since the known indels come from using HaplotypeCaller on all the samples, it will hold all the indels that are believed to exist after filtering. It seems like that should be good enough. (If not potentially better). And it also means that we can deploy jobs across individuals, which, with 100s of individuals should let us do this a little more quickly.

This has been implemented, and the angsd-ready BAMs (overlap-clipped and indel-realigned) are now part of the default output and go into results/bqsr-round-{bqsr_round}/indel_realigned/{sample}.bam.

Note that there are some intermittent Java failures on the test data set. I am not sure why this is. Re-running it will often let it finish appropriately. Weird. Must see if the same happens on Linux.

Realignment around species-specific indels

Sometimes we map multiple closely-related species to a single reference genome. If we do that, it makes sense to not realign around indels that are not actually seen in the species. To deal with that, you can put the indel_grps in the config. This is a path that points to a file that tells us which species each of the samples (and sample_id’s) belongs to. For example, in the test data set if we uncomment the line:

indel_grps: ".test/config/igrps.tsv" 

Then, Snakemake understands that it will do indel-realignment only around the indels observed in the indviduals that are the same species. The file specifying this (i.e., .test/config/igrps.tsv in the above), must be TAB-delimited and formatted like this:

sample  sample_id   indel_grp
s001    T199967 spp1
s002    T199968 spp1
s003    T199969 spp1
s004    T199970 spp1
s005    T199971 spp2
s006    T199972 spp2
s007    T199974 spp2

Quick install and run

If you would like to put this on your system and test it running on the tiny test data set it comes with on a single node (or across multiple nodes if you are on a SLURM cluster), these are the s you have to take. We assume that you already have git installed.

  1. Install Snakemake if you don’t already have it. I have been developing and testing this mostly on snakemake-6, but am not testing and using it with snakemake-7.7.0. You must have a full installation of snakemake, and you must have mamba. To install snakemake so that this all works, follow the installation directions at https://snakemake.readthedocs.io/en/stable/getting_started/installation.html.

I have developed and tested this workflow using snakemake version 7.7.0. The updgrade to 7.8.0 introduces a lot of new snakemake behaviors that can be a little unexpected, so I will be sticking with 7.7.0 for a while. Accordingly, you should create a snakemake environment with 7.7.0 to run this. You can do that like this:

conda activate base
mamba create -c conda-forge -c bioconda -n snakemake-7.7.0 snakemake=7.7.0
  1. You must have cloned this repository. If you are savvy with this sort of thing, you might as well fork it then clone it. If not, the simplest way to clone it will be to use this command:
git clone https://github.com/eriqande/mega-non-model-wgs-snakeflow.git
  1. When that is done, change into the repository directory, and activate the snakemake conda environment:
cd mega-non-model-wgs-snakeflow/
conda activate snakemake-7.7.0
  1. The first thing we will do is a “dry-run” of the workflow. This tells you all the different steps that will be taken, but does not actually run them.
 snakemake --cores 20 --use-conda  -np --configfile .test/config/config.yaml
  • The --configfile option tells snakemake to find all the configurations for the run in .test/config/config.yaml. This runs a very small test data set of 8 samples from fastq to VCF.
  • The -np option tells snakemake to do a dry run and also to print all the shell commands that it would use.

After you run that command, there should be a lot of output (one little block for each job) and then a summary at the end that looks something like this:

Job stats:
job                                   count    min threads    max threads
----------------------------------  -------  -------------  -------------
all                                       1              1              1
apply_bqsr                               14              1              1
bcf_concat                                1              1              1
bcf_concat_mafs                           5              1              1
bcf_maf_section_summaries                30              1              1
bcf_section_summaries                    18              1              1
bung_filtered_vcfs_back_together         18              1              1
bwa_index                                 1              1              1
combine_bcftools_stats                    3              1              1
combine_maf_bcftools_stats                5              1              1
concat_gvcf_sections                      7              1              1
condense_variants_for_bqsr                2              1              1
fastqc_read1                             18              1              1
fastqc_read2                             18              1              1
gather_scattered_vcfs                    18              1              1
genome_dict                               1              1              1
genome_faidx                              1              1              1
genomics_db2vcf_scattered               126              2              2
genomics_db_import_chromosomes           12              2              2
genomics_db_import_scaffold_groups        6              2              2
get_genome                                1              1              1
hard_filter_indels                       18              1              1
hard_filter_snps                         18              1              1
maf_filter                               30              1              1
make_chromo_interval_lists               12              1              1
make_gvcf_sections                      126              1              1
make_indel_vcf                           18              1              1
make_scaff_group_interval_lists           6              1              1
make_scatter_interval_lists             126              1              1
make_snp_vcf                             18              1              1
map_reads                                18              4              4
mark_dp0_as_missing                      18              1              1
mark_duplicates                           7              1              1
multiqc                                   3              1              1
recalibrate_bases                        14              1              1
samtools_stats                           21              1              1
trim_reads_pe                            18              1              1
total                                   777              1              4

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
  1. Do the run. But, for the first go-round only install the necessary software environments by using the --conda-create-envs-only option:
snakemake --cores 20 --use-conda  --conda-create-envs-only --configfile .test/config/config.yaml

This can take 5 or 10 minutes, or even longer, but you only have to do it once. After that, when you run the workflow, all the software environments will already be in place.

  1. Once that has finished. Do a whole run of the test data set. Note that this is set up to use 20 cores, which is reasonable if you have checked out an entire node on SEDNA, using, for example srun -c 20 --pty /bin/bash. At any rate, to do the run you give this command:
 snakemake --cores 20 --use-conda --configfile .test/config/config.yaml

When you do that it should take about 5 to 20 minutes to run through the whole workflow on the tiny test data.

  1. Once that has finished. Do a dry run of snakemake again and you should see that it is telling you that there is nothing to do because all requested output files have been created.
snakemake --cores 20 --use-conda  --keep-going  -np --configfile .test/config/config.yaml

This will output something like:

Building DAG of jobs...
The input used to generate one or several output files has changed:
    To inspect which output files have changes, run 'snakemake --list-input-changes'.
    To trigger a re-run, use 'snakemake -R $(snakemake --list-input-changes)'.
The params used to generate one or several output files has changed:
    To inspect which output files have changes, run 'snakemake --list-params-changes'.
    To trigger a re-run, use 'snakemake -R $(snakemake --list-params-changes)'.
Nothing to be done (all requested files are present and up to date).

The changes it thinks have occurred, have not. Those are a consequence of some chicanery I did in the workflow to enable genomic data base updating, which I think I will probably revamp at some point, because it is not as useful as one might think, since loading genomic data bases from gVCFs is actually not terribly costly.

So, what just happened there

The upshot is that this workflow started with the fastq files in .test that represent 7 samples sequenced across multiple lanes and prepared in different libraries:

(snakemake-7.7.0) [node08: mega-non-model-wgs-snakeflow]--% ls .test/data/fastq/
T199967_T2087_HY75HDSX2_L001_R1_001.fastq.gz  T199970_T2094_HY75HDSX2_L004_R2_001.fastq.gz  T199973_T2087_HY75HDSX2_L002_R1_001.fastq.gz
T199967_T2087_HY75HDSX2_L001_R2_001.fastq.gz  T199971_T2087_HY75HDSX2_L002_R1_001.fastq.gz  T199973_T2087_HY75HDSX2_L002_R2_001.fastq.gz
T199968_T2087_HY75HDSX2_L001_R1_001.fastq.gz  T199971_T2087_HY75HDSX2_L002_R2_001.fastq.gz  T199973_T2087_HY75HDSX2_L003_R1_001.fastq.gz
T199968_T2087_HY75HDSX2_L001_R2_001.fastq.gz  T199971_T2087_HY75HDSX2_L004_R1_001.fastq.gz  T199973_T2087_HY75HDSX2_L003_R2_001.fastq.gz
T199968_T2087_HY75HDSX2_L002_R1_001.fastq.gz  T199971_T2087_HY75HDSX2_L004_R2_001.fastq.gz  T199973_T2094_HY75HDSX2_L002_R1_001.fastq.gz
T199968_T2087_HY75HDSX2_L002_R2_001.fastq.gz  T199971_T2099_HTYYCBBXX_L002_R1_001.fastq.gz  T199973_T2094_HY75HDSX2_L002_R2_001.fastq.gz
T199969_T2087_HTYYCBBXX_L002_R1_001.fastq.gz  T199971_T2099_HTYYCBBXX_L002_R2_001.fastq.gz  T199973_T2094_HY75HDSX2_L003_R1_001.fastq.gz
T199969_T2087_HTYYCBBXX_L002_R2_001.fastq.gz  T199972_T2087_HTYYCBBXX_L003_R1_001.fastq.gz  T199973_T2094_HY75HDSX2_L003_R2_001.fastq.gz
T199969_T2087_HY75HDSX2_L002_R1_001.fastq.gz  T199972_T2087_HTYYCBBXX_L003_R2_001.fastq.gz  T199974_T2087_HY75HDSX2_L001_R1_001.fastq.gz
T199969_T2087_HY75HDSX2_L002_R2_001.fastq.gz  T199972_T2087_HY75HDSX2_L001_R1_001.fastq.gz  T199974_T2087_HY75HDSX2_L001_R2_001.fastq.gz
T199969_T2087_HY75HDSX2_L003_R1_001.fastq.gz  T199972_T2087_HY75HDSX2_L001_R2_001.fastq.gz  T199974_T2094_HY75HDSX2_L001_R1_001.fastq.gz
T199969_T2087_HY75HDSX2_L003_R2_001.fastq.gz  T199972_T2094_HTYYCBBXX_L004_R1_001.fastq.gz  T199974_T2094_HY75HDSX2_L001_R2_001.fastq.gz
T199970_T2087_HY75HDSX2_L003_R1_001.fastq.gz  T199972_T2094_HTYYCBBXX_L004_R2_001.fastq.gz  T199974_T2099_HY75HDSX2_L001_R1_001.fastq.gz
T199970_T2087_HY75HDSX2_L003_R2_001.fastq.gz  T199972_T2094_HY75HDSX2_L002_R1_001.fastq.gz  T199974_T2099_HY75HDSX2_L001_R2_001.fastq.gz
T199970_T2094_HY75HDSX2_L004_R1_001.fastq.gz  T199972_T2094_HY75HDSX2_L002_R2_001.fastq.gz

And then it downloaded the genome for those samples, trimmed the fastq’s, fastqc-ed them, mapped them to the genome, marked duplicates, created gVCF files for each sample, imported those gVCF files to a genomics data base, genotyped the samples from those genomic data bases, marked sites with 0 read depth as missing, did best-practices GATK hard-filtering on those genotypes, combined a lot of VCF files across multiple regions of the genome into a single VCF file called results/vcf/all-filtered.vcf.gz, and then printed out some summaries of those variants using bcftools stats. You can have a look at that at the VCF file with the command:

zcat results/bqsr-round-{x}/bcf/all.bcf | less

Where {x} is 0, 1, or 2. If you are doing this on a Mac, then you can use gzcat instead of zcat.

Additionally, the log files from every job that got run have been recorded in various directories and files in results/bqsr-round-{x}/logs.

Finally, run-time information (how long it took, how much memory was required, how much disk I/O occurred) for every one of the jobs that ran is recorded in various directories and files in results/bqsr-round-{x}/benchmarks. This can be a treasure trove for estimating how long different jobs/steps of this workflow will take on new data sets.

All the files generated by the workflow are stored in

  • resources: downloaded and indexed genomes, etc. This also contains some adapter sequence files for trimmomatic that are distributed with this repo.
  • results: all the logs, all the outputs, etc.

A number of files are temporary files and are deleted after all downstream products they depend on have been produced. There are many more such files to mark as temporary, but I will do that after I have used this updated workflow to finish out a long project.

Some files are marked as protected so that they cannot easily be accidentally deleted or modified, such as the files in:

results/bqsr-round-{x}/bcf

It would be typical practice to copy and archive all those to some other place upon completion of the project.

The following section shows a nice acylic directed graph diagram of all the steps in the workflow.

Condensed DAG for the workflow

Here is a DAG for the workflow on the test data in .test, condensed into an easier-to-look-at picture by the condense_dag() function in Eric’s SnakemakeDagR package.

Rulegraph for the workflow

Here is the pure rulegraph, which might be a little easier to read:

Running this with SLURM

This repository includes a snakemake profile that allows all the jobs in the workflow to be dispatched via the SLURM scheduler. This can be really handy. To test this on SEDNA, for example, do this:

  1. Remove the results and the genome parts in the resources directory, so that snakemake will run through the entire workflow, again:
rm -rf resources/genome* results

The -f in the -rf option in the command above is used to override the write-protection on some of the files.

  1. Do a dry run using the SEDNA slurm profile:
snakemake --profile hpcc-profiles/slurm/sedna -np --configfile .test/config/config.yaml

You should get dry-run output like before.

  1. Make sure you have another shell available that you can put this command into, in order to see your SLURM job queue:
squeue -u $(whoami) -o "%.12i %.9P %.50j %.10u %.2t %.15M %.6D %.18R %.5C %.12m"
  1. Start the snakemake job, using the slurm profile:
snakemake --profile hpcc-profiles/slurm/sedna --configfile .test/config/config.yaml

While this is running, go to your other shell and use the above squeue command to see all of your jobs that are queued or running. (To be honest, there seems to be some latency with squeue on SEDNA. Since all these jobs are super short, it might be that they are not there long enough for squeue to show them. Instead, you can use sacct -u $(whoami) to see all those jobs when running or completed).

What the user must do and values to be set, etc

units.tsv

The user has to make a file that lists all the different units of a single sample. Typically different units are different fastq files that hold sequences from a sinble biological sample. For example, the same sample might have been sequenced on different lanes, or on different machines, or it might have been prepared in more than a single library prep. All that can be accounted for. The units.tsv file holds a lot of necessary information for each sample. Here is a link to the units.tsv file used for the .test data set:

https://github.com/eriqande/mega-non-model-wgs-snakeflow/blob/main/.test/config/units.tsv

All columns are required. Study it!

chromosomes.tsv

The user must make this file that tells the workflow about the different fully assembled chromosomes in the reference genome. Here is an example:

https://github.com/eriqande/mega-non-model-wgs-snakeflow/blob/main/.test/config/chromosomes.tsv

It is a TAB separated values file. You have to follow the format exactly. The file can easily be made from the .fai file for the genome. You can modify the helper script at workflow/prepare/make_chromosomes_and_scaffolds.R to prepare this file for yourself. Here is a link that that file if you want to see the contents:

https://github.com/eriqande/mega-non-model-wgs-snakeflow/blob/main/workflow/prepare/make_chromosomes_and_scaffolds.R

The workflow operates a lot on individual chromosomes to allow parallelization, so this is critical information.

However, in some cases, your genome might be in a lot of small scaffolds, and you have no true chromosomes. Or, you might have a bunch of scaffolds with stupid characters in their names like a | that will break most shell scripts. In such cases, you can put everything into the scaffold_groups file and leave the chromosome.tsv file as just the TAB-separated columns names. (see .test/config/empty-chromosomes.tsv as an example).

scaffold_groups.tsv

The user must make this file that tells snakemake which collections of scaffolds should be merged together into scaffold groups.
Here is what the scaffold_groups.tsv file looks like:

https://github.com/eriqande/mega-non-model-wgs-snakeflow/blob/main/.test/config/scaffold_groups.tsv

You have to follow the format, exactly. Also, the order of the scaffolds in this file must match the order of the scaffolds in the reference genome EXACTLY.

This file can also be made from the .fai file for the genome using the helper script at workflow/prepare/make_chromosomes_and_scaffolds.R:

https://github.com/eriqande/mega-non-model-wgs-snakeflow/blob/main/workflow/prepare/make_chromosomes_and_scaffolds.R

If you don’t have any scaffolds, this file can be left as just the TAB-separated column names as in cat .test/config/empty-scaffold_groups.tsv.

config.yaml

The user must make a config.yaml file. It serves a lot of purposes, like:

  • giving the relative path to the units.tsv, chromosomes.tsv, and scaffold_groups.tsv files. (When we say “relative” path here we mean relative to the top level of the repo directory where the snakemake command will be given.)
  • Giving the URL from which the reference genome can be downloaded. (If there is not a URL for it, then just copy the reference FASTA file to resources/genome.fasta).
  • The location of the adapter file for Trimmomatic must be specified. The correct one to use depends on what sequencing platform your data come from.
  • The BQSR parameters as described in the next section.
  • The google drive directory to copy results back to, if desired. See below.
  • Some parameters can be set here; however, some of the YAML blocks here are vestigial and need to be cleaned up. Not all of these options actually change things. For now, ask Eric for help…

The current config.yaml file in the test directory can be viewed at:

https://github.com/eriqande/mega-non-model-wgs-snakeflow/blob/main/.test/config/config.yaml

As mentioned above, there is a little bit of cruft in it that should stay in there for now, but which ought to be cleaned up, ultimately.

Bootstrapped Base Quality Score Recalibration

The GATK folks insist that Base Quality Score Recalibration (BQSR) is a very important step—even for non-model organisms that don’t have a well-curated data base of known variants. Back in the days of the Illumina Hi-Seq, I had been somewhat skeptical of the importance of BQSR for non-model organisms. However, not that sequencing is mostly happening on NovaSeq machines, my views have changed somewhat, because these sequencing machines only deliver four possible base quality scores, and one of those, # is given to uncalled bases. Observe:

# here we extract all the base quality scores recorded by the
# machine for the fastq file 
# .test/data/fastq/T199967_T2087_HY75HDSX2_L001_R1_001.fastq.gz
library(tidyverse)

bqs <- tibble(
  base_quals = read_lines(".test/data/fastq/T199967_T2087_HY75HDSX2_L001_R1_001.fastq.gz")
) %>%
  mutate(line = 1:n()) %>%
  filter(line %% 4 == 0) %>%
  mutate(singles = map(base_quals, strsplit, split = "")) %>%
  pull(singles) %>%
  unlist(.)

tibble(
  base_quals = bqs
) %>%
  count(base_quals) %>%
  mutate(
    ascii = map_int(base_quals, utf8ToInt),
    PHRED = ascii - 33
  )
## # A tibble: 4 × 4
##   base_quals       n ascii PHRED
##   <chr>        <int> <int> <dbl>
## 1 #               17    35     2
## 2 ,            68212    44    11
## 3 :            77998    58    25
## 4 F          1207941    70    37

So, these new machines produce a whole lot of sequence, but it really only provides four possible values for the base qualities:

10 ^{-c(2, 11, 25, 37) / 10}
## [1] 0.6309573445 0.0794328235 0.0031622777 0.0001995262

Given this, some sort of empirical recalibration of the base quality scores seems like it is probably a very good idea.

The problem with this for non-model organisms is that such species don’t have a well-known data base of variants that can be used for the base score recalibration. In that case, the GATK folks recommend using “high-confidence” variants as the the know variant set, and then possibly doing several rounds of this. Doing so requires a boatload of computation, but it might be worthwhile, so this workflow has been set up to do that.

Of course, it is hard to know what is meant by a “high-confidence” variant. The purpose of having known variants is so that they don’t get included in making an emprirical model of base quality scores. Basically, GATK goes through the BAMs and it assumes that any base that is not the reference base is an error, and it tallies those up, broken down by a number of sequence and read-group features, and uses that to estimate what the true sequencing error rate is. Of course, you don’t reference mismatches at sites that are actually variants to contribute to this calculation—because those mismatches are not actually errors. But we are in a chicken and egg situation here—we don’t know which variants are real and which are not!

I am of the mind that it is good to include as many known variants as you can (because, otherwise, the BQSR model will tell us that there are a lot of errors). But, on the other hand, you don’t want to include dubious sites in that known variant set, because mismatches at those sets are likely actually errors and should be counted.

It is important to recognize that the known variants are only excluded from the BQSR model-building stage. Those sites themselves will still get recalibrated once the model is made. And, if there isn’t a strong pattern to the mismatches that occur because some true variants were left out of the known variants data base, the consequences might not be too bad.

All this is to say that this is a pretty inexact science. Nonetheless, we have a few parameters that can be set in the config to select the “known variants” set:

  • bqsr_maf. Sites with a minor allele frequency less than this will not be included in the known variants set. The idea is that you are more confident that a variant is real if it actually was observed in more than one or just a few individuals. Not only that, but, of all the true variants found in an individual, only somewhat less than 2 * bqsr_maf will be discarded because of this filter. Accordingly, the default value in the test config is 0.0225, which means that about 3.5% of the mismatches in any actual variants in any individual will still be called mismatches. That seems OK to me.

  • bqsr_qual. Only sites with a variant quality (QUAL) score equal to or greater than this value will be retained in the known-variants set. For the test data set, this is set at 37, but should be larger (perhaps 100) for data sets with more depth and more individuals.

  • bqsr_qd. Only sites with a QD—a variant quality score, normalized by the number of reads—will be retained. INFO/QD is calculated by GATK. If it is low, it means that a variant has been called but the base quality scores for that variant are low on most, if not all, of the reads supporting that variant. The value in the .test data set is 15, but the effect of that should be investigated.

What values should be chosen?

This is still an area where there are no solid answers; however, it can be helpful to look at the distribution of the QUAL and the QD values. The workflow is set up to make it easy to get those values and do all the quality control steps in the workflow, and then you can investigate the results to choose values of the three bqsr_* config parameters listed above. That is done by setting bqsr_rounds: 0 in the config.yaml, and then choosing the dest_qc_0 and the dest_bqsr_histos_0 rules as targets for the first run. This will do all the mapping and qc and one round of variant calling and filtering and then it will compute QUAL and QD histograms. For the test data set on a local set of cores that looks like this:

snakemake --cores 6  --use-conda  dest_bqsr_histos_0 dest_qc_0 --configfile .test/config/config.yaml

For determining the effect of the values of bqsr_qual and bqsr_qd, after this has run you can investigate the files:

results/bqsr-round-0/qc/bqsr_relevant_histograms/qd.tsv    results/bqsr-round-0/qc/bqsr_relevant_histograms/qual.tsv

Versions of those files have been stored in this repo so that we can see some example R code of tallying them up:

quals <- read_tsv("README_files/qual.tsv", col_names = c("value", "n")) %>%
  arrange(value) %>%
  mutate(
    fract = n / sum(n),
    cumul = cumsum(fract)
  )
qds <- read_tsv("README_files/qd.tsv", col_names = c("value", "n")) %>%
  arrange(value) %>%
  mutate(
    fract = n / sum(n),
    cumul = cumsum(fract)
  )

Now, look at the cumulative distribution of qd values:

ggplot(qds, aes(x = value)) +
  geom_col(aes(y = fract)) +
  geom_point(aes(y = cumul)) +
  geom_line(aes(y = cumul)) +
  xlab("INFO/QD value")

And the distribution of qual values:

g <- ggplot(quals, aes(x = value)) +
  geom_col(aes(y = fract)) +
  geom_point(aes(y = cumul)) +
  geom_line(aes(y = cumul)) +
  xlab("QUAL value")

g

That is a little harder to see, so we can limit it to QUAL values less than 100:

g +
  xlim(0, 100)
## Warning: Removed 126 rows containing missing values (`position_stack()`).

## Warning: Removed 1 rows containing missing values (`geom_col()`).

## Warning: Removed 126 rows containing missing values (`geom_point()`).

## Warning: Removed 126 rows containing missing values (`geom_line()`).

Obviously, with this very small test data set, it is hard to interpret these results. But, you can sort of see why it seems like for this test data set, the values of bqsr_qd = 15 and bqsr_qual = 37 might be reasonable.

You really should check these values for your own data set and select values for those parameters accordingly.

Offloading results to google drive

Set the config parameter rclone_base to the directory on google drive that you want created (if necessary) and everything copied to. For eric, this is like:

rclone_base: "gdrive-rclone:Bioinformatic-Project-Archives/mega-flow-test"

for the test data set.

I have an rclone profile to copy results back to google drive. After you have finished a run, you can run the snakemake on the rule send_to_gdrive target, like this (for SEDNA):

snakemake -np --profile  hpcc-profiles/slurm/sedna  send_to_gdrive --configfile config/config.yaml

And it will print out a command line that does:

  1. copies multiqc.html and bcftools stats to a qc_summaries directories
  2. tarballs up the qc, benchmarks, and logs directories in all bqsr-round-X directories
  3. uses rclone to copy:
    • the bams and gvcfs (and their indexes) from the final round of BQSR,
    • the BCF files (and their indexes) from the final and all all previous rounds of BQSR
    • the qc and benchmark tarballs from the ry bqsr-round-X directory,
    • the qc_summaries directory,
    • the resources directory with the indexed genome
    • the bq_variants and bq_recal_tables
    • Finally, if there is a data directory present at the top level we assume that is what holds the original fastqs, and we copy all that back, too!
    • It will also copy the entire contents of the subdirectory that contains the units file. Accordingly, when you set your run up, you should put your config.yaml, your chromosomes and scaffold groups, etc. all in that subdirectory along with units.tsv.

For example, on the test data set it prints out something like this (but not exactly this, because I keep tweaking it):

mkdir -p results/qc_summaries/bqsr-round-{0..2};  
for i in {0..2}; do 
  cp -r results/bqsr-round-$i/qc/{multiqc.html,bcftools_stats/*.txt} results/qc_summaries/bqsr-round-$i/; 
done;  
for i in {0..2}; do 
tar -cvf results/bqsr-round-$i/qc.tar results/bqsr-round-$i/qc; gzip results/bqsr-round-$i/qc.tar;  
done;  
for i in {0..2}; do 
tar -cvf results/bqsr-round-$i/benchmarks.tar results/bqsr-round-$i/benchmarks; gzip results/bqsr-round-$i/benchmarks.tar; 
done; 
for i in {0..2}; do 
tar -cvf results/bqsr-round-$i/logs.tar results/bqsr-round-$i/logs; gzip results/bqsr-round-$i/logs.tar; 
done; 

rclone copy --dry-run  --drive-stop-on-upload-limit . gdrive-rclone:Bioinformatic-Project-Archives/mega-flow-test   \
  --include='config/**'  \
  --include='results/qc_summaries/**'  \
  --include='results/bqsr-round-{0,1,2}/{qc,benchmarks,logs}.tar.gz'  \
  --include='results/bqsr-round-{0,1,2}/{bcf,bq_recal_tables,bq_variants}/**'  \
  --include='resources/**'  --include='data/**'  --include='results/bqsr-round-2/gvcf/*' \
  --include='results/bqsr-round-2/recal/*'

But, it prints out out all in one line. You then need to copy that line into the Unix terminal so you can execute it and provide rclone with your password.

Google Drive Directory Structure

The resulting directory structure on google drive then looks like this—an example from a case where one round of BQSR was done:

└── Archive-Directory  # this will be named different things for different runs
    ├── config # sample meta data and info for doing the run
    ├── data   # The fastqs
    │   ├── NVS144B_R1_Columbus
    │   └── NVS144B_R2_Columbus
    ├── resources # the indexed genome it was mapped it
    │   └── adapters # adapter seqs used for Trimmomatic
    └── results
        ├── bqsr-round-0
        │   └── bcf. # the BCF files with no BQSR. pass = passes filters, maf is minor allele freq
        ├── bqsr-round-1
        │   ├── bcf  # more BCF files.  These after a round of BQSR.
        │   ├── bq_recal_tables # BQSR info for each individual 
        │   ├── bq_variants. # the variants used as "known" for BQSR round 1
        │   ├── gvcf. # the gVCF files after this round of BQSR
        │   └── recal  # the bam files after this round of  BQSR
        └── qc_summaries. # multiqc summaries and bcftools stats for variants for each round
            ├── bqsr-round-0
            └── bqsr-round-1

Within each of the results/bqsr-round-X directories you will also find tarballs of all the log files and all the benchmark files.

You will also find a directory like results/bqsr-round-0/indel_realigned.

This contains subdirectories of indel-realigned bam files. For example:

results/bqsr-round-0/indel_realigned
├── __ALL
├── spp-carnatus
├── spp-chrysomelas
├── spp-diaconus
├── spp-flavidus
├── spp-melanops
├── spp-mystinus
└── spp-serranoides

In the above case, the subdirectory __ALL has bam files that were recalibrated using indels discovered in all samples. While subdirectory spp-carnatus includes only the bam files of those species that are listed as S carnatus (according to the indel_grps file) AND the indel realignment was done only with indels that were found in the individuals in that directory (i.e., that was species-specific indel realignment).

Assumptions

  • Paired end

Things fixed or added relative to JK’s snakemake workflow

  • fastqc on both reads
  • don’t bother with single end
  • add adapters so illumina clip can work
  • benchmark each rule
  • use genomicsDBimport
  • allow for merging of lots of small scaffolds into genomicsDB

Stepwise addition of new samples to the Workflow (and the Genomics Data bases)

I originally had made a scheme were we can start with one units.tsv file that maybe only has six samples in it, and you can run that to completion. Then you can update the units.tsv file to have two additional samples in it, and that should then properly update the genomics data bases. This was done by a system of writing Genomics_DBI receipts that told us what was already in there.

I ended up never using that feature, and the receipts system complicated the workflow and caused endless headaches during some reruns. So I completely tossed it. You can’t add anything to the genomics databases, now.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages