GitHub - carjed/smaug-genetics: Processing pipeline for singleton mutation project

The SMAUG (Singleton Mutation Analysis Utility with Graphics) Pipeline

Summary

This repository contains the scripts necessary to reproduce the analysis of genome-wide germline mutation patterns described in Carlson et al., (2017).

This pipeline can be applied to analyze mutation patterns from any human genomic data mapped to the hg19/GRCh37 reference genome. The only input required is indexed .vcf file(s) containing the standard information (chromosome, position, reference, alternate, etc.) of single-nucleotide variants in a sample. Individual genotypes for each site are not necessary.

Outline

Create a project directory at /path/to/project/ where we will store this codebase and subsequent output.
Enter your project directory and clone this repository with the command: git clone https://github.com/carjed/smaug-genetics.git.
Edit the /path/to/project/smaug-genetics/_config.yaml file to specify the absolute paths to the input directory, project folder, and other parameters.
Run bash download_ref_data.sh to download the necessary reference data.
Run perl data_mgmt/data_prep/vcf_to_summary.pl to extract positions and alleles of singleton (or common) variants.
Run perl data_mgmt/data_prep/augment_batch.pl to annotate these summary files with local K-mer sequence context and get data file of motif counts in the reference genome.
Run R/mod_shell.r to begin analysis for additional processing. This can be run from the command line with Rscript, but I recommend running within an R session to troubleshoot any errors without having to reload the data every time (may have high memory usage [~30-40GB]).
The analyses called by R/mod_shell.r can be categorized as sequence-only, or sequence+features. The first time you run this script, the sequence-only analyses will run, but if you try to run any of the sequence+features analyses, you will get a notification that you haven't run the proper models. Doing so requires some additional processing, which is handled by scripts in the data_mgmt directory.
Follow the steps in data_mgmt/per-site_dp/README.md to extract depth information from glf files.
Follow the steps in data_mgmt/logit_scripts/README.md to generate model input data and run the models.
Follow the steps in data_mgmt/process_predicted/README.md to parse the model output.
Assuming the above steps have completed successfully, running R/mod_shell.r should execute the final set of analyses.

Dependencies

Hardware

All analyses were run on a server running Ubuntu 14.04. We recommend at least 40GB of RAM and 2-3TB of free disk space for storing output and intermediate file (disk space recommendation does NOT include space used by input files).

Many of the computationally intensive elements of this pipeline are run using custom batch files processed with the slurm workload management system, which queues and distributes jobs to run simultaneously on multiple server nodes. At peak efficiency, our system utilized 300-400 cores simultaneously, but mileage may vary according to your specific hardware/software configuration, resource availability, etc. For this reason, we do not have a specific recommendation for minimum number of processors...the more, the better!

Note that the slurm batch files generated in this pipeline are included here primarily for reference purposes, but will likely need to be customized to your unique server environment or ported to your workload management system of choice in order to run these steps successfully.

Software

R (v3.2)
- ggplot2
- dplyr
- tidyr
- broom
- RColorBrewer
- MASS
- speedglm
- boot
- devtools
- psych
- lmtest
- fmsb
- cowplot
- hexbin
- stringr
- grid
- gridExtra
- gtable
perl (v5.18)
- POSIX
- File::Basename
- File::Path qw(make_path)
- Benchmark
- FindBin
- YAML::XS 'LoadFile'
- Getopt::Long
- Math::Round
- Pod::Usage
- Compress::Zlib
- IO::Compress::Gzip
bash (v4.2)
bcftools v1.3.1
vcftools
bedtools v2.22.0
samtools-hybrid v0.1.7
- (this is a modified version of samtools with better support for .glf files)
bigWigToWig
wigToBigWig

We leave it to the user to download these programs and any missing libraries and ensure they run properly. The R/get_functions.r script include a helper function named usePackage() that automatically installs and loads R packages from the CRAN repository.

Reference data

This pipeline requires the use of several external data files, almost all of which are publicly available. A script for downloading and formatting these files can be found in this repository at download_ref_data.sh. These files include:

hg19/GRCh37 reference genome (all other files must be mapped to this build)
chromosome sizes for reference genome
RefSeq v69 exon positions
Histone modifications (broad peaks) for peripheral blood mononuclear cells from Roadmap Epigenomics Project
Lamin B1 domains
DNase Hypersensitive Sites
CpG island coordinates
- Original source (dead link)
deCODE sex-averaged recombination rates
Replication timing profile of lymphoblastoid cell lines
de novo mutations from two studies:
- Genomes of the Netherlands
- Inova Translational Medicine Institute Molecular Study of Preterm Birth

Project directory structure

INPUTDIR/
|---vcfs/                       [initial vcf directory]
|---summaries/                  [initial summary files]

PROJDIR/
|---slurm/                      [generated slurm batch files]
|---images/                     [figures output by R scripts]
|---reference_data/             [external files for analysis]
|---smaug-genetics/             [this repository]
    |---data_mgmt/
        |---lib/
        |---data_prep/
        |---logit_scripts/
        |---per-site_dp/
        |---process_predicted/
    |---R/
    |---sandbox/                  [old files]
|---output/
    |---{K}bp_{N}k/               [augmented summary info]
    |---glf_depth/
        |---meandp/
        |---chr*/                   [intermediate depth files]
    |---logmod_data/
        |---coefs/                  [estimated effects of genomic features]
        |---motifs/                 [per-subtype model input]
    |---predicted/                [predicted single-base mutation rates]
        |---tracks/                 [OPTIONAL: predicted rates in .bw format]
    |---slurm/                    [slurm error logs]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The SMAUG (Singleton Mutation Analysis Utility with Graphics) Pipeline

Summary

Outline

Dependencies

Hardware

Software

Reference data

Project directory structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1,082 Commits
R		R
data_mgmt		data_mgmt
sandbox		sandbox
LICENSE		LICENSE
README.md		README.md
_config.yaml		_config.yaml
download_ref_data.sh		download_ref_data.sh

License

carjed/smaug-genetics

Folders and files

Latest commit

History

Repository files navigation

The SMAUG (Singleton Mutation Analysis Utility with Graphics) Pipeline

Summary

Outline

Dependencies

Hardware

Software

Reference data

Project directory structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages