Skip to content

Processing pipeline for singleton mutation project

License

Notifications You must be signed in to change notification settings

carjed/smaug-genetics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The SMAUG (Singleton Mutation Analysis Utility with Graphics) Pipeline

Summary

This repository contains the scripts necessary to reproduce the analysis of genome-wide germline mutation patterns described in Carlson et al., (2017).

This pipeline can be applied to analyze mutation patterns from any human genomic data mapped to the hg19/GRCh37 reference genome. The only input required is indexed .vcf file(s) containing the standard information (chromosome, position, reference, alternate, etc.) of single-nucleotide variants in a sample. Individual genotypes for each site are not necessary.

Outline

  1. Create a project directory at /path/to/project/ where we will store this codebase and subsequent output.

  2. Enter your project directory and clone this repository with the command: git clone https://github.com/carjed/smaug-genetics.git.

  3. Edit the /path/to/project/smaug-genetics/_config.yaml file to specify the absolute paths to the input directory, project folder, and other parameters.

  4. Run bash download_ref_data.sh to download the necessary reference data.

  5. Run perl data_mgmt/data_prep/vcf_to_summary.pl to extract positions and alleles of singleton (or common) variants.

  6. Run perl data_mgmt/data_prep/augment_batch.pl to annotate these summary files with local K-mer sequence context and get data file of motif counts in the reference genome.

  7. Run R/mod_shell.r to begin analysis for additional processing. This can be run from the command line with Rscript, but I recommend running within an R session to troubleshoot any errors without having to reload the data every time (may have high memory usage [~30-40GB]).

  8. The analyses called by R/mod_shell.r can be categorized as sequence-only, or sequence+features. The first time you run this script, the sequence-only analyses will run, but if you try to run any of the sequence+features analyses, you will get a notification that you haven't run the proper models. Doing so requires some additional processing, which is handled by scripts in the data_mgmt directory.

  9. Follow the steps in data_mgmt/per-site_dp/README.md to extract depth information from glf files.

  10. Follow the steps in data_mgmt/logit_scripts/README.md to generate model input data and run the models.

  11. Follow the steps in data_mgmt/process_predicted/README.md to parse the model output.

  12. Assuming the above steps have completed successfully, running R/mod_shell.r should execute the final set of analyses.

Dependencies

Hardware

All analyses were run on a server running Ubuntu 14.04. We recommend at least 40GB of RAM and 2-3TB of free disk space for storing output and intermediate file (disk space recommendation does NOT include space used by input files).

Many of the computationally intensive elements of this pipeline are run using custom batch files processed with the slurm workload management system, which queues and distributes jobs to run simultaneously on multiple server nodes. At peak efficiency, our system utilized 300-400 cores simultaneously, but mileage may vary according to your specific hardware/software configuration, resource availability, etc. For this reason, we do not have a specific recommendation for minimum number of processors...the more, the better!

Note that the slurm batch files generated in this pipeline are included here primarily for reference purposes, but will likely need to be customized to your unique server environment or ported to your workload management system of choice in order to run these steps successfully.

Software

  • R (v3.2)
    • ggplot2
    • dplyr
    • tidyr
    • broom
    • RColorBrewer
    • MASS
    • speedglm
    • boot
    • devtools
    • psych
    • lmtest
    • fmsb
    • cowplot
    • hexbin
    • stringr
    • grid
    • gridExtra
    • gtable
  • perl (v5.18)
    • POSIX
    • File::Basename
    • File::Path qw(make_path)
    • Benchmark
    • FindBin
    • YAML::XS 'LoadFile'
    • Getopt::Long
    • Math::Round
    • Pod::Usage
    • Compress::Zlib
    • IO::Compress::Gzip
  • bash (v4.2)
  • bcftools v1.3.1
  • vcftools
  • bedtools v2.22.0
  • samtools-hybrid v0.1.7
    • (this is a modified version of samtools with better support for .glf files)
  • bigWigToWig
  • wigToBigWig

We leave it to the user to download these programs and any missing libraries and ensure they run properly. The R/get_functions.r script include a helper function named usePackage() that automatically installs and loads R packages from the CRAN repository.

Reference data

This pipeline requires the use of several external data files, almost all of which are publicly available. A script for downloading and formatting these files can be found in this repository at download_ref_data.sh. These files include:

Project directory structure

INPUTDIR/
|---vcfs/                       [initial vcf directory]
|---summaries/                  [initial summary files]

PROJDIR/
|---slurm/                      [generated slurm batch files]
|---images/                     [figures output by R scripts]
|---reference_data/             [external files for analysis]
|---smaug-genetics/             [this repository]
    |---data_mgmt/
        |---lib/
        |---data_prep/
        |---logit_scripts/
        |---per-site_dp/
        |---process_predicted/
    |---R/
    |---sandbox/                  [old files]
|---output/
    |---{K}bp_{N}k/               [augmented summary info]
    |---glf_depth/
        |---meandp/
        |---chr*/                   [intermediate depth files]
    |---logmod_data/
        |---coefs/                  [estimated effects of genomic features]
        |---motifs/                 [per-subtype model input]
    |---predicted/                [predicted single-base mutation rates]
        |---tracks/                 [OPTIONAL: predicted rates in .bw format]
    |---slurm/                    [slurm error logs]

About

Processing pipeline for singleton mutation project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published