DNA sequencing analysis notes from Ming Tang
Branch: master
Clone or download
Latest commit caa8c9d Jan 2, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
Breakpoint_clustering.md Create Breakpoint_clustering.md Mar 4, 2016
README.md add sv paper Jan 2, 2019



Databases for variants

Important paper DNA damage is a major cause of sequencing errors, directly confounding variant identification

However, in this study we show that false positive variants can account for more than 70% of identified somatic variations, rendering conventional detection methods inadequate for accurate determination of low allelic variants. Interestingly, these false positive variants primarily originate from mutagenic DNA damage which directly confounds determination of genuine somatic mutations. Furthermore, we developed and validated a simple metric to measure mutagenic DNA damage and demonstrated that mutagenic DNA damage is the leading cause of sequencing errors in widely-used resources including the 1000 Genomes Project and The Cancer Genome Atlas.

Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects

How to represent sequence variants

Sequence Variant Nomenclature from Human Genome Variation Society

dbSNP IDs are not unique?

Oh God, why are people still using dbSNP IDs as though they're unique identifiers?

— Daniel MacArthur (@dgmacarthur) July 27, 2016
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

The Evolving Utility of dbSNP

see a post:dbSNP (build 147) exceeds a ridiculous 150 million variants

In the early days of next-generation sequencing, dbSNP provided a vital discriminatory tool. In exome sequencing studies of Mendelian disorders, any variant already present in dbSNP was usually common, and therefore unlikely to cause rare genetic diseases. Some of the first high-profile disease gene studies therefore used dbSNP as a filter. Similarly, in cancer genomics, a candidate somatic mutation observed at the position of a known polymorphism typically indicated a germline variant that was under-called in the normal sample. Again, dbSNP provided an important filter.

Now, the presence or absence of a variant in dbSNP carries very little meaning. The database includes over 100,000 variants from disease mutation databases such as OMIM or HGMD. It also contains some appreciable number of somatic mutations that were submitted there before databases like COSMIC became available. And, like any biological database, dbSNP undoubtedly includes false positives.

Thus, while the mere presence of a variant in dbSNP is a blunt tool for variant filtering, dbSNP’s deep allele frequency data make it incredibly powerful for genetics studies: it can rule out variants that are too prevalent to be disease-causing, and prioritize ones that are rarely observed in human populations. This discriminatory power will only increase as ambitious large-scale sequencing projects like CCDG make their data publicly available.

Tips and lessons learned during my DNA-seq data analysis journey.

  1. Allel frequency(AF)
    Allele frequency, or gene frequency, is the proportion of a particular allele (variant of a gene) among all allele copies being considered. It can be formally defined as the percentage of all alleles at a given locus on a chromosome in a population gene pool represented by a particular allele. AF is affected by copy-number variation, which is common for cancers. tools such as pyclone take tumor purity and copy-number data into account to calculate Cancer Cell Fraction (CCFs).

  2. "for SNVs, we are interested in genotype 0/1, 1/1 for tumor and 0/0 for normal. 1/1 genotype is very rare.
    It requires the same mutation occurs at the same place in two sister chromsomes which is very rare. one possible way to get 1/1 is deletion of one chromosome and duplication of the mutated chromosome". Quote from Siyuan Zheng.

  3. "Mutect analysis on the TCGA samples finds around 5000 ~ 8000 SNVs per sample." Quote from Siyuan Zheng.

  4. Cell lines might be contamintated or mislabled. The Great Big Clean-Up

  5. Tumor samples are not pure, you will always have stromal cells and infiltrating immnue cells in the tumor bulk. When you analyze the data, keep this in mind.

  6. the devil 0 based and 1 based coordinate systems! Make sure you know which system your file is using:

credit from Vince Buffalo. Also, read this post and this post

Also read The UCSC Genome Browser Coordinate Counting Systems

some useful tools for preprocessing

  • FastqPuri fastq quality assessment and filtering tool.
  • fastp A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance. really promising, take a look!
  • A new tool bazam A read extraction and realignment tool for next generation sequencing data. Take a look!

Mutation caller, structural variant caller

Delly is the best sv caller in the DREAM challenge https://www.synapse.org/#!Synapse:syn312572/wiki/70726

SV caller benchmark

SNV filtering

Whole exome and genome sequencing have transformed the discovery of genetic variants that cause human Mendelian disease, but discriminating pathogenic from benign variants remains a daunting challenge. Rarity is recognised as a necessary, although not sufficient, criterion for pathogenicity, but frequency cutoffs used in Mendelian analysis are often arbitrary and overly lenient. Recent very large reference datasets, such as the Exome Aggregation Consortium (ExAC), provide an unprecedented opportunity to obtain robust frequency estimates even for very rare variants. Here we present a statistical framework for the frequency-based filtering of candidate disease-causing variants, accounting for disease prevalence, genetic and allelic heterogeneity, inheritance mode, penetrance, and sampling variance in reference datasets.

  • a new database called dbDSM A database of Deleterious Synonymous Mutation, a continually updated database that collects, curates and manages available human disease-related SM data obtained from published literature.

  • LncVar: a database of genetic variation associated with long non-coding genes

Annotation of the variants

Mannual review of the variants called by IGV

Third generation sequencing for Structural variants (works on short reads as well!)

tools useful for everyday bioinformatics

A series of posts from Brad Chapman

  1. Validating multiple cancer variant callers and prioritization in tumor-only samples
  2. Benchmarking variation and RNA-seq analyses on Amazon Web Services with Docker
  3. Validating generalized incremental joint variant calling with GATK HaplotypeCaller, FreeBayes, Platypus and samtools
  4. Validated whole genome structural variation detection using multiple callers
  5. Validated variant calling with human genome build 38

Copy number variants

Tools for visulization

  1. New app gene.iobio
    App here I will definetely have it a try.

  2. ASCIIGenome is a command-line genome browser running from terminal window and solely based on ASCII characters. Since ASCIIGenome does not require a graphical interface it is particularly useful for quickly visualizing genomic data on remote servers. The idea is to make ASCIIGenome the Vim of genome viewers.

Tools for vcf files

  1. tools for pedigree files. It can determine sex from PED and VCF files. Developed by Brent Pedersen. I really like tools from Aaron Quinlan's lab.
  2. cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files
  3. PyVCF - A Variant Call Format Parser for Python
  4. VcfR: an R package to manipulate and visualize VCF format data
  5. Varapp is an application to filter genetic variants, with a reactive graphical user interface. Powered by GEMINI.
  6. varmatch: robust matching of small variant datasets using flexible scoring schemes
  7. vcf-validator validate your VCF files!
  8. BrowseVCF: a web-based application and workflow to quickly prioritize disease-causative variants in VCF files

mutation signature

Tools for MAF files

TCGA has all the variants calls in MAF format. Please read a post by Cyriac Kandoth.

  1. convert vcf to MAF: perl script by Cyriac Kandoth.
  2. once converted to MAF, one can use this MAFtools to do visualization: oncoprint wraps complexHeatmap, Lollipop and Mutational Signatures etc. Very cool, I just found it...
  3. MutationalPatterns: an integrative R package for studying patterns in base substitution catalogues

Tools for bam files

  1. VariantBam: Filtering and profiling of next-generational sequencing data using region-specific rules

Annotate and explore variants

  1. Variant Effect Predictor: VEP
  3. vcfanno
  4. myvariant.info tutorial
  5. FunSeq2- A flexible framework to prioritize regulatory mutations from cancer genome sequencing
  6. ClinVar
  7. ExAC
  8. vcf2db and GEMINI: a flexible framework for exploring genome variation from Qunlan lab.


1.oncoprint 2.deconstructSigs aims to determine the contribution of known mutational processes to a tumor sample. By using deconstructSigs, one can: Determine the weights of each mutational signature contributing to an individual tumor sample; Plot the reconstructed mutational profile (using the calculated weights) and compare to the original input sample 3. Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Identify driver genes

intra-Tumor heterogenity

tumor colonality and evolution

mutual exclusiveness of mutations

  • MEGSA: A powerful and flexible framework for analyzing mutual exclusivity of tumor mutations.
  • CoMet
  • DISCOVER co-occurrence and mutual exclusivity analysis for cancer genomics data.

mutation enrich in pathways

*PathScore: a web tool for identifying altered pathways in cancer data

Non-coding mutations


long reads

Quality Assessment Tools for Oxford Nanopore MinION data Signal-level algorithms for MinION data

Single-cell DNA sequencing