Skip to content

Latest commit

 

History

History
141 lines (86 loc) · 5.98 KB

index-swaps.md

File metadata and controls

141 lines (86 loc) · 5.98 KB

Index Hopping Analysis

Characterization and remediation of sample index swaps by non-redundant dual indexing on massively parallel sequencing platforms

Costello et al. - 10.1186/s12864-018-4703-0

  • HiSeq X, 3000/4000 utilize a patterned flowcell.
  • The patterned flowcell has been observed to cause higher levels of index swapping. The patterned flowcell ustilizes ExAmp chemistry to provide higher yields, but this also contributes to greater levels of swapping because adapters/primers are not washed away as they are on HiSeq 2500.

Single v. Dual indexing

Single indexing was obsered to cause greater levels of contamination

PCR-plus - 8 Extra cycles

Factors contributing to swapping

The authors in this paper identified several factors that contribute to increased rates of swapping.

  • Presence of excess free primer or adapters
  • Increases as plex increases. Prob(self-swaps) decreases, and prob of swaps with other samples increases.
  • Library prep method influences swap rates with DNA shearing + adapter ligation + HiSeq - Highest rates
  • Higher rates observed for chimeric reads.
  • Higher rates at lower % GC, b/c shorter fragments with lower %GC amplify more efficiently with polymerase based amplification assays - which is what the ExAmp chemistry ustilizes.
  • Swapping also occurs during exome capture

  • "Index switching causes spreading of signal" paper reported 10% swap rate.
  • Highest found in this paper is 6%.
  • Typical swap rates:
    • 3% - PCR- (free)
    • 0.25% - PCR+ genomes.
  • Authors hypothesize higher swap rates in PCR-free libraries due to low library fragment yield. This is likely due to the dilution step.
  • PCR-plus libraries are diluted following PCR, reducing the ratio of adapter to library fragments.
  • They hypothesize that the dilution of PCR-plus libraries reduces the ratio of free adapter to library fragments.
  • Index swapping examined at the flowcell - lane level.
  • Observed rates of 0.2 - 6%, avg. 1%
  • In the case of WES, most adapters would be lost during the exome capture stage, and PCR further reduces likelihood of swapping.
  • Swapping is therefore more of an issue with WGS without adequite cleanup.

Swap rates are examined on a per-lane per-flowcell basis

Both smaller inserts and chimeras have increased swap rates

A novel post hoc method for detecting index switching finds no evidence for increased switching on the Illumina HiSeq X

Owens et al.

Examines unbalanced heterozygoes which would be produced by index switching.

  • Working with WGS sunflower (_Helianthus annuus).
  • 350 bp size.
  • Prob. is that only a single read is observed to switches in their dataset.
  • Calculated the bionomial probability that the rare allele would be found in one or more samples based on the allele frequency for all samples sequenced with that machine.

$\hat{p} = 1 - (1-f)^{2n}$

If index switching is occuring, then $\hat{p} > p$.

Occult Specimen Contamination in Routine Clinical Next-Generation Sequencing Testing

Sehn et al.

  • Examined haplotypes to identify contamination.
  • Developed a list of 200 pairs of closely spaced SNPs.
    • Within 400 bp of one another
    • Population minor allele $frequency > 0.1$
    • $r^2 < 0.5$
    • $HWE \geq 0.05$
  • Contamination estimated as $2x$ the mean frequency of the minor haplotype at multihaplotype loci to correct for zygosity (???).

Sample-Index Miasassignment Impacts Tumour Exome Sequencing

Vodak et al.

  • Git repo

  • "index misassignment is a source of false positive somatic variant calls in a form of true variation obtained from co-multiplexed samples."

  • Performed deep exome sequencing; Median coverage: 315x Tumor; 146x control. Compared HiSeq 2000/2500 and HiSeq 4000 (ExAmp).

  • Measured sample-wise contamination using Conpair.

  • Simulations suggest conpair underestimates contamination rates with increasing sample counts.

  • Conpair is designed for 2 sample mixtures only (supposedly)

Contamination Estimates

Comparing ExAmp and Bridge amplification, they identify median per-sample rates:

  • 0.839% with ExAmp
  • 0.187% with Bridge Amplification

Examining samples that were sequenced as part of a pool or on their own revealed estimates of:

  • 0.644% pooled (multiplexed)
  • 0.046% in individual lanes

"The dependency on sample pooling indicated that co-multiplexed samples serve as contaminants."

Artifactual variant calls

SSNV = Somatic single nucleotide variants

  • Examined the "pool complement" which consisted of reads from all co-multiplexed samples.
  • Calculated two allelic fractions for each somatic variant:
    • PC-AF - "Pool Complement" allelic fraction; Allelic fraction in the source of contamination.
    • AF - Allelic fraction in a sample variant suspected of contamination.

Two distinct classes of variants identified from resulting analysis:

  1. Apparently true somatic variants consisting of variants not present in the Norwegian population, and lacking support in the pool complement.
  2. Suspected contaminant variants, consisting of common Norwegian Germline Variants with >= 5% allele frequency, with considerable support in their pool complements.

Conpair: concordance and contamination estimator for matched tumor–normal pairs

Bergmann et al.

"Using a grid-search over a range of contamination fractions";

A grid search is simply a parameter sweep or exhaustive search.

  • VerifyBamID works well for copy-neutral samples.