Skip to content

Identify microhaplotypes from low depth sequencing data

Notifications You must be signed in to change notification settings

delomast/mhFromLowDepSeq

Repository files navigation

mhFromLowDepSeq

Identify microhaplotypes from low depth sequencing data

Situation: You have either pooled sequencing data (sequencing data for a bunch of individuals with no barcodes or other means of identifying which reads came from which individual) or you have low coverage sequencing data for a bunch of individuals (not enough to reliably call genotypes).

Desired outcome: Allele frequencices and/or expected heterozygosity for microhaps in the population that was sequenced.

Method: Mixture models fit with EM algorithms to infer allele frequencies (see full description in ...).

Limitations: Can only handle substitution SNPs

A few notes

To use the method, the only file you need to download from this repository is calc_mh_freq.py (and I might also recommend the example directory). The other files are only included because they are relevant to the analyses discussed in ...

Note on speed: Combining all reads into a single BAM file (and identiifying individuals with read groups, if you have data for individuals) is significantly quicker than having one BAM file for each individual (sorting reads by read group internally is quicker than multiple BAM reading operations). To speed things up, you can run it separately (and concurrently) on each chromosome and/or population. If running populations separately, you should be able to join the results together directly (or using the Chr and Pos columns as keys if you want to double check). As long as each population is run with the same -s input file, -w window size, and -ms maximum number of SNPs in a window, these columns should be identical (the same windows evaluated in the same order). In some limited speed testing, it seems that using smaller values of -maxS, such as 1, increases speed. Additionally, using the pool method (-pool) is generally faster (at the expense of model complexity). Also note that since it uses numba, there is some fixed compilation time at the start of the script. For analyses of realistic size, this extra time is quickly made up.

Note on memory: memory usage should be pretty low (a few GB or less), as long as you are using smaller values of -maxS, do not have hundreds of individuals (for the individual method), and use reasonable values of -maxH and maxR. If memory usage is high, one common cause is that you are inputing a large number of BAM files and they have large headers (lots of @SQ from thousands of small contigs and/or an unreasonable number of @PG lines). This happens because the header is stored in memory while the file is opened for reading. You can remedy this by combining the bam files into one per population (mark individuals with @RG). Theoretically, you could also make the header smaller by removing irrelevant information, but only do this if you understand what you're doing and any downstream repurcussions that you may cause, particularly if you use the file(s) for something else.

Manual

Quick examples:

For low coverage data from multiple individuals

python3 calc_mh_freq.py -m ./example/filePop.txt -s ./example/snpLoc.txt -o output.txt -af -w 65

For pooled data (can't assign reads to individuals)

python3 calc_mh_freq.py -m ./example/filePop.txt -s ./example/snpLoc.txt -o outputPool.txt -af -w 65 -pool

Output file

The output file is a tab-delimited text file with each line representing a microhaplotype locus. The first two columns give the chromosome name and the positions of SNPs (comma-separated) in the locus. The following columns give the expected heterozygosity for each population and are titled by the population name. The next columns give the number of reads used to inform the estimate for each population and are titled as NumReads_[PopName]. If using the "individual-based" mode, the next columns give the number of individuals that contributed at least one read to the estimate and are titled as NumInds_[PopName]. If the -af option is specified, the next columns give the allele frequencies estimated for each population and are titled as AlleleFreq_[PopName].

Command line options

  • -m FILE a tab delimited text file with no header that maps bam files (pools or individuals, first column) to populations (second column). The bam files must be sorted by position and indexed. bamFilePath \t popName
  • -s FILE a tab delimited text file with no header that indicates the positions of substitution SNPs (no indels). First column is the chromosome name and second column is the 1-based position. Variants must be sorted by chromosome and by position, with position sorted from smallest to largest. Chr \t Position
  • -w INT window size for defining microhaplotypes. default 125
  • -ms INT maximum number of SNPs in a window to be valid (if exceeded, window is skipped and nothing is written to output for this window). default 25
  • -o STRING output file name. default He_mh.txt
  • -af include this argument if you want the output file to contain allele frequencies in addition to He. default is to not output allele frequencies
  • -minAF FLOAT minimum allele frequency to keep an allele in the model when pruning. Note that pruning is not performed after the last cycle, so the final result may have alleles with frequencies below this minimum. default 0.001
  • -eps FLOAT probability a base in the template is wrong, should be > 0 and < 1. default 0.01
  • -maxH INT maximum number of alleles to try to estimate results for. If this is exceeded, the window is skipped and NA is output. default is 128 for low coverage data from individuals and 256 for pooled data (option -pool is used)
  • -maxS INT maximum number of SNPs to add in one iteration. Must be > 0. default 1
  • -pool include this argument to treat data (even if multiple bam files per population are provided) as a pool. default is to treat data as coming from individuals with no knowledge of which reads came from which individual
  • -maxR INT The maximum number of reads within a window to consider for a given individual or, if -pool is used, for the entire pool. If more reads are present, a random subsample is taken. default for individual based analyses: 40 reads per individual; for -pool analyses: 100 reads total.
  • -r INT The seed to use for random subsampling of reads when a window exceeds -maxR. default is 7
  • -rg include this argument to split reads within ONE BAM file into separate samples based on the read group SM tag. Note that all reads for a given sample must be in one BAM file. If you use this option, it is strongly recommended that you only input one BAM file for each population.

This software is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as United States Government employees and thus cannot be copyrighted. This software is freely available to the public for use.

About

Identify microhaplotypes from low depth sequencing data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published