Skip to content

brianabernathy/ABGD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ABGD

variant-based AB Genome Dosage tools

Overview

While ABGD has some flexibility, it was originally designed to support peanut research. Cultivated peanut, Arachis hypogaea, is an allotetraploid derived from two ancestral diploid parents. Arachis duranensis represents the A subgenome and Arachis ipaensis the B. The basic concept is to align diploid parent reads to a diploid parent genome and call variants. ABGD can then be used to assign subgenome-specific alleles to variants. Tetraploid sample reads can then be aligned to the same parent genome, variants called, then ABGD can be used to determine subgenomic content at each of the previously identified target variant sites.

Usage

target variant site and consensus allele identification

ABGD requires input vcf variant file(s) with the AD (allelic depth) FORMAT tag present. Methods for aligning reads and calling variants vary depending on project details. As an example, bcftools mpileup uses the -a/--annotate AD option to add the AD tags and info to it's output vcf files. Please consult your preferred variant calling tool's documentation for more details. A truncated example vcf file, test.ref.vcf, can be found in the test_data directory.

ABGD also requires a "group samples file", which defines sample genotypes to be grouped for consensus allele calling. In our project, groups represent the A and B subgenomes, but this functionality can be used in other ways. (1 or more groups can be defined) Additionally, minimum mapping depth and minimum allele percentages are also defined to provide some basic filtering for target variant sites. This file should be in tab-separated value format. See the example below or test.ref.groups in the test_data directory for an example.

#group	sample	min_depth	min_allele_pct
a_cons	dur.gnm1.frags	1	100
a_cons	dur.gnm2.frags	1	100
a_cons	dur.bgi	10	95
a_cons	dur.ha	10	95
b_cons	ipa.gnm1.frags	1	100
b_cons	ipa.gnm2.frags	1	100
b_cons	ipa.bgi	10	95
b_cons	ipa.ha	10	95

Once the above conditions are satisfied, abgd_consensus_alleles.pl can produce a file containing consensus alleles for each defined group (subgenome), at each target variant site. Run the command without options or the -h/--help option to view the full documentation. Note that the variant file containing the group samples can contain non-group samples as well, they will simply be ignored.

/path/to/ABGD/bin/abgd_consensus_alleles.pl -v /path/to/ABGD/test_data/test.ref.vcf -g /path/to/ABGD/test_data/test.ref.groups -o /path/to/ABGD/test_data/test.ref.cons.alleles

sample variant calling and dosage processing

Now that the target variant sites and consensus alleles have been identified, the next step is to analyze allopolyploid sample variants. You may have already called sample variants, if not, it is more efficient to instruct the variant caller to process only the target variant sites. abgd_norm_dosage_samples.pl can still process variant files containing non-target variant sites, so this step is optional. Again, please consult your preferred variant caller's documentation for more information. bcftools mpileup, for example, uses the -R/--regions-file FILE option to specify a subset of variant sites to process. You can use the command below to obtain a properly formatted region file for bcftools mpileup from the test.ref.cons.alleles file generated by abgd_consensus_alleles.pl above.

tail -n +2 /path/to/ABGD/test_data/test.ref.cons.alleles | cut -f 1,2 > /path/to/ABGD/test_data/test.ref.target.variant.sites

Note

Please note, depending on your preferred variant caller, it may be necessary to include a sample from each consensus group to ensure that all consensus alleles and corresponding allele depths appear in the sample variant file.

without normalization

If normalized dosage output is not desired, abgd_sample_dosage.pl can be run on the test_data using the command below.

/path/to/ABGD/bin/abgd_sample_dosage.pl -v /path/to/ABGD/test_data/test.samples.vcf -g /path/to/ABGD/test_data/test.ref.cons.alleles -d -p > /path/to/ABGD/test_data/test.samples.group.dosage

with normalization

If normalized dosage output is desired, the variant file containing allopolyploid samples will also require the inclusion of a normalization reference sample. The normalization reference sample is used to calculate normalized allopolyploid sample subgenome dosages. In our case, the normalization reference sample was a synthetic tetraploid created from proportional quantities of high-quality subgenome reads. Please see the publication for more details.

Additionally, a tab-seperated coverage file is required, which defines approximate genome coverages for the normalization reference and samples. See the example below or test.samples.cov in the test_data directory for an example.

#sample	coverage
norm.ref	100.5
sample1	30.5
sample2	30.2
sample3	30.2

Normalization is performed using the following formula, where 'cons allele X' represents one of the previously identified consensus alleles:

sample cons allele X norm = (sample cons allele X count / sample mean cov) / (norm.ref cons allele X count / norm.ref mean cov)

Once these conditions are met, abgd_sample_dosage.pl can be run to produce normalized group dosages for each sample using the command below.

/path/to/ABGD/bin/abgd_sample_dosage.pl -v /path/to/ABGD/test_data/test.samples.vcf -a /path/to/ABGD/test_data/test.ref.cons.alleles -c /path/to/ABGD/test_data/test.samples.cov -d -p -n norm.ref > /path/to/ABGD/test_data/test.samples.group.dosage

About

variant-based AB Genome Dosage tools

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages