variant-based AB Genome Dosage tools
While ABGD has some flexibility, it was originally designed to support peanut research. Cultivated peanut, Arachis hypogaea, is an allotetraploid derived from two ancestral diploid parents. Arachis duranensis represents the A subgenome and Arachis ipaensis the B. The basic concept is to align diploid parent reads to a diploid parent genome and call variants. ABGD can then be used to assign subgenome-specific alleles to variants. Tetraploid sample reads can then be aligned to the same parent genome, variants called, then ABGD can be used to determine subgenomic content at each of the previously identified target variant sites.
ABGD requires input vcf variant file(s) with the AD (allelic depth) FORMAT tag present. Methods for aligning reads and calling variants vary depending on project details. As an example, bcftools mpileup uses the -a/--annotate AD option to add the AD tags and info to it's output vcf files. Please consult your preferred variant calling tool's documentation for more details. A truncated example vcf file, test.ref.vcf, can be found in the test_data directory.
ABGD also requires a "group samples file", which defines sample genotypes to be grouped for consensus allele calling. In our project, groups represent the A and B subgenomes, but this functionality can be used in other ways. (1 or more groups can be defined) Additionally, minimum mapping depth and minimum allele percentages are also defined to provide some basic filtering for target variant sites. This file should be in tab-separated value format. See the example below or test.ref.groups in the test_data directory for an example.
#group sample min_depth min_allele_pct
a_cons dur.gnm1.frags 1 100
a_cons dur.gnm2.frags 1 100
a_cons dur.bgi 10 95
a_cons dur.ha 10 95
b_cons ipa.gnm1.frags 1 100
b_cons ipa.gnm2.frags 1 100
b_cons ipa.bgi 10 95
b_cons ipa.ha 10 95
Once the above conditions are satisfied, abgd_consensus_alleles.pl can produce a file containing consensus alleles for each defined group (subgenome), at each target variant site. Run the command without options or the -h/--help option to view the full documentation. Note that the variant file containing the group samples can contain non-group samples as well, they will simply be ignored.
/path/to/ABGD/bin/abgd_consensus_alleles.pl -v /path/to/ABGD/test_data/test.ref.vcf -g /path/to/ABGD/test_data/test.ref.groups -o /path/to/ABGD/test_data/test.ref.cons.alleles
Now that the target variant sites and consensus alleles have been identified, the next step is to analyze allopolyploid sample variants. You may have already called sample variants, if not, it is more efficient to instruct the variant caller to process only the target variant sites. abgd_norm_dosage_samples.pl can still process variant files containing non-target variant sites, so this step is optional. Again, please consult your preferred variant caller's documentation for more information. bcftools mpileup, for example, uses the -R/--regions-file FILE option to specify a subset of variant sites to process. You can use the command below to obtain a properly formatted region file for bcftools mpileup from the test.ref.cons.alleles file generated by abgd_consensus_alleles.pl above.
tail -n +2 /path/to/ABGD/test_data/test.ref.cons.alleles | cut -f 1,2 > /path/to/ABGD/test_data/test.ref.target.variant.sites
Note
Please note, depending on your preferred variant caller, it may be necessary to include a sample from each consensus group to ensure that all consensus alleles and corresponding allele depths appear in the sample variant file.
If normalized dosage output is not desired, abgd_sample_dosage.pl can be run on the test_data using the command below.
/path/to/ABGD/bin/abgd_sample_dosage.pl -v /path/to/ABGD/test_data/test.samples.vcf -g /path/to/ABGD/test_data/test.ref.cons.alleles -d -p > /path/to/ABGD/test_data/test.samples.group.dosage
If normalized dosage output is desired, the variant file containing allopolyploid samples will also require the inclusion of a normalization reference sample. The normalization reference sample is used to calculate normalized allopolyploid sample subgenome dosages. In our case, the normalization reference sample was a synthetic tetraploid created from proportional quantities of high-quality subgenome reads. Please see the publication for more details.
Additionally, a tab-seperated coverage file is required, which defines approximate genome coverages for the normalization reference and samples. See the example below or test.samples.cov in the test_data directory for an example.
#sample coverage
norm.ref 100.5
sample1 30.5
sample2 30.2
sample3 30.2
Normalization is performed using the following formula, where 'cons allele X' represents one of the previously identified consensus alleles:
sample cons allele X norm = (sample cons allele X count / sample mean cov) / (norm.ref cons allele X count / norm.ref mean cov)
Once these conditions are met, abgd_sample_dosage.pl can be run to produce normalized group dosages for each sample using the command below.
/path/to/ABGD/bin/abgd_sample_dosage.pl -v /path/to/ABGD/test_data/test.samples.vcf -a /path/to/ABGD/test_data/test.ref.cons.alleles -c /path/to/ABGD/test_data/test.samples.cov -d -p -n norm.ref > /path/to/ABGD/test_data/test.samples.group.dosage