Perl software for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
Branch: master
Clone or download
cwnelson88 Fixed score recognition and s/products/lines/ in the output.
The original did not recognize decimal scores; that has been remedied.
The script now asks the user to verify the number of lines, not
products.
Latest commit 4f2738e Jan 14, 2019

README.md

SNPGenie logo

SNPGenie is a collection of Perl scripts for estimating πN/πS, dN/dS, and other evolutionary parameters from next-generation sequencing (NGS) single-nucleotide polymorphism (SNP) variant data. Each type of analysis uses its own script:

  1. WITHIN-POOL ANALYSIS. Use snpgenie.pl, the original SNPGenie. This will perform an analysis of within-sample πN/πS from pooled NGS SNP data. SNP reports (acceptable in a variety of formats) must each correspond to a single population, with variants called relative to a single reference sequence (one sequence in one FASTA file). For example, call:

     snpgenie.pl --vcfformat=4 --snpreport=<variants>.vcf --fastafile=<ref_seq>.fa --gtffile=<CDS_annotations>.gtf
    
  2. WITHIN-GROUP ANALYSIS. Use snpgenie_within_group.pl. This will perform a traditional within-group analysis of dN/dS among aligned sequences in a FASTA file, such as possible using the MEGA software, with automation, sliding windows, parallelism, and overlapping gene features. For example, call:

     snpgenie_within_group.pl --fasta_file_name=<aligned_seqs>.fa --gtf_file_name=<CDS_annotations>.gtf --num_bootstraps=10000 --procs_per_node=8
    
  3. BETWEEN-GROUP ANALYSIS. For a traditional between-group analysis comparing different groups of aligned sequences in two or more FASTA files, such as possible using the MEGA software, with automation, sliding windows, parallelism, and overlapping gene features. For example, call:

     snpgenie_between_group.pl
    

UPDATE FOR VCF INPUT: given myriad VCF formats in use, it is NECESSARY TO SPECIFY THE FORMAT OF THE VCF SNP report input using the --vcfformat argument. Format 4 is the most common. For details, see the section on VCF.

Contents

Introduction

New applications of next-generation sequencing (NGS) use pooled samples containing DNA from multiple individuals to perform population genetic analyses. SNPGenie is a Perl program which analyzes the variant results of a single-nucleotide polymorphism (SNP) caller to calculate evolutionary parameters, such as nucleotide diversity (including its nonsynonymous and synonymous partitions, πN and πS) and gene diversity. Variant calls are typically present in annotation tables and assume that the pooled DNA sample is representative of some population of interest. For example, if one is interested in determining the nucleotide diversity of a virus population within a single host, it might be appropriate to sequence the pooled nucleic acid content of the virus in a blood sample from that host. Comparing πN and πS for a gene product, or comparing gene diversity at polymorphic sites of different types, may help to identify genes or gene regions subject to positive (Darwinian) selection, negative (purifying) selection, and random genetic drift. SNPGenie also includes such features as minimum allele frequency trimming (see Options), and can be combined with upstream applications such as maximum-likelihood SNP calling techniques (e.g., see Lynch et al. 2014). For additional background, see Nelson & Hughes (2015) in the References.

SNPGenie

SNPGenie Input

SNPGenie is a command-line application written in Perl, with no additional dependencies beyond the standard Perl package (i.e., simply download and run). Input includes:

  1. One Reference Sequence file, containing one reference sequence, in FASTA format (.fa/.fasta);
  2. One file with CDS annotations in Gene Transfer Format (.gtf); and
  3. One or more tab-delimited (.txt) SNP Reports, each corresponding to a single pooled-sequencing run (i.e., sample from a population) in CLC, VCF, or Geneious formats, with variants called relative to the reference sequence.

All input files must be placed in the same directory. Then simply run the appropriate SNPGenie script from the command line (see Options if you wish for more control). To do this, first download the snpgenie.pl script and place it in your system’s $PATH, or simply in your working directory. Next, place your SNP report(s), FASTA (.fa/.fasta), and GTF (.gtf) files in your working directory. Open the command line prompt (or Terminal) and navigate to the directory containing these files using the "cd" command in your shell. Finally, execute SNPGenie by typing the name of the script and pressing the <RETURN> (<ENTER>) key:

snpgenie.pl

Further details on input and options are below.

INPUT (1) One Reference Sequence File

Only one reference sequence must be provided in a single FASTA (.fa/.fasta) file (i.e., one file containing one sequence). Thus, all SNP positions in the SNP reports are called relative to the single reference sequence. Because of this one-sequence stipulation, a script has been provided to split a multi-sequence FASTA file into its constituent sequences for multiple or parallel analyses; see Additional Scripts below.

INPUT (2) One Gene Transfer Format File

The Gene Transfer Format (.gtf) file is tab (\t)-delimited, and must include non-redundant records for all CDS elements (i.e., open reading frames, or ORFs) present in your SNP report(s). Thus, if multiple records exist with the same gene name (e.g., different transcripts), ONLY ONE may be included in your analysis at a time. The GTF should also include any ORFs which do not contain any variants (if they exist). SNPGenie expects every coding element to be labeled as type "CDS", and for its product name to follow a "gene_id" tag. In the case of CLC and Geneious SNP reports, this name must match that present in the SNP report. If a single coding element has multiple segments (e.g., exons) with different coordinates, simply enter one line for each segment, using the same product name; they will be concatenated. Finally, for sequences with genes on the reverse '–' strand, SNPGenie must be run twice, once for each strand, with the minus strand's own set of input files (i.e., the '–' strand FASTA, GTF, and SNP report); see A Note on Reverse Complement ('–' Strand) Records below. For more information about GTF, please visit The Brent Lab. To convert a GFF file to GTF format, use a tool like gffread. A short GTF example follows:

reference.gbk	CLC	CDS	5694	8369	.	+	0	gene_id "ORF1";
reference.gbk	CLC	CDS	8203	8772	.	+	0	gene_id "ORF2";
reference.gbk	CLC	CDS	1465	4485	.	+	0	gene_id "ORF3";
reference.gbk	CLC	CDS	5621	5687	.	+	0	gene_id "ORF4";
reference.gbk	CLC	CDS	7920	8167	.	+	0	gene_id "ORF4";
reference.gbk	CLC	CDS	5395	5687	.	+	0	gene_id "ORF5";
reference.gbk	CLC	CDS	7920	8016	.	+	0	gene_id "ORF5";
reference.gbk	CLC	CDS	4439	5080	.	+	0	gene_id "ORF6";
reference.gbk	CLC	CDS	5247	5549	.	+	0	gene_id "ORF7";
reference.gbk	CLC	CDS	4911	5246	.	+	0	gene_id "ORF8";

INPUT (3) SNP Report File(s)

Each SNP report should contain variant calls for a single pooled-sequencing run (i.e., sample from population), or alternatively a summary of multiple individual sequences. Its site numbers should refer to the exact reference sequence present in the FASTA input. SNP reports must be provided in one of the following formats:

CLC Genomics Workbench

A CLC Genomics Workbench SNP report must include the following columns to be used with SNPGenie, with the unaltered CLC column headers:

  • Reference Position, which refers to the start site of the polymorphism within the reference FASTA sequence;
  • Type, which refers to the nature of the record, usually the type of polymorphism, e.g., "SNV” for single-nucleotide variants;
  • Reference, the reference nucleotide(s) at that site(s);
  • Allele, the variant nucleotide(s) at that site(s);
  • Count, the number of reads containing the variant;
  • Coverage, the total number of sequencing reads at the site(s);
  • Frequency, the frequency of the variant as a percentage, e.g., “14.6” for 14.60%; and
  • Overlapping annotations, containing the name of the protein product or open reading frame (ORF), e.g., “CDS: ORF1”.

In addition to containing the aforementioned columns, the SNP report should ideally be free of thousand separators (,) in the Reference Position, Count, and Coverage columns (default .txt output). The Frequency must remain a percentage (default .txt output). Finally, the user should verify that the reading frame in the CLC input file is correct.

Geneious

A Geneious SNP report must include the following columns to be used with SNPGenie, with the unaltered Geneious column headers:

  • Minimum and Maximum, which refer to the start and end sites of the polymorphism within the reference FASTA sequence (for SNPs, both will be the same site number);
  • CDS Position, the nucleotide position relative to the start site of the relevant CDS annotation;
  • Type, which refers to the nature of the record entry, e.g., “Polymorphism”;
  • Polymorphism Type, which gives the type of polymorphism;
  • product, containing the name of the protein product or open reading frame, e.g., ORF1;
  • Change, which contains the reference and variant nucleotides, e.g., "A -> G", and is always populated for SNP records;
  • Coverage, containing the number of sequencing reads that overlap the site; and
  • Variant Frequency, which contains the frequency of the nucleotide variant as a percentage, e.g., 14.60%.

In addition to containing the aforementioned columns, the SNP report should ideally be free of extraneous characters such as thousand separators (,). The Variant Frequency must remain a percentage (default .txt output). Finally, the user should verify that the reading frame in the Geneious output is correct.

Variant Call Format (VCF)

A VCF SNP report must include the following columns, with the unaltered VCF column headers:

  • CHROM, the name of the reference genome;
  • POS, which refers to the start site of the polymorphism within the reference FASTA sequence;
  • REF, the reference nucleotide(s) at that site(s);
  • ALT, the variant nucleotide(s) at that site(s), with multiple variants at the same site separated by commas (,);
  • QUAL, the Phred quality score for the variant;
  • FILTER, the filter status, based on such metrics as minimum frequencies and minimum quality scores;
  • INFO, information regarding read depth and/or allele frequencies;
  • FORMAT, sometimes an alternative to the INFO column for read depth and/or allele frequency data;
  • <SAMPLE>, column(s) named by sample, a common alternative to the INFO column for read depth and/or allele frequency data.

Because VCF files vary widely in format, SNPGenie now requires users to specify exactly which VCF format is being used with the --vcfformat argument. New formats are being added on a case-by-case basis; users should contact the author to have new formats incorporated. FORMAT 4 is the most common, and necessary if your file contains multiple <SAMPLE> columns. All supported formats are:

  1. FORMAT (1): --vcfformat=1. Multiple individual genomes have been sequenced separately, with SNPs summarized in the VCF file. SNPGenie will require the following in the INFO column:

    • AN, the number of alleles sequenced (e.g., "AN=60"). If absent, will use NS, the number of samples (i.e., individual sequencing experiments) being summarized (e.g., "NS=30");
    • AF, the fractional (decimal) allele frequency(-ies) for the variant alleles in the same order as listed in the ALT column (e.g., "AF=0.200"). If multiple variants exist at the same site, their frequencies must be separated by commans (e.g., "AF=0.200,0.087").
  2. FORMAT (2): --vcfformat=2. Variants have been called from a pooled deep-sequencing sample (genomes from multiple individuals sequenced together). SNPGenie will require the following in the INFO column:

    • DP, the coverage or total read depth (e.g., "DP=4249");
    • AF, the fractional (decimal) allele frequency(-ies) of the variant alleles in the same order as listed in the ALT column (e.g., "AF=0.01247"; "AF=0.01247,0.08956" for two variants; etc.).
  3. FORMAT (3): --vcfformat=3. Like format 2, variants have been called from a pooled deep-sequencing sample (genomes from multiple individuals sequenced together). Unlike format 2, SNPGenie will require the following in the INFO column:

    • DP4, containing the number of reference and variant reads on the forward and reverse strands in the format DP4=<num. fw ref reads>,<num. rev ref reads>,<num. fw var reads>,<num. rev var reads> (e.g., "DP4=11,9,219,38").
    • N.B.: if multiple single nucleotide variants exist at the same site, this VCF format does not specify the number of reads for each variant. SNPGenie thus approximates by dividing the number of non-reference reads evenly amongst the variant alleles.
  4. FORMAT (4): --vcfformat=4. THE MOST COMMON FORMAT. Like formats 2 and 3, variants have been called from a pooled deep-sequencing sample (genomes from multiple individuals sequenced together). For this format, SNPGenie will require AD and DP data in the FORMAT and <SAMPLE> columns, where FORMAT is the final column header before the <SAMPLE> column(s) begins. The order of the data keys in the FORMAT column and the data values in the <SAMPLE> columns must match.

    • For reference allele depth, include the AD tag in the FORMAT column, which refers to the read depth value for each allele in the <SAMPLE> column(s). Values for variant allele(s) must be in the same order as listed in the ALT column (e.g., "AD" in the FORMAT column and "75,77" in the <SAMPLE> column);
    • For coverage (total read depth), include DP in the FORMAT column, which refers to the total read depth in the <SAMPLE> column(s) (e.g., "DP" in the FORMAT column and "152" in the <SAMPLE> column). If the value of DP does not equal the sum of reads for all variant nucleotides (sum of AD values), the latter will be used.

As usual, you will want to make sure to maintain the VCF file's features, such as TAB(\t)-delimited columns. Unlike some other formats, the allele frequency in VCF is a decimal.

A Note on Reverse Complement ('–' Strand) Records

Many large genomes have coding products on both strands. In this case, SNPGenie must be run twice: once for the '+' strand, and once for the '—' strand. This requires FASTA, GTF, and SNP report input for the '–' strand. I provide a script for converting input files to their reverse complement strand in the Additional Scripts below. Note that, regardless of the original SNP report format, the reverse complement SNP report is in a CLC-like format that SNPGenie will recognize. For both runs, the GTF should include all products for both strands, with products on the strand being analyzed labeled '+' and coordinates defined with respect to the beginning of that strand's FASTA sequence. Also note that a GTF file containing only '—' strand records will not run; SNPGenie does calculations only for the products on the current + strand, using the '—' strand products only to determine the number of overlapping reading frames for each variant.

Options

To alter how SNPGenie works, the following options may be used:

  • --minfreq: the minimum allele (SNP) frequency to include. Enter as a proportion/decimal (e.g., 0.01), not as a percentage (e.g., 1.0%). Default: 0.
  • --snpreport: the (one) SNP report to analyze. Default: auto-detect all .txt and .csv file(s).
  • --vcfformat: the exact format of VCF input. See VCF.
  • --fastafile: the (one) reference sequence. Default: auto-detect a .fa or .fasta file.
  • --gtffile: the (one) file with CDS annotations. Default: auto-detect the .gtf file.
  • --slidingwindow: the length of the (optional) sliding (codon) window used in the analysis. Default: 9 codons.

For example, if you wanted to specify a minimum allele frequency of 1% and specify all three input files, you could run the following:

snpgenie.pl --minfreq=0.01 --snpreport=mySNPreport.txt --fastafile=myFASTA.fa --gtffile=myGTF.gtf

Output

SNPGenie creates a new folder called SNPGenie_Results within the working directory. This contains the following TAB-delimited results files:

  1. SNPGenie_parameters.txt, containing the input parameters and file names.

  2. SNPGenie_LOG.txt, documenting any peculiarities or errors encountered. Warnings are also printed to the Terminal (shell) window.

  3. site_results.txt, providing results for all polymorphic sites. Note that, if the population is genetically homogenous at a site, even if it differs from the reference or ancestral sequence, it will not be considered polymorphic. Also keep in mind that columns are sorted by product first, then site number, with noncoding sites at the end of the file. Columns are:

    • file. The SNP report analyzed.
    • product. The CDS annotation to which the site belongs; "noncoding" if none on this strand.
    • site. The site coordinate of the nucleotide in the reference sequence.
    • ref_nt. The identity of the nucleotide in the reference sequence.
    • maj_nt. The most common nucleotide in the population at this site.
    • position_in_codon. If present in a CDS annotation, the position of this site within its codon (1, 2, or 3).
    • overlapping_ORFs. The number of additional CDS annotations overlapping this site. For example, if the site is part of only one open reading frame, the value will be 0. If the site is part of two open reading frames, the value will be 1.
    • codon_start_site. The site coordinate of the relevant codon's first nucleotide in the reference sequence.
    • codon. The identity of the relevant codon.
    • pi. Nucleotide diversity at this site.
    • gdiv. Gene diversity at this site.
    • class_vs_ref. This site's classification, as compared to the reference sequence. For example, if the site contains only one SNP, and that SNP is synonymous, the site will be classified as Synonymous. Nonsynonymous or Synonymous.
    • class. This site's classification as compared to all sequences present in the population. For example, if the population contains both A and G residues at the third site of a GAA (reference) codon, then the site will be Synonymous, because both GAA and GAG encode Glu. On the other hand, if the population also contains a C at this site, the site will be Ambiguous, because GAC encodes Asp, meaning both nonsynonymous and synonymous polymorphisms exist at the site. Nonsynonymous, Synonymous, or Ambiguous.
    • coverage. The NGS read depth at the site.
    • A. The number of reads containing an A (adenine) nucleotide at this site. N.B.: may be fractional if the coverage and variant frequency given in the SNP report do not imply a whole number.
    • C. For C (cytosine), as for A.
    • G. For G (guanine), as for A.
    • T. For T (thymine), as for A.
  4. codon_results.txt, providing results for all polymorphic sites. Columns are:

    • file. The SNP report analyzed.
    • product. The CDS annotation to which the site belongs; "noncoding" if none.
    • site. The site coordinate of the nucleotide in the reference sequence.
    • codon. The identity of the relevant codon.
    • num_overlap_ORF_nts. The number of nucleotides in this codon (up to 3) which overlap other ORFs (in addition to the current "product" annotation).
    • N_diffs. The mean number of pairwise nucleotide comparisons in this codon which are nonsynonymous (i.e., amino acid-altering) in the pooled sequence sample. The numerator of πN.
    • S_diffs. The mean number of pairwise nucleotide comparisons in this codon which are synonymous (i.e., amino acid-conserving) in the pooled sequence sample. The numerator of πS.
    • N_sites. The mean number of sites in this codon which are nonsynonymous, given all sequences in the pooled sample. The denominator of πN and mean dN versus the reference.
    • S_sites. The mean number of sites in this codon which are synonymous, given all sequences in the pooled sample. The denominator of πS mean dS versus the reference.
    • N_sites_ref. The number of sites in this codon which are nonsynonymous in the reference sequence.
    • S_sites_ref. The number of sites in this codon which are synonymous in the reference sequence.
    • N_diffs_vs_ref. This codon's mean number of nonsynonymous nucleotide differences from the reference sequence in the pooled sequence sample. The numerator of mean dN versus the reference.
    • S_diffs_vs_ref. This codon's mean number of synonymous nucleotide differences from the reference sequence in the pooled sequence sample. The numerator of mean dS versus the reference.
    • gdiv. Mean gene diversity (observed heterozygosity) for this codon's nucleotide sites.
    • N_gdiv. Mean gene diversity for this codon's nonsynonymous polymorphic sites.
    • S_gdiv. Mean gene diversity for this codon's synonymous polymorphic sites.
  5. product_results.txt, providing results for all CDS elements present in the GTF file for the '+' strand. Columns are:

    • file. The SNP report analyzed.
    • product. The CDS annotation to which the site belongs; "noncoding" if none.
    • N_diffs. The sum over all codons in this product of the mean number of pairwise nucleotide comparisons which are nonsynonymous (i.e., amino acid-altering) in the pooled sequence sample. The numerator of πN.
    • S_diffs. The sum over all codons in this product of the mean number of pairwise nucleotide comparisons which are synonymous (i.e., amino acid-conserving) in the pooled sequence sample. The numerator of πS.
    • N_diffs_vs_ref. The sum over all codons in this product of the mean number of nonsynonymous nucleotide differences from the reference sequence in the pooled sequence sample. The numerator of mean πN versus the reference.
    • S_diffs_vs_ref. The sum over all codons in this product of the mean number of synonymous nucleotide differences from the reference sequence in the pooled sequence sample. The numerator of mean πS versus the reference.
    • N_sites. The mean number of sites in this product which are nonsynonymous, given all sequences in the pooled sample. The denominator of πN and mean dN versus the reference.
    • S_sites. The mean number of sites in this product which are synonymous, given all sequences in the pooled sample. The denominator of πS and mean dS versus the reference.
    • piN. (πN.) The mean number of pairwise nonsynonymous differences per nonsynonymous site in this product.
    • piS. (πS.) The mean number of pairwise synonymous differences per synonymous site in this product.
    • mean_dN_vs_ref. The mean number of nonsynonymous differences from the reference per nonsynonymous site in this product.
    • mean_dS_vs_ref. The mean number of synonymous differences from the reference per synonymous site in this product.
    • mean_gdiv_polymorphic. Mean gene diversity (observed heterozygosity) at all polymorphic nucleotide sites in this product.
    • mean_N_gdiv. Mean gene diversity at all nonsynonymous polymorphic nucleotide sites in this product.
    • mean_S_gdiv. Mean gene diversity at all synonymous polymorphic nucleotide sites in this product.
  6. population_summary.txt, providing summary results for each population's sample (SNP report) with respect to the '+' strand. Columns are:

    • file. The SNP report analyzed.
    • sites. Total number of sites in the reference genome.
    • sites_coding. Total number of sites in the reference genome which code for a protein product on the analyzed '+' strand, given the CDS annotations in the GTF file.
    • sites_noncoding. Total number of sites in the reference genome which do not code for a protein product, given the CDS annotations in the GTF file.
    • pi. Mean number of pairwise differences per site in the pooled sample across the whole genome.
    • pi_coding. Mean number of pairwise differences per site in the pooled sample across all coding sites in the genome.
    • pi_noncoding. Mean number of pairwise differences per site in the pooled sample across all noncoding sites in the genome.
    • N_sites. The mean number of sites in the genome which are nonsynonymous, given all sequences in the pooled sample. The denominator of πN and mean dN versus the reference.
    • S_sites. The mean number of sites in the genome which are synonymous, given all sequences in the pooled sample. The denominator of πS and mean dS versus the reference.
    • piN. The mean number of pairwise nonsynonymous differences per nonsynonymous site across the genome of the pooled sample.
    • piS. The mean number of pairwise synonymous differences per synonymous site across the genome of the pooled sample.
    • mean_dN_vs_ref. The mean number of nonsynonymous differences from the reference per nonsynonymous site across the genome of the pooled sample.
    • mean_dS_vs_ref. The mean number of synonymous differences from the reference per synonymous site across the genome of the pooled sample.
    • mean_gdiv_polymorphic. Mean gene diversity (observed heterozygosity) at all polymorphic nucleotide sites in the genome of the pooled sample.
    • mean_N_gdiv. Mean gene diversity at all nonsynonymous polymorphic nucleotide sites in the genome of the pooled sample.
    • mean_S_gdiv. Mean gene diversity at all synonymous polymorphic nucleotide sites in the genome of the pooled sample.
    • mean_gdiv. Mean gene diversity at all nucleotide sites in the genome of the pooled sample.
    • sites_polymorphic. The number of sites in the genome of the pooled sample which are polymorphic.
    • mean_gdiv_coding_poly. Mean gene diversity at all polymorphic nucleotide sites in the genome of the pooled sample which code for a protein product, given the CDS annotations in the GTF file.
    • sites_coding_poly. The number of sites in the genome of the pooled sample which are polymorphic and code for a protein product, given the CDS annotations in the GTF file.
    • mean_gdiv_noncoding_poly. Mean gene diversity at all polymorphic nucleotide sites in the genome of the pooled sample which do not code for a protein product, given the CDS annotations in the GTF file.
    • sites_noncoding_poly. The number of sites in the genome of the pooled sample which are polymorphic and do not code for a protein product, given the CDS annotations in the GTF file.
  7. sliding_window_length<Length>_results.txt, containing codon-based results over a sliding window, with a default length of 9 codons.

SNPGenie Within-Group

The script snpgenie_within_group.pl estimates mean dN and dS within a group of aligned sequences in a single FASTA file. Designed for use with sequence data that outsizes what can be handled by other software platforms, this script invokes parallelism for codons and bootstrapping, and as a result has one dependency: the Parallel::ForkManager module. If this isn't already installed, please use CPAN to install it before running the script. Click here to learn how.

Note that within-group mean dN and dS are mathematically equivalent to πN and πS when groups contain sequences sampled from a single population.

Provide the script with the following arguments:

  • --fasta_file_name: the name or path of the (one) FASTA file containing aligned nucleotide sequences.
  • --gtf_file_name: the name or path of the (one) GTF file containing CDS information.
  • --num_bootstraps: OPTIONAL. the number of bootstrap replicates to be performed for calculating the standard error of dN-dS. DEFAULT: 0.
  • --procs_per_node: OPTIONAL. The number of parallel processes to be invoked for processing codons and bootstraps (speeds up SNPGenie if high performance computing resources are available). DEFAULT: 1.

The format for using this script is:

    snpgenie_within_group.pl --fasta_file_name=<aligned_sequences>.fa --gtf_file_name=<coding_annotations>.gtf --num_bootstraps=<bootstraps> --procs_per_node=<procs_per_node>

For example, on a node with access to 64 processors in a high performance computing environment, you might submit a job including the following (DON'T run this from the head node!):

    snpgenie_within_group.pl --fasta_file_name=my_virus_genomes.fa --gtf_file_name=my_virus_genes.gtf --num_bootstraps=10000 --procs_per_node=64

Results files will be generated in the working directory. For a detailed explanation of the results, consult the output descriptions above (however, note that not all files and columns generated for the within-pool analysis will be present for the within-group analysis).

To execute bootstrapping and/or sliding window analyses, run the processing script snpgenie_within_group_processor.pl with the following arguments:

  • --codon_file: the name or path of the within_group_codon_results.txt file produced using snpgenie_within_group.pl.
  • --sliding_window_size: the size of the sliding windows to be calculated, in codons.
  • --sliding_window_step: the number of codons to advance forward for each subsequent sliding window.
  • --num_bootstraps: OPTIONAL. the number of bootstrap replicates to be performed for calculating the standard error of dN-dS for genes and sliding windows. DEFAULT: 0.

The format for using the script is:

snpgenie_within_group_processor.pl --codon_file=within_group_codon_results.txt --sliding_window_size=10 --sliding_window_step=1 --num_bootstraps=10000

A typical workflow thus includes one run of snpgenie_within_group.pl to obtain codon results, followed by one or more runs of snpgenie_within_group_processor.pl to obtain sliding window results with bootstraps.

SNPGenie Between-Group

The script snpgenie_between_group.pl estimates mean dN and dS between two or more groups of sequences in FASTA format. Designed for use with sequence data that outsizes what can be handled by other software platforms, this script invokes parallelism for codons and bootstrapping, and as a result has one dependency: the Parallel::ForkManager module. If this isn't already installed, please use CPAN to install it before running the script. Click here to learn how.

Just run the script in a directory containing two or more FASTA files, each containing a group of aligned sequences which are also aligned to one another across groups (files), ending with a .fa, .fas, or .fasta extension. In other words, each FASTA file contains one group. For example, one might wish to compare sequences of the West Nile Virus 2K gene derived from three different mosquito vector species. In such a case, all sequences would first be aligned. Next, all sequences from mosquito species 1 would be placed together in one FASTA file; all sequences from mosquito species 2 would be placed in a second FASTA file; and all sequences from mosquito species 3 would be placed in a third FASTA file.

SNPGenie automatically detects all FASTA files in the working directory. The GTF file (see Gene Transfer Format for details) must be specified at the command line, along with an optional the number of bootstraps and processes:

snpgenie_between_group.pl --gtf_file=<CDS_annotations>.gtf --num_bootstraps=<integer> --procs_per_node=<integer>

Results files will be generated in the working directory. For a detailed explanation of the results, consult the output descriptions above (however, note that not all files and columns generated for the within-pool analysis will be present for the within-group analysis).

To execute bootstrapping and/or sliding window analyses, run the processing script snpgenie_between_group_processor.pl with the following arguments:

  • --between_group_codon_file: the name or path of the between_group_codon_results.txt file produced using snpgenie_between_group.pl.
  • --sliding_window_size: the size of the sliding windows to be calculated, in codons.
  • --sliding_window_step: the number of codons to advance forward for each subsequent sliding window.
  • --codon_min_taxa_total: the minimum number of total distinct sequences or taxa across both groups having defined (non-gap) codons at a codon position. If this criterion is not met for the groups being compared, the codon position is ignored.
  • --codon_min_taxa_group: the minimum number of distinct sequences or taxa in a single groups having defined (non-gap) codons at a codon position. If this criterion is not met for either of the groups being compared, the codon position is ignored.
  • --num_bootstraps: OPTIONAL. the number of bootstrap replicates to be performed for calculating the standard error of dN-dS for genes and sliding windows. DEFAULT: 0.
  • --procs_per_node: OPTIONAL. The number of parallel processes to be invoked for processing codons and bootstraps (speeds up SNPGenie if high performance computing resources are available). DEFAULT: 1.

The format for using the script is:

snpgenie_between_group_processor.pl --between_group_codon_file=<between_group_codon_results>.txt --sliding_window_size=10 --sliding_window_step=1 --codon_min_taxa_total=5 --codon_min_taxa_group=2 --num_bootstraps=10000 --procs_per_node=8 

A typical workflow thus includes one run of snpgenie_between_group.pl to obtain codon results, followed by one or more runs of snpgenie_between_group_processor.pl to obtain sliding window and/or whole-gene results with various codon criteria, numbers of bootstraps, and sliding window sizes.

How SNPGenie Works

SNPGenie calculates gene and nucleotide diversities for sites in a protein-coding sequence. Nucleotide diversity, π, is the average number of nucleotide variants per nucleotide site for all pairwise comparisons. To distinguish between nonsynonymous and synonymous differences and sites, it is necessary to consider the codon context of each nucleotide in a sequence. This is why the user must submit the starting and ending sites of the coding regions in the .gtf file, along with the reference FASTA sequence file, so that the numbers of nonsynonymous and synonymous sites for each codon may be accurately estimated by a derivation of the Nei-Gojobori (1986) method.

SNPGenie first splits the coding sequence into codons, each of which contains 3 sites. The software then determines which are nonsynonymous and synonymous by testing all possible polymorphisms present at each site of every codon in the sequence. Because different nucleotide variants at the same site may lead to both nonsynonymous and synonymous polymorphisms, fractional sites may occur (e.g., only 2 of 3 possible nucleotide substitutions at the third position of AGA cause an amino acid change; thus, that site is considered 2/3 nonsynonymous and 1/3 synonymous). Next, the SNP report is consulted for the presence of variants to produce a revised estimate of the number of sites based on the presence of variant alleles. Variants are incorporated through averaging, weighted by their frequency. Although it is relatively rare, high levels of sequence variation may alter the number of nonsynonymous and synonymous sites in a particular codon.

SNPGenie calculates the number of nucleotide differences for each codon in each ORF as follows. For every variant in the SNP Report, the number of variants is calculated as the product of the variant’s relative frequency and the coverage at that site. For each variant nucleotide (up to 3 non-reference nucleotides), the number of variants is stored, and their sum is subtracted from the coverage to yield the reference’s absolute frequency. Next, for each pairwise nucleotide comparison at the site, it is determined whether the comparison represents a nonsynonymous or synonymous change. If the former, the product of their absolute frequencies contributes to the number of nonsynonymous pairwise differences; if the latter, it contributes to the number of synonymous pairwise differences. When comparing codons with more than one nucleotide difference, all possible mutational pathways are considered, per the method of Nei & Gojobori (1986). The sum of pairwise differences is divided by the total number of pairwise comparisons at the codon (nC2, where n = coverage) to yield the mean number of differences per site of each type. This is calculated separately for nonsynonymous and synonymous comparisons. Calculating nucleotide diversity codon-by-codon enables sliding-window analyses that may help to pinpoint important nucleotide regions subject to varying forms of natural selection.

For further background, see Nelson & Hughes (2015) in the References.

Additional Scripts

Additional scripts, provided to help in the preparation of SNPGenie input, have moved to the CHASeq repository. There you can find gb2gtf.pl, gff2gtf.pl, split_fasta.pl, vcf2revcom.pl, and lots of other useful stuff. Enjoy!

Troubleshooting

  • "coverage is 0" WARNING. SNPGenie will run even when some sites lack sequencing coverage. Sometimes this presents no problems, e.g., if all sequences lack coverage in the same regions or at terminal sites. However, if coverage differs between samples, this will affect diversity estimates and will need to be accounted for: one sample might have relatively low diversity simply because numerous sites lack coverage. Such a difference is not biological, but rather a result of incomplete data. As always, it is up to the user to understand their own data, and to only supply samples that are biologically meaningful.

  • Using Windows? Unfortunately, SNPGenie was written for Unix systems, including Mac, which have Perl installed by default. Windows sometimes doesn't, but getting Perl installed is as simple as following these three-minute download instructions, and with any luck the program will work! Just open the Windows Command Prompt, and remember to type "perl" first when you run SNPGenie, i.e., type "perl snpgenie.pl". Unfortunately, if this does not work, we are unable to provide any further Windows support.

  • Try preceding the whole command line with "perl" to make sure SNPGenie is being treated as a script. For example:

      perl snpgenie.pl --minfreq=0.01 --snpreport=mySNPreport.txt --fastafile=myFASTA.fa --gtffile=myGTF.gtf
    
  • Make sure that the first line of the script contains the correct path to your machine's copy of Perl. Right now SNPGenie assumes it's located at /usr/bin/perl. You can find out if this matches your machine by typing which perl at the command line. If it's a different path than the one above, copy and paste your machine's path after the #! at the beginning of the SNPGenie script.

  • Make sure that the script is executable at the command line, as follows:

      chmod +x snpgenie.pl	
    
  • Make sure end-of-line characters are in Unix LF (\n) format. Although an attempt has been made to ensure SNPGenie can accept Windows CRLF (\r\n) or Mac CR (\r) formats, these can sometimes introduce unexpected problems causing SNPGenie to crash or return all 0 values. This is often a problem when saving Excel spreadsheets on Mac, which defaults to a Mac newline format. Trying changing the newline character to Unix LF using a free program such as TextWrangler.

  • Make sure CLC files are tab (\t)-delimited.

  • Make sure Geneious files are comma-separated.

  • Make sure the SNP reports contain all necessary columns. (See sections on Input above.)

  • Make sure the Frequency (CLC) or Variant Frequency (Geneious) SNP report columns contain a percentage, not a decimal (e.g., 11.0% rather than 0.11).

  • Make sure the SNP calling frame correct (e.g., do the codons for a product in the SNP report begin with ATG and end with TAA, TAG, or TGA?).

  • Make sure the product coordinates in the GTF file are correct. (You might use a free program such as MEGA to check that the CDS coordinates begin with ATG and end with TAA, TAG, or TGA, if that's what you expected.)

Citation

When using this software, please refer to and cite:

Nelson CW, Moncla LH, Hughes AL (2015) SNPGenie: estimating evolutionary parameters to detect natural selection using pooled next-generation sequencing data. Bioinformatics 31(22):3709-11. doi: 10.1093/bioinformatics/btv449.

and this page:

https://github.com/chasewnelson/snpgenie

Studies Using SNPGenie

Contact

If you have questions about using SNPGenie, please click on the "Issues" tab at the top of this page and begin a new thread, so that others might benefit from the discussion.

Correspondance should be addressed to Chase W. Nelson at the American Museum of Natural History:

  • cnelson <AT> amnh <DOT> org

References

Image Copyright 2015 Elizabeth Ogle