Perl scripts providing Computational Help for the Analysis of SEQuence data
Branch: master
Clone or download
Latest commit 995a881 Jul 27, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE Initial commit Nov 2, 2017
LinkGe_all_site_pairs.pl Added additional statistics at end results Mar 31, 2018
README.md INITIAL COMMIT: align_codon2aa.pl Jul 27, 2018
align_codon2aa.pl Added example line Jul 27, 2018
aligned_fasta2site_nt_freqs.pl Update aligned_fasta2site_nt_freqs.pl May 30, 2018
aligned_fasta_group_diffs.pl Added capability for groups to consist of 1 sequence Mar 31, 2018
calculate_p_distance.pl Changed example syntax Nov 14, 2017
count_k-mers.pl Added k-mer size functionality and added input filename to output May 1, 2018
determine_consensus_IUPAC_seqs.pl Improved formatting by adding spaces Mar 28, 2018
extract_codon_from_ANN_VCFs.pl Initial commit extract_codon_from_ANN_VCFs Nov 14, 2017
extract_fasta_by_sites.pl Initial commit of extract_fasta_by_sites.pl Nov 12, 2017
gb2gtf.pl Added funding acknowledgments Nov 11, 2017
get_random_integers.pl Initial commit get_random_integers Nov 14, 2017
get_unique_values.pl Initial commit get_unique_values Nov 16, 2017
gff2gtf.pl Added funding acknowledgments Nov 11, 2017
haplotypes_provided_sites.pl Initial commit: haplotypes_provided_sites.pl May 30, 2018
remove_seqs_with_stops.pl Initial commit remove_seqs_with_stops Nov 15, 2017
reverse_complement.pl Initial commit: reverse_complement.pl May 30, 2018
split_fasta.pl Added funding acknowledgments Nov 11, 2017
store_fasta_by_ID.pl Allow the header to have more characters Mar 28, 2018
vcf2revcom.pl Updated generate_reverse_complement_gtf subroutine Dec 14, 2017

README.md

CHASeq

Perl scripts providing Computational Help for the Analysis of Sequence data.

Contents

  • Before You Begin
  • Citation
  • Troubleshooting
  • Scripts
    • align_codon2aa.pl. You want to impose an amino acid alignment on the underlying nucleotide sequence.
    • aligned_fasta_group_diffs.pl. You have two or more groups of sequences, aligned to one another but in separate FASTA files, and want to identify the sites at which the groups exhibit differences.
    • aligned_fasta2site_nt_freqs.pl. You want to tabulate the number (and proportion) of each nucleotide at each variable position across an aligned nucleotide FASTA file.
    • calculate_p_distance.pl. You want to calculate a p-distance between two nucleotide sequences.
    • count_k-mers.pl. You want to tally all k-mers for an aligned sequence and its genes.
    • determine_consensus_IUPAC_seqs.pl. You want to determine the consensus and/or IUPAC sequence(s) for an aligned nucleotide FASTA file.
    • extract_codon_from_ANN_VCFs.pl. You want to pull codon variants out of annotated VCF files (i.e., files that have been annotated using the snpeff -formatEff option), so that reference and variant codons can be compared.
    • extract_fasta_by_sites.pl. You want to create separate FASTA files for segments (e.g., each of the genes) in an aligned sequence file.
    • gb2gtf.pl. You want to create a GTF file from a GenBank file, compatible with SNPGenie input.
    • get_random_integers.pl. You want to get a certain number of random integer values in a range.
    • get_unique_values.pl. You have a table with a column containing multiple rows with the same value(s) (i.e., duplicates), and you want to extract just the unique ones (i.e., remove duplicate values) and generate summary statistics for the duplicates.
    • gff2gtf.pl. You want to convert a GFF file to a simpler GTF file, compatible with SNPGenie input.
    • haplotypes_provided_sites.pl. You want to determine all haplotypes in a given FASTA alignment for a specific combination of sites.
    • LinkGe_all_site_pairs.pl. You have an aligned BAM file and a list of sites with their SNPs, and want to use Gabriel Starrett's LinkGe program to check for linkage of SNPs in the same reads.
    • remove_seqs_with_stops.pl. You have a FASTA file where each sequence is an in-frame coding sequence, and want to remove all sequences containing mid-sequence STOP codons.
    • reverse_complement.pl. You want to obtain the reverse complement of a sequence supplied at the command line.
    • split_fasta.pl. You have a FASTA file, and want to create an individual FASTA file for each sequence inside.
    • store_fasta_by_ID.pl. You want to create a new FASTA file containing only a certain subset of another FASTA.
    • vcf2revcom.pl. You want to convert a VCF SNP report file, along with its accompanying FASTA file and GTF file, to the reverse complement strand.

Before You Begin

I provide these scripts for anyone to use freely for help in performing routine bioinformatics tasks with nucleotide sequence and related data. Please just cite this GitHub page. The scripts are meant for use on Unix/Mac machines; no support is offered for Windows. Readers of code, be warned: very little effort has been made to 'clean up' my coding comments and alternative operations. Think of them as pseudogenes!

How to Use

Most scripts are designed to take (unnamed) arguments for simple data file manipulations, in the following format:

    CHASeq_script.pl <argument1> <argument2>

where <argument1> is ommitted and replaced with the desired input value. Some more complicated scripts will contain named arguments in Perl's long form, in the following format:

    CHASeq_script.pl --argument-name=<value>

where <value> is ommitted and replaced with the desired input value.

Citation

When using this software, please refer to and cite:

CHASeq software package, https://github.com/chasewnelson/CHASeq

Troubleshooting

If a script isn't working, try working through the following checklist:

  • Remember that you must open the Terminal and navigate to the directory which contains the script before you're able to call it. More advanced users may wish to place the script(s) in a directory included in your computer's PATH so that it can be called from any directory.

  • Did you place the script in the same directory as your data? If not, you could move the script to the data-containing directory, or vice versa. Another option is to provide the full path of the input files to the script, e.g., /Users/cwnelson88/Desktop/my_project/my_data.fasta

  • These scripts assume that your computer's copy of Perl is located at /usr/bin/perl. You can check whether this is the case by opening the Terminal and typing which perl at the command line. If this does not return /usr/bin/perl, then (1) copy the path it provides; (2) open the CHASeq script; (3) replace #! /usr/bin/perl at the top of the script with your own computer's path.

  • Scripts must be made executable before use. You can check whether this is the case my opening the Terminal, navigating to the directory containing the script, and typing ls -l, which will return something like the following:

      -rwxr-xr-x@  1  name  staff  155  Sep 25  2013  script.pl
    

    If the string of letters at the beginning of the line containing your script (here, it's -rwxr-xr-x) does not contain an 'x', it means it is not yet executable. You can add executable status by typing chmod +x <script.pl>, where <script.pl> is replaced with the script's name.

Scripts

  • align_codon2aa.pl. You want to impose an amino acid alignment on the underlying nucleotide sequence. At the command line, call this script with two arguments: (1) aligned amino acid sequence; and (2) nucleotide sequence. The codon-aligned nucleotide sequence will the be printed to the screen. For example:

      align_codon2aa.pl <aligned amino acid sequence> <nucleotide sequence>
      align_codon2aa.pl MSAARKRTL-L ATGTCGGCGGCTCGTAAGCGCACCTTGTTG
    
  • aligned_fasta_group_diffs.pl. You have two or more groups of sequences, aligned to one another but in separate FASTA files, and want to identify the sites at which the groups exhibit differences. First, make sure all sequences are aligned to one another (even across groups), place each group of sequences in a separate FASTA file, and create a directory containing all the group FASTA files (but no others). Then, at the command line, call this script. A results file will be placed in the working directory describing positions at which the groups differ in their major nucleotide. For more control use the following options:

    • --min_variant_maj_nt_freq to specify a minimum frequency cutoff for assigning a group's majority (consensus) nucleotide. Must be a decimal. Default=0.9 (90%).
    • --min_site_coverage to specify a minimum number of defined sequences required for calculating a group's frequencies. Must be an integer. For example, some sites in some groups may be mostly gaps (-) or undetermined (N), in which case we may not want to consider frequency calculations reliable. Default=5.

    Here is an example using defaults:

      aligned_fasta_group_diffs.pl
    

    Here is an example using using a frequency cutoff of 90% with a minimum of 8 defined nucleotides per site for each group:

      aligned_fasta_group_diffs.pl --min_variant_maj_nt_freq=.9 --min_site_coverage=8
    
  • aligned_fasta2site_nt_freqs.pl. You want to tabulate the number (and proportion) of each nucleotide at each variable position across an aligned nucleotide FASTA file. At the command line, call this script with one argument, an aligned FASTA file. One TAB-delimited results file will be placed in the working directory (*_site_summary.txt) and brief summary statistics will be printed to the Terminal. Here's an example:

      aligned_fasta2site_nt_freqs.pl <aligned_seqs.fasta>
    
  • calculate_p_distance.pl. You want to calculate a p-distance between two nucleotide sequences. At the command line, provide this script with two arguments: two FASTA (.fa or .fasta) files, each containing one sequence, which are aligned to each other. This script will exclude positions which are gaps (-) or undetermined (N) in both sequences and return a p-distance. Here's an example:

      calculate_p_distance.pl <aligned_seq_1.fasta> <aligned_seq_2.fasta>
    
  • count_k-mers.pl. You want to tally all k-mers for a group of aligned sequence and separately for its genes. At the command line, provide this script with three arguments: one ALIGNED FASTA (.fa or .fasta) file; one GTF file with gene annotations; and the max k-mer length. This script will exclude k-mers which contain gaps (-) or undetermined nucleotides (N). Here's an example:

      count_k-mers.pl <aligned.fasta> <gene_annotations.gtf> <max_k>
    
  • determine_consensus_IUPAC_seqs.pl. You want to determine the consensus and/or IUPAC sequence(s) for an aligned nucleotide FASTA file. At the command line, call this script with one argument, an aligned FASTA file. Two results files will be placed in the working directory, one with a consensus sequence (*_consensus.fa) and one with an IUPAC sequence (*_IUPAC.fa). Brief summary statistics will be printed to the Terminal. Here's an example:

      determine_consensus_IUPAC_seqs.pl <aligned_seqs.fasta>
    
  • extract_codon_from_ANN_VCFs.pl. You want to pull codon variants out of annotated VCF files (i.e., files that have been annotated using the snpeff -formatEff option), so that reference and variant codons can be compared. At the command line, call this script in a directory containing one or more files ending in .ann.vcf. It will extract the information in the "CODON: Codon change" field (e.g. ggT/ggG) for each record and output to the files ref_codons_seq.txt, alt_codons_seq.txt, and codon_metadata.txt. Here's an example:

      extract_codon_from_ANN_VCFs.pl
    
  • extract_fasta_by_sites.pl. You want to create separate FASTA files for segments (e.g., each of the genes) in an aligned sequence file. At the command line, provide this script with two arguments: (1) one FASTA file containing one or more aligned sequences; and (2) a GTF file containing the CDS products to extract. It will extract the sites for each coding element (CDS) annotation in the Gene Transfer Format, and output one FASTA file for each. Here's an example:

      extract_fasta_by_sites.pl <multiple_aligned_seqs.fasta> <gene_coordinates_to_extract.gtf>
    
  • gb2gtf.pl. (Helpful for preparing SNPGenie input!) DEPRECATED! Use a tool like gffread.

  • get_random_integers.pl. You want to get a certain number of random integer values in a range. At the command line, provide this script with three arguments: (1) number of random integer values wanted; (2) bottom of range (inclusive); (3) top of range (inclusive). Returns the specified number of random integers in the desired range. Here's an example to return 10 values in the range [3,999]:

      get_random_integers.pl 10 3 999
    
  • get_unique_values.pl. You have a table with a column containing multiple rows with the same value(s) (i.e., duplicates), and you want to extract just the unique ones (i.e., remove duplicate values) and generate summary statistics for the duplicates. At the command line, provide this script with two arguments: (1) the name of the input tab-delimited file; and (2) the column number to analyze. The script outputs the unique values in the column, i.e., duplicate values eliminated, including the header. Also creates a summary statistics file in the working directory, by the name *_unique_values_summary.txt. Here's an example:

      get_unique_values.pl <my_table.txt> <column_number>
    
  • gff2gtf.pl. (Helpful for preparing SNPGenie input!) DEPRECATED! Use a tool like gffread.

  • haplotypes_provided_sites.pl. You want to determine all haplotypes in a given FASTA alignment for a specific set of sites provided. At the command line, provide this script with two named arguments: (1) --fasta_file, the name of the aligned FASTA file; and (2) --sites_file, the name of a file containing a list of the sites of interst, separated by any whitespace (TABS, spaces, or newlines). The script outputs the unique haplotypes and their counts. Run as follows:

      haplotypes_provided_sites.pl --sites_file=<sites_of_interest>.txt --fasta_file=<aligned_seqs>.fa
    
  • LinkGe_all_site_pairs.pl. You have an aligned BAM file and a list of sites with their SNPs, and want to use Gabriel Starrett's LinkGe program to check for linkage of SNPs in the same reads. For example, you may wish to test if high-frequency variant nucleotides are present in linked haplotypes in a deep-sequenced viral sample from a single host. At the command line, provide this script with the named arguments:

    • --snp_file: REQUIRED. One TAB-delimited, CLC-style "SNP report" file with, at minimum, the following three columns (with headers):
      • "Reference Position": the site number of the single nucleotide variant within the aligned BAM reads.
      • "Reference": the reference (majority) nucleotide.
      • "Allele": the allele (variant or mutant) nucleotide.
    • --BAM_file: REQUIRED. Standard format; must be aligned; reads of any length.
    • --max_dist: the maximum distance at which to search for pairs of sites covered by the same read. DEFAULT: 1000.

    This script will then determine all pairs of sites provided in the snp_file which fall within max_dist of one another; run LinkGe for all identified pairs; and output read linkage information for all individual pairs (LinkGe_* file generated in working directory) and total summary data for linkage of reference and variant nucleotides (printed to screen; WT refers to the reference nucleotide, Mut refers to the alternative nucleotide). Here's an example:

      LinkGe_all_site_pairs.pl --snp_file=<my_SNPs>.txt --BAM_file=<aligned_BAM>.bam --max_dist=500
    

    Note that this script requires you to install LinkGe, and to add it to your computer's PATH. Three dependencies are also required: Bio::DB::Sam, BioPerl, and Samtools version 0.1.15. See the LinkGe page for details.

  • remove_seqs_with_stops.pl. You have a FASTA file where each sequence is an in-frame coding sequence, and want to remove all sequences containing mid-sequence STOP codons. At the command line, provide this script with one argument: a FASTA (.fa or .fasta) file containing multiple coding sequences that all begin at the first position of the first codon (they need not be aligned to each other). This script will create a new FASTA file in the working directory, containing just those sequences lacking an in-frame mid-sequence STOP codon. Here's an example:

      remove_seqs_with_stops.pl <my_fasta_file.fasta>
    
  • reverse_complement.pl. You want to obtain the reverse complement of a sequence supplied at the command line. Simply provide this script with one argument, the sequence:

      reverse_complement.pl <nucleotide_sequence>
    
  • split_fasta.pl. You want to create an individual FASTA file for each sequence in another FASTA file. At the command line, provide this script with one argument: a FASTA (.fa or .fasta) file containing multiple sequences. This script will create multiple files in the working directory, each containing one of the sequences. Here's an example:

      split_fasta.pl <my_fasta_file.fasta>
    
  • store_fasta_by_ID.pl. You want to create a new FASTA file containing only a certain subset of another FASTA. At the command line, provide this script with two arguments: (1) a FASTA (.fa or .fasta) file containing multiple sequences; and (2) a .txt file containing one column with the FASTA IDs you want in your new FASTA file — there should be NO HEADER. This script will create a new FASTA file in the working directory, containing the sequences whose headers begin with the IDs in argument (2). Redirect (>) the output to create a file. Here's an example:

      store_fasta_by_ID.pl <all_seqs.fasta> <wanted_seqs_headers.txt> > <just_wanted_seqs.fasta>
    
  • vcf2revcom.pl. (Helpful for preparing SNPGenie input!) You want to convert a VCF SNP report file, along with its accompanying FASTA file and GTF file, to the reverse complement strand. This script automates the creation of such reverse complement files, compatible with SNPGenie input, but the input must already be compatible with SNPGenie specifications. Note that the resulting SNP report will not be a VCF file, but rather a CLC Genomics Workbench format file. At the command line, provide this script with three arguments, in the following order:

    1. A '+' strand FASTA (.fa or .fasta) file containing the reference sequence (ALL UPPERCASE) against which SNPs were called;
    2. A '+' strand GTF file containing both '+' and '–' strand products from the '+' strand point of view; and
    3. A '+' strand SNP report in VCF format.

    This script will then create a '-' strand (reverse complement) version of each file in the working directory, with "_revcom" concatenated to the original file name. Here's an example:

      vcf2revcom.pl <my_reference_sequence.fasta> <my_cds_file.gtf> <my_snp_report.vcf>