Skip to content

Zingiberales Sass et al PeerJ 2016

chodon sass edited this page Feb 15, 2017 · 1 revision

This document outlines the bioinformatic processing steps that were used in Sass et al (2015). PeerJ. When appropriate, the perl scripts and associated files are included in the corresponding numbered directory. In other cases, command syntax for external software, a bash command file, or a description of a process is given.

1. Pre-process transcriptomes: Transcriptomes were often sequenced before Illumina converted to Illumina 1.8+ quality scores. Because most external software use Illumina 1.8+ quality encoding by default, quality scores were converted prior to any other processing. Low complexity, low quality, adapter sequences, and PCR duplicates were removed and overlapping paired-end reads were merged.

  • a. convert quality scores -- converter.pl
  • b. clean and merge reads -- sonal_2scrubReads_cs2.pl

2. Pre-process Musa acuminata CDS: The gene annotation file accompanying the draft Musa acuminata genome was used to generate a .bed format file that contained the location of the exon boundaries in the draft genome coding DNA sequences file (CDS file). This .bed format file was then used to split each gene into its component exons.

  • a. extract separate exons -- parse_gff_to_bed.pl

3. Align, SNP call, genotype, and print for tiling for Agilent array: Each cleaned transcriptome was aligned to the Musa acuminata CDS, which had been separated into exons, using novoalign with very lenient alignment stringency to enable reads with many substitutions to align. SNPs were called with SAMtools and VarScan. Retained exons had a length of at least 150 bp and had coverage in each of the 7 additional families. The SNP calls were used to print out a family-specific bait, after adjusting for insertions and deletions. The list of genes were culled to remove genes with unusually high or low GC content, that blasted to self, or were matched by the repeatMasker database. Finally, an external software was used to divide the exons into 60mer oligos at a specified tiling density (in our case, we used 1x tiling density) for designing the microarray chip to be printed by Agilent.

  • a. Alignment and SNP calling
  • run_8_2_novoalign_full_cds_sepexons.pl
  • make_pileups.sh
  • run_varscan.sh
  • b. Filter gene list
  • make_covered_bed_files.sh
  • intersectBedFiles.sh
    • c. Genotype transcriptome and generate bed files to extract covered portions of exons
  • use the SNP calls and reference sequences to generate a genotype for each transcriptome: parseVarScanv5.pl
  • adjust for indels: remove_Zs_and_delete.sh
  • gather indel list from printed output of parseVarScanv5.pl. Use indel list as input for editBedv4e.pl and remove_Zs_and_delete.sh
  • extend the length of the exon sequence on an individual basis, even if not covered by all individuals (to help reduce edge effects): add_slops.pl
    • d. Filter gene list part 2
  • remove genes that have 30>GC%>70: countGC.pl
  • remove genes blast to self: blastn -db musa.slopped.fasta -query musa.slopped.fasta -outfmt 6 | awk '$1!=$2' | awk '{print $1}' | sort | uniq
  • remove repeatMasker hits

4. Pre-process captured reads:

5. Align genotyped fasta files, filter genes

  • a. grepForAllFromConcatenatedFastaC.sh
  • b. mafft --maxiterate 1000 --localpair --thread 2 GENE.fasta > GENE.aligned.fasta
  • c. find bait edges in alignment mothur>summary.seqs(fasta=GENE.aligned.fasta)
  • d. trim to MUAC boundaries selectsites.pl –x 1 –s [muac boundaries from mothur output] GENE.aligned.fasta > GENE.B.fas (selectsites.pl can be found here:http://raven.iab.alaska.edu/~ntakebay/teaching/programming/perl-scripts/perl-scripts.html)
  • e. trim to coding position starting with 1st position utilizing position from Musa cds
  • f. run macse to place sequences into coding frame java -Xmx12g -jar macse_v1.01b.jar -prog alignSequences -seq GENE.aligned.trimmed2.fas
  • g. check for internal frameshifts README_frameshiftCheck.txt
  • h. align by codon position with prank prank -d=geneIn -o=geneOut -codon -gaprate=0.009 -gapext=0.5 -iterate=20
  • i. run RAxML to test gene tree length raxmlHPC-SSE3 -m GTRGAMMA -p 12345 -N 10 -s geneOut -n raxOut
  • j. remove genes with skewed tree length (>15% of tree length is from one individual)

6. Phylogenetic analyses

  • a. partitionFinder: see .cfg and starting tree files
  • b. maximum likelihood search and bootstrapping: raxmlHPC-HYBRID -T 4 -n result -s infile.txt -q part.txt -p 12345 -x 12345 -N 1000 -c 25 -f a -m GTRCAT [see infile and part.txt file]
  • c. parsimony search: paup> HSearch addSeq=random nreps=100 rseed=725638180
  • d. parsimony bootstrap: paup> Bootstrap nreps=1000 seed=1486916772 / nreps=10
  • d. maximum likelihood gene trees and gene tree bootsrapping: runRaxml1000boot.pl
  • d. ASTRAL-II: java -Xmx8000M -jar astral.4.7.8.jar -i best309trees.txt -b bs_paths -r 1000
Clone this wiki locally