Zingiberales Sass et al PeerJ 2016
This document outlines the bioinformatic processing steps that were used in Sass et al (2015). PeerJ. When appropriate, the perl scripts and associated files are included in the corresponding numbered directory. In other cases, command syntax for external software, a bash command file, or a description of a process is given.
1. Pre-process transcriptomes: Transcriptomes were often sequenced before Illumina converted to Illumina 1.8+ quality scores. Because most external software use Illumina 1.8+ quality encoding by default, quality scores were converted prior to any other processing. Low complexity, low quality, adapter sequences, and PCR duplicates were removed and overlapping paired-end reads were merged.
-
a. convert quality scores --
converter.pl
-
b. clean and merge reads --
sonal_2scrubReads_cs2.pl
2. Pre-process Musa acuminata CDS: The gene annotation file accompanying the draft Musa acuminata genome was used to generate a .bed format file that contained the location of the exon boundaries in the draft genome coding DNA sequences file (CDS file). This .bed format file was then used to split each gene into its component exons.
-
a. extract separate exons --
parse_gff_to_bed.pl
3. Align, SNP call, genotype, and print for tiling for Agilent array: Each cleaned transcriptome was aligned to the Musa acuminata CDS, which had been separated into exons, using novoalign with very lenient alignment stringency to enable reads with many substitutions to align. SNPs were called with SAMtools and VarScan. Retained exons had a length of at least 150 bp and had coverage in each of the 7 additional families. The SNP calls were used to print out a family-specific bait, after adjusting for insertions and deletions. The list of genes were culled to remove genes with unusually high or low GC content, that blasted to self, or were matched by the repeatMasker database. Finally, an external software was used to divide the exons into 60mer oligos at a specified tiling density (in our case, we used 1x tiling density) for designing the microarray chip to be printed by Agilent.
- a. Alignment and SNP calling
run_8_2_novoalign_full_cds_sepexons.pl
make_pileups.sh
run_varscan.sh
- b. Filter gene list
make_covered_bed_files.sh
-
intersectBedFiles.sh
- c. Genotype transcriptome and generate bed files to extract covered portions of exons
- use the SNP calls and reference sequences to generate a genotype for each transcriptome:
parseVarScanv5.pl
- adjust for indels:
remove_Zs_and_delete.sh
- gather indel list from printed output of parseVarScanv5.pl. Use indel list as input for
editBedv4e.pl
andremove_Zs_and_delete.sh
- extend the length of the exon sequence on an individual basis, even if not covered by all individuals (to help reduce edge effects):
add_slops.pl
- d. Filter gene list part 2
- remove genes that have 30>GC%>70:
countGC.pl
- remove genes blast to self:
blastn -db musa.slopped.fasta -query musa.slopped.fasta -outfmt 6 | awk '$1!=$2' | awk '{print $1}' | sort | uniq
- remove repeatMasker hits
-
e. Generate array: use
array_design.pl
(author Hernan A. Burbano, MPI-EVA and Kay Pruefer svn co http://biofs04/svn/bioinf/motiv-counts/) to design tiling for printing by Agilent
-
e. Generate array: use
4. Pre-process captured reads:
-
a.
1PreCleanup.pl
(https://github.com/MVZSEQ/denovoTargetCapturePhylogenomics) -
b.
2ScrubReads.pl
(https://github.com/MVZSEQ/denovoTargetCapturePhylogenomics) -
c.
perl uZingNucAlignFix_140923_famx.pl -Z famx.final.fasta
-
d.
postAlignFixMapsemble.pl -L LIBX -Z anything
(this script relies on 7recipBlasting.pl written by Sonal Singhal for: Singhal S. 2013. De novo transcriptomic analyses for non-model organisms: an evaluation of methods across a multi-species data set. Molecular Ecology Resources 13:403-416.) -
e.
uZingSNPCall_150128.pl
-
f.
uZingGATK2Fa_150129.pl
5. Align genotyped fasta files, filter genes
-
a.
grepForAllFromConcatenatedFastaC.sh
-
b.
mafft --maxiterate 1000 --localpair --thread 2 GENE.fasta > GENE.aligned.fasta
-
c. find bait edges in alignment
mothur>summary.seqs(fasta=GENE.aligned.fasta)
-
d. trim to MUAC boundaries
selectsites.pl –x 1 –s [muac boundaries from mothur output] GENE.aligned.fasta > GENE.B.fas
(selectsites.pl can be found here:http://raven.iab.alaska.edu/~ntakebay/teaching/programming/perl-scripts/perl-scripts.html) - e. trim to coding position starting with 1st position utilizing position from Musa cds
-
f. run macse to place sequences into coding frame
java -Xmx12g -jar macse_v1.01b.jar -prog alignSequences -seq GENE.aligned.trimmed2.fas
- g. check for internal frameshifts README_frameshiftCheck.txt
-
h. align by codon position with prank
prank -d=geneIn -o=geneOut -codon -gaprate=0.009 -gapext=0.5 -iterate=20
-
i. run RAxML to test gene tree length
raxmlHPC-SSE3 -m GTRGAMMA -p 12345 -N 10 -s geneOut -n raxOut
- j. remove genes with skewed tree length (>15% of tree length is from one individual)
6. Phylogenetic analyses
- a. partitionFinder: see .cfg and starting tree files
-
b. maximum likelihood search and bootstrapping:
raxmlHPC-HYBRID -T 4 -n result -s infile.txt -q part.txt -p 12345 -x 12345 -N 1000 -c 25 -f a -m GTRCAT
[see infile and part.txt file] -
c. parsimony search:
paup> HSearch addSeq=random nreps=100 rseed=725638180
-
d. parsimony bootstrap:
paup> Bootstrap nreps=1000 seed=1486916772 / nreps=10
-
d. maximum likelihood gene trees and gene tree bootsrapping:
runRaxml1000boot.pl
-
d. ASTRAL-II:
java -Xmx8000M -jar astral.4.7.8.jar -i best309trees.txt -b bs_paths -r 1000