This repository contains all the code used in our study (Pardo-De la Hoz et. al., 2023) of phylogentic conflicts in nostocalean cyanobacteria, where we showed that an ancient rapid radiation explains most phylogenetic conflicts in the relationships among the major lineages of this order. The paper is now out in Systematic Biology as Advanced Access.
The file commands.sh
has the full workflow with calls to all the scripts in the scripts/
directory. This file was basically my logbook while I was doing the project. It is a reference to navigate the scripts, and has everything needed to replicate the study.
I am also putting together a project website where I will have a more detailed and reader-friendly breakdown of the workflow and computational setup that I used. Hopefully this will be helpful for people trying to do similar analyses.
Disclaimer: This was the first genomic scale project that I worked on. I started in 2020 and continued until late 2022. I didn't have formal training in computational biology or programming, and I have been learning and adopting good coding practices through the whole process. You will see this reflected in some of the scripts I used for this study. In some cases, I reorganized or re-wrote parts of my scripts so that they were easier to follow. However, some parts may still be a little clunky, albeit still functional. The project website is one more attempt to make everything as clear, transparent as possible, but feel free to contact me if something is not clear or doesn't work they way it's suposed to.
The data and output from all analyses are deposited in a Dryad Repository. You can download the data.zip
file. WARNING: the contents of this archive will take ~69 Gb of space when decompressed. The only files we did not include where the raw outputs from the posterior sampling done with PhyloBayes, because it would double the size of this-already huge-repository. However, the posterior samples from this analyses are summarized in the files under analyses/phylonetworks/bucky/infiles/
. Please feel free to conctact me at cjpadodelahoz[at]gmail.com if you want access to the raw outputs. The rest of it is organized as follows:
genomes/
: Contains the assemblies of all 220 genomes that we used in the study in fasta format. These assemblies were the starting point for all analyses.databases/
: Contains the blast database used by mafft-homologs during the alignments of the amino acid sequences.analyses/
: Contains the output from all analyses. It is structured like this:genome_qc/
: Contains the output of the BUSCO analyses that were used for QC of the sampled genomes and to extract the loci used in several of the phylogenomic analyses.busco/all_cyanodb10/
contains the output of the BUSCO analyses done with the cyanobacteria_odb10 database.busco/all_nostocalesdb10/
contains the output of the BUSCO analyses done with the nostocales_odb10 database. The structure of both of these folders is the same:BUSCO_result_all.csv
: Matrix of busco loci (rows) and taxa (columns) where the elements indicate whether a particular locus was complete, duplicated, fragmented, or missing from a particular taxon genome.BUSCO_num_result_all.csv
: Same asBUSCO_result_all.csv
but the elements of the matrix were replaced by 0 (duplicated, fragmented or missing) and 1 (complete). Both of these files are generated by the R scriptsscripts/filter_taxa_and_loci_all.R
for the cyanodb10, andscripts/filter_loci_nostocalesdb10.R
for the nostocalesdb10.summary
: compilation of thefull_table.tsv
(see below) files from all the taxa. I used this to summarize the BUSCO results across all taxa.by_taxon/
: contains the output of the BUSCO analyses done with the respective database, with a directory for each of the 220 taxa. For each taxon, the busco output consists of:short_summary.specific.*
: a short summary of the busco results.logs/
directory with the STDERR and STDOUT of BUSCO, HMMER, and Prodigal.prodigal_output/
: gene predictions in nucleotide and amino acid format (predicted_genes/predicted.*
), andtmp/
subdirectory with the prodigal run temp files.run_*_odb10/
: database-specific busco results:busco_sequences/
: FASTA files with the hits of the BUSCO loci found for the target genome in nucleotide (*.fna
) and amino acid (\*.faa
) form, organized by busco status (complete, multicopy, and fragmented).hmmer_output/
: HMMER output for the seraches with each of the HMMs of the corresponding ortholog database gainst the target genome.full_table.tsv
: table with full details of BUSCO results, including status for each locus code, sequence coordinates of the hits, and gene names.missing_busco_list.tsv
: list of missing busco loci codes.short_summary.txt
: same asshort_summary.specific.*
above.
prelim/
: Contains the sequences and alignments used to infer the preliminary phylogenetic tree with all taxa used in the study (Fig. S1). It also contains the RAxML output of the concatenated analyses, including the tree in newick format undertrees/concat
L31/
: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L31 dataset.L70/
: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L70 dataset.L746/
: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L746 dataset.L1648/
: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L1648 datasets, including the alignments and trees generated with different alignment trimming strategies (ng, strict, kcg, and kcg2). This directory also contains the alignment summaries obtained with AMAS.py, which were used to generate Fig. S3. Finally, there is also subdirectory with the output tables from Modelfinder, which were used to compare the fit of site-heterogeneous and site-homogeneous model, summarized in Fig. S4.ngmin/
: Contains the nucleotide and amino acid alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L1082 (nucleotide) and L1233 (amino acid) datasets. It also contains the input and output from the treeshring analyses, which was used to filter the L1648 alignments from taxa in relatively long branches before producing the ngmin datasets.tbas/
: Contains the nucleotide and amino acid sequences, alignments, as well as the concatenated nucleotide tree and iqtree output obtained with the L1648 loci and 16S rDNA for the 211 taxa that passed the initial QC filter. This is the tree that we will upload to T-BAS for people to use as a reference for placement of new nostocalean taxa.conflict/
: Contains the input and output files from the three discovista analyses that we conducted to investigate phylogenetic conflict: gene trees vs their corresponding concatenated trees; concatenated trees vs 22 key topological bipartitions; and astral trees vs 22 key topological bipartitions. The Discovista outputs from these analyses were used to generate the pie charts in Fig. 2, the heatmap in figure three, as well as the boxplots in Figs. S6-S7.phylonetworks/
: Contains the 1293 alignments with from taxa subset1 used to infer the bayesian gene trees from which we calculated concordance factors for the snaq inferences. It also contains the raw (modelfinder_bulk/
) and extracted (modelfinder_out
) output from the modelfinder analyses on these alignments, which we use to select the best model for each locus for MCMC sampling using phylobayes. Within thealignments/
directory, there are also two.pb
file for each locus. Those files contain the locus-specific command line that was used to run Phylobayes with the best model found for each locus. Thebucky/infiles/*.in
files contain the counts of the different tree topologies from the bayesian posterior samples from each locus, which were used as input to run bucky and infer concordance factors. Thebucky/outfiles/*.concordance
files contain the log of the BUCKy run for each quartet (set of for taxa with labels 1–12). Thebucky/outfiles/*.cf
contain the concordance factors (CF) estimated for each quartet in CSV format. The first four fields are taxon names, followed by the CF estimate and lower (CF_*lo) and higher (CF*_hi) end of the 95% HPD of the estimates for each of the three alternative topologies of the quartet. The last field is the number of genes used to infer the quartet CFs. Thesnaq/
directory contains the input and output of the networ estimation analyses. Thesnaq/CFtable.csv
is a compilation of the CFs for all quartets, i.e., allbucky/outfiles/*.cf
files described above. Thestart_tree_subset1.tre
file was the starting tree for the snaq inferences, and thesubset1.tree
is the major edge topology of the best network with h = 2, which we used for Figure 4 of the paper. Thesnaq/net*.out
files contain the inferred networks in newixk format for each value of h we tested, while thesnaq/net*.best
files contain the best network for each h according to the pseudolikelihood score. Thesnaq/net*.log
files contain the SNaQ log for each run. Theplog_scores.csv
has a table with the pseudolikelihood scores of the best network for each h. Thesnaq/bootstrap/*
contains the output and logs of the 100 bootsnaq pseudoreplicate searches on the net2 network. The numbers afterboot
in the filenames are the starting seeds used for each pseudoreplicate serach.divtime/
: Contains all the files used and generated as part of the divergence time estimation.data/
has the tree from figure 4 in phylip format, including the fossil calibration, and the concatenated alignment of the 1293 loci for taxa in subset 1.gH/
contains the control file (mcmctree-outBV.ctl
) used to infer the gradient and hessian matrices needed to run the approximate likelihood analyses, the LG matrix (lg.dat
) and the output from the codeml analyis.mcmc/c{1..3}
andprior/c{1..3}
contain the control files (.ctl
files) as well as the output from the MCMC sampling of the posterior and prior distribution of node times for each of three chains (c1-c3).phylogenomic_jackknifing/
: Contains the alignments and trees obtained for the phylogenomic jackknifing analyses, summarized in Fig. 5.loci_samples/
contains files with the list of randomly sampled loci for each one of dataset sizes explored. For example,31_rep1
has a list of the first replicate of 31 randomly sampled loci from the 1293 loci which are complete for taxa subset 1.