Skip to content

Computational workflow used in my study on Nostocales phylogenomics

Notifications You must be signed in to change notification settings

cjpardodelahoz/nostocales

Repository files navigation

cover_light

Computational workflow used in our study on phylogenomics of Nostocales

This repository contains all the code used in our study (Pardo-De la Hoz et. al., 2023) of phylogentic conflicts in nostocalean cyanobacteria, where we showed that an ancient rapid radiation explains most phylogenetic conflicts in the relationships among the major lineages of this order. The paper is now out in Systematic Biology as Advanced Access.

The code

The file commands.sh has the full workflow with calls to all the scripts in the scripts/ directory. This file was basically my logbook while I was doing the project. It is a reference to navigate the scripts, and has everything needed to replicate the study.

I am also putting together a project website where I will have a more detailed and reader-friendly breakdown of the workflow and computational setup that I used. Hopefully this will be helpful for people trying to do similar analyses.

Disclaimer: This was the first genomic scale project that I worked on. I started in 2020 and continued until late 2022. I didn't have formal training in computational biology or programming, and I have been learning and adopting good coding practices through the whole process. You will see this reflected in some of the scripts I used for this study. In some cases, I reorganized or re-wrote parts of my scripts so that they were easier to follow. However, some parts may still be a little clunky, albeit still functional. The project website is one more attempt to make everything as clear, transparent as possible, but feel free to contact me if something is not clear or doesn't work they way it's suposed to.

The Data

The data and output from all analyses are deposited in a Dryad Repository. You can download the data.zip file. WARNING: the contents of this archive will take ~69 Gb of space when decompressed. The only files we did not include where the raw outputs from the posterior sampling done with PhyloBayes, because it would double the size of this-already huge-repository. However, the posterior samples from this analyses are summarized in the files under analyses/phylonetworks/bucky/infiles/. Please feel free to conctact me at cjpadodelahoz[at]gmail.com if you want access to the raw outputs. The rest of it is organized as follows:

  • genomes/: Contains the assemblies of all 220 genomes that we used in the study in fasta format. These assemblies were the starting point for all analyses.
  • databases/: Contains the blast database used by mafft-homologs during the alignments of the amino acid sequences.
  • analyses/: Contains the output from all analyses. It is structured like this:
    • genome_qc/: Contains the output of the BUSCO analyses that were used for QC of the sampled genomes and to extract the loci used in several of the phylogenomic analyses. busco/all_cyanodb10/ contains the output of the BUSCO analyses done with the cyanobacteria_odb10 database. busco/all_nostocalesdb10/ contains the output of the BUSCO analyses done with the nostocales_odb10 database. The structure of both of these folders is the same:
      • BUSCO_result_all.csv: Matrix of busco loci (rows) and taxa (columns) where the elements indicate whether a particular locus was complete, duplicated, fragmented, or missing from a particular taxon genome.
      • BUSCO_num_result_all.csv: Same as BUSCO_result_all.csv but the elements of the matrix were replaced by 0 (duplicated, fragmented or missing) and 1 (complete). Both of these files are generated by the R scripts scripts/filter_taxa_and_loci_all.R for the cyanodb10, and scripts/filter_loci_nostocalesdb10.R for the nostocalesdb10.
      • summary: compilation of the full_table.tsv (see below) files from all the taxa. I used this to summarize the BUSCO results across all taxa.
      • by_taxon/: contains the output of the BUSCO analyses done with the respective database, with a directory for each of the 220 taxa. For each taxon, the busco output consists of:
        • short_summary.specific.*: a short summary of the busco results.
        • logs/ directory with the STDERR and STDOUT of BUSCO, HMMER, and Prodigal.
        • prodigal_output/: gene predictions in nucleotide and amino acid format (predicted_genes/predicted.*), and tmp/ subdirectory with the prodigal run temp files.
        • run_*_odb10/: database-specific busco results:
          • busco_sequences/: FASTA files with the hits of the BUSCO loci found for the target genome in nucleotide (*.fna) and amino acid (\*.faa) form, organized by busco status (complete, multicopy, and fragmented).
          • hmmer_output/: HMMER output for the seraches with each of the HMMs of the corresponding ortholog database gainst the target genome.
          • full_table.tsv: table with full details of BUSCO results, including status for each locus code, sequence coordinates of the hits, and gene names.
          • missing_busco_list.tsv: list of missing busco loci codes.
          • short_summary.txt: same as short_summary.specific.* above.
    • prelim/: Contains the sequences and alignments used to infer the preliminary phylogenetic tree with all taxa used in the study (Fig. S1). It also contains the RAxML output of the concatenated analyses, including the tree in newick format under trees/concat
    • L31/: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L31 dataset.
    • L70/: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L70 dataset.
    • L746/: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L746 dataset.
    • L1648/: Contains the nucleotide and amino acid sequences, alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L1648 datasets, including the alignments and trees generated with different alignment trimming strategies (ng, strict, kcg, and kcg2). This directory also contains the alignment summaries obtained with AMAS.py, which were used to generate Fig. S3. Finally, there is also subdirectory with the output tables from Modelfinder, which were used to compare the fit of site-heterogeneous and site-homogeneous model, summarized in Fig. S4.
    • ngmin/: Contains the nucleotide and amino acid alignments, as well as trees (single gene, concatenated, and astral) and iqtree output obtained with the L1082 (nucleotide) and L1233 (amino acid) datasets. It also contains the input and output from the treeshring analyses, which was used to filter the L1648 alignments from taxa in relatively long branches before producing the ngmin datasets.
    • tbas/: Contains the nucleotide and amino acid sequences, alignments, as well as the concatenated nucleotide tree and iqtree output obtained with the L1648 loci and 16S rDNA for the 211 taxa that passed the initial QC filter. This is the tree that we will upload to T-BAS for people to use as a reference for placement of new nostocalean taxa.
    • conflict/: Contains the input and output files from the three discovista analyses that we conducted to investigate phylogenetic conflict: gene trees vs their corresponding concatenated trees; concatenated trees vs 22 key topological bipartitions; and astral trees vs 22 key topological bipartitions. The Discovista outputs from these analyses were used to generate the pie charts in Fig. 2, the heatmap in figure three, as well as the boxplots in Figs. S6-S7.
    • phylonetworks/: Contains the 1293 alignments with from taxa subset1 used to infer the bayesian gene trees from which we calculated concordance factors for the snaq inferences. It also contains the raw (modelfinder_bulk/) and extracted (modelfinder_out) output from the modelfinder analyses on these alignments, which we use to select the best model for each locus for MCMC sampling using phylobayes. Within the alignments/ directory, there are also two .pb file for each locus. Those files contain the locus-specific command line that was used to run Phylobayes with the best model found for each locus. The bucky/infiles/*.in files contain the counts of the different tree topologies from the bayesian posterior samples from each locus, which were used as input to run bucky and infer concordance factors. The bucky/outfiles/*.concordance files contain the log of the BUCKy run for each quartet (set of for taxa with labels 1–12). The bucky/outfiles/*.cf contain the concordance factors (CF) estimated for each quartet in CSV format. The first four fields are taxon names, followed by the CF estimate and lower (CF_*lo) and higher (CF*_hi) end of the 95% HPD of the estimates for each of the three alternative topologies of the quartet. The last field is the number of genes used to infer the quartet CFs. The snaq/ directory contains the input and output of the networ estimation analyses. The snaq/CFtable.csv is a compilation of the CFs for all quartets, i.e., all bucky/outfiles/*.cf files described above. The start_tree_subset1.tre file was the starting tree for the snaq inferences, and the subset1.tree is the major edge topology of the best network with h = 2, which we used for Figure 4 of the paper. The snaq/net*.out files contain the inferred networks in newixk format for each value of h we tested, while the snaq/net*.best files contain the best network for each h according to the pseudolikelihood score. The snaq/net*.log files contain the SNaQ log for each run. The plog_scores.csv has a table with the pseudolikelihood scores of the best network for each h. The snaq/bootstrap/* contains the output and logs of the 100 bootsnaq pseudoreplicate searches on the net2 network. The numbers after boot in the filenames are the starting seeds used for each pseudoreplicate serach.
    • divtime/: Contains all the files used and generated as part of the divergence time estimation. data/ has the tree from figure 4 in phylip format, including the fossil calibration, and the concatenated alignment of the 1293 loci for taxa in subset 1. gH/ contains the control file (mcmctree-outBV.ctl) used to infer the gradient and hessian matrices needed to run the approximate likelihood analyses, the LG matrix (lg.dat) and the output from the codeml analyis. mcmc/c{1..3} and prior/c{1..3} contain the control files (.ctl files) as well as the output from the MCMC sampling of the posterior and prior distribution of node times for each of three chains (c1-c3).
    • phylogenomic_jackknifing/: Contains the alignments and trees obtained for the phylogenomic jackknifing analyses, summarized in Fig. 5. loci_samples/ contains files with the list of randomly sampled loci for each one of dataset sizes explored. For example, 31_rep1 has a list of the first replicate of 31 randomly sampled loci from the 1293 loci which are complete for taxa subset 1.

About

Computational workflow used in my study on Nostocales phylogenomics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published