Skip to content

Tutorial

biologyguy edited this page Mar 6, 2018 · 45 revisions

Curating putative orthogroups with RD-MCL

In the following tutorial, we will perform an end-to-end walkthrough of a workflow that can leave you with high quality orthogroup assignments for a gene family.

A quick note up front: RD-MCL is computationally intensive – Some of the examples below could take over a week to run on a basic workstation. The good news is that all results files have been saved to the GitHub repository and links are provided in the appropriate sections, so feel free to terminate jobs after you learn how to get them running and grab the pre-calculated output to carry on with. Now without further ado, the tutorial!

Table of Contents

Setup

Before we get started, you will need to ensure that:

  1. You are working on a *NIX system (most likely Linux or Mac OS X)
  2. You can launch a bash shell and have at least a rudimentary understanding of how to navigate within that shell
  3. Python3 and RD-MCL are installed
  4. BuddySuite is installed
  5. The alignment program Clustal Omega is installed
  6. You have an active internet connection on the machine you will be running RD-MCL on
  7. Optional: Phylogenetic tree visualization software (FigTree is a very good choice)

List of files created through the tutorial

All of the files downloaded/generated during this tutorial can be accessed from this list if you do not want to create them yourself.

Download files

Preamble

The caspases will be used as our example protein family for this tutorial, because they are widely distributed across the Metazoa, there is diversity among the individual proteins, the family as a whole has already enjoyed considerable attention and annotation, and the presence of a CASc domain can be used as the minimum requirement to classify a protein as a caspase. This final point is particularly important, because it constitutes the human imposed definition of the family. Whichever family you choose to study on your own, be very clear about this definition up front!

Why? Consider that genomic plasticity is a fact of life, leading to all manner of complications when we try to reconstruct the lineage of specific genomic regions. Indels are rampant over evolutionary time, as is wholesale domain swapping, so it is rare for any two genes to be homologous along their entire length. Therefore, contemporary genetic nomenclature is largely a convenience that we use to help us understand patterns and function, not a rigorous means of documenting precise homology and descent.

For example, consider the domain architecture of the mammalian caspases below.

Lavrik and Krammer, Acta Naturae. 2009 Apr; 1(1): 80–83

They all contain large and small catalytic domains, which presumably share homologous residues across most of their lengths, but the N-termini of each protein has some combination of DEDs and CARDs (or nothing at all). Therefore, the proteins in this 'family' are fusions of several disparate domains, and the community has decided that the catalytically active domains are the most pertinent for naming purposes. But what if the first people to study caspases were only interested in its protein binding properties? What if they decided that CARD domains were the most important feature? There are dozens of other CARD containing proteins, so one may argue that the naming convention we currently have is simply a quirk of history.

The takeaway message is to not get too hung up on a protein or gene's name, but instead to ask yourself what it is about a particular class of protein that you find the most interesting. What defining features are homologous across all of the 'family' you want to study? Once you have answered that question, you're all set to sub-classify them into useful orthogroups.

Step 1: Acquiring an initial caspase sequence from RefSeq

We will begin by searching the well-curated RefSeq database for the human caspase-2 sequence using DatabaseBuddy (part of BuddySuite).

Launch DatabaseBuddy and follow along. Hopefully the DatabaseBuddy commands will be intuitive, but if you're feeling lost there's a BuddySuite tutorial that you may want to go through before continuing:

$: databasebuddy
DbBuddy> database ncbi_prot

DbBuddy> search "Homo sapiens[Organism] AND caspase-2*[Protein Name] AND srcdb_refseq[PROP]"

DbBuddy> show

If all has gone well, you should see a similar output as shown below after running the show command:

You'll notice that our search has pulled in a few more sequences than we wanted, but NP_116764.2 (highlighted) looks like the primary isoform, so keep that one and fetch the sequence as a GenBank file.

DbBuddy> keep "NP_116764.2"

DbBuddy> fetch

DbBuddy> format genbank

DbBuddy> write hsap_casp2.gb

DbBuddy> quit

Step 2: PSI-BLAST search RefSeq for CASc domains

We are starting with the human caspase-2 sequence because it is well curated. As stated in the preamble, the catalytic CASc domain is the common feature we have assigned to the family, so let's use SeqBuddy to extract that feature from caspase2 and redirect it into a FASTA file.

$: seqbuddy hsap_casp2.gb --extract_feature_sequences CASc --out_format fasta > hsap_casp2_casc.fa

Now use PSI-BLAST to query the RefSeq database for other sequences containing a CASc domain. Navigate to the NCBI BLASTP web portal and run your search after changing the default parameters as marked with red arrows below:

You should get back almost 6000 results from the first search, but we want to iterate the PSI-BLAST a few times to pull out as many as possible. Click the "Go" button to do so:

After the third or fourth iteration, you should be seeing less than ten new records being identified, so call it quits at that point. Download the CSV hit-table:

Rename the file to something a little more meaningful.

$: mv /path/to/Downloads/<name of your file> ./casc_blast.csv

Step 3: Filter BLAST hits

Pull out the uniq accession numbers from the hit table using the following UNIX commands and regular expression:

$: cut -f2 -d ',' casc_blast.csv | \
awk 'match($0,/[NX]P_[0-9]+(.[0-9]+)?/) {print substr($0,RSTART,RLENGTH)}' | \
sort | uniq > casc_blast_accns.txt

Now load the accessions up into DatabaseBuddy to get a better look at what was returned.

$: databasebuddy casc_blast_accns.txt -d ncbi_prot
DbBuddy> sort organism

DbBuddy> show 100

Many of the records are alternate splice variants/isoforms. Although this is useful information in many contexts, splice variants are not homologs (they are literally the same gene!), so they need to be filtered down to a representative transcript. Let's remove everything except the primary variants, and also get rid of anything marked as partial or LOW QUALITY.

DbBuddy> remove "[iI]soform (X1[0-9]|X[2-9])" "[iI]soform (1[0-9]|[2-9])" "[iI]soform ([b-zB-Z]|A[b-zB-Z]+)"

DbBuddy> remove "LOW QUALITY PROTEIN" "partial"

This leaves just under 3,000 sequences.

If we sort the records by size, we also see a few that are very large and others that are quite small. The average caspase is around 400 residues, with a CASc domain of about 250 residues, so let's filter out the hundred-ish sequences less than 200 residues and the fifty-ish sequences that are greater than 800 residues. Large sequence can be particularly troublesome from a performance perspective, because they slow down the multiple sequence alignment considerably.

DbBuddy> sort length

DbBuddy> show 100

DbBuddy> remove (length<200)

DbBuddy> sort length reverse

DbBuddy> show 100

DbBuddy> remove (length>800)

Fetch the records and save them in GenBank format:

DbBuddy> fetch

DbBuddy> format genbank

DbBuddy> write refseq_casps.gb

DbBuddy> quit

Step 4: Select sequences for RD-MCL

Use the SeqBuddy command taxonomic_breakdown to visualize the taxonomic spread of the records you just curated (the integer value passed in specifies the depth of the breakdown):

$: seqbuddy refseq_casps.gb --taxonomic_breakdown 4
Eukaryota    2889
 |Metazoa    2889
 | |Echinodermata    34
 | | |Eleutherozoa    34
 | |Placozoa    1
 | | |Trichoplax    1
 | |Hemichordata    15
 | | |Enteropneusta    15
 | |Porifera    27
 | | |Demospongiae    27
 | |Platyhelminthes    8
 | | |Trematoda    8
 | |Cnidaria    80
 | | |Hydrozoa    21
 | | |Anthozoa    59
 | |Chordata    1950
 | | |Craniata    1899
 | | |Tunicata    15
 | | |Cephalochordata    36
 | |Ecdysozoa    604
 | | |Nematoda    12
 | | |Scalidophora    3
 | | |Arthropoda    589
 | |Lophotrochozoa    170
 | | |Annelida    5
 | | |Brachiopoda    35
 | | |Mollusca    130

The first thing that should be jumping out at you is how unbalanced the taxonomic distribution is. There is currently a strong bias in our public databases towards vertebrates and certain orders of insects, so don't be fooled into thinking there is a dearth of caspases throughout broad swaths of the Metazoa; in reality, we just haven't sequenced them yet.

RD-MCL deals reasonably well with imbalances, but the final results will be much cleaner and easier to interpret if we focus on the clades with high coverage. Let's begin by digging deeper into the distribution within the chordates:

$: seqbuddy refseq_casps.gb --keep_taxa Chordata | seqbuddy --taxonomic_breakdown 7
Eukaryota    1950
 |Metazoa    1950
 | |Chordata    1950
 | | |Craniata    1899
 | | | |Vertebrata    1899
 | | | | |Chondrichthyes    11
 | | | | | |Holocephali    8
 | | | | | |Elasmobranchii    3
 | | | | |Euteleostomi    1888
 | | | | | |Actinopterygii    525
 | | | | | |Amphibia    37
 | | | | | |Lepidosauria    45
 | | | | | |Coelacanthiformes    10
 | | | | | |Archelosauria    465
 | | | | | |Mammalia    806
 | | |Tunicata    15
 | | | |Ascidiacea    15
 | | | | |Enterogona    15
 | | | | | |Phlebobranchia    15
 | | |Cephalochordata    36
 | | | |Branchiostomidae    36
 | | | | |Branchiostoma    36
 | | | | | |floridae    17
 | | | | | |belcheri    19

Great! It looks like three distinct groups within the Euteleostomi have high coverage: the Actinopterygii (fish), Archelosauria (turtles, crocodiles, birds), and Mammalia. Run --taxonomic_breakdown 0 on each of these clades in turn and observe the number of genes present in each species, this will give you an approximate idea as to how many orthogroups to expect:

$: seqbuddy refseq_casps.gb --keep_taxa Archelosauria | seqbuddy --taxonomic_breakdown 0
...
 | | | | | | | | | | | | | | |Passeroidea    23
 | | | | | | | | | | | | | | | |Estrildidae    18
 | | | | | | | | | | | | | | | | |Estrildinae    18
 | | | | | | | | | | | | | | | | | |Taeniopygia    10
 | | | | | | | | | | | | | | | | | | |guttata    10
 | | | | | | | | | | | | | | | | | |Lonchura    8
 | | | | | | | | | | | | | | | | | | |striata domestica    8
 | | | | | | | | | | | | | | | |Fringillidae    5
 | | | | | | | | | | | | | | | | |Carduelinae    5
 | | | | | | | | | | | | | | | | | |Serinus    5
 | | | | | | | | | | | | | | | | | | |canaria    5
 | | | | | | | | | | | | | | |Paridae    10
 | | | | | | | | | | | | | | | |Pseudopodoces    6
 | | | | | | | | | | | | | | | | |humilis    6
 | | | | | | | | | | | | | | | |Parus    4
 | | | | | | | | | | | | | | | | |major    4
 | | | | | | | | | | | | | | |Pipridae    11
 | | | | | | | | | | | | | | | |Manacus    6
 | | | | | | | | | | | | | | | | |vitellinus    6
 | | | | | | | | | | | | | | | |Lepidothrix    5
 | | | | | | | | | | | | | | | | |coronata    5
 | | | | | | | |Crocodylia    22
 | | | | | | | | |Alligatoridae    13
 | | | | | | | | | |Alligatorinae    13
 | | | | | | | | | | |Alligator    13
 | | | | | | | | | | | |sinensis    8
 | | | | | | | | | | | |mississippiensis    5
 | | | | | | | | |Longirostres    9
 | | | | | | | | | |Gavialidae    4
 ...

By my guess, we should anticipate around 10 major orthogroups for each of these clades. If you observe 20+ genes in each species for your pet protein family, than you would expect closer to twenty, and likewise, expect much fewer if you only see two or three genes per species. It's just a rough estimate.

Let's prepare each group for RD-MCL:

$: seqbuddy refseq_casps.gb --keep_taxa Actinopterygii | seqbuddy --prepend_organism --out_format fasta > fish_casps.fa

$: seqbuddy refseq_casps.gb --keep_taxa Archelosauria | seqbuddy --prepend_organism --out_format fasta > bird_casps.fa

$: seqbuddy refseq_casps.gb --keep_taxa Mammalia | seqbuddy --prepend_organism --out_format fasta > mammal_casps.fa

Similarly, have a look at the Arthropoda:

$: seqbuddy refseq_casps.gb --keep_taxa Arthropoda | seqbuddy --taxonomic_breakdown 6

Total: 599

Eukaryota    589
 |Metazoa    589
 | |Ecdysozoa    589
 | | |Arthropoda    589
 | | | |Crustacea    12
 | | | | |Malacostraca    12
 | | | |Hexapoda    553
 | | | | |Insecta    544
 | | | | |Collembola    9
 | | | |Chelicerata    24
 | | | | |Merostomata    7
 | | | | |Arachnida    17

Insects are making up the vast majority, so zoom in on them:

seqbuddy refseq_casps.gb --keep_taxa Insecta | seqbuddy --taxonomic_breakdown 10

Total: 554

Eukaryota    544
 |Metazoa    544
 | |Ecdysozoa    544
 | | |Arthropoda    544
 | | | |Hexapoda    544
 | | | | |Insecta    544
 | | | | | |Pterygota    544
 | | | | | | |Neoptera    544
 | | | | | | | |Polyneoptera    3
 | | | | | | | | |Dictyoptera    3
 | | | | | | | |Paraneoptera    25
 | | | | | | | | |Psocodea    4
 | | | | | | | | |Hemiptera    21
 | | | | | | | |Holometabola    516
 | | | | | | | | |Lepidoptera    45
 | | | | | | | | |Coleoptera    42
 | | | | | | | | |Diptera    282
 | | | | | | | | |Hymenoptera    147

Most of the coverage is in the Diptera (flies) and Hymenoptera (wasps/bees/ants), so grab out those two clades for further analysis as well:

$: seqbuddy refseq_casps.gb --keep_taxa Diptera | seqbuddy --prepend_organism --out_format fasta > diptera_casps.fa

$: seqbuddy refseq_casps.gb --keep_taxa Hymenoptera | seqbuddy --prepend_organism --out_format fasta > hymenoptera_casps.fa

Note that in each of the commands above, there is a final call to the SeqBuddy function prepend_organism. This creates and appends a species identifier (first character of the genus name, first three characters from the species, and an integer if there would otherwise be a naming conflict) to each gene in the data-set, as is required by RD-MCL.

Step 5: Run RD-MCL

RD-MCL builds all-by-all similarity graphs among the sequences it is provided; this step can be very computationally expensive because these types of problems grow quadratically (i.e., doubling the number of sequences quadruples the runtime, at minimum). On my workstation, which runs 12 hyper-threaded processors, I can process 100 sequences in a few hours, while 800 sequences takes more than a week. If you have access to a cluster, I strongly recommend parallelizing the job for more than a couple hundred sequences.

Running RD-MCL serially is straightforward; at minimum, just call the program and tell it where the sequence file is:

$: rdmcl bird_casps.fa

RD-MCL has a couple dozen optional parameters you can set if you want to modify the run. For a list of those options call rdmcl --help or see the dedicated wiki page. We'll use some of these parameters to add more informative names to our output directory and final clusters.

$: rdmcl bird_casps.fa --outdir bird --group_name bird

$: rdmcl mammal_casps.fa --outdir mammal --group_name mamm

$: rdmcl fish_casps.fa --outdir fish --group_name fish

$: rdmcl diptera_casps.fa --outdir diptera --group_name dipt

$: rdmcl hymenoptera_casps.fa --outdir hymenoptera --group_name hyme

If you are following along with your own data, it will probably take a while for the job(s) to finish. If you are following along with the caspase example, you can grab the pre-computed results from here (they will be used in the following steps).

Step 6: First look at RD-MCL results

After a run has complete, the output directory should have a collection of files and folders that looks like this:

$: ls bird/
alignments/         hmm/                mcmcmc/
psi_pred/           placement.log       rdmcl.log
final_clusters.txt  paralog_cliques     sim_scores/
sqlite_db.sqlite
File/folder Description
alignments/ The MSAs used to generate the all-by-all similarity scores for each final cluster
mcmcmc/ The output of each mcmcmc chain for each final cluster
psi_pred/ Secondary structure .ss2 files for each input sequence
hmm/ Collection of hidden Markov model files
placement.log Final sequence placement into clusters log
rdmcl.log Description of the run
final_clusters.txt Your main result!
paralog_cliques List of RBH paralogs extracted from the dataset before running RDMCL
sim_scores/ All-by-all similarity graphs for each final cluster
sqlite_db.sqlite SQLlight database with all of the MSAs and similarity graphs from the run

The two files you will be most interested in, at least at first, are rdmcl.log and final_clusters.txt

The rdmcl.log file gives a full breakdown of all parameters used to generate your RD-MCL run. This will allow you to keep track of what you did and reproduce the result if necessary. Each part of this file is explained in greater detail in a separate wiki page.

$: cat bird/rdmcl.log
*************************** Recursive Dynamic Markov Clustering ****************************
RD-MCL version 1.0.4

Public Domain Notice
--------------------
This is free software; see the LICENSE for further details.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Questions/comments/concerns can be directed to Steve Bond, steve.bond@nih.gov
--------------------

********************************************************************************************

Function call: /home/bondsr/bioinf_tools/rdmcl bird_casps.fa -o bird_casps -rs 1 -q -cpu 16 -wdb /data/bondsr/workers/ -psi psipred
Date: 11:35-21-12-2017

** Environment setup **
Random seed used for this run: 1

Working directory: /data/bondsr/caspases
Output directory:
    /data/bondsr/caspases/bird_casps
SeqBuddy version: 1.3.0
AlignBuddy version: 1.3.0

Alignment method: CLUSTALO 1.2
Alignment calls: $: clustalo --pileup
TrimAl values: ['gappyout', 0.5, 0.75, 0.9, 0.95, 'clean']

Launching SQLite Daemons
Checking that the format of all sequence ids matches 'taxa-gene'
    465 sequences PASSED
68 unique taxa present

** PSI-Pred **
RESUME: All PSI-Pred .ss2 files found

** All-by-all graph **
gap open penalty: -5
gap extend penalty: 0
Generating initial all-by-all similarity graph (107880 comparisons)
 written to: bird_casps/sim_scores/complete_all_by_all.scores
	-- finished in 9 min, 5 sec --

Base cluster score: 236.2077

** Recursive MCL **
Auto detect MCMCMC convergence
Gelman-Rubin convergence breakpoint: 1.1
Number of MCMC chains: 3
Number of Metropolis-Hastings walkers per chain: 3
Total MCL runs: 4503
	-- finished in 9 hrs, 36 min, 36 sec --

** HMM-based sequence-to-cluster reassignment **
	-- finished in 11 min, 58 sec --

Placing any collapsed paralogs into their respective clusters
Preparing final_clusters.txt

Total execution time: 9 hrs, 57 min, 39 sec
Final score: 1068.3789
Clusters written to: bird_casps/final_clusters.txt

Probably even more interesting to you, are the results in final_clusters.txt:

The first column is the group name assigned to the cluster. The second column is the final score of the cluster, based on a diminishing returns algorithm. The remaining tab separated columns are the sequence names for the members of that cluster.

$: cat bird/final_clusters.txt
bird_0_0_0	161.6026	Acar-XP_010171751.1	Achl-XP_009067452.1	Achr-XP_011577228.1	Acyg-XP_013035279.1	Afor-XP_009275633.1	Amis-XP_006263939.1	Apla-XP_005030551.1	Asin-XP_006026683.1	Breg-XP_010308818.1	Brhi-XP_010129400.1	Cann-XP_008490097.1	Cbra-XP_008639693.1	Ccan-XP_009563972.1	Ccri-XP_009697353.1	Cjap-XP_015716892.1	Cliv-XP_005500235.1	Cmyd-XP_007054526.1	Cpel-XP_010000752.1	Cpor-XP_019390228.1	Cpug-XP_014800971.1	Cvoc-XP_009890080.1	Egar-XP_009640697.1	Falb-XP_005045177.1	Fche-XP_005445415.1	Fgla-XP_009570651.1	Fper-XP_005230888.1	Gfor-XP_005415590.1	Ggal-NP_990056.1	Ggan-XP_019377451.1	Gste-XP_009810205.1	Halb-XP_009922022.1	Lcor-XP_017664322.1	Ldis-XP_009958677.1	Lstr-XP_021399050.1	Mnub-XP_008944070.1	Mund-XP_005149203.1	Muni-XP_010182555.1	Mvit-XP_008923018.1	Nmel-XP_021249038.1	Nnip-XP_009473093.1	Nnot-XP_010009471.1	Ohoa-XP_009932264.1	Pade-XP_009328950.1	Pcar-XP_009512955.1	Pcri-XP_009479867.1	Pgut-XP_010078873.1	Phum-XP_005517714.1	Plep-XP_010281454.1	Pmaj-XP_015479949.1	Ppub-XP_009909179.1	Svul-XP_014745174.1	Talb-XP_009970400.1	Tery-XP_009988707.1	Tgut1-XP_002191294.1	Tgut1-XP_012428936.1	Zalb-XP_005483896.1
bird_0_1	115.6107	Acar-XP_010163578.1	Acyg-XP_013040763.1	Afor-XP_019326128.1	Apla-XP_021128354.1	Asin-XP_006025610.2	Avit-XP_009875221.1	Breg-XP_010310914.1	Brhi-XP_010134615.1	Cann-XP_008497831.1	Ccan-XP_009563276.1	Ccor-XP_010410023.1	Ccri-XP_009699574.1	Cjap-XP_015717471.1	Cmac-XP_010124189.1	Cmyd-XP_007054543.1	Cpel-XP_010002388.1	Cpic-XP_005287972.1	Cstr-XP_010200177.1	Cvoc-XP_009883787.1	Egar-XP_009647487.1	Fgla-XP_009578173.1	Fper-XP_013155138.1	Ggal-NP_990057.1	Halb-XP_009924247.1	Hleu-XP_010584548.1	Lcor-XP_017661517.1	Ldis-XP_009947440.1	Mgal-XP_010708306.2	Mund-XP_005148683.1	Muni-XP_010187950.1	Mvit-XP_017927808.1	Nmel-XP_021252130.1	Nnip-XP_009472930.1	Nnot-XP_010015536.1	Ohoa-XP_009930434.1	Pade-XP_009325341.1	Pcar-XP_009511931.1	Pcri-XP_009483599.1	Pgut-XP_010082525.1	Phum-XP_005517572.1	Plep-XP_010290767.1	Pmaj-XP_015480600.1	Ppub-XP_009903163.1	Psin-XP_014427079.1	Talb-XP_009966306.1	Tery-XP_009977213.1	Tgut1-XP_002197219.1
bird_0_2	114.1264	Aaus-XP_013807673.1	Acar-XP_010167151.1	Achl-XP_009074486.1	Achr-XP_011574988.1	Avit-XP_009876193.1	Brhi-XP_010144488.1	Cann-XP_008494158.1	Ccan-XP_009558080.1	Ccor-XP_010404899.1	Ccri-XP_009707893.1	Cjap-XP_015723629.1	Cmac-XP_010123198.1	Cpel-XP_010005644.1	Cpic-XP_005314340.1	Cpug-XP_014809790.1	Cstr-XP_010196023.1	Cvoc-XP_009888482.1	Egar-XP_009641851.1	Ehel-XP_010145466.1	Falb-XP_005049244.1	Fgla-XP_009573771.1	Gfor-XP_005415995.1	Ggal-NP_001189391.1	Gste-XP_009807246.1	Hleu-XP_010563070.1	Ldis-XP_009950008.1	Lstr-XP_021411077.1	Mgal-XP_010711730.1	Mund-XP_005145602.2	Muni-XP_010189510.1	Nnip-XP_009460549.1	Nnot-XP_010009278.1	Ohoa-XP_009938107.1	Pcar-XP_009506123.1	Pcri-XP_009479033.1	Pgut-XP_010076877.1	Phum-XP_005519962.2	Plep-XP_010285479.1	Ppub-XP_009894784.1	Psin-XP_006118011.1	Scan-XP_009086337.1	Svul-XP_014734929.1	Talb-XP_009974979.1	Tery-XP_009976784.1	Tgut1-XP_002188318.3	Zalb-XP_005484242.1
bird_0_3	111.1932	Aaus-XP_013797196.1	Achl-XP_009078556.1	Acyg-XP_013030032.1	Apla-XP_005014501.1	Avit-XP_009876249.1	Breg-XP_010295946.1	Cann-XP_008488344.1	Cbra-XP_008632648.1	Ccan-XP_009568871.1	Ccri-XP_009697487.1	Cjap-XP_015722817.1	Cmac-XP_010117445.1	Cpic-XP_005293940.1	Cpug-XP_014814053.1	Cstr-XP_010198191.1	Cvoc-XP_009879870.1	Egar-XP_009634188.1	Ehel-XP_010155261.1	Fgla-XP_009578529.1	Gfor-XP_005416425.1	Ggal-XP_421764.3	Gste-XP_009805143.1	Halb-XP_009923386.1	Ldis-XP_009948704.1	Lstr-XP_021381698.1	Mgal-XP_019473523.1	Mvit-XP_017924402.1	Nmel-XP_021255526.1	Nnip-XP_009474328.1	Nnot-XP_010010881.1	Ohoa-XP_009932919.1	Pade-XP_009324216.1	Pcar-XP_009508983.1	Pcri-XP_009483464.1	Phum-XP_005520811.1	Plep-XP_010288794.1	Ppub-XP_009906461.1	Psin-XP_006135034.1	Scam-XP_009666939.1	Scan-XP_009085424.1	Svul-XP_014728008.1	Talb-XP_009965879.1	Tgut1-XP_012430035.1	Tgut2-XP_010215551.1	Zalb-XP_005481147.1
bird_0_4	103.7729	Aaus-XP_013807671.1	Acar-XP_010165007.1	Achr-XP_011574945.1	Afor-XP_009282933.1	Apla-XP_012950189.1	Avit-XP_009876852.1	Breg-XP_010296000.1	Cann-XP_008494202.1	Ccan-XP_009558108.1	Ccri-XP_009701338.1	Cmac-XP_010125639.1	Cmyd-XP_007072004.1	Cpel-XP_010005642.1	Cpug-XP_014810034.1	Cvoc-XP_009888483.1	Egar-XP_009641850.1	Ehel-XP_010147131.1	Falb-XP_005049234.1	Fgla-XP_009583496.1	Gfor-XP_005415994.1	Ggal-XP_421936.4	Gste-XP_009807895.1	Halb-XP_009924272.1	Ldis-XP_009949374.1	Lstr-XP_021411054.1	Mgal-XP_010711743.1	Mnub-XP_008941675.1	Mund-XP_005140102.1	Mund-XP_005144727.1	Nnip-XP_009460550.1	Nnot-XP_010017544.1	Ohoa-XP_009938108.1	Pade-XP_009330078.1	Pcar-XP_009500813.1	Pcri-XP_009483191.1	Phum-XP_005519961.1	Plep-XP_010288754.1	Pmaj-XP_015489704.1	Ppub-XP_009894800.1	Scam-XP_009682549.1	Scan-XP_009086338.1	Talb-XP_009963924.1	Tgut1-XP_002188263.2	Tgut2-XP_010222872.1	Zalb-XP_005484241.1
bird_0_5	96.5389	Achl-XP_009074285.1	Afor-XP_009285422.1	Avit-XP_009872031.1	Breg-XP_010308307.1	Brhi-XP_010140546.1	Cann-XP_008497332.1	Ccan-XP_009564969.1	Ccor-XP_010394452.1	Ccri-XP_009706668.1	Cmyd-XP_007066518.1	Cpel-XP_009996635.1	Cpic-XP_005281708.1	Cvoc-XP_009891089.1	Egar-XP_009637435.1	Falb-XP_005038461.2	Fche-XP_014143045.1	Fper-XP_013153061.1	Gfor-XP_005430019.1	Ggal-NP_001161173.1	Gste-XP_009806633.1	Halb-XP_009910597.1	Hleu-XP_010565918.1	Ldis-XP_009946982.1	Lstr-XP_021404675.1	Mgal-XP_010715921.1	Mund-XP_005144852.1	Muni-XP_010184195.1	Nmel-XP_021251148.1	Nnip-XP_009459864.1	Nnot-XP_010011239.1	Ohoa-XP_009930304.1	Pade-XP_009326897.1	Phum-XP_005523866.1	Plep-XP_010281146.1	Pmaj-XP_015496309.1	Ppub-XP_009897931.1	Scam-XP_009683457.1	Scan-XP_009100903.1	Svul-XP_014747857.1	Talb-XP_009968311.1	Tgut1-XP_002192067.2	Zalb-XP_005495024.1
bird_0_6	95.903	Aaus-XP_013807675.1	Acar-XP_010165008.1	Achl-XP_009074487.1	Afor-XP_009282932.1	Apla-XP_021124568.1	Avit-XP_009876853.1	Breg-XP_010296001.1	Cann-XP_008494157.1	Ccan-XP_009558078.1	Ccor-XP_010404904.1	Ccri-XP_009701337.1	Cjap-XP_015723635.1	Cliv-XP_021155180.1	Cmac-XP_010125640.1	Cpel-XP_010005641.1	Cpug-XP_014818396.1	Cstr-XP_010197451.1	Ehel-XP_010147130.1	Falb-XP_005049235.1	Fgla-XP_009583493.1	Gfor-XP_005415993.1	Ggal-NP_001038154.1	Gste-XP_009807896.1	Halb-XP_009924273.1	Lcor-XP_017693623.1	Ldis-XP_009949365.1	Lstr-XP_021411019.1	Mgal-XP_010711755.1	Muni-XP_010177246.1	Mvit-XP_008929353.1	Nmel-XP_021256512.1	Nnot-XP_010017555.1	Pade-XP_009331031.1	Pcar-XP_009500812.1	Pcri-XP_009482626.1	Pgut-XP_010071041.1	Plep-XP_010288743.1	Ppub-XP_009894798.1	Scam-XP_009682550.1	Svul-XP_014734940.1	Talb-XP_009963925.1	Tgut1-XP_012430417.1
bird_0_7	86.1436	Aaus-XP_013807775.1	Achr-XP_011575010.1	Acyg-XP_013043041.1	Afor-XP_009282931.1	Asin-XP_006022169.1	Brhi-XP_010133584.1	Cann-XP_008494156.1	Cjap-XP_015723633.1	Cliv-XP_021155179.1	Cmyd-XP_007059506.1	Cpel-XP_010005640.1	Cpug-XP_014809874.1	Cvoc-XP_009888484.1	Egar-XP_009641849.1	Falb-XP_005049226.1	Fche-XP_014133274.1	Fper-XP_013161015.1	Gfor-XP_005415992.1	Ggal-NP_989923.1	Halb-XP_009919268.1	Hleu-XP_010563140.1	Lcor-XP_017689443.1	Lstr-XP_021411142.1	Mgal-XP_003207543.1	Muni-XP_010177244.1	Mvit-XP_017927167.1	Nmel-XP_021253377.1	Nnip-XP_009460551.1	Ohoa-XP_009938109.1	Pade-XP_009330066.1	Ppub-XP_009894783.1	Scam-XP_009682548.1	Scan-XP_009086335.1	Svul-XP_014734964.1	Talb-XP_009961244.1	Tgut1-XP_002190143.1	Tgut2-XP_010222877.1	Zalb-XP_005484240.1
bird_0_8	62.6434	Aaus-XP_013808908.1	Achr-XP_011583470.1	Afor-XP_009288035.1	Apla-XP_021124881.1	Asin-XP_006037744.1	Cann-XP_008489460.1	Ccan-XP_009560717.1	Cjap-XP_015737946.1	Cmyd-XP_007067800.1	Cpel-XP_010003525.1	Cpor-XP_019397069.1	Cvoc-XP_009888283.1	Egar-XP_009644820.1	Falb-XP_005057807.1	Fper-XP_013155413.1	Gfor-XP_014165459.1	Ggal-XP_424580.5	Ggan-XP_019367434.1	Hleu-XP_010575587.1	Lcor-XP_017687439.1	Lstr-XP_021389475.1	Mgal-XP_010721334.2	Mund-XP_005145535.1	Mvit-XP_017934813.1	Nmel-XP_021272957.1	Ohoa-XP_009940847.1	Pcri-XP_009484245.1	Ppub-XP_009905124.1	Psin-XP_006132789.1	Scam-XP_009687912.1	Tgut1-XP_002190691.2	Tgut2-XP_010226293.1	Zalb-XP_014128778.1
bird_0_9	51.3829	Aaus-XP_013798363.1	Achr-XP_011592833.1	Afor-XP_009274713.1	Breg-XP_010307140.1	Ccan-XP_009560405.1	Ccri-XP_009703294.1	Cliv-XP_005515064.1	Cmac-XP_010125475.1	Cpug-XP_014808566.1	Cstr-XP_010193785.1	Cvoc-XP_009887176.1	Egar-XP_009643261.1	Fgla-XP_009573990.1	Fper-XP_005236179.1	Hleu-XP_010584226.1	Mnub-XP_008937577.1	Mund-XP_005139687.1	Muni-XP_010189744.1	Nmel-XP_021270990.1	Nnip-XP_009463110.1	Nnot-XP_010022076.1	Pade-XP_009321893.1	Pcar-XP_009511202.1	Pgut-XP_010085267.1	Plep-XP_010282153.1	Scam-XP_009676466.1	Talb-XP_009970325.1
bird_0_10	30.2185	Achr-XP_011573657.1	Acyg-XP_013056451.1	Afor-XP_009283935.1	Breg-XP_010306058.1	Ccan-XP_009567978.1	Cpel-XP_010000563.1	Cpug-XP_014814213.1	Cstr-XP_010197422.1	Cvoc-XP_009881756.1	Egar-XP_009632096.1	Fgla-XP_009572183.1	Fper-XP_005235802.1	Gste-XP_009807903.1	Halb-XP_009921492.1	Mgal-XP_003214157.1	Nmel-XP_021230580.1	Nnip-XP_009474633.1	Pade-XP_009333346.1	Talb-XP_009966940.1
bird_0_11	12.4519	Amis-XP_006268804.1	Asin-XP_006031223.1	Cmyd-XP_007067525.1	Cpic-XP_005313463.1	Cpor-XP_019412216.1	Ggan-XP_019356914.1	Psin-XP_006129195.1
bird_0_12	8.2432	Amis-XP_006267994.1	Asin-XP_006033845.1	Cmyd-XP_007054624.1	Cpic-XP_008173016.1	Cpor-XP_019402013.1	Psin-XP_014427356.1
bird_0_14	8.2059	Amis-XP_006263127.1	Asin-XP_006037942.1	Cpor-XP_019393144.1	Ggan-XP_019364464.1
bird_0_13_0	3.6958	Cmyd-XP_007070748.1	Cpic-XP_005278627.1	Psin-XP_006121821.1
bird_0_15	2.1732	Cmyd-XP_007072003.1	Psin-XP_014434339.1
bird_0_16	3.3456	Amis-XP_006272609.1	Asin-XP_006022170.1
bird_0_13_1	1.1275	Psin-XP_006121825.1

Use the group_by_cluster program (bundled with RD-MCL) to map the definition line from the original sequence file back onto the gene ids and print them out as a list:

$: group_by_cluster bird
bird_0_0_0
Acar-XP_010171751.1  PREDICTED: caspase-3 [Antrostomus carolinensis]
Achl-XP_009067452.1  PREDICTED: caspase-3 [Acanthisitta chloris]
Achr-XP_011577228.1  PREDICTED: caspase-3 [Aquila chrysaetos canadensis]
Acyg-XP_013035279.1  PREDICTED: caspase-3 [Anser cygnoides domesticus]
Afor-XP_009275633.1  PREDICTED: caspase-3 [Aptenodytes forsteri]
Amis-XP_006263939.1  PREDICTED: caspase-3 [Alligator mississippiensis]

......

bird_0_13_0
Cmyd-XP_007070748.1  PREDICTED: caspase-14-like [Chelonia mydas]
Cpic-XP_005278627.1  PREDICTED: caspase-14-like [Chrysemys picta bellii]
Psin-XP_006121821.1  PREDICTED: caspase-14-like [Pelodiscus sinensis]

bird_0_15
Cmyd-XP_007072003.1  PREDICTED: caspase-8-like [Chelonia mydas]
Psin-XP_014434339.1  PREDICTED: caspase-8-like [Pelodiscus sinensis]

bird_0_16
Amis-XP_006272609.1  PREDICTED: caspase-8 [Alligator mississippiensis]
Asin-XP_006022170.1  PREDICTED: caspase-8-like [Alligator sinensis]

bird_0_13_1
Psin-XP_006121825.1  PREDICTED: caspase-14-like [Pelodiscus sinensis]

This is an excellent way to get an initial feel for how clean the orthogroups are, how well these genes have been annotated in RefSeq, and whether there are any previously unrecognized orthogroups. For example, almost all of the bird caspase records are annotated and almost all of the annotations are sorted perfectly into the clusters returned by RD-MCL. Out of the 465 sequences analyzed, one is labelled as uncharacterized and four are labelled differently from the majority of sequences in their cluster. All of these are listed below:

bird_0_0_0 (putative caspase-3)
.....
Tery-XP_009988707.1  PREDICTED: caspase-3 [Tauraco erythrolophus]
Tgut1-XP_002191294.1  PREDICTED: caspase-3-like [Taeniopygia guttata]
Tgut1-XP_012428936.1  PREDICTED: uncharacterized protein LOC100222232 [Taeniopygia guttata]
Zalb-XP_005483896.1  PREDICTED: caspase-3 [Zonotrichia albicollis]

bird_0_7 (putative caspase-8)
.....
Falb-XP_005049226.1  PREDICTED: caspase-8-like [Ficedula albicollis]
Fche-XP_014133274.1  PREDICTED: caspase-8-like [Falco cherrug]
Fper-XP_013161015.1  PREDICTED: CASP8 and FADD-like apoptosis regulator [Falco peregrinus]
Gfor-XP_005415992.1  PREDICTED: caspase-8-like [Geospiza fortis]
Ggal-NP_989923.1  caspase-8 [Gallus gallus]
.....

bird_0_10 (putative caspase-3)
Achr-XP_011573657.1  PREDICTED: caspase-3-like [Aquila chrysaetos canadensis]
Acyg-XP_013056451.1  PREDICTED: caspase-10-like [Anser cygnoides domesticus]
Afor-XP_009283935.1  PREDICTED: caspase-3 [Aptenodytes forsteri]
Breg-XP_010306058.1  PREDICTED: caspase-3-like [Balearica regulorum gibbericeps]
.....

bird_0_11 (caspase-3/7?)
Amis-XP_006268804.1  PREDICTED: caspase-3 [Alligator mississippiensis]
Asin-XP_006031223.1  PREDICTED: caspase-3-like [Alligator sinensis]
Cmyd-XP_007067525.1  PREDICTED: caspase-3-like [Chelonia mydas]
Cpic-XP_005313463.1  PREDICTED: caspase-7-like [Chrysemys picta bellii]
Cpor-XP_019412216.1  PREDICTED: caspase-3-like [Crocodylus porosus]
Ggan-XP_019356914.1  PREDICTED: caspase-3-like [Gavialis gangeticus]
Psin-XP_006129195.1  PREDICTED: caspase-7-like [Pelodiscus sinensis]

For the sequences that appear misplaced, is it a failing of RD-MCL or are they misannotated in the public database? It's easy enough to test by creating a gene tree with the questionable sequence(s), a few sequences from the group it has been assigned to, and a few sequences from the group it is annotated as belonging to. The example below looks at the issue in bird_0_7, taking sequences from caspase-8 and CFLAR:

$: seqbuddy bird_casps.fa --pull_records 'Falb-XP_005049226' 'Fche-XP_014133274' 'Fper-XP_013161015' 'Gfor-XP_005415992' 'Ggal-NP_989923' 'Aaus-XP_013807673' 'Acar-XP_010167151' 'Achl-XP_009074486' 'Achr-XP_011574988' 'Avit-XP_009876193' | \
alignbuddy --generate_alignment clustalo | \
phylobuddy --generate_tree raxml

Clearly, the Falco peregrinus sequence has been assigned the wrong name in GenBank.

Similarly, to check on an uncharacterized sequence, include sequences from every cluster and determine which group it most closely groups with:

$: seqbuddy bird_casps.fa --pull_records Tgut1-XP_002191294 Tgut1-XP_012428936 Tgut1-XP_002197219 Tgut1-XP_002188318 Tgut1-XP_012430035 Tgut1-XP_002188263 Tgut1-XP_002192067 Tgut1-XP_012430417 Tgut1-XP_002190143 Tgut1-XP_002190691 Talb-XP_009970325 Talb-XP_009966940 | \
 alignbuddy --generate_alignment clustalo | \
 phylobuddy --generate_tree raxml

This result confirms that the uncharacterized sequence has been properly classified as caspase-3 by RD-MCL.

Step 7: Visualize RD-MCL results

To generate a nice visual representation of your RD-MCL results, use the included orthogroup_polytomy_builder program. It will convert the final_clusters.txt file into a nexus formatted tree file, combining each group into a color coded polytomy.

$: orthogroup_polytomy_builder bird/final_clusters.txt > bird/bird_casps.nex

Use your favorite phylogenetic tree visualization software to open up the file (I recommend FigTree). In the tree below, I have exported the image as SVG and annotated each cluster with the appropriate name in Adobe Illustrator.

Step 8: Piecing together evolutionary puzzles in bird caspases

It was easy to miss while scrolling through the list of accessions and gene descriptions, but visualizing the data in tree form and annotating each cluster relative to the names pulled from RefSeq makes it clear that multiple large clusters exist for both Caspase-8 and Caspase-3. What might be the most probable explanations for this?

  1. Was a single large orthogroup incorrectly split up by RD-MCL?
  2. Do the subdivided orthogroups represent 'recent' paralogs?
  3. Have the genes been incorrectly assigned in the database?

All three possibilities are fairly easy to test.

Hypothesis 1) RD-MCL is to blame

Our RD-MCL result will be supported if the members of each putative orthogroup forms a monophyletic clade on a maximum likelihood tree. Below, the seqs mode and --group flag of group_by_cluster are used to pull out the Casp3 and Casp8 orthogroups, plus Casp9 as an outgroup.

$: group_by_cluster bird --mode seqs --group bird_0_0_0 bird_0_6 bird_0_7 bird_0_8 bird_0_10 bird_0_11 bird_0_15 bird_0_16 | \
seqbuddy --order_recs_by_len | \
alignbuddy --generate_alignment clustalo "--pileup" | \
phylobuddy --generate_tree fasttree -o nexus > bird/casp_3_8_9.nex

Given how cleanly separated the groups are, it's pretty safe to say that RD-MCL is performing as expected.

Note: In this example, SeqBuddy was used to order all of the incoming records by sequence length from smallest to largest, then AlignBuddy called Clustal Omega with the --pileup option. 'Pileup' iteratively builds the alignment based on the order of the sequences provided, which ensures that all sequences are anchored to the smallest common domain used to define the protein family. This is also the default behavior of RD-MCL.

Hypothesis 2) The orthogroups are paralogous

Now that orthogroups are defined, the phylogenetic relationships among each of those groups can be analyzed. To do so, first create a consensus sequence for each cluster and then infer a tree from there. Again, group_by_cluster can help us out by converting each orthogroup into a consensus sequence; simply pass in the keyword cons:

$: group_by_cluster bird --mode cons | \
alignbuddy --generate_alignment clustalo | \
phylobuddy --generate_tree raxml -o nexus > bird/bird_consensus.nex

Working from consensus sequences cleans up much of the noise and greatly reduces runtime compared to inferring very large trees from hundreds of sequences. Two things pop out from the above tree: 1) The Casp8 orthogroups form a monophyletic clade, indicating that they can be considered 'recent' paralogs in birds, and 2) The Casp3 orthogroups are separated by Casp6 and Casp7, which suggests an issue with current nomenclature (see hypothesis 3, below).

Note: For this example the 'orphaned' sequences (i.e., clusters with only 1 or 2 records) in bird_0_15, bird_0_16, and bird_0_13_1 were included. We'll see later that more difficult cases can leave many orphans, so it is often helpful to set a minimum cluster size with the group_by_cluster --min_group_size flag.

Hypothesis 3) The genes have been incorrectly annotated in RefSeq

Clearly, there are two distinct orthogroups in birds that have been annotated as Casp8. This may not not technically be a case of misannotation, however, because the gene duplication leading to these two orthogroups may be restricted to Archelosauria (indeed, we'll see this is the case in the next section). It's confusing to not differentiate them, but not necessarily incorrect. The two Casp3 orthogroups, on the other hand, are absolutely an annotation error. Both Casp7 and Casp6 share more recent common ancestors with the bird_0_0_0 Casp3 than with the bird_0_10/11 Casp3.

Let's first alleviate any misgivings you may have about using consensus sequence phylogenies, which are not commonly used at present. Make a tree from all the sequences in the Casp3, 6, 7 neighborhood, using one of the Casp14 clusters as an outgroup:

$: group_by_cluster bird --mode seqs --group bird_0_0_0 bird_0_1 bird_0_3 bird_0_10 bird_0_11 bird_0_14 | \
seqbuddy --order_recs_by_len | \
alignbuddy --generate_alignment clustalo "--pileup" | \
phylobuddy --generate_tree fasttree -o nexus > bird/casp_3_6_7_14.nex

The branching pattern among orthogroups is the same (trust in the consensus!). So what might have caused such an egregious oversight in RefSeq? The 'problem' is that we're sequencing genomes at breakneck speed and there's a ton of evolution to sift through (not actually a bad problem to have, in my opinion). There isn't enough person-power at any one institution to tease apart all of the intricacies of every gene family, which is why it's now up to us in the community to perform detailed analyses on our pet proteins. But I digress — let's get back to what's up with Casp3.

The reason for this misnaming is simple: When the first bird genome was sequenced, its predicted gene models were BLASTed against reference organisms. The bird_0_10/11 orthogroup did not exist in those references, but it was still homologous to caspases. From our analysis here, we know that it is equally related to the orthogroups represented by bird_0_1 (Casp6), bird_0_0_0 (Casp3), and bird_0_3 (Casp7), but at the time, the top hit just happened to be Casp3. Therefore, it was named Casp3, even though it would have been only a marginally better hit than that for Casp6 or Casp7. The name was cemented in at this point, however, so the next time a bird_0_10/11 sequence was identified, of course its top hit was the faux-Casp3 from the first bird genome. I assure you, RefSeq is riddled with cases of misannotation like this.

Step 9: Piecing together deeper evolutionary puzzles in caspases

The bird caspases are a very easy group to work with because the orthogroups are so cleanly separated from one another. In fact, this tends to be true across most gene families, so I highly recommend including them in your analysis. The mammals also tend to generate rather clean clusters.

$: orthogroup_polytomy_builder mammal/final_clusters.txt > mammal/mammal_casps.nex

$: group_by_cluster mammal --mode cons | \
alignbuddy --generate_alignment clustalo | \
phylobuddy --generate_tree raxml -o nexus > mammal/mammal_consensus.nex

As mentioned earlier, including small clusters with orphaned sequences can clutter the tree, but keeping them in this mammal tree nicely illustrates that the method is working; all of the orphaned sequences have been pre-annotated in RefSeq and they all group closely with their larger consensus orthogroups (you can confirm this with $: group_by_cluster mammal/final_clusters.txt mammal_casps.fa). For reference, the size of each orthogroup is indicated in parentheses beside each gene name. Interestingly, this tree also supplies new information pertaining to the Casp8 orthogroups we observed in the bird analysis. Notice mamm_0_43; it contains a single orphaned record that has been annotated as Casp18. Have a quick look at its GenBank record:

$: seqbuddy refseq_casps.gb --pull_records 'NP_00110783'
LOCUS       NP_001107832             471 aa            linear   MAM 01-MAY-2016
DEFINITION  caspase-18 [Monodelphis domestica].
ACCESSION   NP_001107832
VERSION     NP_001107832.1
KEYWORDS    RefSeq.
SOURCE      Monodelphis domestica (gray short-tailed opossum)
  ORGANISM  Monodelphis domestica
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Metatheria; Didelphimorphia; Didelphidae; Monodelphis.
REFERENCE   1  (residues 1 to 471)
  AUTHORS   Eckhart,L., Ballaun,C., Hermann,M., VandeBerg,J.L., Sipos,W.,
            Uthman,A., Fischer,H. and Tschachler,E.
  TITLE     Identification of novel mammalian caspases reveals an important role
            of gene loss in shaping the human caspase repertoire
  JOURNAL   Mol. Biol. Evol. 25 (5), 831-841 (2008)
   PUBMED   18281271
FEATURES             Location/Qualifiers
     CDS             1..471
                     /gene="CASP18"
                     /coded_by="NM_001114360.1:1..1416"
                     /db_xref="GeneID:100017025"
ORIGIN
        1 mdeillqikd kltrsdlesl kflssdiipl qkqenitrsl affqalqkre lldnnsilee
       61 llfligrkdi ltkqlnvnie elkrklqhsd slkispyrkl lfkiaenmts dyldsakfll
      121 rrdlpqskle giktplklfi emekhgliek tnlkalkdil qslnaeclik hikeyekktk
      181 dlagsavtaq leshlghlsl lsgnagqaaa plnipeesem ntlpqsmppy kmehvphgyv
      241 viidnihfsn pvdvrigtek dvaalrkvfg rlqfkeeyhs nldasqlhev mkdyskrdyt
      301 dqdaficcil shgkkgvvlg tdwkpvaikk llsyftanec ktlkdkpklf fiqacqngks
      361 dslpevdvef dleadaicsn tthewsdifi gmatvedsla qrsgsigspy iqnlckelea
      421 hcpqkkelle imtsvnskvs niiqmpefrs tlrapfifqv peesqkapig h
//

In the publication referenced in the GenBank record, Eckhart et. al. explains that "...caspase-18 originated, presumably by duplication of caspase-8 or -10, after the split of the amphibian lineage from amniotes, was further conserved in birds and [monotreme] as well as [marsupial] mammals, but was deleted in a common ancestor of all placental (eutherian) mammals."

They classify the chicken 'initiator caspase' as Casp18, which is part of bird_0_6 (Casp8), so let's see if we can support their claim and classify all of the sequences in bird_0_6 as caspase-18. To do so, create a tree from all of the orthogroups in the evolutionary neighborhood of Casp-18 from both birds and mammals: Casps-8, -10, and CFLAR, with Casp-2 as an outgroup.

$: group_by_cluster mammal --mode seqs --group mamm_0_6 mamm_0_8 mamm_0_11 mamm_0_13 mamm_0_15 mamm_0_43 > Casp18.fa

$: group_by_cluster bird --mode seqs --group bird_0_2 bird_0_4 bird_0_5 bird_0_6 bird_0_7 bird_0_15 bird_0_16 >> Casp18.fa

$: alignbuddy Casp18.fa --generate_alignment clustalo | phylobuddy --generate_tree fasttree -o nexus > Casp18.nex

The Monodelphis domestica Casp18 sequence (red arrow) sits neatly before the split between alligator/turtle Casp8 sequences (blue labels) and the Casp8 sequences represented by bird_0_6. The other two Casp8 orthogroups, mamm_0_6 and bird_0_7, form a sister clade, so I would feel very confident renaming all sequences in bird_0_6 to Casp18. Eckhart carefully assessed genomic synteny information to piece together their evolutionary story regarding Casp8/18, but we are able to arrive at the same conclusions using a much less manually intensive sequence similarity based clustering approach. Furthermore, our high coverage of placental mammals indicates that Eckhart was correct to assert that the gene was lost from the last eutherian common ancestor.

Step 10: Merging clusters semi-manually

A dedicated wiki page digs further into how orphaned sequences can be utilized, but from here on we will focus on genes contained in larger, more highly conserved, orthogroups. Use the --min_size flag to strip out small clusters:

$: group_by_cluster mammal/final_clusters.txt mammal_casps.fa cons --min_size 5 | \
alignbuddy --generate_alignment clustalo | \
phylobuddy --generate_tree raxml -o nexus > mammal/mammal_consensus_min5.nex

Caspase-2 and CFLAR

The naming of genes in the Casp2 and CFLAF orthogroups are very consistent, but both have one large cluster and then a second smaller cluster. Let's have a look at all of these sequences to see if we can identify the cause of the splits (using Casp12 as an outgroup):

$: group_by_cluster mammal/final_clusters.txt mammal_casps.fa seqs --group mamm_0_11 mamm_0_12 mamm_0_13 mamm_0_14 mamm_0_15 | \
alignbuddy --generate_alignment clustalo | \
phylobuddy --generate_tree fasttree -o nexus > mammal/mamm_casps2_12_cflar.nex

Hmmm... Okay, the sequences in each cluster all seem to, well, cluster together on the tree. The Casp2s are both most closely related to one another, as are the CFLARs. Once again, we might be tempted to fault RD-MCL for breaking up groups that should otherwise be combined, but have a look at the actual sequence alignment for the two Casp2 groups.

$: group_by_cluster mammal/final_clusters.txt mammal_casps.fa seqs -g mamm_0_14_0 mamm_0_11 | \
alignbuddy -generate_alignment clustalo

All of the sequences in mamm_0_14_0 are missing a large chunk off of their C-terminal end. The taxonomic distribution of the group does not suggest an obvious evolutionary explanation (i.e., they are scattered throughout the distribution of mamm_0_11), so my guess is an issue with genome assembly/annotation. As an example, there is a large assembly gap following the annotated Leptonychotes weddellii Casp2 gene:

Given this obvious issue, I would personally leave these clusters separated, or even go so far as to remove the mamm_0_14_0 sequences completely from the dataset; if we suspect that the issue is technical and not real biology, there is little reason to pollute our carefully curated orthogroups with extra noise.

The CFLAR clusters do not show the same problem:

Nothing really looks amiss, except maybe the long C-terminal end of Jjac-XP_004653783.2. It's probably safe to merge mamm_0_13 and mamm_0_15 into a single orthogroup, but we can further test if it's appropriate using the merge_orthogroups tool.

$: merge_orthogroups mammal mamm_0_15

The best cluster that mamm_0_15 could be merged with has been identified as mamm_0_13, as we expected, and it's all green. We can go ahead and merge it by re-running merge_orthogroups with the --merge flag.

$: merge_orthogroups mammal mamm_0_15 --merge mamm_0_13

You will be prompted to confirm the merge, following which the final_clusters.txt file will be updated.

BEFORE MERGE:
mamm_0_13  63.7586 Anan-XP_021523112.1 Bmut-XP_005909895.2 Btau-NP_001012281.1 Cang-XP_011805399.1 Casi-XP_006862049.1
Ccri-XP_004674885.2 Chir-NP_001272571.1 Csyr-XP_021574597.1 Dord-XP_012867414.1 Eedw-XP_006889143.1 Efus-XP_008144686.1
Etel-XP_004701801.1 Hsap-NP_003870.4    Lwed-XP_006745969.1 Mput-XP_004764213.1 Nsch-XP_021559060.1 Oafe-XP_007933391.1
Opri-XP_012781194.1 Pabe-NP_001125140.1 Phod-XP_005980242.1 Sara-XP_004601467.1 Sscr-NP_001001628.1 Tman-XP_004378275.2

mamm_0_15  14.1775 Jjac-XP_004653783.2 Mcar-XP_021016969.1 Mmus-NP_997536.1    Moch-XP_013208909.1 Mpah-XP_021054181.1
Mung-XP_021507853.1 Rnor-NP_001029036.1

AFTER MERGE:
mamm_0_13  85.551  Anan-XP_021523112.1 Bmut-XP_005909895.2 Btau-NP_001012281.1 Cang-XP_011805399.1 Casi-XP_006862049.1
Ccri-XP_004674885.2 Chir-NP_001272571.1 Csyr-XP_021574597.1 Dord-XP_012867414.1 Eedw-XP_006889143.1 Efus-XP_008144686.1
Etel-XP_004701801.1 Hsap-NP_003870.4    Jjac-XP_004653783.2 Lwed-XP_006745969.1 Mcar-XP_021016969.1 Mmus-NP_997536.1
Moch-XP_013208909.1 Mpah-XP_021054181.1 Mput-XP_004764213.1 Mung-XP_021507853.1 Nsch-XP_021559060.1 Oafe-XP_007933391.1
Opri-XP_012781194.1 Pabe-NP_001125140.1 Phod-XP_005980242.1 Rnor-NP_001029036.1 Sara-XP_004601467.1 Sscr-NP_001001628.1
Tman-XP_004378275.2

Step 11: Placing new sequences

Now let's say you have a bunch of new putative caspase sequences, isolated from species with insufficient taxonomic coverage to generate high quality orthogroups on their own.

To check a new sequence that wasn't included in the original RD-MCL run, use the place_sequence program.

Example sequence: Ggar_casc.fa

>Ggar-xyz123foo Casc containing sequence from spotted garbargal of gar [Garbargal garususus]
MQTNSVHEVFEQDHITWQNRLRESKGTLADGSKNLLQSLPILGTDIRQIDALMLELMVRN
VEVQFKQQTSEADIFALSSCIFLAKGENPYVYSYDDKEKLQVVWKLLVGKNMKQLVGKYR
AGSEACPGLKDQVVIAQADIVDSAVSSANTAKVKKQPLYALKCGIARQLKPENVRGFLGK
RSNVLGSWWLSMANEFLHRVPMAVPFAEVLLVWARCAAHRNNNISHDTHAICITRMCKFP
LILLRLLFFNTDLK
$ place_sequence Ggar_casc.fa bird

Output:

Placement of sequence(s) in RD-MCL clusters

R² between clusters: 0.000 - 0.259
R² within clusters: 0.925 - 1.000

FwdScore between clusters: 9.938 - 236.542
FwdScore within clusters: 289.924 - 766.872

Testing Ggar-xyz123foo
Group        Size  R² 95%-CI        Forward 95%-CI
bird_0_1    47    0.5124 - 0.5459  59.71 - 77.03
bird_0_3    45    0.0108 - 0.0156  5.56  - 9.06
bird_0_0_0  56    0.0019 - 0.0073  -0.72 - 4.65
bird_0_7    38    0.0075 - 0.0173  -6.44 - 2.54
bird_0_4    45    0.0096 - 0.0157  -6.46 - 2.43
bird_0_11   7     0.0010 - 0.0069  -6.76 - -3.83
bird_0_6    42    0.0131 - 0.0203  -7.49 - -2.62
bird_0_5    42    0.0157 - 0.0198  -7.68 - -6.42
bird_0_8    33    0.0098 - 0.0200  -8.66 - -6.14
bird_0_12   6     0.0092 - 0.0180  -8.89 - -7.41
bird_0_2    46    0.0192 - 0.0262  -9.21 - -5.59
bird_0_10   19    0.0072 - 0.0135  -9.44 - -5.2
bird_0_9    27    0.0105 - 0.0157  -9.48 - -6.51

This sequence is really divergent .

Still working on this tutorial

Papers using the paralogs of chicken Casp3 and Casp8 that have been annotated as such in GenBank, but make no mention of the other two (labelled as 'uncharacterized protein' and 'initiator caspase', respectively).

Step 10: Fish orthogroups

Step 9: Assign the sequences not included in RD-MCL to the correct orthogroups