# Subset phylogeny with more complex models

To illucidate the position of leptophytes (also referred to as NEW, back when we did not have a name for the lineage), we selected a subset of MAGs and references that were more complete (107 taxa in total) so that we could use more complex mixture models on our dataset of 93 genes. 

In [1]:
# Check if python is 3.10.5
import json
import os
import pandas as pd
import sys
import numpy as np
import __init__


print(sys.version)
%load_ext autoreload
%autoreload 2

3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]


In [2]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

## 1. Get dataset

Our taxon selection included mostly red plastids, using glaucophytes and viridiplantae as outgroups. We chose to not include cyanobacteria in an effort to reduce compositional heterogeneity (plastid genomes tend to be greatly AT rich compared to cyanobacteria). 

Here, we generate our dataset as a subset of the V2 concatenated phylogeny. 

In [9]:
## V2 full dataset
DATASET_FULL = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["FULL"]["DATASET"]["V2"]

# Read_genes
GENE_LIST = paths_dict['DATABASES']['GENE_LISTS']['GENES_93']

## Output folder for V14 subset dataset
DATASET = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["DATASET"]["V14"]

In [None]:
%%bash -s "$GENE_LIST" "$DATASET_FULL" "$DATASET"

cat $1 | while read gene
do
    seqkit grep -f "$3/extract.txt" "$2"/"$gene".fasta > "$3"/"$gene".fasta
done

Get stats!

In [None]:
%%bash -s "$DATASET"

/home/mahja/ptMAGs/src/get_stats.sh "$1"

## 2. Run prequal 

We remove sequence stretches with no clear homology using Prequal. We used a posterior probability threshold of 0.95.

In [14]:
# Folder with extracted gene dataset
DATASET = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["DATASET"]["V14"]

# Output folder for prequal files
PREQUAL = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["PREQUAL"]

In [None]:
%%bash -s "$DATASET" "$PREQUAL"

for i in "$1"/*fasta
do 
    sbatch ../../uppmax_scripts/script_bin/job_prequal.sh $i $2
    sleep 1
done

## 3. Align

We align the genes with mafft-ginsi using the --unalign 0.6 option to avoid over-alignment.

In [4]:
from gene_iterator import GeneIterator

In [17]:
# Folder with prequal-filtered gene dataset
PREQUAL = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["PREQUAL"]

# Read_genes
GENE_LIST = paths_dict['DATABASES']['GENE_LISTS']['GENES_93']
genes = list(map(lambda x : x.strip(), open(GENE_LIST, "r").readlines()))

# Directory for mafft output
MAFFT_DIR = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["MAFFT"]

# Slurmlog csv
SLURMLOG = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["MAFFTLOG"]

In [None]:
gi = GeneIterator(PREQUAL, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_mafft(MAFFT_DIR, SLURMLOG)

## 4. Trim

We trim the alignments with BMGE (BLOSUM35 matrix, filter columns > 80% gaps).

In [3]:
# Folder containing aligned fasta files
MAFFT_DIR = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["MAFFT"]

# Read_genes
GENE_LIST = paths_dict['DATABASES']['GENE_LISTS']['GENES_93']
genes = list(map(lambda x : x.strip(), open(GENE_LIST, "r").readlines()))

# Directory for trimal output
BMGE_DIR = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["BMGE"]

# Slurmlog csv
SLURMLOG = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["BMGELOG"]

In [None]:
gi = GeneIterator(MAFFT_DIR, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_bmge(BMGE_DIR, SLURMLOG, MAFFT_DIR)

## 5. Concatenate

In [3]:
# Directory for aligned and trimmed fasta files
BMGE_DIR = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["BMGE"]

## Output directory for fasta files
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

In [None]:
%%bash -s "$BMGE_DIR" "$CONCAT_DIR"

files=("$1"/*fasta)

perl /home/mahja/ptMAGs/src/cat_fasta.pl -f "${files[@]}" > "$2"/concat_107t_93g_prequal_ginsi_bmge.fasta
mv partitions.txt "$2"/partitions_prequal_ginsi_bmge.txt

I manually edited the header the names slightly. 
- Removed the * after New
- Changed ZHAN22 to Haptophyte from New

The alignment has 107 taxa, 93 genes, and 20,292 positions. 

## 6. Run cpREV+C60+G tree

We started by running a tree with the cpREV+C60+G model (the best fit model as assessed by model fit tests on preliminary phylogenies). This phylogeny will be used as the guide tree for calculating the parameters of the MEOW models. 

In [8]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Output directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/concat_107t_93g_prequal_ginsi_bmge_cpREV-C60-G

The resulting tree recovered leptophytes as sister to cryptophytes and haptophytes (62% ufb support for the monophyly of cryptophytes and haptophytes). It also recovered the complex plastids as monophyletic (88% ufb support).

## 7. Run CAT-GTR tree

Three chains of PhyloBayes were started on the lab cluster. We convert the fasta file to a phylip file first.

In [11]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

In [None]:
%%bash -s "$CONCAT_DIR"

perl ../../src/fasta2phylip.pl -f "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta -o "$1"/concat_107t_93g_prequal_ginsi_bmge.phy

cat "$1"/concat_107t_93g_prequal_ginsi_bmge.phy | tr '=' '_' > phylip
mv phylip "$1"/concat_107t_93g_prequal_ginsi_bmge.phy

The following command was used to run PhyloBayes (with three chains).

In [None]:
mpirun -np 10 pb_mpi -d concat_107t_93g_prequal_ginsi_bmge.phy -cat -gtr -dgam 4 -dc concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1

We then ran posterior predictive tests in PhyloBayes. 

*Compositional heterogeneity.*  
The test simulates alignments based on model parameter configurations sampled from the chains (chain 2 in this case) and then performs a posterior predictive test of compositional homogeneity. The test statistic is the maximum squared deviation between global and taxon-specific empirical frequencies. The observed value is computed on the true data, and compared with its null posterior predictive distribution. We saw that the observed max heterogeneity is an order of magnitude higher, and is significantly different (p value = 0). Same for the mean heterogeneity.

*Diversity test.*  
The diversity (div) test is doing a similar thing: simulating an alignment based on the model parameters from chain 2. The test statistic is the mean diversity per site (mean number of distinct amino-acid per sites). This is also not being sufficiently well captured.

In [None]:
mpirun -np 20 /opt/pbmpi-master_1.9/data/readpb_mpi -ppred -comp -x 2000 1 9000 concat_107t_93g_prequal_ginsi_bmge_chain2 > concat_107t_93g_prequal_ginsi_bmge_chain2_comp.out
mpirun -np 20 /opt/pbmpi-master_1.9/data/readpb_mpi -ppred -div -x 2000 1 9000 concat_107t_93g_prequal_ginsi_bmge_chain2 > concat_107t_93g_prequal_ginsi_bmge_chain2_div.out

## 8. Run tree with CATPMSF model

We use the appraoch of [Szánthó et al 2023](https://doi.org/10.1093/sysbio/syad013) to run the CATPMSF model. In the paper, the authors infer a ML tree using a site homogenous model and use it as a fixed tree for the PhyloBayes analysis (using CATGTR). Here, we will generate a tree under the LG+G model to use as fixed tree for the CATGTR analysis. The results of the phylobayes analysis will be used to extract site-specific stationary distributions and exchangeabilities.

### 8.1. Run tree with site homogenous model

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_LG-G

### 8.2. Use as guide tree to run PhyloBayes

In [4]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.phy "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_LG-G.treefile "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1
sleep 1
sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.phy "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_LG-G.treefile "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain2

### 8.3 Check for convergence

The chains were run for more than 5,000 cycles after which we checked for convergence. 

In [3]:
## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_pb-convergence.sh "$1"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr

In [4]:
%%bash -s "$TREE_DIR"

cat "$1"/catpmsf/tracecomp.contdiff

name                effsize	rel_diff

loglik              35		1.6552
length              113		0.0624932
alpha               330		0.525332
Nmode               146		0.227368
statent             118		3.53681
statalpha           110		0.210322
rrent               104		0.979222
rrmean              3525		0.0110885


The effective sample sizes are above 100 for 7 out of 8 variables when setting burn-in to 1800 cycles. While this is not ideal, running the chains for longer is unlikely to improve things. I therefore decided to proceed as is (plus [Szánthó et al 2023](https://doi.org/10.1093/sysbio/syad013) failed to get convergence in all the empirical datasets they tested).

### 8.4 Export and convert to site distributions and exchangeabilities

In [None]:
%%bash -s "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_sitedists_exchangeabilities.sh "$1"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1

The generated exchangeabilities file is in paml format, so I manually turned it into a nexus file which is required by iqtree v2. 

### 8.5. Run tree

In [6]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

Run tree with CAT-PMSF model and 100 non-parametric bootstraps.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_CATPMSF_fbp "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.exchangeabilities "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.sitefreq

Submitted batch job 9730163 on cluster snowy


I was also curious about the LG-PMSF model so I ran that with 1000 ufb.

In [10]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_2023_11_16_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_LGPMSF "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.sitefreq

Submitted batch job 9730164 on cluster snowy


## 9. Estimate alternative topologies with CAT-PMSF model

We want to assess several alternative hypotheses relating to (1) the position of NEW, and (2) the monophyly of complex plastids. These branches in question are the ones with the lowest statistical support in all analyses run so far. We set up the following constraints: 

1. NEW sister to (haptophytes + cryptophytes), complex plastids monophyletic   
2. NEW sister to (haptophytes + cryptophytes), complex plastids non-monophyletic    
3. NEW sister to cryptophytes, complex plastids monophyletic   
4. NEW sister to cryptophytes, complex plastids not monophyletic   
5. NEW sister to haptophytes, complex plastids monophyletic  
6. NEW sister to haptophytes, complex plastids non-monophyletic  

For each constraint, we set up 5 independent searches to check that we were estimating best ML tree.

In [7]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/catpmsf/concat_107t_93g_catpmsf_new-sister-c_complex-mono "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.exchangeabilities "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.sitefreq "$2"/constraints/new-sister-c_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/catpmsf/concat_107t_93g_catpmsf_new-sister-c_complex-non-mono "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.exchangeabilities "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.sitefreq "$2"/constraints/new-sister-c_complex-non-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/catpmsf/concat_107t_93g_catpmsf_new-sister-h_complex-mono "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.exchangeabilities "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.sitefreq "$2"/constraints/new-sister-h_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/catpmsf/concat_107t_93g_catpmsf_new-sister-h_complex-non-mono "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.exchangeabilities "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.sitefreq "$2"/constraints/new-sister-h_complex-non-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/catpmsf/concat_107t_93g_catpmsf_new-sister-h-c_complex-mono "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.exchangeabilities "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.sitefreq "$2"/constraints/new-sister-h-c_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/catpmsf/concat_107t_93g_catpmsf_new-sister-h-c_complex-non-mono "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.exchangeabilities "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.sitefreq "$2"/constraints/new-sister-h-c_complex-non-mono.tre

We can now check if the AU test rejects any of the topologies under the cat-pmsf model. Let's concatenate the trees we want to test first.

In [8]:
%%bash -s "$TREE_DIR"

cat "$1"/catpmsf/concat_107t_93g_catpmsf_new-sister-h-c_complex-mono.treefile \
    "$1"/catpmsf/concat_107t_93g_catpmsf_new-sister-h-c_complex-non-mono.treefile \
    "$1"/catpmsf/concat_107t_93g_catpmsf_new-sister-c_complex-mono.treefile \
    "$1"/catpmsf/concat_107t_93g_catpmsf_new-sister-h_complex-mono.treefile \
    "$1"/catpmsf/concat_107t_93g_catpmsf_new-sister-h_complex-non-mono.treefile \
    "$1"/catpmsf/concat_107t_93g_catpmsf_new-sister-c_complex-non-mono.treefile \
    > "$1"/catpmsf/au_test.trees

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_topology_test.sh \
    "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta \
    "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_au-test \
    "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.exchangeabilities \
    "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_catgtr_chain1.sitefreq \
    "$2"/catpmsf/concat_107t_93g_prequal_ginsi_bmge_CATPMSF_fbp.treefile \
    "$2"/catpmsf/au_test.trees

Submitted batch job 9735351 on cluster snowy


## 10. Estimate site profiles with MEOW 
I will now estimate site profiles using [MEOW](https://github.com/jdaneau/pm), which is an extension of MAMMaL. The difference between MAMMaL and MEOW is that while MAMMaL only uses high rate sites for estimating custom site profiles, MEOW uses all sites in the input alignment for estimating custom site profiles. 

Here, we will follow [Williamson et al 2024](https://doi.org/10.1101/2024.09.04.611237) for using MEOW.

### 10.1. Get phylip file for MEOW input

MEOW requires a phylip file with 10 character headers as input. So we replace the fasta headers of the V14 concat file, and convert to phylip format.

In [5]:
## Directory for V12 concat fasta files
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

In [None]:
%%bash -s "$CONCAT_DIR"

## replace fasta headers
seqkit replace -p '^(\S+)$' -r '{kv}$2' \
-k "$1"/replace_headers.txt "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta \
> "$1"/concat_107t_93g_prequal_ginsi_bmge_replaced.fasta

In [None]:
%%bash -s "$CONCAT_DIR"

## convert to phylip
perl ../../src/fasta2phylip.pl -f "$1"/concat_107t_93g_prequal_ginsi_bmge_replaced.fasta \
-o "$1"/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip

## replace '=' with '_' so it will be consistent with the tree in the next step
cat "$1"/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip | \
tr '=' '_' \
> phylip

mv phylip "$1"/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip

### 10.2. Get tree for MEOW input
We will use the tree inferred with the V14 concat phylogeny and the cpREV+C60+G model as a guide tree for inferring the custom 80 site frequency profiles with MEOW. We now need to replace the tip labels of this tree so that it is consistent with the input phylip file.

In [5]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [11]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

## copy the phylip file to the new mammal directory 
cp "$1"/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip "$2"/meow/.

perl ../../src/replace_tip_labels.pl "$2"/concat_107t_93g_prequal_ginsi_bmge_cpREV-C60-G.treefile \
<(cat "$1"/replace_headers.txt | tr '=' '_') \
"$2"/meow/concat_107t_93g_prequal_ginsi_bmge_cpREV-C60-G_replaced.treefile

### 10.3. Run MEOW

We run MEOW. Following Williamson et al 2024, we remove invariant sites from the alignment from consideration and use a guide tree (inferred with a C60 mixture model) to help estimate site rates of the variable sites using the Discrete
Gamma Probability Estimate (DGPE) method of site rate estimation ([Susko et al 2003](https://doi.org/10.1080/10635150390235395)).

In [12]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

Starting first with MEOW(80,0).

In [None]:
%%bash -s "$TREE_DIR"

meow -s "$1"/meow/concat_106t_93g_prequal_ginsi_bmge_replaced.phylip \
    -ch 80 \
    -cl 0 \
    -t "$1"/meow/concat_106t_93g_prequal_ginsi_bmge_cpREV-C60-G_replaced.treefile \
    -ri \
    -p R \
    -C 5 \
    -f H \ 
    -o "$1"/meow/meow_80_0

Then MEOW(60,20). After initially getting an error ("Error in solve.QP(Dmat = Sigma * 2, dvec = rep(0, ntaxa), Amat = Amat,  : matrix D in quadratic function is not positive definite!"), I used the `-l` flag which does not use likelihood weights for the low rate partition (a custom version of the script kindly provided by Hector Baños). (Custom script provided in the `src` directory).

In [None]:
%%bash -s "$TREE_DIR"

meow_custom.R -s "$1"/meow/concat_106t_93g_prequal_ginsi_bmge_replaced.phylip \
    -ch 60 \
    -cl 20 \
    -t "$1"/meow/concat_106t_93g_prequal_ginsi_bmge_cpREV-C60-G_replaced.treefile \
    -ri \
    -p R \
    -C 5 \
    -f H \ 
    -l \
    -o "$1"/meow/meow_60_20

Finally, MEOW(40,40).

In [None]:
%%bash -s "$TREE_DIR"

meow_custom.R -s "$1"/meow/concat_106t_93g_prequal_ginsi_bmge_replaced.phylip \
    -ch 40 \
    -cl 40 \
    -t "$1"/meow/concat_106t_93g_prequal_ginsi_bmge_cpREV-C60-G_replaced.treefile \
    -ri \
    -p R \
    -C 5 \
    -f H \ 
    -l \
    -o "$1"/meow/meow_40_40

### 10.4. IQTree Model Test

I was interested in trying a regular model test in IQTree using the BIC criteria. I'll be testing the models:  
  
- cpREV+MEOW(40,40)+G   
- cpREV+MEOW(60,20)+G   
- cpREV+MEOW(80,0)+G     
- LG+MEOW(40,40)+G    
- LG+MEOW(60,20)+G    
- LG+MEOW(80,0)+G   

There is a small caveat of using BIC. According to Hector Baños: "Since the classes are estimated from the data, these are also parameters, nonetheless, these are not accounted for in model finder and even if you do account for them they were not estimated under the likelihood estimation.  This is why you cannot compare  LG+C60+G vs LG+MAM60+G using a model finder (and we do cross-validation). Without giving it much thought, I think if you only compare  MEOW(60,0), MEOW(40,20), and MEOW(30,30) then you can get a proxy since all these models estimate the classes similarly so AIC and BIC can give you a good estimation."

So, I use BIC to get a reasonable estimation of the best model since I'm testing models with 80 classes. 

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/meow/concat_107t_93g_prequal_ginsi_bmge_modeltest "$2"/meow/meow.nex

The best fit model was LG+MEOW(60,20)+G. We now run a tree with that model. 

## 11. Run tree with MEOW(60,20) model

The best fit model! We run an unconstrained tree and estimate support with 1000 ufb.

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/meow/concat_107t_93g_prequal_ginsi_bmge_LG-MEOW6020-G "$2"/meow/meow.nex

The topology recovered from this tree was exactly the same as the LG+C60+G tree: (1) NEW sister to cryptophytes and haptophytes (61% ufb suport for the monophyly of cryptophytes and haptophytes), and (2) complex plastids monophyletic (but with lower support this time of 68%).

## 12. Estimate alternative topologies with LG+MEOW(60,20)+G model

We want to assess several alternative hypotheses relating to (1) the position of NEW, and (2) the monophyly of complex plastids. These branches in question are the ones with the lowest statistical support in all analyses run so far. We set up the following constraints: 

1. NEW sister to (haptophytes + cryptophytes), complex plastids monophyletic   
2. NEW sister to (haptophytes + cryptophytes), complex plastids non-monophyletic    
3. NEW sister to cryptophytes, complex plastids monophyletic   
4. NEW sister to cryptophytes, complex plastids not monophyletic   
5. NEW sister to haptophytes, complex plastids monophyletic  
6. NEW sister to haptophytes, complex plastids non-monophyletic  

For each constraint, we set up 10 independent searches, as [Liu et al 2024](https://doi.org/10.1093/sysbio/syae031) showed that the number of tree searches considerably influences the identification of phylogenies with the highest log-likelihood scores.

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono_2 "$2"/meow/meow.nex "$2"/constraints/new-sister-c_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono_2 "$2"/meow/meow.nex "$2"/constraints/new-sister-c_complex-non-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono_2 "$2"/meow/meow.nex "$2"/constraints/new-sister-h_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono_2 "$2"/meow/meow.nex "$2"/constraints/new-sister-h_complex-non-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono_2 "$2"/meow/meow.nex "$2"/constraints/new-sister-h-c_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono_2 "$2"/meow/meow.nex "$2"/constraints/new-sister-h-c_complex-non-mono.tre

## 13. AU test of constrained trees inferred with LG+MEOW(60,20) model

We perform the Approximately Unbiased (AU) test to see if any of the following topologies is rejected by the AU test:

1. NEW sister to (haptophytes + cryptophytes), complex plastids monophyletic   
2. NEW sister to (haptophytes + cryptophytes), complex plastids non-monophyletic    
3. NEW sister to cryptophytes, complex plastids monophyletic   
4. NEW sister to cryptophytes, complex plastids not monophyletic   
5. NEW sister to haptophytes, complex plastids monophyletic  
6. NEW sister to haptophytes, complex plastids non-monophyletic 

We first concatenate the trees that we want to test.

In [5]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [4]:
%%bash -s "$TREE_DIR"

cat "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono.treefile \
    "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono.treefile \
    "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono.treefile \
    "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono_2.treefile \
    "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono.treefile \
    "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono_2.treefile \
    > "$1"/meow/au_test.trees

We set up the AU test! Here we estimate model parameters based on the "best" ML tree reconstructed previously.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_topology_test.sh \
    "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta \
    "$2"/meow/concat_107t_93g_prequal_ginsi_bmge_au-test \
    "$2"/meow/meow.nex \
    "$2"/meow/concat_107t_93g_prequal_ginsi_bmge_LG-MEOW6020-G.treefile \
    "$2"/meow/au_test.trees

Submitted batch job 9730443 on cluster snowy


None of the topologies was rejected (unsurprisingly). Perhaps the guide tree used to estimate parameters is super important? I use the best scoring constrained tree (which had a higher log-likelihood than the unconstrained tree used before) found so far to run the AU test. 

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_topology_test.sh \
    "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta \
    "$2"/meow/concat_107t_93g_prequal_ginsi_bmge_au-test2 \
    "$2"/meow/meow.nex \
    "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono.treefile \
    "$2"/meow/au_test.trees

Submitted batch job 9735347 on cluster snowy


## 14. Removing compositionally heterogenous sites

The previous models (CAT-GTR, MOEW models) assume that the amino acid frequencies for every branch in the tree will be similar. However, this assumption might not hold true, and serious violations can mislead phylogenetic inference. This may be the case as our dataset includes cyanidiales which are known extremophiles. 

### 14.1. Investigating compositional heterogeneity
We first need to calculate the amino acid composition for each taxon. This is easily done using the tool [nRCFV_Reader](https://github.com/JFFleming/RCFV_Reader) ([Fleming and Struck 2023](https://doi.org/10.1186/s12859-023-05270-8)).

In [5]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

perl ../../src/RCFVReader_v1.pl protein "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/comp_het/concat_107t_93g_aa_comp

In [None]:
%%bash -s "$TREE_DIR"

cat "$1"/comp_het/concat_107t_93g_aa_comp.ncsRCFV.txt

We see that amino acids I, V, A, and N have the highest normalised relative compositional frequency variability. 

The tool also generates a frequency file to which we can add taxonomic affiliation.

In [None]:
%%bash -s "$TREE_DIR"

cat "$1"/comp_het/concat_107t_93g_aa_comp.Frequencies.txt | \
    sed -E 's/(.*Ochrophyta.*)/\1\tOchrophyta/' | \
    sed -E 's/(.*Haptophyta.*)/\1\tHaptophyta/' | \
    sed -E 's/(.*Cryptop.*)/\1\tCryptophyta/' | \
    sed -E 's/(.*Glauco.*)/\1\tGlaucophyta/' | \
    sed -E 's/(.*Cyanidiales.*)/\1\tCyanidiophytina/' | \
    sed -E 's/(.*Galdieriales.*)/\1\tCyanidiophytina/' | \
    sed -E 's/(.*Florideo.*)/\1\tRhodophytina/' | \
    sed -E 's/(.*Compsopogono.*)/\1\tRhodophytina/' | \
    sed -E 's/(.*Porphyridiales.*)/\1\tRhodophytina/' | \
    sed -E 's/(.*Rhodello.*)/\1\tRhodophytina/' | \
    sed -E 's/(.*Bangiales.*)/\1\tRhodophytina/' | \
    sed -E 's/(.*Stylonema.*)/\1\tRhodophytina/' | \
    sed -E 's/(.*Viridiplantae.*)/\1\tViridiplantae/' | \
    sed -E 's/(.*New.*)/\1\tNEW/' | \
    grep -v "Mean_Freq_Across_Taxa" | \
    sed -E 's/(NAME.*)/\1\tGroup/' > "$1"/comp_het/concat_107_93g_aa_freq_group.txt

The table was analysed in the R script `PCA_aa_usage.R`. 

### 14.2. Removing compositionally heterogenous sites and inferring a phylogeny

Here, we trim the concatenated alignment with BMGE's stationary-based trimming. This method trims the alignment until the remaining characters are compositionally homogeneous, as assessed by Stuart’s test of marginal homogeneity between each pair of sequences.

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

In [None]:
%%bash -s "$CONCAT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_bmge.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta

The final alignment has 107 taxa and 15,754 sites (i.e. roughly about 5,000 sites were removed). What is the new nRCFV score?

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

perl ../../src/RCFVReader_v1.pl protein "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta "$2"/comp_het/concat_107t_93g_comp-rem_aa_comp

The nRCFV is much lower now, 0.00112250162175195 compared to 0.00311920771907216 for the untreated dataset.

I decided to infer a tree with the LG+C60+G model. 

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta "$2"/comp_het/concat_107t_93g_prequal_ginsi_bmge_bmge_LG-C60-G

And with the MEOW80 model.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta "$2"/comp_het/stationary_trimming/concat_107t_93g_prequal_ginsi_bmge_bmge_LG-MEOW6020-G "$2"/meow/meow.nex

### 14.3 Inferring constrained trees

As before, we can infer constrained trees under the LG+MEOW80+G model and compare their likelihoods.

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta "$2"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono "$2"/meow/meow.nex "$2"/constraints/new-sister-c_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta "$2"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono "$2"/meow/meow.nex "$2"/constraints/new-sister-c_complex-non-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta "$2"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono "$2"/meow/meow.nex "$2"/constraints/new-sister-h_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta "$2"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono "$2"/meow/meow.nex "$2"/constraints/new-sister-h_complex-non-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta "$2"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono "$2"/meow/meow.nex "$2"/constraints/new-sister-h-c_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta "$2"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono "$2"/meow/meow.nex "$2"/constraints/new-sister-h-c_complex-non-mono.tre

### 14.4 Topology test

We do the AU test to see which topologies can be statistically rejected. First, concatenate the trees we are testing.

In [9]:
%%bash -s "$TREE_DIR"

cat "$1"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono.treefile \
    "$1"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono.treefile \
    "$1"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono.treefile \
    "$1"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono.treefile \
    "$1"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono.treefile \
    "$1"/comp_het/stationary_trimming/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono.treefile \
    > "$1"/comp_het/stationary_trimming/au_test.trees


We estimate parameters from the unconstrained ML tree.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_topology_test.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta "$2"/comp_het/stationary_trimming/AU_test "$2"/meow/meow.nex "$2"/comp_het/stationary_trimming/concat_107t_93g_prequal_ginsi_bmge_bmge_LG-MEOW6020-G.treefile "$2"/comp_het/stationary_trimming/au_test.trees

### 14.5 CAT-GTR tree

We run a Bayesian phylogenetic inference in PhyloBayes. 

Convert alignment to phylip format first.

In [None]:
%%bash -s "$CONCAT_DIR"

perl ../../src/fasta2phylip.pl -f "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.fasta -o "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.phy

cat "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.phy | tr '=' '_' > phylip
mv phylip "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.phy

Run three chains of PhyloBayes.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.phy "$2"/comp_het/stationary_trimming/concat_107t_93g_prequal_ginsi_bmge.bmge_catgtr_chain1
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.phy "$2"/comp_het/stationary_trimming/concat_107t_93g_prequal_ginsi_bmge.bmge_catgtr_chain2
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.bmge.phy "$2"/comp_het/stationary_trimming/concat_107t_93g_prequal_ginsi_bmge.bmge_catgtr_chain3


## 15. Modelling compositional heterogeneity with the GFmix model

The GFmix model modifies the vector of amino acid frequencies in a branch-specific manner to account for shifts in the relative frequencies of amino acids in different branches of the tree. While Williamson et al 2024 used multiple partitions (5 partitions), we will use a single partition. This is because all of our genes are plastid encoded genes, and we do expect there to be lesser discrepancy in compositional heterogeneity compared to Williamson et al 2024, which had both mitochondrial and nuclear encoded genes. Here, we estimate the likelihood of the six ML trees, corresponding to different topologies, under the GFmix model. 

GFmix expects the sequence file to be in phylip format with names being 10 characters long, which should correspond to the names in the treefile. Let's fulfill these requirements. We already have the sequence file: `concat_107t_93g_prequal_ginsi_bmge_replaced.phylip`. Let's convert the names of the six treefiles we inferred earlier. 

In [6]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

## copy the phylip file to the tree directory 
cp "$1"/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip "$2"/comp_het/gfmix/.

## replace tip labels of trees inferred with LG+MEOW80+G model
perl ../../src/replace_tip_labels.pl "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono.treefile \
<(cat "$1"/replace_headers.txt | tr '=' '_') \
"$2"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono_replaced.treefile

perl ../../src/replace_tip_labels.pl "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono.treefile \
<(cat "$1"/replace_headers.txt | tr '=' '_') \
"$2"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono_replaced.treefile

perl ../../src/replace_tip_labels.pl "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono.treefile \
<(cat "$1"/replace_headers.txt | tr '=' '_') \
"$2"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono_replaced.treefile

perl ../../src/replace_tip_labels.pl "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono_2.treefile \
<(cat "$1"/replace_headers.txt | tr '=' '_') \
"$2"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono_replaced.treefile

perl ../../src/replace_tip_labels.pl "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono.treefile \
<(cat "$1"/replace_headers.txt | tr '=' '_') \
"$2"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono_replaced.treefile

perl ../../src/replace_tip_labels.pl "$2"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono_2.treefile \
<(cat "$1"/replace_headers.txt | tr '=' '_') \
"$2"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono_replaced.treefile

Also copy the IQTree files.

In [8]:
%%bash -s "$TREE_DIR"

## replace tip labels of trees inferred with LG+MEOW80+G model
cp "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono.iqtree \
"$1"/comp_het/gfmix/.

cp "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono.iqtree \
"$1"/comp_het/gfmix/.

cp "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono.iqtree \
"$1"/comp_het/gfmix/.

cp "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono_2.iqtree \
"$1"/comp_het/gfmix/.

cp "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono.iqtree \
"$1"/comp_het/gfmix/.

cp "$1"/meow/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono_2.iqtree \
"$1"/comp_het/gfmix/.

### 15.1 Binomial test to determine groups of enriched/depleted amino acids

We now estimate the groups of amino acids that are depleted and enriched in different lineages from our data. We don't have any a priori knowledge of how groups might differ (e.g. as is known in the case of halophiles vs non-halophiles), nor do we know which amino acids might be involved in composition bias. 

We now use an R script (from Williamson et al 2024, kindly provided by Charley McCarthey) to:

1) reads in the alignment file and tabulate the counts of all 20 amino acids,
2) perform a chi-squared test on that table, and use the residuals from that test to build a hierarchically-clustered tree of taxa based on UPGMA,
3) cut the tree into two groups, and use them to run the binomial test for our alignment (the groups are also printed to screen),
4) print the results for each amino acid are to screen or to an output file,
5) present the results are presented as a three column table of amino acid, Z-score and GFmix class assignment based on Z-score.

In [4]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

Rscript ../../src/quickBinomial.R -aln "$1"/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip -fmt phylip -out "$2"/comp_het/gfmix/bionimal_results.txt

The two groups of taxa were found to be Cyanidiococcus-yangmingshanensis-accession_NC-051883 and Cyanidioschyzon-merolae-accession_NC-004799, vs. everything else. 

The groups of amino acids that are enriched/depleted in the two groups of lineages are as follows:

G-class: I, N, F, K, T, D, S, E   
F-class: R, C, V, W, H, M, A, Q

### 15.2 Run gfmix

Run the gfmix_custombins model! (An updated version should be available via the official release on Ed Susko's website. The version used for our analyses is provided in the `src` folder).

In [None]:
%%bash -s "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_gfmix.sh "$1"/comp_het/gfmix/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono_replaced.treefile \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono.iqtree \
    "$1"/comp_het/gfmix/meow_60_20.aafreq.dat \
    "$1"/comp_het/gfmix/glauco.rootfile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_gfmix.sh "$1"/comp_het/gfmix/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono_replaced.treefile \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono_2.iqtree \
    "$1"/comp_het/gfmix/meow_60_20.aafreq.dat \
    "$1"/comp_het/gfmix/glauco.rootfile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_gfmix.sh "$1"/comp_het/gfmix/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono_replaced.treefile \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono.iqtree \
    "$1"/comp_het/gfmix/meow_60_20.aafreq.dat \
    "$1"/comp_het/gfmix/glauco.rootfile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_gfmix.sh "$1"/comp_het/gfmix/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono_replaced.treefile \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono.iqtree \
    "$1"/comp_het/gfmix/meow_60_20.aafreq.dat \
    "$1"/comp_het/gfmix/glauco.rootfile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_gfmix.sh "$1"/comp_het/gfmix/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono_replaced.treefile \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono_2.iqtree \
    "$1"/comp_het/gfmix/meow_60_20.aafreq.dat \
    "$1"/comp_het/gfmix/glauco.rootfile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_gfmix.sh "$1"/comp_het/gfmix/concat_107t_93g_prequal_ginsi_bmge_replaced.phylip \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono_replaced.treefile \
    "$1"/comp_het/gfmix/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono.iqtree \
    "$1"/comp_het/gfmix/meow_60_20.aafreq.dat \
    "$1"/comp_het/gfmix/glauco.rootfile
sleep 1



## 16. Modelling heterotachy with the GHOST model

All models used so far assume that the substitution rate for each site is constant across all lineages. However, we know that substitution rates can vary across lineages. To accomodate rate variation across lineages, we use the GHOST model (General Heterogeneous evolution On a Single Topology; [Crotty et al 2019](https://doi.org/10.1093/sysbio/syz051)). GHOST is a mixture model comprised of several site classes, each having a separate set of model parameters and edge lengths on the same tree topology.

We follow [Williamson et al 2024](https://doi.org/10.1101/2024.09.04.611237) and calculate the likelihoods of the six topologies under the GHOST model. We will evaluate each topology with 4, 6, and 8 linked heterotachy classes (LG+MEOW80+H4/6/8).

In [4]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

Note that the IQTree commands include a "-optlen BFGS" flag. I asked Kelsey Williamson about it, and she relayed her communication with Minh:

"This is only for technical reason, if you want to know: the EM (expectation maximization) algorithm is used to estimate branch lengths for +H mixture model. Currently it does not work in the combination of two mixture models +C60 and +H model. The BFGS algorithm is an alternative, and it works for this combination.

IQ-TREE could have switched internally, but we intentionally print this message to remind us that we can implement the EM algorithm for this case.”


### 16.1. Four rate categories

Estimate likelihoods with 4 linked heterotachy classes.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_4_new-sister-c_complex-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_4_new-sister-c_complex-non-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono_2.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_4_new-sister-h-c_complex-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_4_new-sister-h-c_complex-non-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_4_new-sister-h_complex-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono_2.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_4_new-sister-h_complex-non-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono.treefile


### 16.2. Six rate categories

Estimate likelihoods with 6 linked heterotachy classes.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_6_new-sister-c_complex-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_6_new-sister-c_complex-non-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono_2.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_6_new-sister-h-c_complex-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_6_new-sister-h-c_complex-non-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_6_new-sister-h_complex-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono_2.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_6_new-sister-h_complex-non-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono.treefile


### 16.3. Eight rate categories

And now 8 classes.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_8_new-sister-c_complex-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-mono.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_8_new-sister-c_complex-non-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-c_complex-non-mono_2.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_8_new-sister-h-c_complex-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-mono.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_8_new-sister-h-c_complex-non-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h-c_complex-non-mono.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_8_new-sister-h_complex-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-mono_2.treefile
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_8_new-sister-h_complex-non-mono "$2"/meow/meow.nex "$2"/ghost/concat_107t_93g_LG-MEOW6020-G_new-sister-h_complex-non-mono.treefile


### 16.4. Model test

But which GHOST model fits the best?

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_ghost.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_model_test "$2"/meow/meow.nex

### 16.5. AU test

GHOST with 8 categories fits the best. Let's do an AU test of the trees. We concatenate the 6 different trees to test.

In [6]:
%%bash -s "$TREE_DIR"

cat "$1"/ghost/ghost_8_new-sister-c_complex-mono.treefile \
    "$1"/ghost/ghost_8_new-sister-c_complex-non-mono.treefile \
    "$1"/ghost/ghost_8_new-sister-h-c_complex-mono.treefile \
    "$1"/ghost/ghost_8_new-sister-h-c_complex-non-mono.treefile \
    "$1"/ghost/ghost_8_new-sister-h_complex-mono.treefile \
    "$1"/ghost/ghost_8_new-sister-h_complex-non-mono.treefile \
    > "$1"/ghost/ghost_8_au_test.trees

We will use the best scoring tree (ghost_8_new-sister-h-c_complex-mono.treefile) to estimate parameters for the topology tests.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_topology_test.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta "$2"/ghost/ghost_8_AU_test "$2"/meow/meow.nex "$2"/ghost/ghost_8_new-sister-h-c_complex-mono.treefile "$2"/ghost/ghost_8_au_test.trees

## 17. Fast taxa removal

We now test the impact of fast evolving taxa on the resulting topology, by removing the fastest evolving taxa. We use a custom python script to identify the fastest evolving taxa (largest root to tip distances).

In [18]:
## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [20]:
%%bash -s "$TREE_DIR"

python ../../src/root_to_tip_distances.py "$1"/meow/concat_107t_93g_prequal_ginsi_bmge_LG-MEOW6020-G.treefile Glaucocystophyceae "$1"/meow/bls_result.txt

We decided to remove the 10 taxa with the longest root to tip distances.

In [21]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

cat "$2"/meow/bls_result.txt | tr ' ' '\t' | sort -nk2 | tail | cut -f1 | sed -E 's/taxo_/taxo=/' | sed -E 's/mag_/mag=/' | sed -E 's/accession_/accession=/' | sed -E 's/taxonomy_/taxonomy=/' > "$1"/lb_taxa.list

seqkit grep -f "$1"/lb_taxa.list "$1"/concat_107t_93g_prequal_ginsi_bmge.fasta -v > "$1"/concat_97t_93g_prequal_ginsi_bmge_nLB.fasta

A quick test (not shown) showed that the nRCVFV value was now 0.00307385430785046.

Run tree with the newly generated alignment with the MEOW(60,20) model.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_97t_93g_prequal_ginsi_bmge_nLB.fasta "$2"/meow/concat_97t_93g_prequal_ginsi_bmge_LG-MEOW6020-G_nLB "$2"/meow/meow.nex

## 18. Recoding amino acids (SR4 recoding)

The results of the GFmix model indicate that our phylogenetic inference is impacted by compositional heterogeneity. We now try recoding our dataset to reduce the impact of compositional heterogeneity. We opted for SR4 recoding. This was done using the PhyloFisher script `aa_recoder.py` ([Tice et al 2021](https://doi.org/10.1371/journal.pbio.3001365)).  

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
aa_recoder.py -i concat_107t_93g_prequal_ginsi_bmge.fasta -o concat_107t_93g_prequal_ginsi_bmge.SR4 -re SR4

The nRCFV values is now at its lowest with a value of 0.000975221142460481.

We can now set up two trees! A Bayesian inference under CAT-GTR, and a Maximum Likelihood tree using a recoded version of the C60 model obtained from: https://github.com/xgrau/recoded-mixture-models/blob/master/recoded_models/xmC60SR4.nex


### 18.1 CAT-GTR tree

Convert alignment to phylip format first.

In [None]:
%%bash -s "$CONCAT_DIR"

perl ../../src/fasta2phylip.pl -f "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.fas -o "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.phy

cat "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.phy | tr '=' '_' > phylip
mv phylip "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.phy

Run three chains of PhyloBayes.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.phy "$2"/recoded_sr4/concat_107t_93g_prequal_ginsi_bmge.SR4_catgtr_chain1
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.phy "$2"/recoded_sr4/concat_107t_93g_prequal_ginsi_bmge.SR4_catgtr_chain2
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.phy "$2"/recoded_sr4/concat_107t_93g_prequal_ginsi_bmge.SR4_catgtr_chain3


We checked convergence with bpcomp. 

### 18.2. Constrained ML trees

We now infer constrained trees with the recoded C60 model in a ML framework (GTR+C60+G)

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.fas "$2"/recoded_sr4/concat_107t_93g_sr4_LG-C60-G_new-sister-c_complex-mono "$2"/recoded_sr4/xmC60SR4.nex "$2"/constraints/new-sister-c_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.fas "$2"/recoded_sr4/concat_107t_93g_sr4_LG-C60-G_new-sister-c_complex-non-mono "$2"/recoded_sr4/xmC60SR4.nex "$2"/constraints/new-sister-c_complex-non-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.fas "$2"/recoded_sr4/concat_107t_93g_sr4_LG-C60-G_new-sister-h_complex-mono "$2"/recoded_sr4/xmC60SR4.nex "$2"/constraints/new-sister-h_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.fas "$2"/recoded_sr4/concat_107t_93g_sr4_LG-C60-G_new-sister-h_complex-non-mono "$2"/recoded_sr4/xmC60SR4.nex "$2"/constraints/new-sister-h_complex-non-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.fas "$2"/recoded_sr4/concat_107t_93g_sr4_LG-C60-G_new-sister-h-c_complex-mono "$2"/recoded_sr4/xmC60SR4.nex "$2"/constraints/new-sister-h-c_complex-mono.tre
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_93g_prequal_ginsi_bmge.SR4.fas "$2"/recoded_sr4/concat_107t_93g_sr4_LG-C60-G_new-sister-h-c_complex-non-mono "$2"/recoded_sr4/xmC60SR4.nex "$2"/constraints/new-sister-h-c_complex-non-mono.tre

## 19. Check the effect of alternative gene sets

Previous studies frequently used the "slow evolving" and "fast evolving" gene sets as defined by [Janouskovec et al 2010](https://doi.org/10.1073/pnas.1003335107). These were the gene sets used for our preliminary analyses in fact. Here, we run phylogenies of this 107 taxon set based on the 32 and 66 gene set as requested by a reviewer. Note that while Janouskovec et al have 34 (not 32) genes in the slow set and 68 genes in the fast set. The discrepancy is because we excluded ascF (too few taxa), and psbH (slightly weird gene tree).


### 19.1 32 genes (slow evolving genes)

We don't need to align or trim the sequences as we are taking just a subset of the genes. 

In [17]:
# Read_genes
GENE_LIST = paths_dict['DATABASES']['GENE_LISTS']['GENES_32']

# Directory for aligned and trimmed fasta files
BMGE_DIR = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["BMGE"]

## Output directory for fasta files
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

Copy the relevant files to a directory.

In [21]:
%%bash -s "$GENE_LIST" "$BMGE_DIR"

cat "$1" | while read line
    do cp "$2"/"$line".fasta "$2"/32g/.
done


Concatenate the genes!

In [22]:
%%bash -s "$BMGE_DIR" "$CONCAT_DIR"

files=("$1"/32g/*fasta)

perl /home/mahja/beta-Cyclocitral/src/cat_fasta.pl -f "${files[@]}" > "$2"/concat_107t_32g_prequal_ginsi_bmge.fasta
mv partitions.txt "$2"/partitions_prequal_ginsi_bmge_32g.txt

The alignment has 107 taxa, 32 genes, and 7,921 sites. 

I manually edited the header the names slightly. 
- Removed the * after New
- Changed ZHAN22 to Haptophyte from New

#### 20.1.1. Run cpREV+C60+G tree

We ran a tree with the cpREV+C60+G model. 

In [10]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Output directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_32g_prequal_ginsi_bmge.fasta "$2"/concat_107t_32g_prequal_ginsi_bmge_cpREV-C60-G

#### 19.1.2. Run cat-gtr tree
Convert alignment to phylip format. 

In [None]:
%%bash -s "$CONCAT_DIR"

perl ../../src/fasta2phylip.pl -f "$1"/concat_107t_32g_prequal_ginsi_bmge.fasta -o "$1"/concat_107t_32g_prequal_ginsi_bmge.phy

cat "$1"/concat_107t_32g_prequal_ginsi_bmge.phy | tr '=' '_' > phylip
mv phylip "$1"/concat_107t_32g_prequal_ginsi_bmge.phy

Run three chains of PhyloBayes. 

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_32g_prequal_ginsi_bmge.phy "$2"/gene_subsets/concat_107t_32g_prequal_ginsi_bmge_catgtr_chain1
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_32g_prequal_ginsi_bmge.phy "$2"/gene_subsets/concat_107t_32g_prequal_ginsi_bmge_catgtr_chain2
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_32g_prequal_ginsi_bmge.phy "$2"/gene_subsets/concat_107t_32g_prequal_ginsi_bmge_catgtr_chain3


In [None]:
%%bash -s "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/gene_subsets/concat_107t_32g_prequal_ginsi_bmge_catgtr_chain1
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/gene_subsets/concat_107t_32g_prequal_ginsi_bmge_catgtr_chain2
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/gene_subsets/concat_107t_32g_prequal_ginsi_bmge_catgtr_chain3


### 19.2 66 genes (slow+fast evolving genes)

We don't need to align or trim the sequences as we are taking just a subset of the genes. 

In [27]:
# Read_genes
GENE_LIST = paths_dict['DATABASES']['GENE_LISTS']['GENES_66']

# Directory for aligned and trimmed fasta files
BMGE_DIR = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["BMGE"]

## Output directory for fasta files
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

Copy files to relevant directory.

In [29]:
%%bash -s "$GENE_LIST" "$BMGE_DIR"

cat "$1" | while read line
    do cp "$2"/"$line".fasta "$2"/66g/.
done

Concatenate the genes!

In [30]:
%%bash -s "$BMGE_DIR" "$CONCAT_DIR"

files=("$1"/66g/*fasta)

perl /home/mahja/beta-Cyclocitral/src/cat_fasta.pl -f "${files[@]}" > "$2"/concat_107t_66g_prequal_ginsi_bmge.fasta
mv partitions.txt "$2"/partitions_prequal_ginsi_bmge_66g.txt

The alignment has 107 taxa, 32 genes, and 16,438 sites. 

I manually edited the header the names slightly. 
- Removed the * after New
- Changed ZHAN22 to Haptophyte from New

#### 19.2.1. Run cpREV+C60+G tree

We ran a tree with the cpREV+C60+G model. 

In [3]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["ALIGNMENTS"]["V14"]["CONCAT"]

## Output directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["SUBSET"]["TREES"]["V14"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_107t_66g_prequal_ginsi_bmge.fasta "$2"/gene_subsets/concat_107t_66g_prequal_ginsi_bmge_cpREV-C60-G

#### 19.2.2. Run cat-gtr tree
Convert alignment to phylip format. 

In [None]:
%%bash -s "$CONCAT_DIR"

perl ../../src/fasta2phylip.pl -f "$1"/concat_107t_66g_prequal_ginsi_bmge.fasta -o "$1"/concat_107t_66g_prequal_ginsi_bmge.phy

cat "$1"/concat_107t_66g_prequal_ginsi_bmge.phy | tr '=' '_' > phylip
mv phylip "$1"/concat_107t_66g_prequal_ginsi_bmge.phy

Run 3 chains of PhyloBayes.

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_66g_prequal_ginsi_bmge.phy "$2"/gene_subsets/concat_107t_66g_prequal_ginsi_bmge_catgtr_chain1
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_66g_prequal_ginsi_bmge.phy "$2"/gene_subsets/concat_107t_66g_prequal_ginsi_bmge_catgtr_chain2
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/concat_107t_66g_prequal_ginsi_bmge.phy "$2"/gene_subsets/concat_107t_66g_prequal_ginsi_bmge_catgtr_chain3


In [None]:
%%bash -s "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/gene_subsets/concat_107t_66g_prequal_ginsi_bmge_catgtr_chain1
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/gene_subsets/concat_107t_66g_prequal_ginsi_bmge_catgtr_chain2
sleep 1

sbatch ../../uppmax_scripts/script_bin/job_pb.sh "$1"/gene_subsets/concat_107t_66g_prequal_ginsi_bmge_catgtr_chain3


## References

Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780. https://doi.org/10.1093/molbev/mst010

Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25(15), 1972-1973. https://doi.org/10.1093/bioinformatics/btp348

cat_fasta.pl script. Written by Martin Ryberg. Available at https://github.com/mr-y/my_bioinfPerls/blob/master/cat_fasta.pl. 

Susko, E. (2022). MAMMaL:(M)ultinomial (A)pproximate (M)ixture (Ma)ximum (L)ikelihood Accelerated Estimation of Frequency Classes in Site-heterogeneous Profile Mixture Models Version 1.1. 3 June 20, 2022. https://www.mathstat.dal.ca/~tsusko/doc/mammal.pdf 

Susko, E., Lincker, L., & Roger, A. J. (2018). Accelerated estimation of frequency classes in site-heterogeneous profile mixture models. Molecular Biology and Evolution, 35(5), 1266-1283. https://doi.org/10.1093/molbev/msy026

Baños, H., Wong, T. K., Daneau, J., Susko, E., Minh, B. Q., Lanfear, R., Brown, M., Eme, L., & Roger, A. J. (2024). GTRpmix: A linked general-time reversible model for profile mixture models. bioRxiv, 2024-03.

Zhou, H. Q., Ning, L. W., Zhang, H. X., & Guo, F. B. (2014). Analysis of the relationship between genomic GC content and patterns of base usage, codon usage and amino acid usage in prokaryotes: similar GC content adopts similar compositional frequencies regardless of the phylogenetic lineages. PloS one, 9(9), e107319. 

Fleming, J. F., & Struck, T. H. (2023). nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets. BMC bioinformatics, 24(1), 145.

Hernandez, A. M., & Ryan, J. F. (2021). Six-state amino acid recoding is not an effective strategy to offset compositional heterogeneity and saturation in phylogenetic analyses. Systematic Biology, 70(6), 1200-1212. 

MEOW (2024). https://github.com/jdaneau/pm

Williamson, K., Eme, L., Baños, H., McCarthy, C., Susko, E., Kamikawa, R., ... & Roger, A. J. (2024). A robustly rooted tree of eukaryotes reveals their excavate ancestry. bioRxiv, 2024-09. https://doi.org/10.1101/2024.09.04.611237

Tice, A. K., Žihala, D., Pánek, T., Jones, R. E., Salomaki, E. D., Nenarokov, S., ... & Brown, M. W. (2021). PhyloFisher: a phylogenomic package for resolving eukaryotic relationships. PLoS Biology, 19(8), e3001365.

Crotty, S. M., Minh, B. Q., Bean, N. G., Holland, B. R., Tuke, J., Jermiin, L. S., & Haeseler, A. V. (2020). GHOST: recovering historical signal from heterotachously evolved sequence alignments. Systematic biology, 69(2), 249-264. https://doi.org/10.1093/sysbio/syz051

Eglit, Y., Shiratori, T., Jerlström-Hultqvist, J., Williamson, K., Roger, A. J., Ishida, K. I., & Simpson, A. G. (2024). Meteora sporadica, a protist with incredible cell architecture, is related to Hemimastigophora. Current Biology, 34(2), 451-459. https://doi.org/10.1016/j.cub.2023.12.032