# Mitochondrial contigs from targeted coassembly  

In the search for a potential host for leptophytes, we now turn to the mitochondria. It has been difficult to recover putative nuclear genome sequences of leptophytes. One reason could be that plastid genomes seem to be more abundant/multi-copy than the corresponding nuclear genome, so it is easier to recover the plastid genome. Given that the mitochondrial genome is also expected to be at a higher abundance, we try to (1) recover mitochondrial contigs from the targeted co-assembly, (2) infer phylogenies to identify the contigs, and (3) compare distribution patterns with those of Lepto-01. Together, steps 2 and 3 could help identify putative leptophyte hosts. 

In [None]:
# Check if python is 3.10.5
import json
import os
import pandas as pd
import sys
import numpy as np
import __init__


print(sys.version)
%load_ext autoreload
%autoreload 2

In [2]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

## 1. Annotate mitochondrial contigs

Tom recovered 34 complete mitochondrial contigs. I will now run mfannot (as done for the plastid contigs) to annotate the contigs and to extract protein coding genes. 

### 1.1. Run MFAnnot

We use genetic code 1 (standard) or the genetic code 4 (mold mitochondrial) as appropriate (genetic code determined by Codetta). 

In [4]:
## Define directory with samples (SMP_DIR). Paths are defined in PATHS.json in the main directory.
SMP_DIR = paths_dict['DATABASES']["COASSEMBLY"]["MITO"]

## Define output directory
MFOUT_DIR = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["MFANNOT"]["MASTERFILE"]
MFSLURM_CSV = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["MFANNOT"]["MFSLURMLOG"]

## Define MFannot database
PROTEIN_COLLECTION_DB = paths_dict["DATABASES"]["MF_ANNOT_REFS"]["PROTEINS"]

In [5]:
from plastome_raw_data import PlastomeRawIterator

In [None]:
pri = PlastomeRawIterator(SMP_DIR, suffix="fa")

pri.run_mfannot(MFOUT_DIR, MFSLURM_CSV, PROTEIN_COLLECTION_DB, force=False, restart_fails=False)

### 1.2. Convert to GenBank Format

We used the asn2gb tool from NCBI to convert the sqn files (files in ASN.1 syntax that contains the sequences, their features, and the metadata about submission to NCBI) to flat genbank files.

In [7]:
## Define directory with sqn files (SQN_DIR). Paths are defined in PATHS.json in the main directory.
SQN_DIR = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["MFANNOT"]["MASTERFILE"]

## Define output directory
GBOUT_DIR = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["MFANNOT"]["GENBANK"]

## Define slurm csv to track jobs 
MFSLURM_CSV = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["MFANNOT"]["GBSLURMLOG"]

In [8]:
from sqn2gb import PlastomeSQNIterator

In [None]:
psi = PlastomeSQNIterator(SQN_DIR, GBOUT_DIR, MFSLURM_CSV)

psi.run_asn2gb()

### 1.3. Extract proteome

We extract the proteome from the genbank files.

In [10]:
## Define directory with GenBank files
MAG_GB = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["MFANNOT"]["GENBANK"]

## Define output directory to store proteomes
MAG_PROT = paths_dict['ANALYSIS_DATA']["COASSEMBLY"]["MITO"]["MFANNOT"]["PROTEOMES"]

## Define whether the file is a reference (from NCBI) or a MAG (opt between "ref" and "mag").
FILE_TYPE = "mag"

In [11]:
from gb_to_prot import mfannot_gb_prot

In [12]:
mfannot_gb_prot(MAG_GB, MAG_PROT, FILE_TYPE)

## 2. Get references for phylogenies

We will now infer some phylogenies to identify the mitochondrial contigs. To do so, we need a reference dataset. Here, we will start off with the 100 taxa, 93 gene from [Williamson et al 2025]( https://doi.org/10.1038/s41586-025-08709-5). This dataset will be reduced to include only eukaryotic taxa and mitochondria encoded genes. To this, we will add the rappemonad mitochondrial genome from [Kawachi et al 2021](https://www.cell.com/current-biology/fulltext/S0960-9822(21)00351-1), and seven other representatives of cryptophytes and haptophytes from [Kim et al 2018](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4626-9).

### 2.1. Get reference dataset from Williamson et al 2025

I downloaded the unaligned single gene fasta files from the Figshare repo associated with Williamson et al 2025 (https://figshare.com/s/59b28ecc0056dc8d0d03), and then kept only the 40 genes that are mitochondria encoded (at least in some eukaryotes). 

Each single gene file contains prokaryote sequences and other eukaryote sequences that were not in the final concatenated dataset. I will retain the 61 eukaryotes in the final concatenated file of Williamson et al 2025 (without the Anaeramoebae) and remove everything else. The header names were also restricted to 10 characters, so we update the headers to include the full name and taxonomy string. 

In [4]:
## Fasta file directory
REF_FASTA = paths_dict['ANALYSIS_DATA']["COASSEMBLY"]["MITO"]["REFS"]["FASTA"]

In [None]:
%%bash -s "$REF_FASTA"

for i in "$1"/*fasta; do 
    echo $i
    seqkit grep -f <(cat "$1"/references_taxo.list.txt | cut -f1) $i > test
    seqkit replace -p '^(\S+)$' -r '{kv}$2' -k "$1"/references_taxo.list.txt test > test2
    mv test2 $i
done

### 2.2. Add more haptophyte, cryptophyte, and other references

We add more haptophyte and cryptophyte references as the leptophyte plastids are most closely to these groups. In total, this corresponds to 9 mitochondria:

MG680941	Chroomonas placoidea  
MG680942	Cryptomonas curvata  
MG680945	Proteomonas sulcata  
NC_002572	Rhodomonas salina  
MG680943	Storeatula species  
MG680944	Teleaulax amphioxeia  
JN131834	Phaeocystis antarctica  
KC967226	Phaeocystis globosa  
LC564891	Pavlomulina ranunculiformis  

A fasta file containing all the mitochondria genomes was manually downloaded using Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez). 

In addition, we also added mitochondria genomes from major lineages in the eukaryotic tree of life including katablepharids, centrohelids, and Microheliela. In total, these corresponded to 18 genomes from [Yazaki et al 2022](https://doi.org/10.3389/fevo.2022.1030570), [Nishimura et al 2019](https://doi.org/10.1038/s41598-019-41238-6), [Janouskovec et al 2017](https://doi.org/10.1016/j.cub.2017.10.051) and [Wideman et al 2019](https://doi.org/10.1038/s41564-019-0605-4). 

I then split this fasta file into multiple files, with one plastid genome per file, and the file name corresponding to the sequence header. 

In [9]:
MFANNOT = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["REFS"]["MFANNOT"]

In [None]:
%%bash -s "$MFANNOT"

awk '/^>/{if(N){close(N)} N=substr($1,2) ".fa"; print > N; next;} {if(N) print >> N}' "$1"/sequence.fasta

rm "$1"/sequence.fasta

#### 2.2.1 Run MFAnnot

We run mfannot to annotate the sequences (so they are consistent). 

In [21]:
## Define directory with samples. This will also be the output directory.
MFANNOT = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["REFS"]["MFANNOT"]

## Define slurmlog csv
MFSLURM_CSV = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["REFS"]["MFSLURMLOG"]

## Define MFannot database
PROTEIN_COLLECTION_DB = paths_dict["DATABASES"]["MF_ANNOT_REFS"]["PROTEINS"]

In [None]:
pri = PlastomeRawIterator(MFANNOT, suffix="fa")

pri.run_mfannot(MFANNOT, MFSLURM_CSV, PROTEIN_COLLECTION_DB, force=False, restart_fails=False)

Then convert to Genbank format. 

In [23]:
## Define slurm csv to track jobs 
GBSLURM_CSV = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["REFS"]["GBSLURMLOG"]

In [None]:
psi = PlastomeSQNIterator(MFANNOT, MFANNOT, GBSLURM_CSV)

psi.run_asn2gb()

#### 2.2.2 Extract the proteome.

In [26]:
## Define directory with GenBank files and where poteomes will be stored.
MFANNOT = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["REFS"]["MFANNOT"]

## Define whether the file is a reference (from NCBI) or a MAG (opt between "ref" and "mag").
FILE_TYPE = "ref"

In [27]:
mfannot_gb_prot(MFANNOT, MFANNOT, FILE_TYPE)

## 3. Collect homologs

We now collect homologs of the 40 genes from both the 9 additional references as well as the 34 mitochondrial contigs extracted by Tom. We start by putting the query (references from Williamson et al 2025) sequences in a working_dataset folder. 

### 3.1. Create BLAST databases for all new proteomes

In [3]:
## Folder containing proteomes of the ptMAGs
MAG_PROT = paths_dict['ANALYSIS_DATA']["COASSEMBLY"]["MITO"]["MFANNOT"]["PROTEOMES"]

REF_PROT = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["REFS"]["MFANNOT"]

## Output folder that will contain the blast dbs of the new proteomes
TO_ADD = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['TO_ADD']

In [None]:
%%bash -s "$MAG_PROT" "$REF_PROT" "$TO_ADD"

## load the blast module on Uppmax
module load bioinfo-tools
module load blast

for i in "$1"/*fasta; do
    seq=$(basename $i | cut -f 1 -d '.')
    makeblastdb -in $i -dbtype prot -out "$3"/"$seq".db
done

for i in "$2"/*fasta; do
    seq=$(basename $i | cut -f 1 -d '.')
    makeblastdb -in $i -dbtype prot -out "$3"/"$seq".db
done

### 3.2. BLAST references against database

We first create a text file called `blastdb_name.txt` with the names of the blast databases.

In [18]:
%%bash -s "$TO_ADD"

for i in "$1"/*; do
    echo $i | \
    cut -f 1 -d '.' | \
    sed -E 's/(.*)/\1\.db/'
done | uniq > "$1"/blastdb_name.txt

In [16]:
## Folder containing the reference fasta files that we will use as blast query
WORKING_DATASET = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['WORKING']

## Folder containing the blast dbs
TO_ADD = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['TO_ADD']

We submit the blast jobs now! Use an evalue of 1e-01.

In [None]:
%%bash -s "$WORKING_DATASET" "$TO_ADD" 

for i in  "$1"/*.fasta; do
    sbatch ../../uppmax_scripts/script_bin/job_running_blast.sh \
        $i "$2"/blastdb_name.txt
    sleep 1
done

### 3.3 Extract best BLAST hits

Blast output files are generated in the folder containing the reference query sequences. For each gene, we want to:
- parse the blast output files, 
- extract the best hit,
- pull out the corresponding sequence

In [26]:
## Full gene dataset
GENE_LIST = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['MITOGENES']

## Folder containing the query fasta files and the blast outputs
WORKING_DATASET = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['WORKING']

We first pull the best blast hit. We can do that by taking the first line of the blast output (since the blast output is already sorted). The second column of the blast tabular output contains the name of the best sequence.

In [23]:
%%bash -s "$GENE_LIST" "$WORKING_DATASET"

## Extract best blast hit for each gene and taxon
cat $1 | while read line; do
    for i in "$2"/"$line".fasta__*; do
        cat $i | head -n 1 | cut -f2 >> "$2"/"$line"_toAdd.list
    done
done

I manually went through each {gene}_toAdd.list file and removed the hits that were from other genes!!! For 12 genes, less than 10 hits were found, leaving just 28 genes for which hits were detected. The genes with less than 10 hits were: atp3, atp9, ccmFC, cox11, nad8, rpl11, rpl1, rpl20, rpl27, rpl31, sdh2, tufA.

We pull the corresponding sequences now. 

First we concatenate all the proteomes together to make it easier to search the sequences.

In [13]:
## Folders containing proteomes
MAG_PROT = paths_dict['ANALYSIS_DATA']["COASSEMBLY"]["MITO"]["MFANNOT"]["PROTEOMES"]

REF_PROT = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["REFS"]["MFANNOT"]

## Folder containing the blast dbs
TO_ADD = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['TO_ADD']

In [10]:
%%bash -s "$MAG_PROT" "$REF_PROT" "$TO_ADD" 
for i in "$1"/*fasta; do (cat "${i}"; echo) >> "$3"/all.fasta; done
for i in "$2"/*fasta; do (cat "${i}"; echo) >> "$3"/all.fasta; done

Next we use [seqkit](https://github.com/shenwei356/seqkit) to pull out the sequences for each gene.

In [None]:
%%bash -s "$WORKING_DATASET" "$TO_ADD"

for i in "$1"/*_toAdd.list; do
    gene=$(basename $i | cut -f 1 -d '.')
    seqkit grep -f $i "$2"/all.fasta > "$1"/"$gene".fasta
done

Put together the query and the extracted sequences (for only the 28 selected genes).

In [20]:
GENE_SUBLIST = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['SUBMITOGENES']

In [21]:
%%bash -s "$GENE_SUBLIST" "$WORKING_DATASET"

cat $1 | while read line; do
    cat "$2"/"$line".fasta "$2"/"$line"_toAdd.fasta > "$2"/"$line".all.fasta
done

Before running the trees, we probably need to format the fasta headers so that the tree inference programme is happy. That means replacing ';' and '.' with '-'. 

Finally we also use seqkit to rename any duplicates that may exist.

In [25]:
CLEAN = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['CLEAN']

In [27]:
%%bash -s "$WORKING_DATASET" "$CLEAN"

for i in "$1"/*all.fasta; do
    gene=$(basename $i | cut -f 1 -d '.')
    cat $i | \
    tr ';' '_' | \
    sed -E 's/>(gene=.*_)(mag=.*)_(contig=.*)/>\2_\1\3/' | \
    sed -E 's/>(gene=.*_)(accession=.*)_(contig=.*)/>\2_\1\3/' | \
    seqkit rename \
    > "$2"/"$gene".fasta
done

Lastly, we rename the references we added to include a taxonomy string. 

In [None]:
%%bash -s "$CLEAN"

for i in "$1"/*fasta
    do cat $i | \
    sed -E 's/_gene/\tgene/' | \
    seqkit replace -p '^(\S+)(.+?)$' -r '{kv}$2' -k "$1"/taxonomy_string.tsv --keep-key | \
    tr '\t' '_' \
    > tmp
    mv tmp $i
done

## 4. Align, trim, and infer single gene trees

### 4.1 Align

We align the genes with mafft-linsi.

In [5]:
from gene_iterator import GeneIterator

In [4]:
# Folder with extracted gene dataset
DATASET = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['CLEAN']

# Read_genes
GENE_SUBLIST = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['SUBMITOGENES']
genes = set(line.split()[0].strip() for line in open(GENE_SUBLIST, "r"))

# Directory for mafft output
MAFFT_DIR = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']['MAFFT']

# Slurmlog csv
SLURMLOG = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']['MAFFTLOG']


In [None]:
gi = GeneIterator(DATASET, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_mafft(MAFFT_DIR, SLURMLOG)

### 4.2 Trim

We do gentle trimming with TrimAl by removing columns with 80% or more gaps. 

In [6]:
# Folder containing aligned fasta files
MAFFT_DIR = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']['MAFFT']

# Read_genes
GENE_SUBLIST = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['SUBMITOGENES']
genes = set(line.split()[0].strip() for line in open(GENE_SUBLIST, "r"))

# Directory for trimal output
TRIMAL_DIR = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']['TRIMAL']

# Slurmlog csv
SLURMLOG = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']['TRIMALLOG']

In [None]:
gi = GeneIterator(MAFFT_DIR, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_trimal(TRIMAL_DIR, SLURMLOG, MAFFT_DIR)

### 4.3 Infer SGTs

We infer trees with [IQ-TREE](http://www.iqtree.org/) using the best fitting model for each gene. 

In [41]:
# Folder containing aligned, trimmed fasta files
TRIMAL_DIR = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']['TRIMAL']

# Read_genes
GENE_SUBLIST = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['SUBMITOGENES']
genes = set(line.split()[0].strip() for line in open(GENE_SUBLIST, "r"))

# Directory for trees output
TREES_DIR = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']['SGTS']

# Slurmlog csv
SLURMLOG = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']['SGTLOG']

In [None]:
gi = GeneIterator(TRIMAL_DIR, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_siqtree(TREES_DIR, SLURMLOG, TRIMAL_DIR)

I parsed the trees manually, and they looked mostly okay - just need to remove a couple of duplicates, which I did manually.

## 5. Preliminary concatenated phylogeny

Now infer a preliminary concatenated phylogeny with the LG4X model. For the alignment, I don't realign the sequences but simply concatenate the trimmed files after removing duplicates. The tree will be used to do the final taxon sampling.

In [None]:
# Directory for aligned and trimmed fasta files
TRIMAL_DIR = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['CONCAT']['TRIMAL']

## Output directory for fasta files
PRELIM = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['CONCAT']['ALIGNMENTS']['PRELIM']

First change the headers so that are the same across all genes for corresponding taxa.

In [7]:
%%bash -s "$TRIMAL_DIR"

for i in "$1"/*fasta; do 
    cat $i | sed -E 's/(>.*)_gene.*/\1/' > tmp 
    mv tmp $i 
done

Then concatenate!

In [None]:
%%bash -s "$TRIMAL_DIR" "$PRELIM"

files=("$1"/*fasta)

perl /home/mahja/beta-Cyclocitral/src/cat_fasta.pl -f "${files[@]}" > "$2"/concat.fasta
mv partitions.txt "$2"/.

The final concatenated alignment contains 121 taxa, 28 genes, and 8,401 alignment positions. 

Run tree!

In [None]:
## Output directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["TREE"]

In [None]:
%%bash -s "$PRELIM" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat.fasta "$2"/prelim/concat_121t_28g_LG4X

## 6. Concatenated tree

### 6.1 Dataset
We remove 16 taxa to reduce taxonomic redundancy and to get a small enough taxon sampling to make a figure for the main text. I also decided to remove Roombia truncata since it had low occupancy (9/28 genes), and did not cluster with K4 (based on BLAST results, K4 consistently blasted to cyrptophytes, while Roombia never did).


In [18]:
DATASET = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["ALIGNMENTS"]["DATASET"]

In [None]:
%%bash -s "$DATASET"

for i in "$1"/*.fasta; do seqkit grep -v -f "$1"/remove.list $i > tmp; mv tmp $i; done

### 6.2 Align
Align with mafft-ginsi (unalign = 0.6) as done for plastids.

In [19]:
from gene_iterator import GeneIterator

In [20]:
# Folder with extracted gene dataset
DATASET = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["ALIGNMENTS"]["DATASET"]

# Read_genes
GENE_SUBLIST = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['SUBMITOGENES']
genes = set(line.split()[0].strip() for line in open(GENE_SUBLIST, "r"))

# Directory for mafft output
MAFFT_DIR = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["ALIGNMENTS"]["MAFFT"]

# Slurmlog csv
SLURMLOG = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["ALIGNMENTS"]["MAFFTLOG"]

In [None]:
gi = GeneIterator(DATASET, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_mafft(MAFFT_DIR, SLURMLOG)

### 6.3 Trim 

We do gentle trimming with BMGE by removing columns with 80% or more gaps, and using the BLOSUM30 matrix.

In [23]:
# Folder containing aligned fasta files
MAFFT_DIR = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["ALIGNMENTS"]["MAFFT"]

# Read_genes
GENE_SUBLIST = paths_dict['ANALYSIS_DATA']['COASSEMBLY']['MITO']['TREES']["DATASET"]['SUBMITOGENES']
genes = set(line.split()[0].strip() for line in open(GENE_SUBLIST, "r"))

# Directory for trimal output
TRIMAL_DIR = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["ALIGNMENTS"]["TRIMAL"]

# Slurmlog csv
SLURMLOG = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["ALIGNMENTS"]["TRIMALLOG"]

In [None]:
gi = GeneIterator(MAFFT_DIR, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_bmge(TRIMAL_DIR, SLURMLOG, MAFFT_DIR)

### 6.4 Concatenate

In [36]:
# Directory for aligned and trimmed fasta files
TRIMAL_DIR = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["ALIGNMENTS"]["TRIMAL"]

## Output directory for fasta files
CONCAT = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["ALIGNMENTS"]["CONCAT"]

In [38]:
%%bash -s "$TRIMAL_DIR" "$CONCAT"

files=("$1"/*fasta)

perl /home/mahja/beta-Cyclocitral/src/cat_fasta.pl -f "${files[@]}" > "$2"/concat_101t_28g_ginsi_bmge.fasta
mv partitions.txt "$2"/.

The final alignment contains 101 taxa, 28 genes and 3,270 alignment sites.

### 6.5 Run ML and Bayesian phylogenetic analyses

We run a tree with the LG+C60+G and CAT-GTR models. 

##### 6.5.1 Run tree with PhyloBayes

Convert alignment to phylip format. 

In [None]:
%%bash -s "$CONCAT"

perl ../../src/fasta2phylip.pl -f "$1"/concat_101t_28g_ginsi_bmge.fasta -o "$1"/concat_101t_28g_ginsi_bmge.phy

cat "$1"/concat_101t_28g_ginsi_bmge.phy | tr '=' '_' > phylip
mv phylip "$1"/concat_101t_28g_ginsi_bmge.phy

Set up 3 chains of phylobayes on noisy with the cat-gtr model.

##### 6.5.2 Run tree with IQTREE

We use the LG+C60+G model.

In [40]:
## Output directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MITO"]["CONCAT"]["TREE"]

In [None]:
%%bash -s "$CONCAT" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/mito-only-genes/concat_101t_28g_ginsi_bmge.fasta "$2"/mito-only-genes/concat_101t_28g_LG-C60-G

## References

Lang, B. F., Beck, N., Prince, S., Sarrasin, M., Rioux, P., & Burger, G. (2023). Mitochondrial genome annotation with MFannot: a critical analysis of gene identification and gene model prediction. Frontiers in Plant Science, 14.

Beck, N., Lang, B. F. (2010) MFannot, organelle genome annotation websever. Available at: https://github.com/BFL-lab/Mfannot.

asn2gb tool from NCBI. Available at: https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/asn2gb/ 

Williamson, K., Eme, L., Baños, H., McCarthy, C. G., Susko, E., Kamikawa, R., ... & Roger, A. J. (2025). A robustly rooted tree of eukaryotes reveals their excavate ancestry. Nature, 1-8.

Kawachi, M., Nakayama, T., Kayama, M., Nomura, M., Miyashita, H., Bojo, O., ... & Kamikawa, R. (2021). Rappemonads are haptophyte phytoplankton. Current Biology, 31(11), 2395-2403.

Wideman, J.G., Monier, A., Rodríguez-Martínez, R. et al. Unexpected mitochondrial genome diversity revealed by targeted single-cell genomics of heterotrophic flagellated protists. Nat Microbiol 5, 154–165 (2020). https://doi.org/10.1038/s41564-019-0605-4

Nishimura, Y., Shiratori, T., Ishida, Ki. et al. Horizontally-acquired genetic elements in the mitochondrial genome of a centrohelid Marophrys sp. SRT127. Sci Rep 9, 4850 (2019). https://doi.org/10.1038/s41598-019-41238-6

Yazaki, E., Yabuki, A., Nishimura, Y., Shiratori, T., Hashimoto, T., & Inagaki, Y. (2022). Microheliella maris possesses the most gene-rich mitochondrial genome in Diaphoretickes. Frontiers in Ecology and Evolution, 10, 1030570.

Janouškovec, J., Tikhonenkov, D. V., Burki, F., Howe, A. T., Rohwer, F. L., Mylnikov, A. P., & Keeling, P. J. (2017). A new lineage of eukaryotes illuminates early mitochondrial genome reduction. Current Biology, 27(23), 3717-3724.