# Concatenated phylogenies

In this notebook, we infer the large concatenated phylogeny (Figure 1 in teh manuscript). Preliminary phylogenies were based on a reduced gene set (not shown/documented here for the sake of brevity).

In [None]:
# Check if python is 3.10.5
import json
import os
import pandas as pd
import sys
import numpy as np
import __init__

print(sys.version)
%load_ext autoreload
%autoreload 2

In [3]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

## 1. Full tree - 93 genes

### 1.1. Assembled dataset
**Genes**  
We have the clean dataset for 96 genes, however, we will exclude ascF (too few taxa), psbH (slightly weird gene tree), and rbcL (different origins of the gene in red and green algae) leaving us with 93 genes.  

**Taxa**  
We exclude 7 ptMAGs that had high redundancy values. We also exclude two redundant refs and add one ref from prelimnary versions (v1 of the preprint on bioRxiv).   


### 1.2 Some stats

We calculate some stats regarding gene and taxa occupancy. 

In [4]:
## Dataset
DATASET = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["FULL"]["DATASET"]["V3"]

In [None]:
%%bash -s "$DATASET"

/home/mahja/ptMAGs/src/get_stats.sh "$1"

### 3.3 Filter with prequal

In one case, we filter sequence stretches with no clear homology. We use a posterior probability threshold of 0.95. 

In [6]:
# Folder with extracted gene dataset
DATASET = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["FULL"]["DATASET"]["V3"]

# Output folder for prequal files
PREQUAL = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["FULL"]["ALIGNMENTS"]["V3"]["PREQUAL"]

In [None]:
%%bash -s "$DATASET" "$PREQUAL"

for i in "$1"/*fasta
do 
    sbatch ../../uppmax_scripts/script_bin/job_prequal.sh $i $2
    sleep 1
done

### 3.3 Align
We align with mafft-ginsi. 

In [4]:
from gene_iterator import GeneIterator

In [5]:
# Folder with prequal-filtered gene dataset
PREQUAL = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["FULL"]["ALIGNMENTS"]["V3"]["PREQUAL"]

# Read_genes
GENE_LIST = paths_dict['DATABASES']['GENE_LISTS']['GENES_93']
genes = list(map(lambda x : x.strip(), open(GENE_LIST, "r").readlines()))

# Directory for mafft output
MAFFT_DIR = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["FULL"]["ALIGNMENTS"]["V3"]["MAFFT"]

# Slurmlog csv
SLURMLOG = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["FULL"]["ALIGNMENTS"]["V3"]["MAFFTLOG"]

In [None]:
gi = GeneIterator(PREQUAL, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_mafft(MAFFT_DIR, SLURMLOG)

### 3.4 Trim
We trim the alignments with BMGE (BLOSUM35 matrix, filter columns > 80% gaps).

In [10]:
# Folder containing aligned fasta files
MAFFT_DIR = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["FULL"]["ALIGNMENTS"]["V3"]["MAFFT"]

# Read_genes
GENE_LIST = paths_dict['DATABASES']['GENE_LISTS']['GENES_93']
genes = list(map(lambda x : x.strip(), open(GENE_LIST, "r").readlines()))

# Directory for BMGE output
BMGE_DIR = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["FULL"]["ALIGNMENTS"]["V3"]["BMGE"]

# Slurmlog csv
SLURMLOG = paths_dict['ANALYSIS_DATA']["CONCAT_GENE_ANALYSIS"]["FULL"]["ALIGNMENTS"]["V3"]["BMGELOG"]

In [None]:
gi = GeneIterator(MAFFT_DIR, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_bmge(BMGE_DIR, SLURMLOG, MAFFT_DIR)

### 3.5 Concatenate

In [13]:
# Directory for aligned and trimmed fasta files
BMGE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["FULL"]["ALIGNMENTS"]["V3"]["BMGE"]

## Output directory for fasta files
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["FULL"]["ALIGNMENTS"]["V3"]["CONCAT"]

In [None]:
%%bash -s "$BMGE_DIR" "$CONCAT_DIR"

files=("$1"/*fasta)

perl /home/mahja/ptMAGs/src/cat_fasta.pl -f "${files[@]}" > "$2"/concat_839t_93g_prequal_ginsi_bmge.fasta
mv partitions.txt "$2"/partitions_prequal_ginsi_bmge.txt

In [None]:
%%bash -s "$CONCAT_DIR"

seqkit stats "$1"/concat_839t_93g_prequal_ginsi_bmge.fasta

The concatenated file has 839 sequences, and 19,242 aligned sites. 

### 3.6 Infer tree

We run the tree in IQTree, using the best fitting site-homogenous model (matrix LG or cpREV). We treat the concatenated alignment as one big partition. We infer support using 1000 ultrafast bootstraps.

In [17]:
## Directory for fasta file
CONCAT_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["FULL"]["ALIGNMENTS"]["V3"]["CONCAT"]

## Output directory for tree
TREE_DIR = paths_dict["ANALYSIS_DATA"]["CONCAT_GENE_ANALYSIS"]["FULL"]["TREES"]["V3"]

In [None]:
%%bash -s "$CONCAT_DIR" "$TREE_DIR"

sbatch ../../uppmax_scripts/script_bin/job_iqtree.sh "$1"/concat_839t_93g_prequal_ginsi_bmge.fasta "$2"/concat_839t_93g_prequal_ginsi_bmge_ML

Submitted batch job 55859943 on cluster rackham


## References

Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780. https://doi.org/10.1093/molbev/mst010

Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25(15), 1972-1973. https://doi.org/10.1093/bioinformatics/btp348

cat_fasta.pl script. Written by Martin Ryberg. Available at https://github.com/mr-y/my_bioinfPerls/blob/master/cat_fasta.pl. 