# Reference SGTs

In this notebook, we assemble a phylogenetic dataset from reference plastomes only. 

In [1]:
# Check if python is 3.10.5
import json
import os
import pandas as pd
import sys
import numpy as np
import __init__


print(sys.version)
%load_ext autoreload
%autoreload 2

3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]


In [2]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

## 1. Gene lists

Which genes should we use to infer plastid phylogenies? The gene list was finalised in an iterative manner following many rounds of preliminary phylogenies. Here, we go through the final gene list and how it came about. 

We start with the gene dataset of [Janouškovec et al 2010](https://doi.org/10.1073/pnas.1003335107). This gene dataset includes 68 orthologous plastid proteins (the full dataset), and a subset of 34 slow evolving genes (subset).

These gene_lists are defined here:

In [3]:
## Full gene dataset
JANO_FULL = paths_dict['DATABASES']["GENE_LISTS"]["JANO_FULL"]

## Subset of slow evolving genes
JANO_CONS = paths_dict['DATABASES']["GENE_LISTS"]["JANO_CONS"]

Let's look at the JANO_FULL file.

In [5]:
## head JANO_FILE
print("".join(open(JANO_FULL).readlines()[:10]))

gene	gene_name
acsF	acsF
acsF	ycf59
atpA	atpA
atpB	atpB
atpH	atpH
atpI	atpI
ccs1	ccs1
ccs1	ycf44
ccsA	ccsA



The first column is the gene, while the second column lists all the commonly used gene names (there might be multiple). For example the gene acsF is also known as ycf59. 

We later came across [Ponce-Toledo et al 2017](https://doi.org/10.1016/j.cub.2016.11.056) and [Pietluch et al 2024](https://doi.org/10.1093/gbe/evae192) which both had slightly non-overlapping 97 gene datasets. We compared those datasets to our 68 gene set. 50 genes were shared between our dataset and the Ponce-Todeldo dataset, with 18 genes present *only* in our dataset. We found 47 genes were present only in the Ponce-Toledo dataset, out of which 31 were not present in leptophytes (meaning that it was not necessary to add those). This left an additional 16 genes which we could add to our dataset from Ponce-Toledo et al 2017. Similarly, there were 12 additional genes we could add from Pietluch et al 2024. 

In [4]:
## New gene dataset
GENES_ADD = paths_dict['DATABASES']["GENE_LISTS"]["GENES_ADD"]

In [6]:
%%bash -s "$GENES_ADD"
cat $1

atpF
atpG
atpD
atpE
groEL
dnaK
chlI
psaF
rpl1
rpl12
rpl21
rpl33
rps10
rps20
rps9
rbcL
cbbX
petL
petM
psaI
psaJ
psaL
psaM
psbL
psbX
psbY
rbcS
rpl34


## 2. Single gene trees of ref taxa

We first needed to extract the genes of interest from the reference taxa. (It is worth mentioning that we had done some preliminary phylogenetic analyses with sequences retrieved based on gene names - however, this approach led to errors in some cases due to inconsistency in gene names.)

To confirm that the genes from the reference taxa are: 
- free of potential contaminants 
- free of paralogs (rare in plastids, but we should still check), 

we start by inferring single gene trees of the reference sequences only.

### 2.1 Extract gene fasta files

Step 1 is extracting the 68 (later 93 in total) gene sequences from each taxon. In a preliminary analysis, we tried using the gene names themselves to do so - however this did not work super well due to differences in labelling and poor annotation in some cases (particularly for the cyanobacteria - for example the gene atpE was labelled as atpH in some cyanos).

So instead, we decided to use BLAST to pull the relevant genes from each taxon. As query, we opted to use sequences from the brown alga _Fucus vesiculosus_, which has a well-annotated plastid genome ([Corguillé et al 2009](https://doi.org/10.1186/1471-2148-9-253)), and contains all 68 genes we are interested in. 

#### 2.1.1. Extract genes from _Fucus vesiculosus_

We extract the genes from _Fucus vesiculosus_ which has the accession number NC_016735.

In [None]:
## Folder containing proteomes of reference taxa
## Provided here (in the ptMAGs folder): https://doi.org/10.17044/scilifelab.28212173
REF_PROTEOMES = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['PROTEOMES']['REF_PROTEOMES']

## File of gene set. We start by extracting the full dataset of Janouškovec et al 2010.
JANO_FULL = paths_dict['DATABASES']['GENE_LISTS']['JANO_FULL']

## Output folder where we place the gene extracts for Fucus
## (Intermediate files - not provided)
WORKING_DATASET = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['DATASET']['REF']['WORKING']['REF']

In [None]:
%%bash -s "$REF_PROTEOMES" "$JANO_FULL" "$WORKING_DATASET"

cut -f 1 "$2" | uniq | grep -v "gene" | while read line; do
    echo $line
    seqkit grep -rp "gene=$line;" "$1"/NC_016735.taxo.fasta > "$3"/"$line".fasta
done

And do the same for the additional 28 genes.

In [9]:
## Folder containing proteomes of reference taxa
REF_PROTEOMES = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['PROTEOMES']['REF_PROTEOMES']

## File of gene set. 
GENES_ADD = paths_dict['DATABASES']["GENE_LISTS"]["GENES_ADD"]

## Output folder where we place the gene extracts for Fucus
WORKING_DATASET = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['DATASET']['REF']['WORKING']['REF']

In [None]:
%%bash -s "$REF_PROTEOMES" "$GENES_ADD" "$WORKING_DATASET"

cat "$2" | while read line; do
    echo $line
    seqkit grep -rp "gene=$line;" "$1"/NC_016735.taxo.fasta > "$3"/"$line".fasta
done

#### 2.1.2. Create a blast database for all the other reference proteomes

Create blast databases for the other 178 reference taxa. 

In [7]:
## Folder containing proteomes of reference taxa
REF_PROTEOMES = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['PROTEOMES']['REF_PROTEOMES']

## Output folder that will contain the blast dbs of the remaining 179 taxa
REFS_TO_ADD = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['DATASET']['REF']['WORKING']['TOADD']

In [None]:
%%bash -s "$REF_PROTEOMES" "$REFS_TO_ADD"

## load the blast module on the Uppmax cluster
module load bioinfo-tools
module load blast

for i in "$1"/*taxo.fasta; do
    if [ "$i" != "NC_016735.taxo.fasta" ]; then
        taxon=$(basename $i | cut -f1 -d '.')
        makeblastdb -in $i -dbtype prot -out "$2"/"$taxon".db
    fi
done

#### 2.1.3. Blast extracted _Fucus_ genes against the blast databases

We first create a text file called `blastdb_name.txt` with the names of the blast databases.

In [17]:
%%bash -s "$REFS_TO_ADD"

for i in "$1"/*; do
    echo $i | \
    cut -f 1 -d '.' | \
    sed -E 's/(.*)/\1\.db/'
done | uniq > "$1"/blastdb_name.txt

In [11]:
## Folder containing the Fucus fasta files that we will use as blast query
WORKING_DATASET = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['DATASET']['REF']['WORKING']['REF']

## Folder containing the blast dbs
REFS_TO_ADD = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['DATASET']['REF']['WORKING']['TOADD']

We submit the blast jobs now!

In [None]:
%%bash -s "$WORKING_DATASET" "$REFS_TO_ADD" 

for i in  "$1"/*.fasta; do
    sbatch ../../uppmax_scripts/script_bin/job_running_blast.sh \
        $i "$2"/blastdb_name.txt
    sleep 1
done

#### 2.1.4. Extract best blast hits

Blast output files are generated in the folder containing the single genes from _Fucus_, our query. For each gene, we want to:
- parse the blast output files, 
- extract the best hit,
- pull out the corresponding sequence

In [13]:
## Full gene dataset
JANO_FULL = paths_dict['DATABASES']["GENE_LISTS"]["JANO_FULL"]

## 28 gene dataset
GENES_ADD = paths_dict['DATABASES']["GENE_LISTS"]["GENES_ADD"]

## Folder containing the Fucus fasta files and the blast outputs
WORKING_DATASET = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['DATASET']['REF']['WORKING']['REF']

## Folder containing proteomes of reference taxa
REF_PROTEOMES = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['PROTEOMES']['REF_PROTEOMES']


We first pull the best blast hit. We can do that by taking the first line of the blast output (since the blast output is already sorted). The second column of the blast tabular output contains the name of the best sequence.

In [None]:
%%bash -s "$JANO_FULL" "$WORKING_DATASET"

## Extract best blast hit for each gene and taxon
cut -f 1 $1 | grep -v "gene" | sort -u | while read line; do
    for i in "$2"/"$line".fasta__*; do
        cat $i | head -n 1 | cut -f2 >> "$2"/"$line"_toAdd.list
    done
done

In [None]:
%%bash -s "$GENES_ADD" "$WORKING_DATASET"

## Extract best blast hit for each gene and taxon
cat $1 | while read line; do
    for i in "$2"/"$line".fasta__*; do
        cat $i | head -n 1 | cut -f2 >> "$2"/"$line"_toAdd.list
    done
done

We pull the corresponding sequences now. 

First we concatenate all the proteomes together to make it easier to search the sequences.

In [16]:
%%bash -s "$REF_PROTEOMES"
cat "$1"/*.taxo.fasta > "$1"/all.fasta

Next we use [seqkit](https://github.com/shenwei356/seqkit) to pull out the sequences for each gene.

In [None]:
%%bash -s "$WORKING_DATASET" "$REF_PROTEOMES"

for i in "$1"/*_toAdd.list; do
    gene=$(basename $i | cut -f 1 -d '.')
    seqkit grep -f $i "$2"/all.fasta > "$1"/"$gene".fasta
done

Put together the query and the extracted sequences. 

In [8]:
%%bash -s "$JANO_FULL" "$WORKING_DATASET"

cut -f 1 $1 | grep -v "gene" | sort -u | while read line; do
    cat "$2"/"$line".fasta "$2"/"$line"_toAdd.fasta > "$2"/"$line".all.fasta
done

And the additional 28 genes too!

In [None]:
%%bash -s "$GENES_ADD" "$WORKING_DATASET"

cat $1 | while read line; do
    cat "$2"/"$line".fasta "$2"/"$line"_toAdd.fasta > "$2"/"$line".all.fasta
done

At this stage, we may have duplicates, particularly in the cyanobacteria. To avoid problems downstream, we can rename duplicated sequences. We can use seqkit for this, which appends '_N' to duplicated sequence IDs to make them unique.

In [19]:
# Folder containing output fasta files
DATASET = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['DATASET']['REF']['WORKING']['ROOT']

In [20]:
%%bash -s "$WORKING_DATASET" "$DATASET"

for i in "$1"/*all.fasta; do
    gene=$(basename $i | cut -f 1 -d '.')
    cat $i | \
    seqkit rename > "$2"/"$gene".fasta
done

Before running the trees, we probably need to format the fasta headers so that RAxML (or any tree inference programme) is happy. That means replacing ';' and '.' with '-'. 

We also edited our fasta headers so they would work with PhyloFisher (we ended up not using the Phylofisher workflow, as it removes all hits to bacterial orthogroups when collecting homologs via the fisher algorithm, which might not be appropriate for plastid genes). Since PhylloFisher expects the fasta headers to be identical across all genes for each taxon, we'll need to drop the gene name from the fasta header. I do somehow want to keep the protein ID or contig, so we can have that at the end of the sequence after an underscore (as PhyloFisher seems to ignore everything after the underscore).

In [22]:
%%bash -s "$DATASET"

for i in "$1"/*fasta; do
    cat $i | sed -E 's/gene=.*;accession/accession/' | \
    tr '_' '-' | \
    sed -E 's/;protein/_protein/' | \
    sed -E 's/;contig/_contig/' | \
    tr ';' '-' | tr '.' '-' \
    > txt
    mv txt $i
done

### 2.2 Align, trim, and infer single gene trees

For our preliminary reference SGTs, we align the genes with mafft --auto. 

In [23]:
from gene_iterator import GeneIterator

In [24]:
# Folder with extracted gene dataset
DATASET = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['DATASET']['REF']['WORKING']['ROOT']

# Read_genes
JANO_FULL = paths_dict['DATABASES']['GENE_LISTS']['JANO_FULL']
genes = set(line.split()[0].strip() for line in open(JANO_FULL, "r"))

# Read genes
GENES_ADD = paths_dict['DATABASES']["GENE_LISTS"]["GENES_ADD"]
genes_ADD = set(line.split()[0].strip() for line in open(GENES_ADD, "r"))

# Directory for mafft output
MAFFT_DIR = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['ALIGNMENTS']['PRELIM_REF']['MAFFT']

# Slurmlog csv
SLURMLOG = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['ALIGNMENTS']['PRELIM_REF']['MAFFTLOG']


In [None]:
gi = GeneIterator(DATASET, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi = GeneIterator(DATASET, gene_list=genes_ADD, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_mafft(MAFFT_DIR, SLURMLOG)

I quickly scanned a few of the alignments and then trimmed with trimal, using a gap thrashold of 0.8 (aggressive trimming).

In [28]:
# Directory for mafft files
MAFFT_DIR = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['ALIGNMENTS']['PRELIM_REF']['MAFFT']

# Read_genes
JANO_FULL = paths_dict['DATABASES']['GENE_LISTS']['JANO_FULL']
genes = set(line.split()[0].strip() for line in open(JANO_FULL, "r"))

# Read genes
GENES_ADD = paths_dict['DATABASES']["GENE_LISTS"]["GENES_ADD"]
genes_add = set(line.split()[0].strip() for line in open(GENES_ADD, "r"))

# Directory for trimal output
TRIMAL_DIR = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['ALIGNMENTS']['PRELIM_REF']['TRIMAL']

# Slurmlog csv
SLURMLOG = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['ALIGNMENTS']['PRELIM_REF']['TRIMALLOG']


In [None]:
gi = GeneIterator(MAFFT_DIR, gene_list=genes, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi = GeneIterator(MAFFT_DIR, gene_list=genes_add, suffix="fasta")
gi.unlock_pipeline()

In [None]:
gi.run_trimal(TRIMAL_DIR, SLURMLOG, MAFFT_DIR)

Now we are ready to run trees! For these preliminary trees, we'll use raxml-ng v1.2 with LG4X model and 100 rapid bootstrap searches.

In [31]:
# Directory for raxml trees
TREES = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['TREES']['PRELIM_REF']

We manually created soft links for the processed fasta files (from TRIMAL_DIR) in the TREES directory.

We can now submit jobs to run the trees.

In [None]:
%%bash -s "$TREES" 

for i in "$1"/*fasta; do
    gene=$(basename $i | cut -f 1 -d '.')
    echo $gene
    sbatch /crex/proj/naiss2023-6-81/Mahwash/beta-Cyclocitral/uppmax_scripts/script_bin/job_raxml.sh "$i" "$gene"
    sleep 1
done 

### 2.3 Manually go through the single gene trees

We manually checked the single gene trees. Once checked, each tree was saved as `{file name}.checked`.

 Tips were marked in red (#ff0000) if they were duplicates (so that each taxon should have only one ortholog), or if they clearly belonged to another gene (super long branches). 

We marked paralogs in blue (#0000ff). As these are plastid genes, there werent really any paralogs in the genes we looked at, but for the clpC gene, one distinct clade was made of clpB sequences which was marked in blue. This will help distinguish between the clpC and clpB gene when we extract sequences from the MAGs. 

## 3. Get clean reference sequences for each gene

### 3.1 Remove contaminants

To get clean sequences for each gene, we remove sequences marked in red or blue in the previous step. 

In [3]:
# Gene list
JANO_FULL = paths_dict['DATABASES']['GENE_LISTS']['JANO_FULL']

# Read genes
GENES_ADD = paths_dict['DATABASES']["GENE_LISTS"]["GENES_ADD"]

# Directory for raxml trees
TREES = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['TREES']['PRELIM_REF']

# Directory for input fasta files (unaligned and untrimmed)
DATASET = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['DATASET']['REF']['WORKING']['ROOT']

# Out directory
FINAL = paths_dict['ANALYSIS_DATA']['SINGLE_GENE_ANALYSIS']['DATASET']['REF']['FINAL']

In [None]:
%%bash -s "$JANO_FULL" "$TREES" "$DATASET" "$FINAL"

## Get gene list
genes=$(cut -f 1 $1 | grepp -v "gene" | sort -u | tr '\n' ' ')

## Loop over gene list, get list of marked sequences from trees, and remove the marked sequences 
## from the corresponding fasta files to get a clean fasta file for each gene
for gene in $genes
do
    echo $gene 
    grep "color=" "$2"/RAxML_bipartitions."$gene".selected | \
    sed -E 's/\s+(.*)\[.*\]/\1/' | \
    tr -d "'" \
    > "$2"/marked.list
    seqkit grep -f "$2"/marked.list "$3"/"$gene".fasta -v | \
    sed -E 's/(>.*)_protein.*/\1/' | \
    sed -E 's/(>.*)_contig.*/\1/' \
    > "$4"/"$gene".fas
done

In [None]:
%%bash -s "$GENES_ADD" "$TREES" "$DATASET" "$FINAL"

## Get gene list
genes=$(cat $1 | tr '\n' ' ')

## Loop over gene list, get list of marked sequences from trees, and remove the marked sequences 
## from the corresponding fasta files to get a clean fasta file for each gene
for gene in $genes
do
    echo $gene 
    grep "color=" "$2"/"$gene".raxml.support.checked | \
    sed -E 's/\s+(.*)\[.*\]/\1/' | \
    tr -d "'" \
    > "$2"/marked.list
    seqkit grep -f "$2"/marked.list "$3"/"$gene".fasta -v | \
    sed -E 's/(>.*)_protein.*/\1/' | \
    sed -E 's/(>.*)_contig.*/\1/' \
    > "$4"/"$gene".fas
done

At this I noticed one of the taxa had a `/` in the header. To avoid confusion. I changed the header as follows:

In [5]:
%%bash -s "$FINAL"

for i in "$1"/*fas
do 
    cat $i | \
    sed -E 's/taxo=Eukaryota-Sar-Stramenopiles-Ochrophyta-Eustigmatophyceae-sp-Chic-10\/23-P-6w-accession=NC-040296/taxo=Eukaryota-Sar-Stramenopiles-Ochrophyta-Eustigmatophyceae-sp-accession=NC-040296/' \
    > fasta
    mv fasta "$i"ta
done

How many sequences do have for each gene?

In [None]:
%%bash -s "$FINAL"

seqkit stats "$1"/*fasta

## References

Janouškovec, J., Horák, A., Oborník, M., Lukeš, J., & Keeling, P. J. (2010). A common red algal origin of the apicomplexan, dinoflagellate, and heterokont plastids. Proceedings of the National Academy of Sciences, 107(24), 10949-10954. https://doi.org/10.1073/pnas.1003335107

Le Corguillé, G., Pearson, G., Valente, M., Viegas, C., Gschloessl, B., Corre, E., ... & Leblanc, C. (2009). Plastid genomes of two brown algae, Ectocarpus siliculosus and Fucus vesiculosus: further insights on the evolution of red-algal derived plastids. BMC Evolutionary Biology, 9(1), 1-14. https://doi.org/10.1186/1471-2148-9-253

Shen, W., Le, S., Li, Y., & Hu, F. (2016). SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PloS one, 11(10), e0163962. https://doi.org/10.1371/journal.pone.0163962

Rambaut, A. (2009). FigTree. Tree figure drawing tool. http://tree.bio.ed.ac.uk/software/figtree/.

Tice, A. K., Žihala, D., Pánek, T., Jones, R. E., Salomaki, E. D., Nenarokov, S., ... & Brown, M. W. (2021). PhyloFisher: a phylogenomic package for resolving eukaryotic relationships. PLoS Biology, 19(8), e3001365. https://doi.org/10.1371/journal.pbio.3001365

Pietluch, F., Mackiewicz, P., Ludwig, K., & Gagat, P. (2024). A New Model and Dating for the Evolution of Complex Plastids of Red Alga Origin. Genome Biology and Evolution, evae192. https://doi.org/10.1093/gbe/evae192

Ponce-Toledo, R. I., Deschamps, P., López-García, P., Zivanovic, Y., Benzerara, K., & Moreira, D. (2017). An early-branching freshwater cyanobacterium at the origin of plastids. Current Biology, 27(3), 386-391. https://doi.org/10.1016/j.cub.2016.11.056 

