# psbO gene   

This is an attempt to link our ptMAGs to nuclear genetic data. The psbO gene encodes the manganese-stabilising polypeptide of the photosystem II oxygen evolving complex. It is a single copy gene found in nuclear genomes. 

In [1]:
# Check if python is 3.10.5
import json
import os
import pandas as pd
import sys
import numpy as np
import __init__


print(sys.version)
%load_ext autoreload
%autoreload 2

3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]


In [2]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

## 1. Most abundant psbO sequences in filters where Lepto-01 is abundant

We start by focusing on the filters where Lepto-01 is most abundant: 

- 194SUR1GGZZ11 (mean coverage of NEW = 124.15, 6th most abundant plastid lineage)  
- 194SUR0CCKK11 (mean coverage of NEW = 70.54, 3rd most abundant plastid lineage)  

Let's start by extracting the top 50 most abundant psbO sequences in these two filters.

I downloaded the psbO database provided by [Karlusich et al 2022](https://doi.org/10.1111/1755-0998.13592), which also includes sequence abundance per Tara filter. 

### 1.1 Extract most abundant psbO sequences

In [3]:
## Path to psbO database, and sequence list
DATABASE = paths_dict["DATABASES"]["PSBO"]["ROOT"]

## Path to output folder 
OUT_DIR = paths_dict["ANALYSIS_DATA"]["PSBO"]["ABUNDANT_PSBO"]["DATA"]

In [4]:
%%bash -s "$DATABASE" "$OUT_DIR"

cat "$1"/PsbO_metaG.tsv | \
tr '\t' '@' | \
grep "194@SUR@1@GGZZ@11" | \
tr '@' '\t' | \
sort -nrk 7 | \
head -50 | \
cut -f 6 \
> "$2"/194SUR1GGZZ11_psbO_top50.list

cat "$1"/PsbO_metaG.tsv | \
tr '\t' '@' | \
grep "194@SUR@0@CCKK@11" | \
tr '@' '\t' | \
sort -nrk 7 | \
head -50 | \
cut -f 6 \
> "$2"/194SUR0CCKK11_psbO_top50.list

In [5]:
%%bash -s "$DATABASE" "$OUT_DIR"

seqkit grep -f "$2"/194SUR1GGZZ11_psbO_top50.list "$1"/psbO_20210825.fna > "$2"/194SUR1GGZZ11_psbO_top50.fasta
seqkit grep -f "$2"/194SUR0CCKK11_psbO_top50.list "$1"/psbO_20210825.fna > "$2"/194SUR0CCKK11_psbO_top50.fasta

[INFO][0m 50 patterns loaded from file
[INFO][0m 50 patterns loaded from file


In [None]:
%%bash -s "$OUT_DIR"

seqkit stats "$1"/194SUR1GGZZ11_psbO_top50.fasta
seqkit stats "$1"/194SUR0CCKK11_psbO_top50.fasta

Let's add some information to our fasta files, so that they contain: (1) order of abundance, (2) number of reads mapped.

In [7]:
%%bash -s "$DATABASE" "$OUT_DIR"

## Create a file with the abundnace rank and the number of reads
cat "$1"/PsbO_metaG.tsv | \
tr '\t' '@' | \
grep "194@SUR@1@GGZZ@11" | \
tr '@' '\t' | \
sort -nrk 7 | \
head -50 | \
cat -n | \
cut -f1,7,8 | \
sed -E 's/([0-9]+)\t(.*)\t([0-9]+)/\2\trank=\1_reads=\3/' | \
awk '{$1=$1;print}' | \
tr ' ' '\t' \
> "$2"/194SUR1GGZZ11_psbO_top50.metadata

## Create a file with the abundance rank and the number of reads
cat "$1"/PsbO_metaG.tsv | \
tr '\t' '@' | \
grep "194@SUR@0@CCKK@11" | \
tr '@' '\t' | \
sort -nrk 7 | \
head -50 | \
cat -n | \
cut -f1,7,8 | \
sed -E 's/([0-9]+)\t(.*)\t([0-9]+)/\2\trank=\1_reads=\3/' | \
awk '{$1=$1;print}' | \
tr ' ' '\t' \
> "$2"/194SUR0CCKK11_psbO_top50.metadata

In [9]:
%%bash -s "$OUT_DIR"

## Create a file with the taxonomy
grep ">" "$1"/194SUR1GGZZ11_psbO_top50.fasta | \
tr -d '>' | \
cut -f 1,2 -d ' ' | \
sed -E 's/(.*) (.*)/\1\t\1_tax=\2/' | \
tr ';' '_' \
> "$1"/194SUR1GGZZ11_psbO_top50.taxonomy

## Create a file with the taxonomy
grep ">" "$1"/194SUR0CCKK11_psbO_top50.fasta | \
tr -d '>' | \
cut -f 1,2 -d ' ' | \
sed -E 's/(.*) (.*)/\1\t\1_tax=\2/' | \
tr ';' '_' \
> "$1"/194SUR0CCKK11_psbO_top50.taxonomy


I had to manually edit the taxonomy files a bit as the column content was not consistent across the fasta headers from which this file was generated.

Now generate the file to rename the fasta headers.

In [10]:
%%bash -s "$OUT_DIR"

join <(sort -k1 "$1"/194SUR1GGZZ11_psbO_top50.taxonomy) <(sort -k1 "$1"/194SUR1GGZZ11_psbO_top50.metadata) | \
tr ' ' '\t' | \
sed -E 's/(.*)\t(.*)\t(.*)/\1\t\2_\3/' \
> "$1"/194SUR1GGZZ11_psbO_top50.rename

join <(sort -k1 "$1"/194SUR0CCKK11_psbO_top50.taxonomy) <(sort -k1 "$1"/194SUR0CCKK11_psbO_top50.metadata) | \
tr ' ' '\t' | \
sed -E 's/(.*)\t(.*)\t(.*)/\1\t\2_\3/' \
> "$1"/194SUR0CCKK11_psbO_top50.rename

Rename the fasta headers now!

In [None]:
%%bash -s "$OUT_DIR"

seqkit replace -p '^(\S+)(.+?)$' -r '{kv}' -k "$1"/194SUR1GGZZ11_psbO_top50.rename "$1"/194SUR1GGZZ11_psbO_top50.fasta \
> "$1"/194SUR1GGZZ11_psbO_top50.renamed.fasta 

seqkit replace -p '^(\S+)(.+?)$' -r '{kv}' -k "$1"/194SUR0CCKK11_psbO_top50.rename "$1"/194SUR0CCKK11_psbO_top50.fasta \
> "$1"/194SUR0CCKK11_psbO_top50.renamed.fasta 

Finally, we can add the reference psbO sequences to the extracted sequences in order to build a phylogeny. 

In [12]:
%%bash -s "$DATABASE" "$OUT_DIR"

## extract mmetsp psbO sequences
seqkit grep -nf "$1"/mmetsp.list "$1"/psbO_20210825.fna > "$1"/mmetsp.fasta

## add to sequences from each filter
cat "$1"/mmetsp.fasta "$2"/194SUR1GGZZ11_psbO_top50.renamed.fasta > "$2"/194SUR1GGZZ11_psbO_top50.reference.fasta

cat "$1"/mmetsp.fasta "$2"/194SUR0CCKK11_psbO_top50.renamed.fasta > "$2"/194SUR0CCKK11_psbO_top50.reference.fasta

[INFO][0m 346 patterns loaded from file


### 1.2 Align and trim sequences

In [22]:
## Path to input fasta file
DATA = paths_dict["ANALYSIS_DATA"]["PSBO"]["ABUNDANT_PSBO"]["DATA"]

## Path to output folder 
OUT_DIR = paths_dict["ANALYSIS_DATA"]["PSBO"]["ABUNDANT_PSBO"]["ALIGNMENTS"]

Align with mafft-linsi!

In [None]:
%%bash -s "$DATA" "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_mafft-linsi.sh "$1"/194SUR0CCKK11_psbO_top50.reference.fasta "$2"/194SUR0CCKK11_psbO_top50.mafft.fasta

sbatch ../../uppmax_scripts/script_bin/job_mafft-linsi.sh "$1"/194SUR1GGZZ11_psbO_top50.reference.fasta "$2"/194SUR1GGZZ11_psbO_top50.mafft.fasta

Remove gaps and semicolons from files.

In [16]:
%%bash -s "$OUT_DIR"

cat "$1"/194SUR0CCKK11_psbO_top50.mafft.fasta | \
    tr ' ' '_' | \
    tr ';' '_' | \
    tr -d '(' | tr -d ')' \
    | tr -d '.' \
    > "$1"/194SUR0CCKK11_psbO_top50.mafft.edit.fasta

cat "$1"/194SUR1GGZZ11_psbO_top50.mafft.fasta | \
    tr ' ' '_' | \
    tr ';' '_' | \
    tr -d '(' | tr -d ')' \
    | tr -d '.' \
    > "$1"/194SUR1GGZZ11_psbO_top50.mafft.edit.fasta

Trim gently with trimal.

In [None]:
%%bash -s "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_trimal_ssu.sh "$1"/194SUR0CCKK11_psbO_top50.mafft.edit.fasta "$1"/194SUR0CCKK11_psbO_top50.mafft.trimal.fasta

sbatch ../../uppmax_scripts/script_bin/job_trimal_ssu.sh "$1"/194SUR1GGZZ11_psbO_top50.mafft.edit.fasta "$1"/194SUR1GGZZ11_psbO_top50.mafft.trimal.fasta

We found a couple of duplicates in our files (not sure why...). We remove these.

In [23]:
%%bash -s "$OUT_DIR"

seqkit rmdup -n "$1"/194SUR0CCKK11_psbO_top50.mafft.trimal.fasta > temp
mv temp "$1"/194SUR0CCKK11_psbO_top50.mafft.trimal.fasta 

seqkit rmdup -n "$1"/194SUR1GGZZ11_psbO_top50.mafft.trimal.fasta > temp
mv temp "$1"/194SUR1GGZZ11_psbO_top50.mafft.trimal.fasta

[INFO][0m 1 duplicated records removed
[INFO][0m 1 duplicated records removed


### 1.3 Run phylogenies!

In [24]:
## Path to input fasta file
ALIGNMENTS = paths_dict["ANALYSIS_DATA"]["PSBO"]["ABUNDANT_PSBO"]["ALIGNMENTS"]

## Path to output folder 
OUT_DIR = paths_dict["ANALYSIS_DATA"]["PSBO"]["ABUNDANT_PSBO"]["TREES"]

In [None]:
%%bash -s "$ALIGNMENTS" "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_2023_10_22_raxml-ng.sh "$1"/194SUR0CCKK11_psbO_top50.mafft.trimal.fasta "$2"/194SUR0CCKK11_psbO_top50
sbatch ../../uppmax_scripts/script_bin/job_2023_10_22_raxml-ng.sh "$1"/194SUR1GGZZ11_psbO_top50.mafft.trimal.fasta "$2"/194SUR1GGZZ11_psbO_top50


## 2. Search for psbO gene in Arctic asembled metagenomes   

We would like to:  

1) Search for psbO genes in the Arctic metagenomes (where NEW is most abundant)
2) Correlate the abundance of all hits against the abundance of NEW
3) Build a psbO phylogeny of potential candidates (fingers crossed we get some!). The NEW psbO gene should fall in a deep position in the tree.
5) Try and find other genes from the same contig/MAG
6) Use those to try and determine the identity of NEW

### 2.1. Get data

We downloaded the metagenomic co-assembly from the Arctic from https://www.genoscope.cns.fr/tara/. We downloaded both the contigs >2,500nt (which were used for binning), and >1,000nt.

### 2.2 Generate contigs database

We use anvi'o to generate the contigs database. This runs Prodigal in the background for gene calling. This of course may not be the best approach for eukaryotes, but it's a good start! 

In [None]:
## Path to input fasta file
DATABASE = paths_dict["DATABASES"]["PSBO"]["ARCTIC"]

## Path to output folder 
OUT_DIR = paths_dict["ANALYSIS_DATA"]["PSBO"]["ARCTIC"]["CANDIDATES"]

Submit job for contigs above 2,500 nucleotides.

In [None]:
%%bash -s "$DATABASE" "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_anvio_contigs_db.sh "$1"/TARA_ARC_GGZZ_SSUU_QQSS_KKQQ_2500nt.fa "$2"/TARA_ARC_CONTIGS_2500.db

And now for the contigs above 1,000 nucleotides.

In [None]:
%%bash -s "$DATABASE" "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_anvio_contigs_db.sh "$1"/TARA_ARC_GGZZ_SSUU_QQSS_KKQQ_1000nt.fa "$2"/TARA_ARC_CONTIGS_1000.db

### 2.3 Run HMM search
We use an HMM search for the psbO gene based on the PFAM accession PF01716.

In [None]:
## Path to psbO HMM profile
DATABASE = paths_dict["DATABASES"]["PSBO"]["ROOT"]

## Path to output folder 
OUT_DIR = paths_dict["ANALYSIS_DATA"]["PSBO"]["ARCTIC"]["CANDIDATES"]

In [None]:
%%bash -s "$DATABASE" "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_anvio_hmm_search.sh "$2"/TARA_ARC_CONTIGS_2500.db "$1"/psbOHmms

We detected 51 hits.

In [None]:
%%bash -s "$DATABASE" "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_anvio_hmm_search.sh "$2"/TARA_ARC_CONTIGS_1000.db "$1"/psbOHmms

We detected 84 hits.

### 2.4 Extract HMM hits
Now we simply extract the HMM hits!

In [None]:
## Path to psbO HMM profile
DATABASE = paths_dict["DATABASES"]["PSBO"]["ROOT"]

## Path to output folder 
OUT_DIR = paths_dict["ANALYSIS_DATA"]["PSBO"]["ARCTIC"]["CANDIDATES"]

In [None]:
%%bash -s "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_anvio_hmm_extract.sh "$1"/TARA_ARC_CONTIGS_2500.db psbOHmms "$1"/psbO_TARA_ARC_CONTIGS_2500.fasta

sbatch ../../uppmax_scripts/script_bin/job_anvio_hmm_extract.sh "$1"/TARA_ARC_CONTIGS_1000.db psbOHmms "$1"/psbO_TARA_ARC_CONTIGS_1000.fasta

### 1.5 Run tree!

Now we can infer a phylogeny with the extracted psbO sequences to see where they go in the psbO tree. First let us trim and align the sequences.

In [None]:
## Path to input fasta file
DATABASE = paths_dict["ANALYSIS_DATA"]["PSBO"]["ARCTIC"]["PHYLOGENY"]["DATA"]

## Path to output folder 
OUT_DIR = paths_dict["ANALYSIS_DATA"]["PSBO"]["ARCTIC"]["PHYLOGENY"]["ALIGNMENTS"]

In [None]:
%%bash -s "$DATABASE" "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_mafft-linsi.sh "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_2500.fasta "$2"/mmetsp_psbO_TARA_ARC_CONTIGS_2500.mafft.fasta

sbatch ../../uppmax_scripts/script_bin/job_mafft-linsi.sh "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_1000.fasta "$2"/mmetsp_psbO_TARA_ARC_CONTIGS_1000.mafft.fasta

We remove gaps and semicolons from fasta headers.

In [None]:
%%bash -s "$OUT_DIR"

cat "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_2500.mafft.fasta | \
    tr ' ' '_' | \
    tr ';' '_' | \
    tr -d '(' | tr -d ')' | \
    tr ':' '_' | \
    tr -d '.' \
    > "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_2500.mafft.edit.fasta

cat "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_1000.mafft.fasta | \
    tr ' ' '_' | \
    tr ';' '_' | \
    tr -d '(' | tr -d ')' | \
    tr ':' '_' | \
    tr -d '.' \
    > "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_1000.mafft.edit.fasta

Trim gently with trimal (keep columns with 90% data).

In [None]:
## Path to output folder 
DIR = paths_dict["ANALYSIS_DATA"]["PSBO"]["ARCTIC"]["PHYLOGENY"]["ALIGNMENTS"]

In [None]:
%%bash -s "$DIR"

sbatch ../../uppmax_scripts/script_bin/job_trimal_ssu.sh "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_2500.mafft.edit.fasta "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_2500.mafft.trimal.fasta

sbatch ../../uppmax_scripts/script_bin/job_trimal_ssu.sh "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_1000.mafft.edit.fasta "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_1000.mafft.trimal.fasta

Now run a tree with raxml-ng!

In [None]:
## Path to input fasta file
ALIGNMENTS = paths_dict["ANALYSIS_DATA"]["PSBO"]["ARCTIC"]["PHYLOGENY"]["ALIGNMENTS"]

## Path to output folder 
OUT_DIR = paths_dict["ANALYSIS_DATA"]["PSBO"]["ARCTIC"]["PHYLOGENY"]["TREES"]

In [None]:
%%bash -s "$ALIGNMENTS" "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_raxml-ng.sh "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_2500.mafft.trimal.fasta "$2"/psbO_arctic_2500

sbatch ../../uppmax_scripts/script_bin/job_raxml-ng.sh "$1"/mmetsp_psbO_TARA_ARC_CONTIGS_1000.mafft.trimal.fasta "$2"/psbO_arctic_1000