# Plastid MAGs

In this notebook, we annotate the plastid MAGs with MFannot and convert the resulting masterfiles to genbank format.  We began with a redundant set of 902 plastid MAGs. The final curated set of 660 non-redundant plastid MAGs is provided at the Figshare repository associated with the manuscript: https://doi.org/10.17044/scilifelab.28212173.

### Settings

In [1]:
# Check if python is 3.10.5
import json
import os
import pandas as pd
import sys
import numpy as np
import __init__


print(sys.version)
%load_ext autoreload
%autoreload 2

3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]


In [2]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

In [18]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 150)
pd.set_option('display.max_rows', None)

Preliminary analyses compared how MFannot performed by extending the proteins collection with plastomes from RefSeq (both including and excluding the highly over-represented Viridiplantae). Thomas did not find a difference in performance, so all downstream performances were carried out with the proteins collection shipped with MFannot.

## 1. Run MFannot
We annotate all the plastid MAGs with MFannot.

We first define the input and output directories.

In [None]:
## Define directory with samples (SMP_DIR). Paths are defined in PATHS.json in the main directory.
SMP_DIR = paths_dict['DATABASES']["TARA_PLASTID_GENOMES"]

## Define output directory
MFOUT_DIR = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["MFANNOTATIONS"]["MASTERFILE"]["ROOT"]
MFSLURM_CSV = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["MFANNOTATIONS"]["MASTERFILE"]["SLURMLOG"]

## Define MFannot database
PROTEIN_COLLECTION_DB = paths_dict["DATABASES"]["MF_ANNOT_REFS"]["PROTEINS"]

For all our genomes we start a batch of MFannot runs. Jobs are submitted through the Uppmax `SlurmSubmitter`. And as defined above, jobs are tracked in `/crex/proj/naiss2023-6-81/Mahwash/ptMAGs/00_plastid_MAGS/results/mfannot/2023_10_mfannot_slurmlog.csv`.

In [13]:
from plastome_raw_data import PlastomeRawIterator

In [None]:
pri = PlastomeRawIterator(SMP_DIR, suffix="fa")

pri.run_mfannot(MFOUT_DIR, MFSLURM_CSV, PROTEIN_COLLECTION_DB, force=False, restart_fails=False)

### Track jobs
Jobs can be tracked below:

In [5]:
from dateutil import parser
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_theme(style="whitegrid", palette="pastel")

In [None]:
job_df = pd.read_csv(MFSLURM_CSV)

job_df.sort_values(by=["duration"])

## 2. Convert to GenBank Format

We used the asn2gb tool from NCBI to convert the sqn files (files in ASN.1 syntax that contains the sequences, their features, and the metadata about submission to NCBI) to flat genbank files.

In [15]:
## Define directory with sqn files (SQN_DIR). Paths are defined in PATHS.json in the main directory.
SQN_DIR = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["MFANNOTATIONS"]["MASTERFILE"]["ROOT"]

## Define output directory
GBOUT_DIR = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["MFANNOTATIONS"]["GENBANK"]["ROOT"]

## Define slurm csv to track jobs 
MFSLURM_CSV = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["MFANNOTATIONS"]["GENBANK"]["SLURMLOG"]

In [16]:
from sqn2gb import PlastomeSQNIterator

In [None]:
psi = PlastomeSQNIterator(SQN_DIR, GBOUT_DIR, MFSLURM_CSV)

psi.run_asn2gb()

## 3. Extract proteome
We extract the proteome from the GenBank files.

In [None]:
## Define directory with GenBank files
MAG_GB = paths_dict['ANALYSIS_DATA']["PLASTOMES"]["MFANNOTATIONS"]["GENBANK"]["ROOT"]

## Define output directory to store proteomes
MAG_PROT = paths_dict['ANALYSIS_DATA']["PLASTOMES"]["MFANNOTATIONS"]["PROTEOMES"]

## Define whether the file is a reference (from NCBI) or a MAG (option between "ref" and "mag").
FILE_TYPE = "mag"

In [19]:
from gb_to_prot import mfannot_gb_prot

In [20]:
mfannot_gb_prot(MAG_GB, MAG_PROT, FILE_TYPE)

Now we have all the proteomes from the plastid MAGs!

## 4. Mitochondrial contaminants

At a much later point, we identified that one of the ptMAGs had mitochondrial contamination (thanks to R2 for discovering this after the first submission)!! This is something we had not previously checked! So we will now systematically check all our ptMAGs for mitochondrial genes. I'm curious to see if the contaminant ptMAGs are assembly chimeras (one contig contains both plastid and mitochondrial genes), or if plastid and mitochondrial contigs were accidentally binned together. 

### 4.1 Search for mito genes
I put together a list of canonical mitochondrial genes from [Butenko et al 2024](https://doi.org/10.1186/s12915-024-01824-1), containing 25 genes. 

I then searched all the genbank files for the mito genes. 

In [None]:
grep -f mfannot/mito_genes.txt mfannot/genbank/* > mfannot/mito_contaminants.tsv

cat mfannot/mito_contaminants.tsv | cut -f1 -d ' ' | uniq | wc -l

51 redundant ptMAGs had mitochondrial gene contaminants. Manual checking revealed that this corresponded to only 6 non-redundant ptMAGs in our final database (and 45 redudnant ptMAGs that are not in the final database). Interestingly, this indiactes that reducing redundancy takes care of most low-quality ptMAGs.

The list of contaminants/non-contaminants is stored in the file `mfannot/mito_contaminants_ptMAGS.txt`

### 4.2 Manually checking for chimeras/binning errors

I now manually check the 6 NR ptMAGs. 


**1. CHL_AON_Bin_184_91_c**  
Contains 4 mitochondrial genes: nad5, nad4, nad9, nad4L

All for mito genes (nad5, nad4, nad9, nad4L) were on C_4 (6645 bp) along with trnA, trnE, and rpl14. The rpl14 gene indeed seems closer to a mitochondrial gene upon BLASTing.

**2. CHL_AON_Bin_218_19_c**  
Contains 2 mitochondrial genes: atp6 and cob.

Both genes were found on C_0 (4639 bp) along with 16S, trnM, trnX, and 23S. Only BLASTed the 16S - definitely mitochondrial.

**3. CHL_AOS_Bin_125_33_c**  
Contains the mito gene cox11.

This gene was found on C_4 (3385 bp) along with ycf60 (copy 1), psbI, ycf12, psaC, trnL, trnC, ycf60 (copy2), rpl19, and ycf46. 

This struck me as a bit strange as the contig clearly contains genes involved in photosynthesis. Could it be a misassembly? I BLASTed the cox11 amino acid and nucleotide sequence but found no hits. Could be a misannotation. 

**4. CHL_ARC_Bin_238_20_c**  
Contains 2 mito genes: cox1 and atp8. 

Both genes on C_0 (12507 bp) along with orf145, orf170, orf168, orf141, 23S (copy 1), trnM (copy1), 16S (copy 1), 5S, orf116, trnY, orf159, orf102, and tatC. The 16S sequence BLASTed to some Acetobacteria (with only 80% similarity). Unclear, but my guess is that the whole contig is mitochondrial. 

**5. CHL_ION_Bin_67_7_c**  
Contains 12 mitochondrial genes!! atp6 (copy 1), cox3, atp8, nad3, nad2 (copy 1), nad5, cob, nad9, nad2 (copy 2), nad6, atp6 (copy 2), cox3. 

atp6, cox3, atp8, nad3, nad2  on C_3 (3454 bp) along with rps14.
nad5, cob, nad9, nad2, nad6, atp6, cox3  on C_17 (6511 bp) along with trnR, trnF, and trnS.

**6. CHL_IOS_Bin_126_17_c**  
Contains 1 mito gene: nad10.

Found on C_2 (3399 bp) along with rps4 (copy 1), trnM, trnH, trnF, trnC, rps13, trnM, 23S, and orf102. Both the 23S and rps4 gene clearly BLASTed to a mitochondrial genome. 


All the contaminating contigs were removed. 

## Final ptMAGs dataset

The final dataset consisted of 660 ptMAGs that were:
- non-redundant (less than 98% average nucleotide identity with other ptMAGs and our selected references)  
- checked for mitochondrial contamination  
- checked for chimeras (through single gene tree inference)  
- less than 15% redundant based on 44 core genes 

## References

Lang, B. F., Beck, N., Prince, S., Sarrasin, M., Rioux, P., & Burger, G. (2023). Mitochondrial genome annotation with MFannot: a critical analysis of gene identification and gene model prediction. Frontiers in Plant Science, 14.

Beck, N., Lang, B. F. (2010) MFannot, organelle genome annotation websever. Available at: https://github.com/BFL-lab/Mfannot.

asn2gb tool from NCBI. Available at: https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/asn2gb/ 

Butenko, A., Luke≈°, J., Speijer, D., & Wideman, J. G. (2024). Mitochondrial genomes revisited: why do different lineages retain different genes?. BMC biology, 22(1), 15.
