# References

In this notebook, we obtain the reference plastomes to be used in downstream phylogenetic analyses of plastid genomes.  

### Settings

In [1]:
# Check if python is 3.10.5
import json
import os
import pandas as pd
import sys
import __init__

print(sys.version)
%load_ext autoreload
%autoreload 2

3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]


In [2]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

## 1. Retrieve reference plastomes/genomes
The reference plastomes and genomes used in downstream phylogenetic analyses were selected manually through several rounds of phylogenies, mostly from RefSeq. The aim was to include plastomes:
- closely related to the ptMAGs,
- that cover the broad diversity of plastid-bearing lineages,
- while excluding very fast-evolving lineages (such as some dinoflagellate groups) that can affect phylogenetic analyses through long-branch attraction.

The final list of selected references contained 178 accession numbers (NB: this number was updated later after many more rounds of preliminary phylogenies, these can be found at the Figshare repository: https://doi.org/10.17044/scilifelab.28212173).
 

I downloaded the references manually (both genbank and fasta formats) through BatchEntrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez). 

This worked for 178 records. Two could not be processed:

```
Id=GCF_000315565: nuccore: Wrong UID GCF_000315565
Id=GCF_000173555: nuccore: Wrong UID GCF_000173555
```

I downloaded 'GCF_000315565' and 'GCF_000173555' separately.

I edited the fasta file to keep only the accession number.

In [None]:
## Fasta file directory
REF_FASTA = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["FASTA_FILES"]

In [None]:
%%bash -s "$REF_FASTA"

cat "$1"/sequence.fasta | sed -E 's/(>.*)\.[0-9].*/\1/' > fasta
mv fasta "$1"/sequence.fasta 

I then split this fasta file into multiple files, with one plastid genome per file, and the file name corresponding to the sequence header. 

In [None]:
%%bash -s "$REF_FASTA"

awk '/^>/{if(N){close(N)} N=substr($1,2) ".fa"; print > N; next;} {if(N) print >> N}' "$1"/sequence.fasta

rm "$1"/sequence.fasta

## 2. Extract proteomes
We converted the retrieved GenBank files to protein fasta files to enable downstream phylogenetic analyses. 

In [5]:
## Define directory with GenBank files
REF_GB = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["GENBANK_FILES"]

## Define output directory to store proteomes
REF_PROT = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["PROTEOMES"]

In [6]:
from gb_to_prot import refseq_gb_prot

In [7]:
refseq_gb_prot(REF_GB, REF_PROT)

Now we check if there are any empty proteome files, i.e. if some genbank to fasta file conversion failed.

In [None]:
# Iterate through all files in the output directory
for filename in os.listdir(REF_PROT):
    file_path = os.path.join(REF_PROT, filename)
    
    # Check if the file is a regular file and its size is 0 bytes
    if os.path.isfile(file_path) and os.path.getsize(file_path) == 0:
        print(f"Empty file found: {filename}")

We have empty files corresponding to accession number LC716140, NC_012575, LC716139, NC_039967, in addition to the genomes GCF_000173555, and GCF_000315565. These files are empty either becuase the CDS sequences do not have gene information, or because the CDS sequences are missing. We can deal with these sequences manually. 
First, we get the corresponding nucleotide fasta file, and then run mfannot on that.  

Run MFannot! 

In [3]:
## Define directory with fasta files for genomes that failed in the last step
## This is also the output folder for the mfannot analysis
REF_MFANNOT = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["MF_ANNOT_REFS"]["ROOT"]

## Define slurm csv to track jobs 
MFSLURM_CSV = paths_dict["ANALYSIS_DATA"]["REFERENCE_ORGANIZATION"]["MF_ANNOT_REFS"]["SLURMLOG"]

## Define MFannot database
PROTEIN_COLLECTION_DB = paths_dict["DATABASES"]["MF_ANNOT_REFS"]["PROTEINS"]

In [4]:
from plastome_raw_data import PlastomeRawIterator

In [None]:
pri = PlastomeRawIterator(REF_MFANNOT, suffix="fa")

pri.run_mfannot(REF_MFANNOT, MFSLURM_CSV, PROTEIN_COLLECTION_DB, force=False, restart_fails=False)

Now we can convert the MFannot output file to the genbank format. We use the asn2gb tool from NCBI to do so.

In [3]:
## Define directory with sqn files (SQN_DIR). Paths are defined in PATHS.json in the main directory.
SQN_DIR = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["MF_ANNOT_REFS"]["ROOT"]

## Define output directory
GBOUT_DIR = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["MF_ANNOT_REFS"]["ROOT"]

## Define slurm csv to track jobs 
MFSLURM_CSV = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["MF_ANNOT_REFS"]["SLURMLOG_GB"]

In [4]:
from sqn2gb import PlastomeSQNIterator

In [None]:
psi = PlastomeSQNIterator(SQN_DIR, GBOUT_DIR, MFSLURM_CSV)

psi.run_asn2gb()

We track the jobs.

In [6]:
from dateutil import parser
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_theme(style="whitegrid", palette="pastel")

In [None]:
job_df = pd.read_csv(MFSLURM_CSV)

job_df.sort_values(by=["duration"])

Finally we can get the proteome from the genbank file.

In [15]:
## Define directory with GenBank files
GBOUT_DIR = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["MF_ANNOT_REFS"]["ROOT"]

## Define output directory to store proteomes
REF_PROT = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["PROTEOMES"]

## Define whether the file is a reference (from NCBI) or a MAG (opt between "ref" and "mag").
FILE_TYPE = "ref"

In [17]:
from gb_to_prot import mfannot_gb_prot

In [18]:
mfannot_gb_prot(GBOUT_DIR, REF_PROT, FILE_TYPE)

Now we have all the proteomes from the reference sequences!!

## 3. Get taxonomy

We have the accession numbers for the all the sequences, but we don't have the taxonomy information. Let's add this now to help interpret all out phylogenies. 

To get the taxonomic IDs and their associated lineage information, we use the a custom python script that looks up each accession number on NCBI and extracts the corresponding taxonomy string.

In [17]:
from accession2taxonomy import acc_to_taxo

In [18]:
## List of accession numbers
LIST = paths_dict["ANALYSIS_DATA"]['REFERENCE_ORGANIZATION']['LIST']

## Output taxonomy file
TAXONOMY = paths_dict["ANALYSIS_DATA"]['REFERENCE_ORGANIZATION']['TAXONOMY']

## Type of database to query
DATABASE = "nuccore"

In [None]:
acc_to_taxo(LIST, TAXONOMY, DATABASE)

We couldn't get the taxonomy information for two accession numbers: GCF_000315565 and GCF_000173555. We add this manually. 
This is what the file looks like:

In [21]:
## head TAXONOMY file
print("".join(open(TAXONOMY).readlines()[:10]))

NC_022600	Bacteria_Cyanobacteriota_Cyanophyceae_Gloeobacterales_Gloeobacteraceae_Gloeobacter_kilaueensis_JS1
NC_005125	Bacteria_Cyanobacteriota_Cyanophyceae_Gloeobacterales_Gloeobacteraceae_Gloeobacter_violaceus_PCC_7421
CP017675	Bacteria_Cyanobacteriota_Cyanophyceae_Gloeomargaritales_Gloeomargaritaceae_Gloeomargarita_lithophora_Alchichica-D10
CP088016	Bacteria_Cyanobacteriota_Cyanophyceae_Gloeomargaritales_cyanobacterium_VI4D9
GCF_000315565	Bacteria_Cyanobacteriota_Cyanophyceae_Nostocales_Symphyonemataceae_Mastigocladopsis_repens_PCC_10914
NC_019693	Bacteria_Cyanobacteriota_Cyanophyceae_Oscillatoriophycideae_Oscillatoriales_Oscillatoriaceae_Oscillatoria_acuminata_PCC_6304
GCF_000173555	Bacteria_Cyanobacteriota_Cyanophyceae_Oscillatoriales_Sirenicapillariaceae_Limnospira_maxima_CS-328
NZ_AP014638	Bacteria_Cyanobacteriota_Cyanophyceae_Leptolyngbyales_Leptolyngbyaceae_Leptolyngbya_boryana_IAM_M-101
NC_009925	Bacteria_Cyanobacteriota_Cyanophyceae_Acaryochloridales_Acaryochloridaceae_Acary

Nice! We want to add the taxonomy information to all the fasta headers in the proteome files (generated in the previous step). I want to change a couple of things:
1. The taxonomy strings are super long, so I will manually edit them to retain only informative taxonomy ranks.
2. PhyloFisher does not like the underscore character, so we should replace it with a dash. (NB: Ended up not using PhyloFisher at the end because it discards plastid data as bacterial contaminents.)

Let's get started! We start by creating a third column and replacing '_' with '-'.

In [23]:
%%bash -s "$TAXONOMY"
cat $1 | awk 'BEGIN{FS=OFS="\t"} {gsub("_", "-", $2)} 1' | sed -E 's/(.*)\t(.*)/\1\t\2\ttaxo=\2/' > txt
mv txt $1

Let's examine the file now.

In [24]:
## head updated TAXONOMY file
print("".join(open(TAXONOMY).readlines()[:10]))

NC-022600	Bacteria-Cyanobacteriota-Cyanophyceae-Gloeobacterales-Gloeobacteraceae-Gloeobacter-kilaueensis-JS1	taxo=Bacteria-Cyanobacteriota-Cyanophyceae-Gloeobacterales-Gloeobacteraceae-Gloeobacter-kilaueensis-JS1
NC-005125	Bacteria-Cyanobacteriota-Cyanophyceae-Gloeobacterales-Gloeobacteraceae-Gloeobacter-violaceus-PCC-7421	taxo=Bacteria-Cyanobacteriota-Cyanophyceae-Gloeobacterales-Gloeobacteraceae-Gloeobacter-violaceus-PCC-7421
CP017675	Bacteria-Cyanobacteriota-Cyanophyceae-Gloeomargaritales-Gloeomargaritaceae-Gloeomargarita-lithophora-Alchichica-D10	taxo=Bacteria-Cyanobacteriota-Cyanophyceae-Gloeomargaritales-Gloeomargaritaceae-Gloeomargarita-lithophora-Alchichica-D10
CP088016	Bacteria-Cyanobacteriota-Cyanophyceae-Gloeomargaritales-cyanobacterium-VI4D9	taxo=Bacteria-Cyanobacteriota-Cyanophyceae-Gloeomargaritales-cyanobacterium-VI4D9
GCF-000315565	Bacteria-Cyanobacteriota-Cyanophyceae-Nostocales-Symphyonemataceae-Mastigocladopsis-repens-PCC-10914	taxo=Bacteria-Cyanobacteriota-Cyanophyc

Looking good. I will now manually edit the third column.

We can look at the edited file now. 

In [26]:
## head updated TAXONOMY file
print("".join(open(TAXONOMY).readlines()[:10]))

NC-022600	Bacteria-Cyanobacteriota-Cyanophyceae-Gloeobacterales-Gloeobacteraceae-Gloeobacter-kilaueensis-JS1	taxo=Cyanobacteriota-Gloeobacterales-Gloeobacter-kilaueensis
NC-005125	Bacteria-Cyanobacteriota-Cyanophyceae-Gloeobacterales-Gloeobacteraceae-Gloeobacter-violaceus-PCC-7421	taxo=Cyanobacteriota-Gloeobacterales-Gloeobacter-violaceus
CP017675	Bacteria-Cyanobacteriota-Cyanophyceae-Gloeomargaritales-Gloeomargaritaceae-Gloeomargarita-lithophora-Alchichica-D10	taxo=Cyanobacteriota-Gloeomargaritales-Gloeomargarita-lithophora-Alchichica
CP088016	Bacteria-Cyanobacteriota-Cyanophyceae-Gloeomargaritales-cyanobacterium-VI4D9	taxo=Cyanobacteriota-Gloeomargaritales-cyanobacterium-VI4D9
GCF-000315565	Bacteria-Cyanobacteriota-Cyanophyceae-Nostocales-Symphyonemataceae-Mastigocladopsis-repens-PCC-10914	taxo=Cyanobacteriota-Nostocales-Mastigocladopsis-repens
NC-019693	Bacteria-Cyanobacteriota-Cyanophyceae-Oscillatoriophycideae-Oscillatoriales-Oscillatoriaceae-Oscillatoria-acuminata-PCC-6304	taxo=C

Let's append the taxonomy to the fasta files now. 

In [11]:
## Define directory containing proteomes (fasta files). Also the output folder.
REF_PROT = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["PROTEOMES"]

## Taxonomy mapping file
TAXONOMY = paths_dict["ANALYSIS_DATA"]['REFERENCE_ORGANIZATION']['TAXONOMY']

In [12]:
from rename_fasta import FastaIterator

In [13]:
fi = FastaIterator(REF_PROT, REF_PROT, TAXONOMY)

fi.prepend_taxo()

---
# References
Ponce-Toledo, R.I., Deschamps, P., López-García, P., Zivanovic, Y., Benzerara, K. and Moreira, D., 2017. An early-branching freshwater cyanobacterium at the origin of plastids. Current Biology, 27(3), pp.386-391.
 