# Overview

We generate some stats about the MAGs and the references. How complete are the MAGs? How many genes do they encode? What is their size compared to the size of the references? What is the abundance of these MAGs in the global ocean?

All stats are presented in Supplementary Data 1 of the corresponding manuscript (and are also present in the Figshare repo: https://doi.org/10.17044/scilifelab.28212173).

In [1]:
# Check if python is 3.10.5
import sys
import os
import pandas as pd
import json
import __init__

print(sys.version)
%load_ext autoreload
%autoreload 2

3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]


In [2]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

## 1. Completeness

The first question we ask is how complete are the MAGs, and we compare to the references. To answer this question, we identified a set of 44 core genes that are found in all plastids ([Puerta et al 2005](https://doi.org/10.1093/dnares/12.2.151), and [Janouškovec et al 2010](https://doi.org/10.1073/pnas.100333510)). We use presence and absence of these genes to estimate completeness of all the MAGs and the reference taxa. 

In [3]:
## List of core genes
CORE_GENES = paths_dict["DATABASES"]["CORE_GENES"]

## Output folder (which also contains list of taxa in Figure 1)
OUT = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["STATS"]["ROOT"]

## Proteomes path for MAGS
PROTEOMES_MAGS = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["MFANNOTATIONS"]["PROTEOMES"]

## Proteomes PATH for references
PROTEOMES_REFS = paths_dict["ANALYSIS_DATA"]["REFERENCE_ORGANIZATION"]["PROTEOMES"]

Calculate completeness for MAGs first.

In [7]:
%%bash -s "$CORE_GENES" "$OUT" "$PROTEOMES_MAGS"

printf "Taxon\tType\tNumber of genes detected\tCompleteness percentage\n" > $2/completeness_taxa.txt

cat $2/mags.list | while read line; do
    genes=$(grep -f $1 $3/"$line".fasta -w | cut -f1 -d ';' | sort -u | wc -l)
    completeness=$(awk 'BEGIN {print ('$genes'/44*100)}')
    printf "$line\tmag\t$genes\t$completeness\n" >> $2/completeness_taxa.txt
done

Now estimate completeness for all references and add to the same file.

In [8]:
%%bash -s "$CORE_GENES" "$OUT" "$PROTEOMES_REFS"

cat $2/refs.list | while read line; do
    genes=$(grep -f $1 $3/"$line".fasta -w | cut -f1 -d ';' | sort -u | wc -l)
    completeness=$(awk 'BEGIN {print ('$genes'/44*100)}')
    printf "$line\tref\t$genes\t$completeness\n" >> $2/completeness_taxa.txt
done

We can now plot the completeness of the MAGs vs. the references in R using the script `src/stats.R`. 

## 2. Redundancy

We can also calculate the redundancy of the MAGs and references using the same set of 44 genes.

In [9]:
## List of core genes
CORE_GENES = paths_dict["DATABASES"]["CORE_GENES"]

## Output folder (which also contains list of taxa in Figure 1)
OUT = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["STATS"]["ROOT"]

## Proteomes path for MAGS
PROTEOMES_MAGS = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["MFANNOTATIONS"]["PROTEOMES"]

## Proteomes PATH for references
PROTEOMES_REFS = paths_dict["ANALYSIS_DATA"]["REFERENCE_ORGANIZATION"]["PROTEOMES"]

First MAGs!

In [10]:
%%bash -s "$CORE_GENES" "$OUT" "$PROTEOMES_MAGS"

## read list of genes
gene_list=($(cat $1))

# Generate the header line with gene names
header="Taxa\t${gene_list[@]}"

# Initialize the content with taxa and zeros
content=""

# Read list of taxa
taxa=($(cat $2/mags.list))

# Iterate through taxa and genes, check number of genes, and append the number
for taxon in ${taxa[@]}; do
    line="${taxon}"
    for gene in "${gene_list[@]}"; do
        count=$(grep -w "${gene}" "$3/${taxon}.fasta" | wc -l)
        line+="\t${count}"
    done
    content+="${line}\n"
done

# Combine header and content and write to gene_presence_absence.tsv
echo -e "$header\n" > "$2"/redundancy_taxa.txt
echo -e "$content" >> "$2"/redundancy_taxa.txt

Then references!

In [11]:
%%bash -s "$CORE_GENES" "$OUT" "$PROTEOMES_REFS"

## read list of genes
gene_list=($(cat $1))

# Initialize the content with taxa and zeros
content=""

# Read list of taxa
taxa=($(cat $2/refs.list))

# Iterate through taxa and genes, check number of genes, and append the number
for taxon in ${taxa[@]}; do
    line="${taxon}"
    for gene in "${gene_list[@]}"; do
        count=$(grep -w "${gene}" "$3/${taxon}.fasta" | wc -l)
        line+="\t${count}"
    done
    content+="${line}\n"
done

# Write to redundancy_taxa.txt
echo -e "$content" >> "$2"/redundancy_taxa.txt

## 3. Size

What is the total size of the MAGs and the references in number of bp? We calculate the size of each genome using [seqkit](https://bioinf.shenwei.me/seqkit/). 

In [12]:
## Output folder (which also contains list of taxa in Figure 1)
OUT = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["STATS"]["ROOT"]

## Fasta path for MAGS
MAGS = paths_dict["DATABASES"]["TARA_PLASTID_GENOMES"]["REDUNDANCY_FILTERED"]

## Fasta PATH for references
REFS = paths_dict["ANALYSIS_DATA"]["REFERENCE_ORGANIZATION"]["FASTA_FILES"]

Calculate size of MAGs first.

In [13]:
%%bash -s "$OUT" "$MAGS"

printf "Taxon\tType\tSize\n" > $1/size_taxa.txt

cat $1/mags.list | while read line; do
    length=$(seqkit stats $2/"$line".fa -T | cut -f 5 | grep -v "sum")
    printf "$line\tmag\t$length\n" >> $1/size_taxa.txt
done

Now calculate the size of the references.

In [14]:
%%bash -s "$OUT" "$REFS"

cat $1/refs.list | while read line; do
    length=$(seqkit stats $2/"$line".fa -T | cut -f 5 | grep -v "sum")
    printf "$line\tref\t$length\n" >> $1/size_taxa.txt
done

Plot in R using the script `src/stats.R`. 

## 4. GC content

What is the GC content of each MAG and reference? We calculate it using seqkit again.

In [None]:
%%bash -s "$OUT" "$MAGS"

printf "Taxon\tType\tGCcontent\n" > $1/gc_taxa.txt

cat $1/mags.list | while read line; do
    gc=$(seqkit stats $2/"$line".fa -T -a | cut -f 16 | grep -v "GC")
    printf "$line\tmag\t$gc\n" >> $1/gc_taxa.txt
done

And now the references!

In [None]:
%%bash -s "$OUT" "$REFS"

cat $1/refs.list | while read line; do
    gc=$(seqkit stats $2/"$line".fa -T -a | cut -f 16 | grep -v "GC")
    printf "$line\tref\t$gc\n" >> $1/gc_taxa.txt
done

Plot in R using the script `src/stats.R`. 

## 5. Taxonomy

Getting the taxonomy of each MAG and reference is important for mapping around the tree. The taxonomy is derived from the position of the taxon in the full concatenated tree in Figure 1. 

We collect the taxonomy in the file `taxonomy_taxa.txt`.

At the same time, we can generate a file containing the original MAG/taxa names, the new MAG/taxa names, and the taxon names used in the phylogenies. We do this manually, and the file is called `correspondence_taxa.txt`

## 6. Number of genes

How many genes are encoded in the MAGs vs. the references? For the MAGs we have data from MFAnnot, while ther references have been annotated in different ways. For the sake of consistency, I decided to annotate the references with MFAnnot too.  

### 6.1. Annotate references with MFAnnot for consistency

In [17]:
## Define directory with samples (SMP_DIR). Paths are defined in PATHS.json in the main directory.
REFS = paths_dict['ANALYSIS_DATA']["REFERENCE_ORGANIZATION"]["FASTA_FILES"]

## Define output directory
MFOUT_DIR = paths_dict["ANALYSIS_DATA"]["REFERENCE_ORGANIZATION"]["MF_ANNOT_REFS"]["ROOT"]
MFSLURM_CSV = paths_dict["ANALYSIS_DATA"]["REFERENCE_ORGANIZATION"]["MF_ANNOT_REFS"]["SLURMLOG"]

## Define MFannot database
PROTEIN_COLLECTION_DB = paths_dict["DATABASES"]["MF_ANNOT_REFS"]["PROTEINS"]

In [18]:
from plastome_raw_data import PlastomeRawIterator

In [None]:
pri = PlastomeRawIterator(REFS, suffix="fa")

pri.run_mfannot(MFOUT_DIR, MFSLURM_CSV, PROTEIN_COLLECTION_DB, force=False, restart_fails=False)

### 6.2. Count number of genes

Now we can get the number of genes for each MAG and reference sequence.

In [19]:
## Output folder (which also contains list of taxa in Figure 1)
OUT = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["STATS"]["ROOT"]

## MFAnnot path for MAGS
MAGS = paths_dict["ANALYSIS_DATA"]["PLASTOMES"]["MFANNOTATIONS"]["MASTERFILE"]["ROOT"]

## Fasta PATH for references
REFS = paths_dict["ANALYSIS_DATA"]["REFERENCE_ORGANIZATION"]["MF_ANNOT_REFS"]["ROOT"]

Calculate number of genes of MAGs first.

In [20]:
%%bash -s "$OUT" "$MAGS"

printf "Taxon\tType\tNumber of genes\n" > $1/no-of-genes_taxa.txt

cat $1/mags.list | while read line; do
    genes=$(grep "Gene Totals" $2/"$line".mf | cut -f2 -d ':' | tr -d ' ')
    printf "$line\tmag\t$genes\n" >> $1/no-of-genes_taxa.txt
done

And now the references!

In [21]:
%%bash -s "$OUT" "$REFS"

cat $1/refs.list | while read line; do
    genes=$(grep "Gene Totals" $2/"$line".mf | cut -f2 -d ':' | tr -d ' ')
    printf "$line\tref\t$genes\n" >> $1/no-of-genes_taxa.txt
done

Plot in R using the script `src/stats.R`. 

---
## References

Puerta, M. V. S., Bachvaroff, T. R., & Delwiche, C. F. (2005). The complete plastid genome sequence of the haptophyte Emiliania huxleyi: a comparison to other plastid genomes. DNA Research, 12(2), 151-156. 

Janouškovec, J., Horák, A., Oborník, M., Lukeš, J., & Keeling, P. J. (2010). A common red algal origin of the apicomplexan, dinoflagellate, and heterokont plastids. Proceedings of the National Academy of Sciences, 107(24), 10949-10954.

Shen, W., Le, S., Li, Y., & Hu, F. (2016). SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PloS one, 11(10), e0163962.