# 16S phylogeny

In this notebook, we extract plastid 16S genes, and infer phylogenies.

In [1]:
# Check if python is 3.10.5
import json
import os
import pandas as pd
import sys
import numpy as np
import __init__


print(sys.version)
%load_ext autoreload
%autoreload 2

3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]


In [2]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

Metagenome assembled genomes (MAGs) often do not contain SSU genes. This is because contigs from assembled metagenomes are usually binned based on differential coverage of reads, GC content, and k-mer frequency. Because the rRNA genes are so conserved, it is difficult to assign them to any MAG based on GC content and k-mer frequency. 

MAGs *can* sometimes contain fragments of the SSU gene (presumably if the gene is part of a larger contig?), e.g. [Delmont et al 2018](https://doi.org/10.1038/s41564-018-0176-9). 

I decided to run [barrnap](https://github.com/tseemann/barrnap) to see if we can detect 16S in the long, unbinned contigs since they were the most complete, and therefore most likely to contain the gene. Happily, we found a roughly 800 bp fragment in one of them! 

The best BLAST hit of this fragment was to a DPL2 16S sequence ([Choi et al 2017](https://doi.org/10.1016/j.cub.2016.11.032)) with 91% similarity!! 

I also found a 16S rDNA gene fragment from Lepto-04 (CHL_AOS_Bin_125_5_c). However, this 808 bp 16S fragment turned out to be chimeric. The first half matched the DPL2 sequences almost perfectly (100% sequence similarity, 98% query cover), but the latter half matched a cyanobacterial sequence extremely well. We therefore took only a 440 bo fragment of the 16S sequence of Lepto-04. 


## 1. Taxon sampling

I gathered 16S sequences from the reference taxa in Figure 2 (the subset phylogeny), the 16S gene fragments from Lepto-01 and Lepto-04, and the two DPL2 sequences. 

## 2. Align

We align with mafft-linsi.

In [6]:
## Input directory containing fasta
DATASET = paths_dict['ANALYSIS_DATA']['16S_GENE']['DATASET']['ROOT']

## Output directory 
ALIGNED = paths_dict['ANALYSIS_DATA']['16S_GENE']['ALIGNMENTS']["V4"]['MAFFT']

In [None]:
%%bash -s "$DATASET" "$ALIGNED"

sbatch /crex/proj/naiss2023-6-81/Mahwash/beta-Cyclocitral/uppmax_scripts/script_bin/job_mafft-linsi.sh "$1"/v4/16S.fasta "$2"/16S.mafft.fasta


## 3. Trim!
We trim with trimal using a gap threshold of 0.1.

In [8]:
## Input directory containing fasta
ALIGNED = paths_dict['ANALYSIS_DATA']['16S_GENE']['ALIGNMENTS']["V4"]['MAFFT']

## Output directory 
TRIMMED = paths_dict['ANALYSIS_DATA']['16S_GENE']['ALIGNMENTS']["V4"]['TRIMAL']

In [None]:
%%bash -s "$ALIGNED" "$TRIMMED" 

sbatch /crex/proj/naiss2023-6-81/Mahwash/beta-Cyclocitral/uppmax_scripts/script_bin/job_2023_10_22_trimal_ssu.sh "$1"/16S.mafft.fasta "$2"/16S.mafft.trimal.fasta


## 4. Run trees

We will use raxml-ng to infer the phylogenies (20 ML searches and 100 bootstraps) using the GTR+G model.

In [10]:
## Input directory containing fasta
TRIMMED = paths_dict['ANALYSIS_DATA']['16S_GENE']['ALIGNMENTS']["V4"]['TRIMAL']

## Output directory 
TREES = paths_dict['ANALYSIS_DATA']['16S_GENE']['TREES']

In [None]:
%%bash -s "$TRIMMED" "$TREES" 

sbatch /crex/proj/naiss2023-6-81/Mahwash/beta-Cyclocitral/uppmax_scripts/script_bin/job_raxml-ng.sh "$1"/16S.mafft.trimal.fasta "$2"/v4/16S


## References

Delmont, T. O., Quince, C., Shaiber, A., Esen, Ö. C., Lee, S. T., Rappé, M. S., ... & Eren, A. M. (2018). Nitrogen-fixing populations of Planctomycetes and Proteobacteria are abundant in surface ocean metagenomes. Nature microbiology, 3(7), 804-813. https://doi.org/10.1038/s41564-018-0176-9

Barrnap. https://github.com/tseemann/barrnap

Choi, C. J., Bachy, C., Jaeger, G. S., Poirier, C., Sudek, L., Sarma, V. V. S. S., ... & Worden, A. Z. (2017). Newly discovered deep-branching marine plastid lineages are numerically rare but globally distributed. Current Biology, 27(1), R15-R16. https://doi.org/10.1016/j.cub.2016.11.032 

FigTree. http://tree.bio.ed.ac.uk/software/figtree/