# 18S rRNA gene from the targeted coassembly  

The targeted co-assembly of NEW-abundant filters did not recover a potential corresponding psbO gene (at least not that we can see).

We now try to extract the 18S genes from the targeted coassembly with the idea that the 18S gene is more likely to be recovered than the psbO gene as it tends to (but not always) have multiple copies in genomes.

I'll try to first extract the 18S gene using [barrnap](https://github.com/tseemann/barrnap).

In [None]:
# Check if python is 3.10.5
import json
import os
import pandas as pd
import sys
import numpy as np
import __init__


print(sys.version)
%load_ext autoreload
%autoreload 2

In [7]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

## 1. Extract 18S sequences from the filters where NEW is abundant

We use barrnap with a very relaxed length cutoff of 0.1.

In [3]:
## Path to assembly folder 
ASSEMBLY = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MEGAHIT"]

In [None]:
%%bash -s "$ASSEMBLY"

sbatch ../../uppmax_scripts/script_bin/job_2024_09_barrnap.sh "$1"/out/final.contigs.fa "$1"/out/18S/rrna.fasta

We extract the 18S sequences only. 

In [None]:
%%bash -s "$ASSEMBLY"

seqkit grep -rp "18S" "$1"/out/18S/rrna.fasta > "$1"/out/18S/18S.fasta
seqkit stats "$1"/out/18S/18S.fasta

We got 95 18S sequences, from 468 to 1800 bp. What are these 18S sequences? Let us find out! 

## 2. BLAST 18S sequences against PR2 v5

We BLAST our extracted 18S sequences against [PR2 v5](https://github.com/pr2database/pr2database/releases/tag/v5.0.0) to try and identify what is present in our metagenomes. Are there things that are very unsimilar to known reference sequences?

In [8]:
## Path to assembly folder 
ASSEMBLY = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MEGAHIT"]

## Path to PR2 blast database
PR2 = paths_dict["DATABASES"]["COASSEMBLY"]["PR2"]

In [None]:
%%bash -s "$ASSEMBLY" "$PR2"

sbatch ../../uppmax_scripts/script_bin/job_blastn.sh "$1"/out/18S/18S.fasta "$2"/pr2_version_5.0.0_SSU_taxo_long.db "$1"/out/18S

We pull the best blast hit for each query.

In [15]:
%%bash -s "$ASSEMBLY" 

## Extract best blast hit for each 18S seq
cat "$1"/out/18S/18S__pr2_version_5.blastout | sort -k1,1 -k12,12nr | awk '!seen[$1]++' > "$1"/out/18S/18S__pr2_version_5.blastout.bestHit

I looked at the best hits manually. Most 18S sequences were actually 16S sequences from plastids and mitochondria, leaving behind only 29 nuclear 18S sequences. Out of these most were >90% to known references (animals, haptophytes, dinoflagellates, ciliates etc). Around four were less than 90% similar, and I looked at these sequences in more detail and BLASTed them against NCBI's nt_core. Three of these sequences could be convincingly placed in existing groups. One sequence (18S_RRNA::C_000000708636:221-1073) looked very different, and it initially excited me. However, a vast proportion of the sequence simply had no hits (when I blasted in bits) and is likely a chimeric sequence from bits of 18S and non-18S sequence. 

To summarise, the search for the 18S of NEW failed looking at just the assembled metagenomes. I will try looking at the 18S ASVs from these filters.

## References

Seemann T. barrnap 0.9 : rapid ribosomal RNA prediction. https://github.com/tseemann/barrnap

Guillou, L., Bachar, D., Audic, S., Bass, D., Berney, C., Bittner, L., ... & Christen, R. (2012). The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote small sub-unit rRNA sequences with curated taxonomy. Nucleic acids research, 41(D1), D597-D604. 
