# Targeted coassembly  

The large Arctic co-assembly did not recover the psbO gene of NEW (at least not that we can see).

We now try to do a targeted co-assembly of the filters where NEW is abundant in the hope that it better assembles the leptophyte nuclear genome.  


In [None]:
# Check if python is 3.10.5
import json
import os
import pandas as pd
import sys
import numpy as np
import __init__


print(sys.version)
%load_ext autoreload
%autoreload 2

In [2]:
# we store the important data paths in PATH_FILE
PATH_FILE = "../../PATHS.json"

paths_dict = json.load(open(PATH_FILE, "r"));

## 1. Megahit

I tried preliminary assemblies using MetaSpades, however, I quickly ran into memory issues (both with and without normalising the reads). I then tried a co-assembly with megahit as it very memory efficient. 

In [11]:
## Path to psbO database, and sequence list
DATA = paths_dict["DATABASES"]["COASSEMBLY"]

## Path to output folder 
ASSEMBLY = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MEGAHIT"]

In [None]:
%%bash -s "$DATA" "$ASSEMBLY"

R1s=$(ls "$1"/*F_1.fastq.gz | tr '\n' ',' | sed 's/.$//')
R2s=$(ls "$1"/*F_2.fastq.gz | tr '\n' ',' | sed 's/.$//')

sbatch ../../uppmax_scripts/script_bin/job_megahit.sh "$R1s" "$R2s" 1000 "$2"/out

Yay that worked!! Before going any further, let's try to see if we recover the psbO sequence of interest in our co-assembly. We start by simplifying the contig names.

In [5]:
## Path to output folder 
ASSEMBLY = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MEGAHIT"]

In [None]:
%%bash -s "$ASSEMBLY"

sbatch ../../uppmax_scripts/script_bin/job_anvio_simplify.sh "$1"/out/final.contigs.fa "$1"/out/final.contigs.fixed.fa

I replaced the contigs file manually by going to the folder and doing:

`mv final.contigs.fixed.fa final.contigs.fa`

## 2. Create contigs database


In [4]:
## Path to output folder 
ASSEMBLY = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MEGAHIT"]

In [None]:
%%bash -s "$ASSEMBLY" 

sbatch ../../uppmax_scripts/script_bin/job_anvio_contigs_db.sh "$1"/out/final.contigs.fa "$1"/out/final.contigs.db

## 3. Run HMM search for psbO gene
We use an HMM search for the psbO gene based on the PFAM accession PF01716.

In [6]:
## Path to psbO HMM profile
DATABASE = paths_dict["DATABASES"]["PSBO"]["ROOT"]

## Path to output folder 
ASSEMBLY = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MEGAHIT"]

In [None]:
%%bash -s "$DATABASE" "$ASSEMBLY"

sbatch ../../uppmax_scripts/script_bin/job_anvio_hmm_search.sh "$2"/out/final.contigs.db "$1"/psbOHmms

We recovered only 16 hits for the psbO gene. 

## 4. Extract HMM hits
Now we simply extract the HMM hits.

In [3]:
## Path to psbO HMM profile
DATABASE = paths_dict["DATABASES"]["PSBO"]["ROOT"]

## Path to output folder 
ASSEMBLY = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MEGAHIT"]

In [None]:
%%bash -s "$ASSEMBLY"

sbatch ../../uppmax_scripts/script_bin/job_anvio_hmm_extract.sh "$1"/out/final.contigs.db psbOHmms "$1"/out/psbO_top6filters.fasta

## 5. Run tree!
Now we can infer a phylogeny with the extracted psbO sequences to see where they go in the psbO tree. First let us add our extracted sequences to some reference sequences. 

In [4]:
## Path to psbO database, and sequence list
DATABASE = paths_dict["DATABASES"]["PSBO"]["ROOT"]

## Path to output folder 
OUT_DIR = paths_dict["ANALYSIS_DATA"]["COASSEMBLY"]["MEGAHIT"]

In [6]:
%%bash -s "$DATABASE" "$OUT_DIR"

## add to sequences from each filter
cat "$1"/mmetsp.fasta "$2"/out/psbO_top6filters.fasta > "$2"/out/psbO_top6filters.reference.fasta

Align and trim sequences!

In [None]:
%%bash -s "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_mafft-linsi.sh "$1"/out/psbO_top6filters.reference.fasta "$1"/out/psbO_top6filters.reference.mafft.fasta

Remove gaps and semicolons from files.

In [6]:
%%bash -s "$OUT_DIR"

cat "$1"/out/psbO_top6filters.reference.mafft.fasta | \
    tr ' ' '_' | \
    tr ';' '_' | \
    tr -d '(' | tr -d ')' \
    | tr -d '.' \
    > "$1"/out/psbO_top6filters.reference.mafft.edit.fasta

Trim gently with trimal.

In [None]:
%%bash -s "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/job_trimal_ssu.sh "$1"/out/psbO_top6filters.reference.mafft.edit.fasta "$1"/out/psbO_top6filters.reference.mafft.trimal.fasta

Submitted batch job 47719428 on cluster rackham


Now run treeeee!

In [None]:
%%bash -s "$OUT_DIR"

sbatch ../../uppmax_scripts/script_bin/jobraxml-ng.sh "$1"/out/psbO_top6filters.reference.mafft.trimal.fasta "$1"/out/psbO_top6filters

Doesn't look like we got the psbO from NEW. My guess is that it is simply not abundant enough for the nuclear genome to be assembled.