## Quality Check: Assemblies

This notebook will work through two quality checks after your assemblies run. We'll be looking at Quast and Checkm2


## Getting Started

Let's make sure the assembly ran with no issue. We're looking for log files and the final .fasta files containing the contigs.

In [None]:
%cd /xdisk/bhurwitz/YOUR_NAME/assembly/out_spades

## Checking the Metagenome Assembled Genomes (MAGs)

All of your metagenomes should have a results directory. If you see that any are missing, this is a clue that your assemblies didn't all run. Let's check what we have so far in our spades output

In [None]:
!ls out_*

Great! Looks like we have all of our metagenomes (or if not, then check the slurm logs to fix). Let's see if the assemblies completed. Let's say your accession is SRR14881731. Let's go into that directory and look at the output files 

In [None]:
%cd out_spades.SRR14881731 #double check -- if this is the correct auto-output data
!ls -l

OK, that is a lot of information. Which files matter? First, we can look at the log file to see how the assembly went...

In [None]:
!cat spades.log

This file contains all of the output from the Spades assembler as it was running. Most of this isn't terribly useful to you if you just plan on using the defaults for the tool. But, can be useful if you get a bad assembly and want to optimize the parameters. For now, let's just be happy that we see the very last line "Thank you for using SPAdes!". This means that the assembly process ran all the way through without dying! Good news. OK, so now we have an assembly. Where are the assembled contigs? They are all in a file called scaffolds.fasta. Let's make sure that file exists and that we have some contigs in the file.

In [None]:
# metaspades
!head scaffolds.fasta

Yeah! We got contigs. How big are the contigs? You can see this by looking at description in the fatsa file for each contig.

In [None]:
# megahit
!egrep ">" final.contigs.fa | sed 's/.*len=//' | sort -nr | head
!grep -A1 882650 final.contigs.fa | more

Great news! We are getting contigs that are >140K in length. We can check to see what these are by running [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi) on the web and selecting "nucleotide blast". First let's just grab the first few lines of the contig sequence to paste in.

In [None]:
!head -500 scaffolds.fasta

What organism do you see? Is this contamination? Or a real microbe in your microbiome?

#### Note on how spades first contig is the largest 

## Quast

How good are our assemblies? We can check the quality by running tools that look at the contigs produced by our assembly algorithms. 

Let's see what the quality of this assembly looks like, using another bioinformatics tool called quast. We can run this tool from your assembly directory and run it on multiple assemblies. Here is where you write the sbatch code to run quast as follows:

You can read the documentation [here](https://quast.sourceforge.net/docs/manual.html).

For this tool we'll use our metaspades output.

In [None]:
%cd /xdisk/bhurwitz/YOUR_NAME/check_assembly
!mkdir out_quast

In [None]:
!nano ~/scripts/run_quast.sh

In [None]:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --time=12:00:00
#SBATCH --partition=standard
#SBATCH --account=bhurwitz
#SBATCH --array=0-46                           # needs to be the number of metagenomes
#SBATCH --output=Job-%a.out
#SBATCH --cpus-per-task=24                    # num CPUs per task
#SBATCH --mem=32G                             # total memory per node

pwd; hostname; date

names=($(cat /xdisk/bhurwitz/YOUR_NAME/bio_pipeline/data/SRR_Acc_List.txt))

readdir="/xdisk/bhurwitz/YOUR_NAME/assembly/out_spades"
outdir="/xdisk/bhurwitz/YOUR_NAME/check_assembly/out_quast"

apptainer run /contrib/singularity/shared/bhurwitz/quast:5.2.0--py39pl5321h4e691d4_3.sif metaquast.py -t 24 \
        -o ${outdir}/${names[${SLURM_ARRAY_TASK_ID}]}
        -m 500 $readdir/${names[${SLURM_ARRAY_TASK_ID}]}/scaffolds.fasta #this might change

#### Let's check out the assembly stats from QUAST

In [None]:
%cd /xdisk/bhurwitz/YOUR_NAME/check_assembly/out_quast
!cat */report.txt #cats all reports
!cat report.txt #cats only within the directory you're in

## Checkm2

Checkm2 is another tool that allows you to produce a quality report on the assembled contigs.

The documentation can be found [here](https://github.com/chklovski/CheckM2).

For this tool we'll be using our megahit output.

### Checkm2 database file

This tool requires a database file to run. More information on downloading the database can be found in the documentation. The current database has been downloaded and saved in the following location:

/groups/bhurwitz/databases/checkm2_database/uniref100.KO.1.dmnd

In [None]:
%cd /xdisk/bhurwitz/YOUR_NAME/check_assembly
!mkdir out_checkm2

In [None]:
!nano ~/scripts/run_checkm2.sh

In [None]:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --time=12:00:00
#SBATCH --partition=standard
#SBATCH --account=bhurwitz
#SBATCH --array=0-4                           # needs to be the number of metagenomes
#SBATCH --output=Job-%a.out
#SBATCH --cpus-per-task=24                    # num CPUs per task
#SBATCH --mem=48G                             # total memory per node

# Fix the resource allocation 

pwd; hostname; date

names=($(cat /xdisk/bhurwitz/YOUR_NAME/bio_pipeline/data/SRR_Acc_List.txt))

readdir="/xdisk/bhurwitz/YOUR_NAME/assembly/out_megahit"
outdir="/xdisk/bhurwitz/YOUR_NAME/check_assembly/out_checkm2"

apptainer run /contrib/singularity/shared/bhurwitz/checkm2\:1.0.1--pyh7cba7a3_0.sif checkm2 predict --threads 24 \
        --input $readdir/${names[${SLURM_ARRAY_TASK_ID}]}/final.contigs.fa  #this name will change \
        --output-directory $outdir/${names[${SLURM_ARRAY_TASK_ID}]} \
        --database_path /groups/bhurwitz/databases/checkm2_database/uniref100.KO.1.dmnd

#### Let's check out the assembly stats from Checkm2

In [None]:
%cd /xdisk/bhurwitz/YOUR_NAME/check_assembly/out_checkm2
!cat */checkm2.log #cats all reports
!cat checkm2.log #cats only within the directory you're in