# Assembly

This notebook will go through the workflow using the metaspades and megahit tools. These tools produce contigs (a series of DNA sequences)

1. An introduction to [Metaspades](https://cab.spbu.ru/files/release3.12.0/manual.html)
2. An introduction to [Megahit](https://github.com/voutcn/megahit)

## Metaspades

In this section we are going to assemble our reads into contigs. Contigs are fragments of DNA that reprepresent part of a genome. If you are lucky, you might even be able to assemble an entire genome in a single contig! But, most of the time, contigs are just part of a genome with missing fragments inbetween contigs that prevent you from assembling the entire genome.  

You will be assembling your reads using a program called spades, that has a metaspades.py program that is container within it for assembling metagenome comprised of multiple organisms.

It's good to note that this assembler is memory intensive, and for large files it takes a lot of resources and time. A common error on large files are running out of memory to complete the job in the HPC. The HPCs have different allotment of resources and we can modify our script if it requires more memory. 

Puma can have 94 CPUs @ 5gb/CPU <br>
Ocelote can have 28 CPUs @ 6gb/CPU

This [HPC documentation](https://public.confluence.arizona.edu/display/UAHPC/Running+Jobs+with+SLURM) is handy to have as you edit your scripts and use different HPCs within UA.

### Data Management

We'll be creating two assemblies based on the trim reads we gathered from running trimmomatic. Let's setup the output directories ahead of time.

In [None]:
!mkdir /xdisk/bhurwitz/YOUR_NAME/assembly/out_spades
!mksir /xdisk/bhurwitz/YOUR_NAME/assembly/out_megahit

In [None]:
!nano run_metaspades.sh

And you will need to paste in & modify the following:

In [None]:
#!/bin/bash
#SBATCH --job-name=run_metaspades
#SBATCH --account=bhurwitz
#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=28
#SBATCH --mem-per-cpu=5gb
#SBATCH --array=0-46

#get array based of SRR List
names=($(cat /xdisk/bhurwitz/YOUR_NAME/bio_pipeline/data/SRR_Acc_List.txt))

readdir = "/xdisk/bhurwitz/YOUR_NAME/trimming/trimmed_reads"
assemdir="/xdisk/bhurwitz/YOUR_NANE/assembly/out_spades"

#add threads flag & exposition on adding threads or it runs inefficient
apptainer run /contrib/singularity/shared/bhurwitz/spades:3.15.5--h95f258a_1.sif metaspades.py \
   -o ${assemdir}/${names[${SLURM_ARRAY_TASK_ID}]} \
   --pe1-1 ${readdir}/${names[${SLURM_ARRAY_TASK_ID}]}_1.fastq.gz \
   --pe1-2 ${readdir}/${names[${SLURM_ARRAY_TASK_ID}]}_2.fastq.gz

In [None]:
!sbatch ~/run_metaspades.sh

In [None]:
!nano run_megahit

In [None]:
#!/bin/bash
#SBATCH --job-name=run_megahit
#SBATCH --account=bhurwitz
#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=28
#SBATCH --mem-per-cpu=5gb
#SBATCH --array=0-46


#get array based of SRR List
names=($(cat /xdisk/bhurwitz/YOUR_NAME/bio_pipeline/data/SRR_Acc_List.txt))

readdir = "/xdisk/bhurwitz/YOUR_NAME/trimming/trimmed_reads"
assemdir="/xdisk/bhurwitz/YOUR_NANE/assembly/out_megahit"

#double check if megahit needs a thread command
apptainer run /contrib/singularity/shared/bhurwitz/megahit:1.2.9--h5b5514e_3.sif megahit \
   -1 ${readdir}/${names[${SLURM_ARRAY_TASK_ID}]}_1.fastq.gz /
   -2 ${readdir}/${names[${SLURM_ARRAY_TASK_ID}]}_2.fastq.gz /
   -o ${assemdir}/${names[${SLURM_ARRAY_TASK_ID}]}