# Assembly

**Megahit**
https://www.metagenomics.wiki/tools/assembly/megahit
- de novo assembly (w/o reference genome)
- aligns/assembles short reads together to reconstruct one 'metagenome'
- assembled contigs are stored in fasta file

### Installation

In [None]:
module load miniconda/22.11.1-1

In [None]:
conda create -n assembly
#dir=/home/brooke_sienkiewicz_student_uml_edu/.conda/envs/assembly

In [None]:
#conda info --env
##lists all ur conda envs 
conda activate assembly

In [None]:
#installation - just do the first time upon creating assembly env
conda install -c bioconda megahit
conda install -c bioconda quast python=2.7

### MCAV

#### MCAV - healthy, 2019

In [None]:
# Using trimmed, qc seqs from redo_auto_detect_01312024 folder
# 1)remove host from sample reads
# 2)concatenate all f and r seqs into single file (1 for f, 1 for r)
# 3)ASSEMBLE reads into contigs (contiguous sequence - joins them together based on read overlap, and ensures there are no gaps - larger portions of genomes if not all are now together in one sequence)
# 4)remove ITS2 seqs from assembled contigs (& remove adapters) ... should try to perform on raw reads so we end up with just one final contig file 

# can definitely combine these into 1 or 2 batch scripts

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=180G  # Requested Memory
#SBATCH -p cpu-long  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/assembly/healthy_2019_mcav/slurm-removal-%j.out  # %j = job ID

module load miniconda/22.11.1-1
conda activate anvio-8

# 1)remove host from sample reads
# Host seq removal - Thij's script https://github.com/ThijsSt/SCTLD-metagenomes/blob/main/Quality_control_metagenomes.ipynb

#set parameters:
SAMPLENAME="healthy_2019_mcav"
GENOME="mcav"
READSPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/trimmed/redo_auto_detect_01312024
INDEX="$GENOME"_DB
INPUTPATH="/project/pi_sarah_gignouxwolfsohn_uml_edu/Reference_genomes/Mcav_genome"
WORKINGPATH='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/assembly/healthy_2019_mcav/host_removed'

#build a bowtie2 index from a known genome
bowtie2-build $INPUTPATH/Mcavernosa_July2018.fasta $INPUTPATH/"$INDEX"

#loop through samples
while IFS= read -r SAMPLEID; do

#re-align reads back to the index
bowtie2 -p 8 -x $INPUTPATH/$INDEX -1 "$READSPATH"/"${SAMPLEID}_R1_001_val_1.fq" -2 "$READSPATH"/"${SAMPLEID}_R2_001_val_2.fq" -S $WORKINGPATH/"${SAMPLEID}"_mapped_and_unmapped.sam

#convert sam file from bowtie to a bam file for processing
samtools view -bS $WORKINGPATH/"${SAMPLEID}"_mapped_and_unmapped.sam > $WORKINGPATH/"${SAMPLEID}"_mapped_and_unmapped.bam

#extract only the reads of which both do not match against the host genome
samtools view -b -f 12 -F 256 $WORKINGPATH/"${SAMPLEID}"_mapped_and_unmapped.bam > $WORKINGPATH/"${SAMPLEID}"_bothReadsUnmapped.bam
#ask thijs what flags mean 

# sorts the file so both mates are together and then extracts them back as .fastq files
samtools sort -n -m 5G -@ 2 $WORKINGPATH/"${SAMPLEID}"_bothReadsUnmapped.bam -o $WORKINGPATH/"${SAMPLEID}"_bothReadsUnmapped_sorted.bam
samtools fastq -@ 8 $WORKINGPATH/"${SAMPLEID}"_bothReadsUnmapped_sorted.bam \
    -1 "${SAMPLEID}"_host_removed_R1.fastq \
    -2 "${SAMPLEID}"_host_removed_R2.fastq \
    -0 /dev/null -s /dev/null -n
#can i direct these to a diff folder?

done < "healthy_2019_MCAV_sampleids"
#run in dir with sampleids txt file (~/mcav/assembly/healthy_2019_mcav)

# JOB-ID: 19358781
# bash script file name: brooke/mcav/assembly/healthy_2019_mcav/removal

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=180G  # Requested Memory
#SBATCH -p cpu-long  # Partition
#SBATCH -t 56:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/assembly/healthy_2019_mcav/slurm-assembly-%j.out  # %j = job ID

module load miniconda/22.11.1-1
conda activate assembly
# 2)concatenate all f and r seqs into single file (1 for f, 1 for r)

READSPATH='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/assembly/healthy_2019_mcav/host_removed'
SAMPLENAME="healthy_2019_mcav"
OUTDIR=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/assembly/healthy_2019_mcav

# Read the sample IDs from the file
while IFS= read -r SAMPLEID; do
    # Construct the file paths for forward and reverse reads
    FORWARD_READ="$READSPATH/${SAMPLEID}_host_removed_R1.fastq"
    REVERSE_READ="$READSPATH/${SAMPLEID}_host_removed_R2.fastq"

    # Check if the files exist before concatenating
    if [ -e "$FORWARD_READ" ]; then
        cat "$FORWARD_READ" >> "$OUTDIR/${SAMPLENAME}_reads_R1_ALL.fastq"
    else
        echo "Forward read file not found for sample $SAMPLEID"
    fi

    if [ -e "$REVERSE_READ" ]; then
        cat "$REVERSE_READ" >> "$OUTDIR/${SAMPLENAME}_reads_R2_ALL.fastq"
    else
        echo "Reverse read file not found for sample $SAMPLEID"
    fi
done < "$OUTDIR/healthy_2019_MCAV_sampleids"

# 3)ASSEMBLE reads into contigs (contiguous sequence - joins them together based on read overlap, and ensures there are no gaps
megahit --presets meta-large \
-1 "$OUTDIR"/"$SAMPLENAME"_reads_R1_ALL.fastq \
-2 "$OUTDIR"/"$SAMPLENAME"_reads_R2_ALL.fastq \
--keep-tmp-files \
-o megahit_host_removed --out-prefix $SAMPLENAME \
#--continue
#this one has to make the directory; will fail if it already exists

# try metavelvet next? 

# JOB-ID: 19388626
# bash script file name: $OUTDIR/assembly

Total time elapsed: 46 hrs 
looks like 190 GB is enough... and 24 CPU - 90% efficiency on last part of run 

In [None]:
# NEED TO: Rename concatenated seqs so it can just be done in 1 step instead of renaming all indiividual files 
# (qc step it adds "val_1/2" to each seq file)

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=180G  # Requested Memory
#SBATCH -p cpu-long  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/assembly/healthy_2019_mcav/slurm-%j.out  # %j = job ID


#load modules
module load miniconda/22.11.1-1
conda activate cutadaptenv

# 4)remove ITS2 seqs from assembled contigs (& remove adapters)

# Set your input and output files
SAMPLENAME="healthy_2019_mcav"
INPUTDIR="/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/assembly/healthy_2019_mcav/megahit_host_removed"
OUTPUTDIR="/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/assembly/healthy_2019_mcav"


#READSPATH='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/trimmed/redo_auto_detect_01312024'

input_fasta="$SAMPLENAME.contigs.fa"
#output_fasta="${SAMPLENAME}_filtered.contigs.fasta"
#couldn't get this to work?? just typed it out for the -o parameter in cutadapt script

# Set your primer sequences
forward_primer="GAATTGCAGAACTCCGTGAACC"
reverse_primer="CGGGTTCWCTTGTYTGACTTCATGC"

# Verify path and input dir - only need for troubleshooting
echo "Working Directory: $(pwd)"
#ls -l $INPUTDIR

# Run cutadapt
cutadapt \
  -g "$forward_primer" \
  -a "$reverse_primer" \
  --discard-trimmed \
  -o $OUTPUTDIR/"{$SAMPLENAME}_filtered" \
  $INPUTDIR/"$input_fasta"
  
ls -l $OUTPUTDIR
#check results dir to see if it was successful in creating output file 

# JOB-ID: 19420673
# bash script file name: brooke/mcav/assembly/healthy_2019_mcav/ITS2_trim


#final assembled contigs: healthy_2019_mcav_filtered.contigs.fasta in ~/brooke/mcav/assembly/healthy_2019_mcav

# Quality Check
**Metaquast**
- quality assessment of metagenomic reads, no reference genome included here

https://quast.sourceforge.net/docs/manual.html#sec1
..how to interpret quality results?
- check how many large contigs you have (>1000 bp)
- did not map to reference genome.
- right now just helpful to see length and quality of contigs, maybe can reassess after mapping back to metagenome?
Cite metaquast: https://quast.sourceforge.net/publications.html

In [None]:
module load miniconda/22.11.1-1
conda activate assembly

In [None]:
metaquast healthy_2019_mcav_filtered.contigs.fasta -o quast_output

In [None]:
# mcav, healthy, 2019 sample assembly (host and ITS2 removal)

Statistics without reference	
# contigs 	1040595
# contigs (>= 0 bp)	2661767
# contigs (>= 1000 bp)	448229
# contigs (>= 5000 bp)	42749
# contigs (>= 10000 bp)	9799
# contigs (>= 25000 bp)	682
# contigs (>= 50000 bp)	47
Largest contig	437957
Total length	1591373878
Total length (>= 0 bp)	2118369555
Total length (>= 1000 bp)	1186762994
Total length (>= 5000 bp)	373469189
Total length (>= 10000 bp)	153333034
Total length (>= 25000 bp)	25000783
Total length (>= 50000 bp)	5240913
N50	2168
N75	986
L50	176435
L75	455045
GC (%)	...