# Genome assembly and quality control of assembly

## Required tools

After the previous steps of quality control (QC), we have reads still in raw_format but now we have a summary of their quality. Furthermore, we have removed regions with poor quality of sequencing (where we cannot be sure if the assigned nucleotides are right) and we removed the adaptor sequences that are added to our DNA for sequencing. 

In this series of steps, we will do assembly of the reads using a tool called `shovill`. Once again, we will mimic how to run the commands in the **Compute Canada (CC)** cluster of analysis. 

For these tutorials, tools will be made available using singularity containers, which can be run using the command `singularity run tool_image`. These tools have been made available in the environment already, so there is no need to download them.

Tools used in this tutorial:
- shovill
- singularity

We will first explore the structure of our environment and the folders available. Tools downloaded for the tutorial are in the `tools` folder and in the `tutorials` directory are the primary datasets as well as the results of our runs. 

```
.
|-- tools
`-- tutorials
    |-- raw_reads
    |-- results_qc
    `-- trimmed_reads
```

For downstream assembly, we will use the curated reads contained in the `trimmed_reads` subdirectory (n=20 files containing paired reads for 10 isolates of _P.aeruginosa_). 

In [23]:
# source PATH to use module function
source /cvmfs/soft.computecanada.ca/config/profile/bash.sh

In [24]:
cd
ls tutorials/trimmed_reads

ERR10479510_R1.fastq.gz  ERR10479513_R2.fastq.gz  ERR10479517_R1.fastq.gz
ERR10479510_R2.fastq.gz  ERR10479514_R1.fastq.gz  ERR10479517_R2.fastq.gz
ERR10479511_R1.fastq.gz  ERR10479514_R2.fastq.gz  ERR10479518_R1.fastq.gz
ERR10479511_R2.fastq.gz  ERR10479515_R1.fastq.gz  ERR10479518_R2.fastq.gz
ERR10479512_R1.fastq.gz  ERR10479515_R2.fastq.gz  ERR10479519_R1.fastq.gz
ERR10479512_R2.fastq.gz  ERR10479516_R1.fastq.gz  ERR10479519_R2.fastq.gz
ERR10479513_R1.fastq.gz  ERR10479516_R2.fastq.gz


## De-novo assembly with Shovill

Shovill is a tool that optimizes the assembler `Spades` to minimize the run time, while maintaining the quality of assembly. It generates a draft genome using heuristic algorithms and does not require a reference genome that guides the process. See the GitHub repositories of [shovill](https://github.com/tseemann/shovill) and [SPAdes](https://github.com/ablab/spades) for more details. 

Shovill is not available as a module pre-installed in **CC**, so we must use another strategy. The easiest one is to use a container, we can install from a **Docker** container, but Docker containers are not suitable for high performance clusters like **Compute Canada** because they have inherent root user privileges. Thus, many HPC allow use of **Singularity** images as an alternative (For more info about what is containerization you can read https://www.melbournebioinformatics.org.au/tutorials/tutorials/docker/docker/).

A useful repository of **Singularity** images is located at https://depot.galaxyproject.org/singularity/

<font color='darkred'>_**Notes for compute canada:**_ </font>  
- Singularity needs to be loaded into the system. On the CIDGOH servers, it is loaded by default. Run the following code to have singularity available in your compute canada session. 

>    module load singularity

- We have already downloaded the **Shovill** singularity container. In CC, you may need to do it, so run the following command in case it is necessary. The command tells the system to pull a container from a repository into your local system.

>    singularity pull tools/shovill_1.1.sif https://depot.galaxyproject.org/singularity/shovill%3A1.1.0--hdfd78af_1


In [25]:
# executing shovill
singularity exec tools/shovill_1.1.sif shovill --help

SYNOPSIS
  De novo assembly pipeline for Illumina paired reads
USAGE
  shovill [options] --outdir DIR --R1 R1.fq.gz --R2 R2.fq.gz
GENERAL
  --help          This help
  --version       Print version and exit
  --check         Check dependencies are installed
INPUT
  --R1 XXX        Read 1 FASTQ (default: '')
  --R2 XXX        Read 2 FASTQ (default: '')
  --depth N       Sub-sample --R1/--R2 to this depth. Disable with --depth 0 (default: 150)
  --gsize XXX     Estimated genome size eg. 3.2M <blank=AUTODETECT> (default: '')
OUTPUT
  --outdir XXX    Output folder (default: '')
  --force         Force overwite of existing output folder (default: OFF)
  --minlen N      Minimum contig length <0=AUTO> (default: 0)
  --mincov n.nn   Minimum contig coverage <0=AUTO> (default: 2)
  --namefmt XXX   Format of contig FASTA IDs in 'printf' style (default: 'contig%05d')
  --keepfiles     Keep intermediate files (default: OFF)
RESOURCES
  --tmpdir XXX    Fast temporary directory (default: '')
  --cpus

#### What does shovill do?

1. Unifies coverage depth for all genomes
2. Trims adapters and poor quality reads if necessary
3. Assembles using SPAdes
4. Polishes genomes (improves quality) and filters low quality contigs


In [26]:
# create output directory
OUTPUT_ASSEMBLY="/home/jupyter-mdprieto/tutorials/contigs"

# define PATH to trimmed_reads
TRIMMED_READS="/home/jupyter-mdprieto/tutorials/trimmed_reads"

To execute shovill, we run the command from the singularity container we just downloaded. Genome assembly is the most resource intensive process in the pipeline, so it will probably take a while to run. As input, we will use or `trimmed_reads` files and to optimize your run time, we will assemble only two isolates. The remaining ones are already available in the `tutorials/results` folder

<font color='darkred'>_**Notes for compute canada:**_ </font>  
- Allocate sufficiente memory as the size of every genome must be kept in storage while it is assembled
- Bioinformatic procedures usually use multiple threads to optimize performance, so their efficiency increases with the number of available cores (including **SPAdes**). 
- In shovill, the `--ram` option specifies the available ram per thread (core)
    - Spades will take input of RAM from shovill as total available mem, better to input limit manually with `--opts "-m XX"`

In [None]:
# for loop to run a command for each sample

for read1 in $(ls "$TRIMMED_READS"/*R1.fastq.gz | head -n 2)
do

    
    read2=${read1/_R1/_R2}                                      # substitute R1 for R2 in variable
    prefix_isolate=$(basename $read1 _R1.fastq.gz)              # create file with isolate name
    echo "Started processing $prefix_isolate"
    
    echo singularity exec tools/shovill_1.1.sif shovill         `# execute shovill image` \                   
        --R1 $read1 --R2 $read2                                 `# specify paired reads R1 and R2` \
        --outdir "$OUTPUT_ASSEMBLY"                             `# define output directory` \
        --force                                                 `# overwrite results if already available` \
        --ram 140                                               `# how much ram memory to use`
    
    # name resulting file with isolate id
    mv "$OUTPUT_ASSEMBLY/contigs.fa" "$OUTPUT_ASSEMBLY/$prefix_isolate\_contigs.fa"
    
    echo "Finished assembly of sample $prefix_isolate"
    
done

Started processing ERR10479510
singularity exec tools/shovill_1.1.sif shovill --R1 /home/jupyter-mdprieto/tutorials/trimmed_reads/ERR10479510_R1.fastq.gz --R2 /home/jupyter-mdprieto/tutorials/trimmed_reads/ERR10479510_R2.fastq.gz --outdir /home/jupyter-mdprieto/tutorials/contigs --force --ram 140
mv /home/jupyter-mdprieto/tutorials/contigs/contigs.fa /home/jupyter-mdprieto/tutorials/contigs/ERR10479510\_contigs.fa
Finished assembly of sample ERR10479510
Started processing ERR10479511
singularity exec tools/shovill_1.1.sif shovill --R1 /home/jupyter-mdprieto/tutorials/trimmed_reads/ERR10479511_R1.fastq.gz --R2 /home/jupyter-mdprieto/tutorials/trimmed_reads/ERR10479511_R2.fastq.gz --outdir /home/jupyter-mdprieto/tutorials/contigs --force --ram 140
mv /home/jupyter-mdprieto/tutorials/contigs/contigs.fa /home/jupyter-mdprieto/tutorials/contigs/ERR10479511\_contigs.fa
Finished assembly of sample ERR10479511


In this tutorial, we processed only two samples to optimize the runtime. The remaining assemblies can be found in the same directory for future steps. In this tutorial, we processed only two samples to optimize the runtime. The remaining assemblies can be found in the same directory for future steps. 

The main output of the **Shovill** pipeline are the files ending in `contigs.fa` which contain assembled reads in fasta format. We can see that this format contains a header for every contig and then the reads.  

In [30]:
head "$OUTPUT_ASSEMBLY/ERR10479510_contigs.fa"

>contig00001 len=531597 cov=40.5 corr=0 origname=NODE_1_length_531597_cov_40.531919_pilon sw=shovill-spades/1.1.0 date=20230301
CGGCGGCAGTTGGCGAAAGAAATCCCGCACCTGTGCCCGCTTGAGTTGGCGACGACATAC
CACATGCTCGTGACGATCAACGCCGTGCACCTGGAATACTTGTTTTGCCAGATCCAGACC
AATGCGACTAAGGTTCATGCTGACTCCCCCTCCGGGACTTGTGGCTGCACCATTAGTCTG
GCGCTTGACGCCGTAGGAGGGAGGAGTCCATTTCATTGCCCTACCCCAGCTCTCCATCGC
CGCCAATCTCCCGCATATCCCCGGAGTCCGCCATGTCCTCACCCCAACCGCCCCGCTTCG
ACGGCCAACGCTGGAGCAACGCCGACGACGACCGCATCGAGGTGCTGCCTGCCGACCCCG
CCTGGCCACAACACTTCGCCGCCGAAGCCGAGGCCATCCGCACGGCGCTGGCGCTGCCCG
GGCTGGGCATCGAGCATGTCGGCAGCACCGCGGTGCCCGGGCTCGACGCCAAGCCGATCA
TCGACATCCTCCTGCTGCCGCCGCCCGGCCACGATCCGCAGCGGCTGGTAGCCCCGCTGG


# Quality control of draft genomes

**Shovill** produces contigs (overlapping consensus regions of DNA) for every isolate. However, sometimes we may have contaminated cultures growing other bacteria than our organism of interest. Also, given the non targeted approach used for sequencing, the reads from an isolate may have poor quality (low realibility in base calling or poor coverage of certain regions).

Thus, after producing draft genomes, we typically conduct additional checks to verify the quality of the resulting files and make sure that we do not have contamination in our samples. 

## Quast

Quast [(github:quast)](https://github.com/ablab/quast) produces quantitative summaries of the contigs in every assembly. It also uses a reference genome to evaluate misassemblies, unaligned contigs, and metrics of coverage against the reference genome. 

**_Some metrics include:_** 
- Number of contigs and number of contigs > 500bp
- **N50** or the length at which the collection of all contigs of at least that length covers half of the assembly 
- **NG50** is similar to **N50** but measures the coverage of the reference genome
- Number of misassemblies including inversions, relocations, and translocations
- Number and total length of unaligned contigs (against the reference genome)

As the data we are analyzing in the tutorial comes from _P. aeruginosa_ isolates ([PMID:34412676](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8376114/) - [BioProject:PRJEB56397](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB56397)), we will use the [PAO1 reference strain](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000006765.1/) for quality control. 

In [31]:
# load quast into our environment
module load StdEnv/2020 gcc/9.3.0 quast/5.0.2

# reference genomes are found in the tools directory
ls $HOME/tools/GCF*


Lmod is automatically replacing "intel/2020.1.217" with "gcc/9.3.0".


The following have been reloaded with a version change:
  1) StdEnv/2016.4 => StdEnv/2020           4) mii/1.1.1 => mii/1.1.2
  2) gcccore/.5.4.0 => gcccore/.9.3.0       5) openmpi/2.1.1 => openmpi/4.0.3
  3) imkl/11.3.4.258 => imkl/2020.1.217

/home/jupyter-mdprieto/tools/GCF_000006765.1_ASM676v1_genomic.fna.gz
/home/jupyter-mdprieto/tools/GCF_000006765.1_ASM676v1_genomic.gff.gz


In [33]:
# create ENV variable to input and output directory
CONTIGS_DIR='/home/jupyter-mdprieto/tutorials/contigs/'
ASSEMBLY_QUAST='/home/jupyter-mdprieto/tutorials/assembly_quast/'

The main command `quast.py` produces several reports with the previously mentioned metrics. The results are provided in multiple formats such as `.pdf, .html, and .csv`. They can be opened directly in Jupyter by clicking in the file explorer or exported to your local computer for further visualization. 

In [17]:
quast.py "$CONTIGS_DIR/*contigs.fa"                                             `# pattern for contig files produced by shovill` \
    -r /home/jupyter-mdprieto/tools/GCF_000006765.1_ASM676v1_genomic.fna.gz     `# reference genome` \
    -g /home/jupyter-mdprieto/tools/GCF_000006765.1_ASM676v1_genomic.gff.gz     `# reference genomic features positions` \
    -o $ASSEMBLY_QUAST                                                          `# output directory` \
    --threads 12

/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Compiler/gcc9/quast/5.0.2/bin/quast.py /home/jupyter-mdprieto/tutorials/assemblies//ERR10479510_contigs.fa /home/jupyter-mdprieto/tutorials/assemblies//ERR10479511_contigs.fa /home/jupyter-mdprieto/tutorials/assemblies//ERR10479512_contigs.fa /home/jupyter-mdprieto/tutorials/assemblies//ERR10479513_contigs.fa /home/jupyter-mdprieto/tutorials/assemblies//ERR10479514_contigs.fa /home/jupyter-mdprieto/tutorials/assemblies//ERR10479515_contigs.fa /home/jupyter-mdprieto/tutorials/assemblies//ERR10479516_contigs.fa /home/jupyter-mdprieto/tutorials/assemblies//ERR10479517_contigs.fa /home/jupyter-mdprieto/tutorials/assemblies//ERR10479518_contigs.fa /home/jupyter-mdprieto/tutorials/assemblies//ERR10479519_contigs.fa -r /home/jupyter-mdprieto/tools/GCF_000006765.1_ASM676v1_genomic.fna.gz --features /home/jupyter-mdprieto/tools/GCF_000006765.1_ASM676v1_genomic.gff.gz -o /home/jupyter-mdprieto/tutorials/assembly_qc/ --threads 12

Versio

Quast provides several summaries of quality in formats such as `.pdf, .html, and .csv`. 

We can preview metrics such as the N50 of the assemblies and the coverage of the reference genome using the commands below. 

In [36]:
ls $HOME/tutorials/assembly_quast/

# overview of N50 and coverage
echo
cat $HOME/tutorials/assembly_quast/report.tsv | grep -E 'Assembly|N50|fraction'

[0m[01;34maligned_stats[0m    [01;34mgenome_stats[0m    quast.log    report.tex  transposed_report.tex
[01;34mbasic_stats[0m      icarus.html     report.html  report.tsv  transposed_report.tsv
[01;34mcontigs_reports[0m  [01;34micarus_viewers[0m  report.pdf   report.txt  transposed_report.txt

Assembly	ERR10479510_contigs	ERR10479511_contigs	ERR10479512_contigs	ERR10479513_contigs	ERR10479514_contigs	ERR10479515_contigs	ERR10479516_contigs	ERR10479517_contigs	ERR10479518_contigs	ERR10479519_contigs
N50	221997	184817	226997	223675	216284	226489	215154	223643	223542	223542
Genome fraction (%)	95.402	95.397	95.458	95.500	95.476	95.507	95.554	95.484	95.515	95.390


## CheckM


**CheckM** infers the quality of the genome assembly based on the presence and uniqueness of sets of gene markers that are specific to species/taxa, and determines the completeness (coverage of reference genome) and the contamination of the input draft genomes.

**CheckM** is not available in the CC cluster. So, we use a singularity container with the latest version.


In [35]:
# create ENV variable to input/output directory and singularity container for checkm
CONTIGS_DIR='/home/jupyter-mdprieto/tutorials/contigs/'
ASSEMBLY_CHECKM='/home/jupyter-mdprieto/tutorials/assembly_checkm/'
IMG_CHECKM="/home/jupyter-mdprieto/tools/checkm_1.2.2.sif"

The first step is to create a dataset with specific genomic markers for a species, taxon or genus using `checkm taxon_set <species/genus/taxon> <taxon_name> <marker_file>`.
Then, we 


In [7]:
singularity exec "$IMG_CHECKM" checkm \
    taxon_set species 'Pseudomonas aeruginosa' /home/jupyter-mdprieto/tools/pseudomonas.ms

[2023-03-08 01:00:20] INFO: CheckM v1.2.2
[2023-03-08 01:00:20] INFO: checkm taxon_set species Pseudomonas aeruginosa /home/jupyter-mdprieto/tools/pseudomonas.ms
[2023-03-08 01:00:20] INFO: CheckM data: /usr/local/checkm_data
[2023-03-08 01:00:20] INFO: [CheckM - taxon_set] Generate taxonomic-specific marker set.
[2023-03-08 01:00:25] INFO: Marker set for Pseudomonas aeruginosa contains 1617 marker genes arranged in 469 sets.
[2023-03-08 01:00:25] INFO: Marker set inferred from 19 reference genomes.
[2023-03-08 01:00:25] INFO: Marker set for Pseudomonas contains 833 marker genes arranged in 312 sets.
[2023-03-08 01:00:25] INFO: Marker set inferred from 182 reference genomes.
[2023-03-08 01:00:25] INFO: Marker set for Pseudomonadaceae contains 800 marker genes arranged in 302 sets.
[2023-03-08 01:00:25] INFO: Marker set inferred from 186 reference genomes.
[2023-03-08 01:00:25] INFO: Marker set for Pseudomonadales contains 549 marker genes arranged in 326 sets.
[2023-03-08 01:00:25] INF

With the reference markers file created in our tools directory, we perform two additional steps:
- Using `checkm analysis` we identify what marker sets that are specific to a taxon are included in every assembly
- Then, with `checkm qa` we produce a summary of contamination and 

The process for the samples used in the tutorial (n = 10) should take around 10-15 min. 

In [11]:
# analyze presence of markers 
singularity exec "$IMG_CHECKM" checkm analyze \
    /home/jupyter-mdprieto/tools/pseudomonas.ms         `#file with checkm marker set for assemblies` \
    "$CONTIGS_DIR"                                      `#dir with assemblies in fasta format` \
    "$ASSEMBLY_CHECKM"                                  `#output directory` \
    -x fa                                               `#extension of assemblies` \
    -t 8 
    
# produce table of contaminations
singularity exec "$IMG_CHECKM" checkm qa \
        /home/jupyter-mdprieto/tools/pseudomonas.ms     `#file with checkm marker set for assemblies` \
        "$ASSEMBLY_CHECKM"                              `#output directory` \
        --file "$ASSEMBLY_CHECKM/checkm_output.tsv" \
        --tab_table                                     `# print tabular output` \
        --threads 8                                     `# number of simultaneous threads for process` \
        --out_format 1                                 `# format of output 1 = summary, 2 = extended`

[2023-03-08 01:18:31] INFO: CheckM v1.2.2
[2023-03-08 01:18:31] INFO: checkm analyze /home/jupyter-mdprieto/tools/pseudomonas.ms /home/jupyter-mdprieto/tutorials/assemblies/ /home/jupyter-mdprieto/tutorials/assembly_checkm/ -x fa -t 8
[2023-03-08 01:18:31] INFO: CheckM data: /usr/local/checkm_data
[2023-03-08 01:18:31] INFO: [CheckM - analyze] Identifying marker genes in bins.
[2023-03-08 01:18:32] INFO: Identifying marker genes in 10 bins with 8 threads:
    Finished processing 10 of 10 (100.00%) bins.
[2023-03-08 01:27:11] INFO: Saving HMM info to file.
[2023-03-08 01:27:12] INFO: { Current stage: 0:08:41.168 || Total: 0:08:41.168 }
[2023-03-08 01:27:12] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 10 of 10 (100.00%) bins.
[2023-03-08 01:27:19] INFO: Aligning marker genes with multiple hits in a single bin:
    Finished processing 10 of 10 (100.00%) bins.
[2023-03-08 01:27:21] INFO: { Current stage: 0:00:08.632 || Total: 0:08:49.800 }
[2023-03-08 01:27:21] IN

Now, we can review the output of results for contamination using **CheckM**. 

*Completeness* is a measure of the coverage of gene marker sets spected for a species in a given contig. 
*Contamination* shows the presence of multi-copy marker genes in the genome assembly. 
*Strain heterogeneity* is determined by the number of multy-copy gene marker sets that have an amino acid identity >=  90%. A high heterogeneity suggests that a majority of the contamination comes from closely related organisms. A smaller value may come from phylogenetically distinct sources

In [40]:
cat ~/tutorials/assembly_checkm/checkm_output.tsv

Bin Id	Marker lineage	# genomes	# markers	# marker sets	0	1	2	3	4	5+	Completeness	Contamination	Strain heterogeneity
ERR10479510_contigs	Pseudomonas aeruginosa (6)	19	1617	469	3	1601	13	0	0	0	99.86	1.56	7.69
ERR10479511_contigs	Pseudomonas aeruginosa (6)	19	1617	469	4	1600	13	0	0	0	99.64	1.56	7.69
ERR10479512_contigs	Pseudomonas aeruginosa (6)	19	1617	469	3	1601	13	0	0	0	99.86	1.56	7.69
ERR10479513_contigs	Pseudomonas aeruginosa (6)	19	1617	469	3	1602	12	0	0	0	99.86	1.45	8.33
ERR10479514_contigs	Pseudomonas aeruginosa (6)	19	1617	469	3	1601	13	0	0	0	99.86	1.56	7.69
ERR10479515_contigs	Pseudomonas aeruginosa (6)	19	1617	469	3	1601	13	0	0	0	99.86	1.56	7.69
ERR10479516_contigs	Pseudomonas aeruginosa (6)	19	1617	469	3	1601	13	0	0	0	99.86	1.56	7.69
ERR10479517_contigs	Pseudomonas aeruginosa (6)	19	1617	469	3	1601	13	0	0	0	99.86	1.56	7.69
ERR10479518_contigs	Pseudomonas aeruginosa (6)	19	1617	469	3	1599	14	1	0	0	99.86	2.03	5.88
ERR10479519_contigs	Pseudomonas aeruginosa (6)	19	1617	469	3	160