# Introduction

Throughout these tutorials we want to introduce some of the common tools used in our laboratory, the **Center for Infectious Disease Genomics and One Health**, for analysis of bacterial sequencing data. 

As bioinformatics  is a rapidly changing field, the pipelines and software used may change over time or become obsolete. New trainees will be in charge of reproducing the pipeline format and evaluating the content. Ideally, the tutorials will be updated and upgraded iteratively. 

# Handling of raw sequencing reads

## Required tools

After a sequencing experiment, the data is available as reads (representations of sections of nucleotides) and typically available in [fastq](https://en.wikipedia.org/wiki/FASTQ_format) format. The length of the reads and the quality of the assigned nucleotides depends largely on the processing steps and the technology used (Illumina, Nanopore, PacBio).

Most of our analysis are done in a High Performance Computing cluster as most tools are resource intensive and performing analysis on your local computer would be too time consuming. The currently used cluster is **Compute Canada (CC cluster)**, you can login using the instructions available in the [wiki page for new users](https://docs.alliancecan.ca/wiki/SSH). 

In the **CC cluster**, there are several tools already available to be used as modules. This environment (computing system) has reproduced this setup and you can use a similar command to load software. `module load desired_software/version`

Manually installing tools can create conflicts and errors. Thus, we will use containers (ready to use packages to run a software tool). Singularity containers are executed using the command `singularity run tool_image` or `singularity exec tool_image command`. There is no need to download said containers as they should be readily available once you connect to this server. 

Tools used in this section:
- seqkit
- fastqc
- multiqc

## Load bash kernel

Kernels are programs that mimic operating systems. In jupyter, they tell the system what king of programming language should be interpreted. In order to run these notebooks, a pre-built environment was developed. So make sure to load the **bash_kernel** in the upper right corner of the notebook. 

## Environment structure

We will first explore the structure of our environment and the folders available. It is good practice to assign a directory (`tutorials`) to every project. Inside this main directory, we will create subdirectories with results, analysis instructions, and tools necessary for analysis.

Every command can be explored using `command --help` for further details. 

In [1]:
# change to home directory and print the PATH to it
cd 
pwd -P

# show what is available in home directory
tree -L 2 -d tutorials

# source PATH to use module function
source /cvmfs/soft.computecanada.ca/config/profile/bash.sh

/home/jupyter-mdprieto
tutorials
|-- raw_reads
|-- results
|   `-- reads_qc
`-- tools
    |-- reference_data
    `-- sing_imgs

6 directories


## Exploring the data

Datasets for this tutorial are availablein a shared folder inside the `tutorials` directory called `raw_reads`. These are Illumina technology short read sequences from a biorepository (75bp paired end reads). The isolates were cultured to investigate an outbreak of multidrug resistant _Pseudomonas aeruginosa_ in Switzerland [PMID:34412676](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8376114/). To make it faster and easier to run, we use a subset of 10 bacterial isolates. 

In [9]:
# show content of raw_reads directory
ls /home/jupyter-mdprieto/tutorials/raw_reads

ERR10479510_R1.fastq.gz  ERR10479513_R2.fastq.gz  ERR10479517_R1.fastq.gz
ERR10479510_R2.fastq.gz  ERR10479514_R1.fastq.gz  ERR10479517_R2.fastq.gz
ERR10479511_R1.fastq.gz  ERR10479514_R2.fastq.gz  ERR10479518_R1.fastq.gz
ERR10479511_R2.fastq.gz  ERR10479515_R1.fastq.gz  ERR10479518_R2.fastq.gz
ERR10479512_R1.fastq.gz  ERR10479515_R2.fastq.gz  ERR10479519_R1.fastq.gz
ERR10479512_R2.fastq.gz  ERR10479516_R1.fastq.gz  ERR10479519_R2.fastq.gz
ERR10479513_R1.fastq.gz  ERR10479516_R2.fastq.gz


These reads are compressed (extension .gz), but most bioinformatic programs can receive `tar` or `gzip` files as input. To unzip, the command is 

    tar -zxf file(s)_to_decompress

The fastq files have the naming convention `_R1.fastq` or `_R2.fastq` for every isolate because we are using paired-end reads (segments of the DNA are analyzed from both ends, left and right). Other naming structures can be used to represent paired end reads (`_1.fastq, _R001.fastq`). 

## Quality control

It is a good practice to organize preliminary results in their own directory so you can trace back when necessary. You can also use github repositories to track all changes related to a project. 

We will create a new directory for output `tutorials/results` and another subdirectory for this particular step `tutorials/results/reads_qc`. 

<font color='darkred'>_**Notes for compute canada:**_ </font>  
- Compute Canada provides different directories for storage. Jobs cannot be launched from the *HOME* and *PROJECT* directory or any of its subdirectories. The ideal place to run jobs is the *SCRATCH* folder, where you have short term storage of large amounts of data. 
- Once you have final results, these should be moved to your *PROJECT* directory as the *SCRATCH* folder is constantly being cleaned

In [12]:
# creates new directory for results of QC 
mkdir -p $HOME/tutorials/results/reads_qc

# show our new project structure
tree -dL 2 tutorials

tutorials
|-- raw_reads
|-- results
|   `-- reads_qc
`-- tools
    |-- reference_data
    `-- sing_imgs

6 directories


### Basic statistics of reads with Seqkit 

***seqkit*** summarizes quality statistics (scores for each assigned nucleotide and length diversity) of all reads in a a folder into a text file. basic statistics from the **.fastq** files. 

In [13]:
# load seqkit and dependencies to our computing environment
module load StdEnv/2020 seqkit/2.3.1


The following have been reloaded with a version change:
  1) StdEnv/2016.4 => StdEnv/2020           4) intel/2016.4 => intel/2020.1.217
  2) gcccore/.5.4.0 => gcccore/.9.3.0       5) mii/1.1.1 => mii/1.1.2
  3) imkl/11.3.4.258 => imkl/2020.1.217     6) openmpi/2.1.1 => openmpi/4.0.3



In the command line, when we want to run a command that involves a file, we have to move to the specific directory and run the desired command or include the complete path to said file. 

To avoid extra typing, we can create variables (called environment variables) that store this path and reuse them sparingly. 

In [6]:
# create environment variables to avoid typing PATHs every time
RAW_READS="/home/jupyter-mdprieto/tutorials/raw_reads"
READS_QC_DIR="/home/jupyter-mdprieto/tutorials/results/reads_qc"

Then, we run `seqkit` command and preview the text file summarizes the results of the tool. The wait time for output should be around minute only as this is not a demanding process.  

In [15]:
# run sequence statistics
seqkit stats $RAW_READS/*.fastq.gz > $READS_QC_DIR/seqkit_output.tsv

# check results
head $READS_QC_DIR/seqkit_output.tsv

file                                                                format  type   num_seqs      sum_len  min_len  avg_len  max_len
/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479510_R1.fastq.gz  FASTQ   DNA   1,673,279  386,900,145       35    231.2      251
/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479510_R2.fastq.gz  FASTQ   DNA   1,673,279  387,923,954       35    231.8      251
/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479511_R1.fastq.gz  FASTQ   DNA   1,344,257  305,517,801       35    227.3      251
/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479511_R2.fastq.gz  FASTQ   DNA   1,344,257  307,892,897       35      229      251
/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479512_R1.fastq.gz  FASTQ   DNA   2,091,662  453,818,911       35      217      251
/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479512_R2.fastq.gz  FASTQ   DNA   2,091,662  458,746,957       35    219.3      251
/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479513_R1.fastq.gz  FASTQ   

### 2. fastqc

We will use **fastqc**  to create additional metrics of sequencing including nucleotide distribution, presence of repeats, quality of base calling, GC content and adapter content  


In [4]:
# load the fastqc tool
module load StdEnv/2020 fastqc/0.11.9

For improved speed, add 'usejni=t' to the command line of BBMap tools which support the use of the compiled jni C code.


- I input all the files that have the suffix 'fastq.gz' from our reads directory and specify that 10 files will be processed simultaneously (-t 10). The processing should take around 3 minutes for this set of 10 samples. 

In [7]:
fastqc $RAW_READS/*.fastq.gz \
    -o $READS_QC_DIR \
    -t 10 \
    --quiet
    # output is saved as individual files in READS_QC_DIR

Picked up JAVA_TOOL_OPTIONS: -Xmx2g



- We can see using `ls` to list all files in the output directory that **fastqc** produces an output summary in `zip` and `html` formats for every file entered

In [8]:
# resulting files
ls $READS_QC_DIR | head 

ERR10479510_R1_fastqc.html
ERR10479510_R1_fastqc.zip
ERR10479510_R2_fastqc.html
ERR10479510_R2_fastqc.zip
ERR10479511_R1_fastqc.html
ERR10479511_R1_fastqc.zip
ERR10479511_R2_fastqc.html
ERR10479511_R2_fastqc.zip
ERR10479512_R1_fastqc.html
ERR10479512_R1_fastqc.zip


### 3. Trimming reads with BBtools

This step is not mandatory because many genome assemblers can remove sequencing adapters from the raw reads before trying to produce a complete genome. In this step we remove these contaminant sequences (used during sequencing) and any reads with poor quality over a moving window of 21 basepairs.

In [1]:
module load StdEnv/2020 bbmap/38.86

For improved speed, add 'usejni=t' to the command line of BBMap tools which support the use of the compiled jni C code.

The following have been reloaded with a version change:
  1) StdEnv/2016.4 => StdEnv/2020           4) intel/2016.4 => intel/2020.1.217
  2) gcccore/.5.4.0 => gcccore/.9.3.0       5) mii/1.1.1 => mii/1.1.2
  3) imkl/11.3.4.258 => imkl/2020.1.217     6) openmpi/2.1.1 => openmpi/4.0.3



The tool includes a text file with commonly used adaptor sequences for Illumina platform. The process removes any reads from our sequencing files that match these reference sequences (fasta format). The command we use is `bbduk.sh` and we add several options:

- Specify PATHs to adapter sequences file and to output directory for trimmed reads
- `k=23` specifies the size of the moving window for quality control of reads
- `qtrim=6` removes regions with a score of quality below 6


In [2]:
# establish necessary paths
adapters_file='/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/bbmap/38.86/resources/adapters.fa'
OUTPUT_TRIM="/home/jupyter-mdprieto/tutorials/trimmed_reads"
RAW_READS="/home/jupyter-mdprieto/tutorials/raw_reads"

for i in $(ls $INPUT_DIR/*_R1*)
do
R1=$(basename $i)
R2=$(echo $R1 | sed 's/_R1/_R2/')
bbduk.sh \
    in1=$INPUT_DIR/$R1 in2=$INPUT_DIR/$R2 \
    out1=$OUTPUT_TRIM/$R1 out2=$OUTPUT_TRIM/$R2 \
    ref=$adapters_file \
    k=23 \
    trimq=6 \
    tpe \
    tbo \
    threads=9
done

java -ea -Xmx36396m -Xms36396m -cp /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/bbmap/38.86/current/ jgi.BBDuk in1=/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479510_R1.fastq.gz in2=/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479510_R2.fastq.gz out1=/home/jupyter-mdprieto/tutorials/trimmed_reads/ERR10479510_R1.fastq.gz out2=/home/jupyter-mdprieto/tutorials/trimmed_reads/ERR10479510_R2.fastq.gz ref=/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/bbmap/38.86/resources/adapters.fa k=23 trimq=6 tpe tbo threads=9
Picked up JAVA_TOOL_OPTIONS: -Xmx2g
Executing jgi.BBDuk [in1=/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479510_R1.fastq.gz, in2=/home/jupyter-mdprieto/tutorials/raw_reads/ERR10479510_R2.fastq.gz, out1=/home/jupyter-mdprieto/tutorials/trimmed_reads/ERR10479510_R1.fastq.gz, out2=/home/jupyter-mdprieto/tutorials/trimmed_reads/ERR10479510_R2.fastq.gz, ref=/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/bbmap/38.86/resources/adapters.fa, k=2

### 4. Summarize with multiqc
Finally, **multiqc** is a really useful tool that can summarize all quality control output into a single html file.

With the code below, we summarize all **fastqc** output contained in the `reads_qc` results directory. The output can be visualized using any web browser. 

Once again, the tool is available as a singularity container and is called using `singularity exec PATH/TO/IMAGE multiqc qc_directory`

In [10]:
singularity exec $HOME/tutorials/tools/sing_imgs/multiqc_1.14.sif multiqc $READS_QC_DIR \
    -o $HOME/tutorials/results/reads_qc


  [34m/[0m[32m/[0m[31m/[0m ]8;id=442026;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.14[0m

[34m|           multiqc[0m | Search path : /home/jupyter-mdprieto/tutorials/results/reads_qc
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m41/41[0m  [0m0m  
[?25h[34m|            fastqc[0m | Found 20 reports
[34m|           multiqc[0m | Compressing plot data
[34m|           multiqc[0m | Report      : ../tutorials/results/reads_qc/multiqc_report.html
[34m|           multiqc[0m | Data        : ../tutorials/results/reads_qc/multiqc_data
[34m|           multiqc[0m | MultiQC complete


## Reproduce the QC pipeline in Compute Canada

In compute canada, we run intensive commands as part of jobs. In a job, we specify how much memory and processing power we require and the system will automatically asign these values and run our specified commands. 

To do this, we save all of our commands and the instructions for the job in a text file. To do this, we open a text editor (`nano text_filename` or `vim text_filename`) and copy the following commands.
Once the commands are written in the text, we save the file using **Ctrl+X** in `nano` or **Esc + ZZ** in `vim`

```
#!/bin/bash
#SBATCH --account=rrg-whsiao-ab                    # compute canada PI allocation
#SBATCH --mem=25gb                                 # 25 GB of memory
#SBATCH --time=06:00:00                            # approximate time to complete all actions
#SBATCH --job-name="quality_control"               # name of job
#SBATCH --chdir=/scratch/mdprieto/tutorials        # change to directory before start
#SBATCH --cpus-per-task=9                          # number of threads, how many simultaneous tasks
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

################################ preparation ######################################

# load necessary modules
module load StdEnv/2020 bbmap/38.86 fastqc/0.11.9 seqkit/0.15.0 

# create output directory
mkdir -p /scratch/mdprieto/tutorials/qc_results

# establish path for outputs and input
adapters_file='/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/bbmap/38.86/resources/adapters.fa'
OUTPUT_QC="/home/jupyter-mdprieto/tutorials/results_qc"
OUTPUT_TRIM="/home/jupyter-mdprieto/tutorials/trimmed_reads"
INPUT_DIR="/home/jupyter-mdprieto/tutorials/raw_reads"

################################## seqkit #########################################

# run seqkit in fastq file and save output in tab separated file
seqkit stats $INPUT_DIR/*.fastq.gz > $OUTPUT_QC/seqkit_output.tsv

################################## fastqc #########################################

# for all files in raw_reads directory
fastqc $INPUT_DIR/*.fastq.gz \
    -o $OUTPUT_QC \
    -t 9 \
    --quiet

################################## multiqc #########################################

# execute from singularity
singularity exec /home/jupyter-mdprieto/tools/multiqc_1.14.sif multiqc $OUTPUT_QC

################################## trimming ########################################

for i in $(ls $INPUT_DIR/*_R1*)
do
R1=$(basename $i)
R2=$(echo $R1 | sed 's/_R1/_R2/')
bbduk.sh \
    in1=$INPUT_DIR/$R1 in2=$INPUT_DIR/$R2 \
    out1=$OUTPUT_TRIM/$R1 out2=$OUTPUT_TRIM/$R2 \
    ref=$adapters_file \
    k=23 trimq=6 tpe tbo \
    threads=9
done

```

<font color='darkred'>_**Notes for compute canada:**_ </font>  
Our pipeline for quality control of raw reads includes several steps:

1. We use the ***seqkit*** tool to obtain basic statistics from the **.fastq** files. The module to load **seqkit** is available in CC, so it can be load using `module load`. 
    - The tool takes `fastq` files as inputs and produces a txt output 
2. We will use **fastqc** which is also available as a module in ComputeCanada. The tool creates an overall summary of different metrics of sequencing including nucleotide distribution, presence of repeats, quality of base calling, GC content and adapter content
    - **fastqc** produces an output summary for every file entered
3. Another tool (**multiqc**) takes all output summaries of **fastqc** in a directory and creates a nice single HTML output that can be visualized in any web browser.
    - Not available in Compute Canada, so we use a singularity container to execute it
4. To have reads ready for assembly, I trimm or remove any PCR adaptors, barcodes or poor quality regions using **BBtools**