# Introduction
**Last updated:** September 2023

Throughout these tutorials we want to introduce some of the common tools used in our laboratory, the **Center for Infectious Disease Genomics and One Health**, for analysis of bacterial sequencing data. 

As bioinformatics  is a rapidly changing field, the pipelines and software used may change over time or become obsolete. New trainees will be in charge of reproducing the pipeline format and evaluating the content. 

## Before starting

### Disclaimer: updating the notebook

Please update the tools and information in thi Jupyter notebook if you realize that there is a new version or a more appropriate procedure. Also feel free to add new content by forking and modifying the file, to keep track of changes, please do a **pull request** to the base repository. 

### Pre-requisites

These tutorials assume that you have a basic understanding of the process behind producing next generation sequencing reads, as we start with data already sequenced. Also, you should be familiar with the command line, at least how to move around different folders and having a basic grasp of commonly used commands and regular expressions. 

If you have not interacted before with the command line or have limited experience, we would recommend you to take a look at the resources below:

- Software carpentry foundation: guided tutorial introducing the command line : [The Unix Shell](https://swcarpentry.github.io/shell-novice/)
- [Extra Unix Shell Material from Software Carpentry](https://carpentries-incubator.github.io/shell-extras/): includes additional set up options and describes advanced commands such as `awk` and use of aliases.
- Bioinformatics Workbook by Andrew Severin has a text based introduction of how to set up a shell(command line) for many systems: [Unix Basics] (https://bioinformaticsworkbook.org/Appendix/Unix/unix-basics-1.html#gsc.tab=0)
- Happy Belly Informatics maintained by Mike Lee (AstroBioMike) has several tutorials ([link here](https://astrobiomike.github.io/all_tutorials/)) about the unix command line and Bioinformatics pipelines 

It is not necessary to be familiar with software containers, although you may find it helpful for your later work. You can find additional information in this [Carpentries introduction to Singularuity containers](https://carpentries-incubator.github.io/singularity-introduction/). 

Finally, this is a Jupyter notebook and it would be helpful to have a basic notion of how it works and how to modify it, you can find more information about what is a Jupyter notebook here: https://docs.jupyter.org/en/latest/. The minimum resource you should take a look before digging into the tutorials is [this practical introduction](https://docs.jupyter.org/en/latest/start/index.html).

### Required tools

After a sequencing experiment, the data is available as reads (representations of sections of nucleotides) and typically available in [fastq](https://en.wikipedia.org/wiki/FASTQ_format) format. The length of the reads and the quality of the assigned nucleotides depends largely on the processing steps and the technology used (Illumina, Nanopore, PacBio).

Most of our analysis are done in a High Performance Computing cluster as most tools are resource intensive and performing analysis on your local computer would be too time consuming. The currently used cluster is **Compute Canada (CC cluster)**, you can login using the instructions available in the [wiki page for new users](https://docs.alliancecan.ca/wiki/SSH). 

In the **CC cluster**, there are several tools already available to be used as modules. This environment (computing system) has reproduced this setup and you can use a similar command to load software. `module load desired_software/version`

Manually installing tools can create conflicts and errors. Thus, we will use containers (ready to use packages with necessary environment for a tool). Singularity containers are executed using the command `singularity run tool_image` or `singularity exec tool_image command`. There is no need to download said containers as they should be readily available once you connect to this server. See below how we use `singularity` to execute multiqc at the end of this tutorial.

Tools used in this section:
- `seqkit   v2.3.1`
- `fastqc   v0.11.9`
- `BBtools  v38.36`
- `multiqc  v1.14`

## Preparing the environment

### Select interpreter for Jupyter notebook

Kernels are programs that mimic operating systems. In jupyter, they tell the system what king of programming language should be interpreted. 

In order to run these notebooks, a pre-built environment was developed and we are using bash/shell commands. So make sure to load the **bash_kernel/bash** in the upper right corner of the notebook. 

### Load additional software pre-packaged in the environment

The `source` command imports a script and executes it. As we are running the command line from inside the Jupyter notebook, it does not have all the functionality of our [high-performance cluster](https://www.hpc.iastate.edu/guides/introduction-to-hpc-clusters/what-is-an-hpc-cluster) by default (Compute Canada/Digital Research Alliance). 

Most HPC have pre-loaded software installed, and you can make it available for your session using the `module` command; with this line we make the same resources available to us during the tutorials.

In [None]:
# source PATH to use module function
source /cvmfs/soft.computecanada.ca/config/profile/bash.sh

### Environment structure

We will first explore the structure of our environment and the folders available. It is good practice to assign a directory (`tutorials`) to every project. Inside this main directory, we will create subdirectories with results, analysis instructions, and tools necessary for analysis.

Every command can be explored using `command --help` for further details. 

In [None]:
# create a tutorials folder for all results
mkdir -p $HOME/tutorials

echo -e "The path to your home directory is
$(realpath $HOME)"

echo -e "
Take a look at the folder containing software/data for the tutorials:"
tree -dL 2 /mnt/cidgoh-object-storage/seagull/jupyter-mdprieto

## Exploring the data

Datasets for this tutorial are available in a shared folder inside the `tutorials` directory called `raw_reads`. These are Illumina technology short read sequences from a biological data repository (75bp paired end reads). The isolates were cultured to investigate an outbreak of multi-drug resistant _Pseudomonas aeruginosa_ in Switzerland [PMID:34412676](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8376114/). To make it faster and easier to run, we use a subset of 10 bacterial isolates. 

In [None]:
# show content of raw_reads directory
ls -1 /mnt/cidgoh-object-storage/seagull/jupyter-mdprieto/raw_reads_tutorial

These reads are compressed (extension .gz), but most bioinformatic programs can receive `.tar` or `.gzip` files as input. To unzip, the command is `tar -zxf file(s)_to_decompress`.

The fastq files have the naming convention `_R1.fastq` or `_R2.fastq` for every isolate because we are using paired-end reads (segments of the DNA are analyzed from both ends, left and right). Other naming structures can be used to represent paired end reads (`_1.fastq, _R001.fastq`). 

# Quality control

It is a good practice to organize preliminary results in their own directory so you can trace back when necessary. You can also use github repositories to track all changes related to a project. 

We will create a new directory for output `tutorials/results` and another subdirectory for this particular step `tutorials/results/reads_qc`. 

<font color='darkred'>_**Notes for compute canada:**_ </font>  
- Compute Canada provides different directories for storage. Jobs cannot be launched from the *HOME* and *PROJECT* directory or any of its subdirectories. The ideal place to run jobs is the *SCRATCH* folder, where you have short term storage of large amounts of data. 
- Once you have final results, these should be moved to your *PROJECT* directory as the *SCRATCH* folder is constantly being cleaned

In [None]:
# creates new directory for results of QC 
mkdir -p $HOME/tutorials/results/raw_seqkit

# show our new project structure
tree -dL 2 $HOME/tutorials

## 1. Basic statistics of reads with Seqkit 

***seqkit*** summarizes quality statistics (scores for each assigned nucleotide and length diversity) of all reads in a a folder into a text file. basic statistics from the **.fastq** files. 

To make the software available to us, we can use the functionality `module` that we made available in the preparation section. More information about how to use modules can be found in the [Digital Research Alliance Canada wiki](https://docs.alliancecan.ca/wiki/Utiliser_des_modules/en). 

The basic syntax for any software/module is `module load module_dependency desired_module`

In [None]:
# load seqkit and dependencies to our computing environment
module load StdEnv/2020 seqkit/2.3.1

In the command line, when we want to run a command that involves a file, we have to move to the specific directory and run the desired command or include the complete path to said file (/mnt/source/user/...). 

To avoid extra typing, we can create variables (**environment variables**) that store this path and reuse them sparingly. You can learn more about the use of environment variables in linux environments in the following link: [ENVIRONMENT VARIABLES](https://linuxize.com/post/how-to-set-and-list-environment-variables-in-linux/). 

In [None]:
# create environment variables to avoid typing PATHs every time
RAW_READS="/mnt/cidgoh-object-storage/seagull/jupyter-mdprieto/raw_reads_tutorial"
RESULTS_SEQKIT="$HOME/tutorials/results/raw_seqkit"

Then, we run the`seqkit` command and preview the text file that summarizes the results . The wait time for output should be around minute only as this is not a demanding process.  

In [None]:
# run sequence statistics
seqkit stats $RAW_READS/*.fastq.gz > $RESULTS_SEQKIT/seqkit_output.tsv

# check results
head $RESULTS_SEQKIT/seqkit_output.tsv

## 2. Trimming reads with BBtools

This step is not mandatory because many genome assemblers can remove sequencing adapters from the raw reads before trying to produce a complete genome. 

However, it is always recommended to filter out poor quality sequences and guarantee the quality of the starting data. The command removes adaptor sequences and any reads with poor quality over a moving window of 21 basepairs.

In [None]:
module load StdEnv/2020 bbmap/38.86

The tool includes a text file with commonly used adaptor sequences for the **Illumina** platform. The process removes any reads from our sequencing files that match these reference sequences (fasta format). 

The command used is `bbduk.sh` and we add several options:

- Specify `PATHs` to adapter sequences file and to output directory for trimmed reads
- `k=23`    - specifies the size of the moving window for quality control of reads
- `trimq=6` - removes regions with a score of quality below 6
- `tbo`     - Trim adapters based on where paired reads overlap
- `tpe`     - When kmer right-trimming, trim both reads to the minimum length of either

In [None]:
# establish necessary paths
ADAPTERS='/mnt/cidgoh-object-storage/seagull/jupyter-mdprieto/reference_data/adapters.fa'
TRIMMED_READS="$HOME/tutorials/trimmed_reads"
RAW_READS="/mnt/cidgoh-object-storage/seagull/jupyter-mdprieto/raw_reads_tutorial"

After defining additional environment variables to minimize our typing, we create a for loop to execute the same command (in this case `bbduk.sh`) for every sample available. It takes around 5 minutes to run as we only have 10 samples.

If you want to learn more about the use of for loops in bioinformatics, make sure to review the suggested tutorials for the shell. 

In [None]:
for i in $(ls $RAW_READS/*_R1*)     
do
R1=$(basename $i)                                                                               `# save the filename of read_1 in a variable called R1`
R2=$(echo $R1 | sed 's/_R1/_R2/')                                                               `# specify the matching read_2 in variable R2`                                                                
bbduk.sh \
    in1=$RAW_READS/$R1 in2=$RAW_READS/$R2                                                       `# define read_1 and read_2 as inputs` \
    out1=$TRIMMED_READS/$R1 out2=$TRIMMED_READS/$R2                                             `# output trimmed reads with same name in trimmed_reads folder` \
    ref=$ADAPTERS                                                                               `# specify the adaptor sequence for matching` \
    k=23 \
    trimq=6 \
    tpe \
    tbo \
    threads=9
done

**Ouput of trimming** can be seen in the `trimmed_reads` folder as shown below

In [None]:
ls -1 $HOME/tutorials/trimmed_reads

## 3. fastqc

We will use **fastqc**  to create additional metrics of sequencing including nucleotide distribution, presence of repeats, quality of base calling, GC content and adapter content. 

The outputs will be produced in new subdirectories of `results` called **raw_fastqc/trimmed_fastqc**.


In [None]:
# create output directories
mkdir -p $HOME/tutorials/results/raw_fastqc
mkdir -p $HOME/tutorials/results/trimmed_fastqc

# define environment variables
FASTQC_RAW="$HOME/tutorials/results/raw_fastqc"
FASTQC_TRIMMED="$HOME/tutorials/results/trimmed_fastqc"

# load the fastqc tool
module load StdEnv/2020 fastqc/0.11.9

- I input all the files that have the suffix 'fastq.gz' from our **raw reads** directory and specify that 10 files will be processed simultaneously (`-t 10`). 
    - The processing should take around 3 minutes for this set of 10 samples. 
- Then, we do the same analysis but with the **trimmed_reads** to see if pre-processing improved the quality of the results

In [None]:
fastqc $RAW_READS/*.fastq.gz \
    -o $FASTQC_RAW \
    -t 10 \
    --quiet
    # output is saved as individual files in READS_QC_DIR

In [None]:
fastqc $TRIMMED_READS/*.fastq.gz \
    -o $FASTQC_TRIMMED \
    -t 10 \
    --quiet
    # output is saved as individual files in READS_QC_DIR


- We can list all files in the output directory of **fastqc**, the default command creates `zip` and `html` files for every read file

In [None]:
echo -e "\nRAW READS:"
ls $FASTQC_RAW | head 

echo -e "\nTRIMMED READS:"
ls $FASTQC_TRIMMED | head


We can also see that we have added several new folders in an organized way to our tutorials/results folder, take a look with the following command:

In [None]:
echo -e "Updated tutorials folder structure:/n"
tree -d $HOME/tutorials

## 4. Summarize with multiqc
Finally, **multiqc** is a really useful tool that can summarize all quality control output into a single html file.

With the code below, we summarize all **fastqc** output contained in the `reads_qc` results directory. The output can be visualized using any web browser. 

Once again, the tool is available as a singularity container and is called using `singularity exec PATH/TO/IMAGE multiqc qc_directory`

In [None]:
singularity exec /mnt/cidgoh-object-storage/images/multiqc_1.14.img multiqc $FASTQC_RAW \
    -f \
    -o $FASTQC_RAW

In [None]:
singularity exec /mnt/cidgoh-object-storage/images/multiqc_1.14.img multiqc $FASTQC_TRIMMED  \
    -f \
    -o  $FASTQC_TRIMMED


`multiqc` produces an `.html` file with several visualizations, you can download it to your own computer and explore it in a web browser (chrome, mozilla, edge). 
If you are using the Jupyter environment (Graphical User Interface) you can navigate the folders to the location of the files (`$HOME/tutorials/results/raw_fastqc/multiqc_report.html`) and open it manually.

## Pipeline script for Compute Canada (REFERENCE ONLY)

In compute canada, we run intensive commands as part of jobs. In a job, we specify how much memory and processing power we require and the system will automatically asign these values and run our specified commands. 

To do this, we save all of our commands and the instructions for the job in a text file. To do this, we open a text editor (`nano text_filename` or `vim text_filename`) and copy the following commands.
Once the commands are written in the text, we save the file using **Ctrl+X** in `nano` or **Esc + ZZ** in `vim`

```
#!/bin/bash
#SBATCH --account=rrg-whsiao-ab                    # compute canada PI allocation
#SBATCH --mem=25gb                                 # 25 GB of memory
#SBATCH --time=06:00:00                            # approximate time to complete all actions
#SBATCH --job-name="quality_control"               # name of job
#SBATCH --chdir=/scratch/mdprieto/tutorials        # change to directory before start
#SBATCH --cpus-per-task=9                          # number of threads, how many simultaneous tasks
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

################################ preparation ######################################

# load necessary modules
module load StdEnv/2020 bbmap/38.86 fastqc/0.11.9 seqkit/0.15.0 

# create output directory
mkdir -p ~/tutorials/qc_results

# establish path for outputs and input
adapters_file='PATH/TO/adapters.fa'
OUTPUT_QC="$HOME/tutorials/results_qc"
OUTPUT_TRIM="$HOME/tutorials/trimmed_reads"
INPUT_DIR="PATH/TO/raw_reads"

################################## seqkit #########################################

# run seqkit in fastq file and save output in tab separated file
seqkit stats $INPUT_DIR/*.fastq.gz > $OUTPUT_QC/seqkit_output.tsv

################################## fastqc #########################################

# for all files in raw_reads directory
fastqc $INPUT_DIR/*.fastq.gz \
    -o $OUTPUT_QC \
    -t 9 \
    --quiet

################################## multiqc #########################################

# execute from singularity
singularity exec $HOME/tools/multiqc_1.14.sif multiqc $OUTPUT_QC

################################## trimming ########################################

for i in $(ls $INPUT_DIR/*_R1*)
do
R1=$(basename $i)
R2=$(echo $R1 | sed 's/_R1/_R2/')
bbduk.sh \
    in1=$INPUT_DIR/$R1 in2=$INPUT_DIR/$R2 \
    out1=$OUTPUT_TRIM/$R1 out2=$OUTPUT_TRIM/$R2 \
    ref=$adapters_file \
    k=23 trimq=6 tpe tbo \
    threads=9
done

```

### <font color='darkred'>_**Notes for compute canada**_ </font>  
Our pipeline for quality control of raw reads includes several steps:

1. We use the ***seqkit*** tool to obtain basic statistics from the **.fastq** files. The module to load **seqkit** is available in CC, so it can be load using `module load`. 
    - The tool takes `fastq` files as inputs and produces a txt output 
2. We will use **fastqc** which is also available as a module in ComputeCanada. The tool creates an overall summary of different metrics of sequencing including nucleotide distribution, presence of repeats, quality of base calling, GC content and adapter content
    - **fastqc** produces an output summary for every file entered
3. Another tool (**multiqc**) takes all output summaries of **fastqc** in a directory and creates a nice single HTML output that can be visualized in any web browser.
    - Not available in Compute Canada, so we use a singularity container to execute it
4. To have reads ready for assembly, I trimm or remove any PCR adaptors, barcodes or poor quality regions using **BBtools**