# Handling of sequencing reads

Reads are typically available in [fastq](https://en.wikipedia.org/wiki/FASTQ_format) format after conducting sequencing experiments (Illumina, Nanopore, PacBio).

Our analysis will be done in the **Compute Canada (CC)** cluster of analysis.
To login into the cluster, follow the instructions available in the [wiki page for new users](https://docs.alliancecan.ca/wiki/SSH). 

## Exploring the data

Data may be shared through a folder or downloaded from a biorepository and is the result of Illumina sequencing (75bp paired end reads). The initial step must be to make it available for yourself in the cluster. 

The dataset is available in a shared directory (`reads_directory`) so you can move it to the folder you desire using the following code. 

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
%%bash
# replace the location with your folder
cd /project/6056895/mdprieto/tutorials

cp -ri /project/6007413/globus_share/Bronchiectasis_genomes/bronch_fastq.tar.gz .

# extract the reads 
tar -zxf bronch_fastq.tar.gz

If you list the files available in your newly created folder, you will see that files are separated into '_R1.fastq' and '_R2.fastq'.
This means that the reads were produced with paired ends and a set of them represents a single sample. In this project, we have '_R0.fastq' and these are reads that were already trimmed.

In [None]:
!cd /project/6056895/mdprieto/tutorials
!ls bronchiectasis_reads/* | head -n 10

## Quality control

It is a good practice to organize every project/analysis in an individual directory. You can also use github repositories to synchronize all your results and scripts so other researchers can easily reproduce your results. 

Also, as a restriction of compute canada, we cannot launch jobs to the processing cluster from directories that belong to the **HOME** and **PROJECT** directory. 

In [None]:
# create a working directory in scratch
mkdir /scratch/mdprieto/tutorials
cd /scratch/mdprieto/tutorials

Our pipeline for quality control of raw reads includes several steps:

1. We use the ***seqkit*** tool to obtain basic statistics from the **.fastq** files. The module to load **seqkit** is available in CC, so it can be load using `module load`. 
    - The tool takes `fastq` files as inputs and produces a txt output 
2. We will use **fastqc** which is also available as a module in ComputeCanada. The tool creates an overall summary of different metrics of sequencing including nucleotide distribution, presence of repeats, quality of base calling, GC content and adapter content
    - A for loop lets me apply a command to every file/read available. In this case to all files with suffix 'fastq' in the INPUT_DIR
    - **fastqc** produces an output summary for every file entered
3. Finally, another tool (**multiqc**) takes all output summaries of **fastqc** in a directory and creates a nice sinfle HTML output that can be visualized in any web browser.
    

To make it easier to reproduce, we will open a text file with editor `nano` and save the following code.

In [None]:
# open text editor
nano tutorial_seqkit.sh

# ------------------------------- write following code inside file

#!/bin/bash
#SBATCH --account=rrg-whsiao-ab                    # compute canada PI allocation
#SBATCH --mem=25gb                                 # 25 GB of memory
#SBATCH --time=06:00:00
#SBATCH --job-name="quality_control"               # name of job
#SBATCH --chdir=/scratch/mdprieto/tutorials        # change to directory before start
#SBATCH --cpus-per-task=9
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

################################ preparation ######################################

# load necessary modules
module load StdEnv/2020
module load nixpkgs/16.09
module load fastqc/0.11.9
module load python/3.10.2
module load seqkit/0.15.0
pip install multiqc

# create output directory
mkdir -p /scratch/mdprieto/tutorials/qc_results

# establish path for output and input
OUTPUT_DIR="/scratch/mdprieto/tutorials/qc_results"
INPUT_DIR="/project/6056895/mdprieto/tutorials/fastq_reads"

################################## seqkit #########################################

# run seqkit in fastq file and save output in tab separated file
seqkit stats $INPUT_DIR/*.fastq.gz > $OUTPUT_DIR/seqkit_output.tsv

################################## fastqc #########################################

# run fastqc tool for every file in INPUT_DIR
for fastq_file in $(ls $INPUT_DIR/*.fastq.gz)
do
fastqc \
        -o $OUTPUT_DIR \
        -t 9 \
        $fastq_file
        # output is saved as individual files in OUTPUT_DIR
done

################################## multiqc #########################################

# move to output dir, multiqc automatically reads all files
cd $OUTPUT_DIR
multiqc . 

## Results

In the `qc_results` output directory for the tutorial, we will find individual summaries for **fastqc**, a tab separated file that contains statistics about all the sequencing reads, and an executable `html` file produced by **multiqc**

Based on the output of **multiqc**, we can exclude samples that had poor quality of sequencing and maybe explore presence of adapter sequences that need to be trimmed for older assemblers. 