# Quality Control and  Pre-processing

We start from raw reads obtained from the sequencer as FASTQ files.
The first crucial step in WES analysis is to check the quality of the raw reads. There are many sources of errors of sequence reads (including ...).

Problematic bases and reads need to be trimmed or filtered. Thus, the first step is to assess the quality of the reads (using FastQC). We then use Trimmomatic for trimming and cropping (removing low quality bases, clipping adapter sequences).


In [None]:
# number of fastq files
! find /data/humgen/guffantilab/exome/SystemsBio/input/Exome/EXOME -type f -name '*.fastq.*' | sort | uniq | wc -l

Hence, since they are paired reads, there are 307 samples.

In [4]:
%%bash
# copy the vcf files into one folder
cd /data/humgen/burook/sysbio_exome
find /data/humgen/guffantilab/exome/SystemsBio/input/Exome/EXOME -type f -name '*.fastq.*' -exec cp -t /data/humgen/burook/sysbio_exome/raw/ {} +

### FastQC analysis

Save the following into a file (~/WES_analysis/fastqc_analysis1.sh).

In [None]:
#!/bin/sh

# FastQC is aleady available on erisone
module load FastQC/0.11.8-Java-1.8

# directory where raw data is saved
dir1=/data/humgen/burook/sysbio_exome/raw/
# directory where results will be saved
dir2=/data/humgen/burook/sysbio_exome/fastq_result1

cd ${dir1}

for file in $(ls $dir1)
do
    fastqc $file -o $dir2
done

Now run this file on bash. (by opening a terminal on Jupyter)

In [None]:
! ~/WES_analysis/fastqc_analysis1.sh

A single result file for each individual fastq file is saved. Now, let's use MultiQC for aggregating the results into one file.
Note: Only 540 (out of 614) fastq files could be read/processed by fastQC.

### MultiQC analysis

Run the following in a Terminal.

In [None]:
%%bash

# let's install MultiQC in conda
module load anaconda2
conda create --name conda_test python=2.7
source activate conda_test
conda install -c bioconda multiqc

# cd to the folder where multiqc results should be saved
cd /data/humgen/burook/sysbio_exome/fastq_result1_multiqc/
multiqc /data/humgen/burook/sysbio_exome/fastq_result1

### Trimmomatic 

Save the following in a text file (~/WES_analysis/trimmomatic_analysis.sh).

In [None]:
#!/bin/bash

# directory where raw data is saved
dir1=/data/humgen/burook/sysbio_exome/raw/
# directory where results will be saved
dir2=/data/humgen/burook/sysbio_exome/trimmomatic_results1

cd ${dir1}

for f1 in *R1_001.fastq.gz
do
    # f1 and f2 are inpute files for forward and reverse reads
    f2=$(basename $f1 _R1_001.fastq.gz)_R2_001.fastq.gz
    # the following four are output files (forward and revers for paired and single)
    f1p=$(basename $f1 _R1_001.fastq.gz)_R1_001.pe.fastq.gz 
    f1s=$(basename $f1 _R1_001.fastq.gz)_R1_001.se.fastq.gz
    f2p=$(basename $f1 _R1_001.fastq.gz)_R2_001.pe.fastq.gz
    f2s=$(basename $f1 _R1_001.fastq.gz)_R2_001.se.fastq.gz
    
    java -jar /PHShome/bm363/bin/trimmomatic/trimmomatic-0.39.jar PE -threads 24 \
        ${f1} ${f2} \
        ${dir2}/${f1p} ${dir2}/${f1s} ${dir2}/${f2p} ${dir2}/${f2s} \
        ILLUMINACLIP:/PHShome/bm363/bin/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10:2:keepBothReads \
        LEADING:3 TRAILING:3 MINLEN:36
done


Run this file on a terminal (opened on Jupyter). Note that there are four output files. The single-read file is one that would be produced if only the corresponding pair is trimmed but not the read itself.