# Observing raw data before processing them

<b>File completed </b> (05/17/2021)
Table of Contents + tool versions on 06/22/2021

Bénédicte Noblet

- Preparing session for IFB core cluster
- Some checks as a precaution
- First quality control on raw `.fastq.gz` files
- Having a summary report with MultiQC


---

## <b>Preparing session for IFB core cluster</b>

<em>loaded JupyterLab</em> : Version 2.2.9

In [1]:
echo "=== Cell launched on $(date) ==="

echo "=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid}

echo "=== Working's root folder is ==="
gohome="/shared/projects/gonseq/Building/" # to adjust with your project's folder
echo "${gohome}"
echo ""

echo "=== current folder tree ==="
tree "${gohome}"
echo "=== current working directory ==="
echo "${PWD}"

=== Cell launched on Mon May 17 12:18:33 CEST 2021 ===
=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
       JobID  AllocCPUS        NodeList 
------------ ---------- --------------- 
16507470             10     cpu-node-39 
16507470.ba+         10     cpu-node-39 
16507470.0           10     cpu-node-39 
=== Working's root folder is ===
/shared/projects/gonseq/Building/

=== current folder tree ===
/shared/projects/gonseq/Building/
├── Data
│   ├── fastq
│   │   └── raw
│   │       ├── SRR7430706_1.fastq.gz
│   │       ├── SRR7430706_2.fastq.gz
│   │       ├── SRR7430707_1.fastq.gz
│   │       ├── SRR7430707_2.fastq.gz
│   │       ├── SRR7430708_1.fastq.gz
│   │       ├── SRR7430708_2.fastq.gz
│   │       ├── SRR7430709_1.fastq.gz
│   │       ├── SRR7430709_2.fastq.gz
│   │       ├── SRR7430710_1.fastq.gz
│   │       ├── SRR7430710_2.fastq.gz
│   │       ├── SRR7430711_1.fastq.gz
│   │       ├── SRR7430711_2.fastq.gz
│   │       ├── SRR7430712_1.fastq.gz


In [2]:
module load bc/1.07.1 fastqc/0.11.9 multiqc/1.9

echo "===== basic calulator ====="
bc --version | head -n 1
echo "===== individual reports ====="
fastqc --version
echo "===== compiled report ====="
multiqc --version

===== basic calulator =====
bc 1.07.1
===== individual reports =====
FastQC v0.11.9
===== compiled report =====
multiqc, version 1.9


---
## <b>I- Some checks as a precaution</b>

### **1- Available files**

The data files are already present on the server, in the `Data/fastq/raw/` folder of our current working directory.

As we change session and or day, let's first check all files are there using following commands and they correspo

In [3]:
rawfolder="${gohome}Data/fastq/raw/"

echo "There are $(ls ${rawfolder} | wc -l) raw .fastq.gz files:"
ls ${rawfolder}

There are 32 raw .fastq.gz files:
SRR7430706_1.fastq.gz  SRR7430711_2.fastq.gz  SRR7430741_1.fastq.gz
SRR7430706_2.fastq.gz  SRR7430712_1.fastq.gz  SRR7430741_2.fastq.gz
SRR7430707_1.fastq.gz  SRR7430712_2.fastq.gz  SRR7430742_1.fastq.gz
SRR7430707_2.fastq.gz  SRR7430713_1.fastq.gz  SRR7430742_2.fastq.gz
SRR7430708_1.fastq.gz  SRR7430713_2.fastq.gz  SRR7430743_1.fastq.gz
SRR7430708_2.fastq.gz  SRR7430738_1.fastq.gz  SRR7430743_2.fastq.gz
SRR7430709_1.fastq.gz  SRR7430738_2.fastq.gz  SRR7430744_1.fastq.gz
SRR7430709_2.fastq.gz  SRR7430739_1.fastq.gz  SRR7430744_2.fastq.gz
SRR7430710_1.fastq.gz  SRR7430739_2.fastq.gz  SRR7430745_1.fastq.gz
SRR7430710_2.fastq.gz  SRR7430740_1.fastq.gz  SRR7430745_2.fastq.gz
SRR7430711_1.fastq.gz  SRR7430740_2.fastq.gz


The files consist of raw data from the Illumina sequencer (`.fastq`) which sizes have been reduced (`.gz`) thanks to compression (``gunzip`` or ``pigz``, see `Pipe_1.ipynb` notebook). As genomics tools can deal with both compressed and uncompressed file formats, we'll save disk space using the compressed ones.

### **2- Examining data files: are they what we expect?**

Let's pick up one file to see inside.

We list the files in the folder and ask for only the first line (``-n 1``).

In [4]:
arawfile=$(ls "${rawfolder}"*gz | head -n 1)
echo ${arawfile}

/shared/projects/gonseq/Building/Data/fastq/raw/SRR7430706_1.fastq.gz


``.fastq`` files are readable by the human eye, and we can display the first lines of this file, using the Unix ``head`` command on the ``zcat`` command that can deal with ``.gzip`` files.

In [5]:
zcat ${arawfile} | head

@SRR7430706.1 HWI-ST1136:323:HS214:3:1101:1373:1854 length=50
NCCAGCCACTCGGGAGGCCAAGGCAGGAGAATCACTCGGACCCAGGGTTG
+SRR7430706.1 HWI-ST1136:323:HS214:3:1101:1373:1854 length=50
#1=DFFFFHHHHHIIIJJJJJJJJJJJGFHIJIJJIJJJJJJJJJJJEHH
@SRR7430706.2 HWI-ST1136:323:HS214:3:1101:1409:1928 length=50
NTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTT
+SRR7430706.2 HWI-ST1136:323:HS214:3:1101:1409:1928 length=50
#4=DFFDEHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@SRR7430706.3 HWI-ST1136:323:HS214:3:1101:1305:1955 length=50
NTATGGTGGGAGCTTTCCCACAGAACATTTTCAGTGTGAGTGAATCAATA

gzip: stdout: Broken pipe


We expect to have text file with 4 files per read (sequence):
- read identification starts with `@`
- sequence itself (some `N` may appear when bases are undetermined)
- a line separator starting with `+` and the identifier again (for first sequencers) or nothing else
- phred quality string with special characters (ranging from 33 to 41 in an ASCII table)

<div class="alert alert-block alert-info">
    For more information on phred score and history, please refer to <a href="https://en.wikipedia.org/wiki/FASTQ_format#Encoding">FASTQ format wikipedia page</a> that display graphical view for different phred score encoding.
</div>

To count lines in that file:

In [6]:
time wcloutput=$(zcat ${arawfile} | wc -l)
echo ${wcloutput}


real	1m22.779s
user	1m21.912s
sys	0m6.036s
209335740


For those who don't want to fetch for a calculator, we will use the `bc`  basic calculator that allow to use decimal in `bash`.

In [7]:
echo "scale=2; ${wcloutput}/4" | bc -l

52333935.00


If the result ends with no decimal (*i.e.* `.00`) along with correct file format (upper bullet point list), we have a good start... else please ask for information to the data supplier (platform or colleagues, file extensions are easy to add, change or even overwrite files...).

<blockquote>
    <ul>
        <li> Alternatively, we can get directly get the number of reads noticing all reads in this file starting (<code>^</code> in an expression pattern) with <code>@SRR</code>, using the command <code>zgrep</code> to do the pattern search in a <code>.gz</code> file: <br> <mark>makes Adenine stops running</mark> </li>
    </ul>
</blockquote>

In [8]:
time zgrep "^@SRR" ${arawfile} | wc -l

52333935

real	1m20.775s
user	1m30.358s
sys	0m10.563s


---
## <b>II - First quality control on raw <code>.fastq.gz</code> files</b>

### **1- Tool version and introduction**
For this step, we will use <a href="https://www.bioinformatics.babraham.ac.uk/projects/fastqc/"><b>FASTQC</b></a> (notebook developped with ``FastQC v0.11.9``).

In [9]:
fastqc --version

FastQC v0.11.9


To analyze a sample, we could launch: <br>
<code>fastqc --outdir path/to/destination/folder/ \ <br>        path/to/file.fastq.gz</code> <br>
where <code>-- outdir</code> introduces the path where you want new created files to be saved, while file to be analyzed is placed at the end of the line. <br>
<br>
For several samples, we can directly launch <code>fastqc</code> with a list of files to analyze. As several cores are availables, we can ask for <code>fastqc</code> to deal with several files at a time.
<blockquote>
    <code>-t 16</code> or <code>-threads 16</code> to ask for 16 files to be managed in parallel, knowing that each process will use 250 MB of RAM memory (<em>so 4 GB at a time for 16 threads, and 32 files is also 2 times 16 samples)</em> 
</blockquote>

### **2- Prepare destination folders**

We will store output files in ``Results/`` and in a subfolder called ``fastqc/``.

In [10]:
qcfolder="${gohome}Results/fastqc/"
mkdir -p "${qcfolder}"

As it's easier to work with files saved close to each other, the matched ``.log`` file will be saved in a ``logfiles/`` subfolder, also placed in ``Results/``.

In [11]:
logfolder="${gohome}Results/logfiles/"
mkdir -p "${logfolder}"

### **3- Run ``fastqc`` tool**

In [12]:
logfile="${logfolder}fastqc_raw-quality-processing.log"
echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starting by $(date)" >> ${logfile}

time fastqc --outdir ${qcfolder} --threads 16 \
            $(echo "${rawfolder}"*.gz)\
            &>> ${logfile}
echo "operation finished by $(date)" >> ${logfile}

# to see which files we have afterward and follow folder sizes
ls -lh ${qcfolder} >> ${logfile}
ls -lh "${gohome}Results/" >> ${logfile}

echo "$(ls -l "${qcfolder}"*.html | wc -l) generated .html reports"

Screen output is redirected to /shared/projects/gonseq/Building/Results/logfiles/fastqc_raw-quality-processing.log

real	18m18.879s
user	140m50.451s
sys	3m42.930s
32 generated .html reports


The ouputs are in a `.zip` folder and a `.html` file, the latest being a complete summary of the analysis. <br>
To open this `html` file, in the left-hand panel of *JupyterLab* double-click the "Results" folder, and in it, on the html file: it should open in a new tab beside this notebook. <br>

If you have no teacher nor bionformatician (or maybe they don't know either of this subject), you can browse some links:
<ul class="alert alert-block alert-info">
    <li><code>fastqc</code>'s help sections on <a href="https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/">its raw (no layout!) website</a>
    </li>
    <li>Michigan state University's support facility offers a nicer <a heef="https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/">FASTQC Tutorial and FAQ</a>
    </li>
    <li><i>Assessing quality metrics</i>'s section in <a ref="https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html">Quality control</a>, from a former workshop by the <i>Harvard Chan Bioinformatics Core</i> (HBC). <br>
        <i>++</i>: <b>some schemes on sequencing run and detailed information in general</b> 
    </li>
    <li><a hef="https://huoww07.github.io/Bioinformatics-for-RNA-Seq/lessons/02_Quality_Control.html#understand-fastqc-report">RNA sequencing quality control with fastQC</a>'s section of <i>Tufts University Research Technology Workshop</i>
    </li>
</ul>

After eyeing one file (<i>so only one sample!</i>), you may want to view all results at the same time to compare between samples. That's where MultiQC and next steps will help us.

## <b>III- Having a summary report with MultiQC</b>

When numerous samples are processed, it can easily become tedious to look in each mapping quality report. To that purpose, we will run <a href="https://multiqc.info/"><b>MultiQC</b></a>, that scans automatically a folder for all quality checks outputs and produce a single report.

### **1- Tool version and short presentation**

This notebook was developped with ``multiqc, version 1.9``, even if next version (1.10) appeared meanwhile.

In [13]:
multiqc --version

multiqc, version 1.9


This tool deals with almost any possible NGS tools: see <a href="https://multiqc.info/docs/#multiqc-modules">the onlinefull updated list</a> for more details and to know how it works (detectd files and folder extensions).

By default, ``multiqc`` identifies any report it can parse from the input directory.
If you want to only generate a multiQC report on specific analyses, you can add the argument ``-m`` followed by the name of the module as for example:
<code>multiqc -m fastqc ./Results/Fastqc/ -o /Results/MultiQC_on_FastQC</code>
> You can add several modules ``-m fastqc dir_fastqc -m qualimap dir_qualimap`` etc...

### **2- Folder, filename, title and comment**

We will create a subfolder in the ``Results/`` folder for ``multiqc``.

In [14]:
qcsummaries="${gohome}Results/multiqc/"
mkdir -p ${qcsummaries}

All downstream reports will also be saved here and we will use different file names.

We will ask MultiQC for specific and meaning filenames and title using ``-n`` and ``-i`` options.
<blockquote>
    <code>-n</code> or <code>--filename TEXT</code> to have a non-default report filename (warning: <code>stdout</code> will just print results to console <br>
    <code>-i</code> or <code>--title</code>, to change file header. Also used for filename if option not specified <br>
    <code>-b</code> or <code>--comment</code> to add any text section in report
</blockquote>

Please, specify **file name** you want to have (do not worry about extension, MultiQC will handle this for us) inside quotes in the next cell.  
<b>DO NOT use spaces or any special characters!</b> 

In [15]:
inamemyfile="1_raw-fastq-files"

Please, specify a meaningful **title** inside quotes in the next cell, to display at the head of your oncoming report.
<b>No more space limit but still avoid any special characters.</b> 

In [16]:
mytitle="Raw fastq files quality for 16 paired end sequenced samples"

Besides, we can add a comment in header's report. It's a good practise to do so. So we will define it in following cell.

> In this cell, we use several lines to keep it readable when displaying notebook. As your text lines are just collapsed together in the html report, be sure to keep last blank space when ending every line).

In [17]:
mycomment=$(echo "Raw fastq files from bulk RNA sequencing (human, stranded, rRNA depletion) " \
"performed by Lecluze et al. 2020 (PMID: 32412604, GEO: GSE116278, SRA: SR151462). " \
"Sample set is limited to fetal ovaries and testes at 6 and 12 PCWeeks (n=4 per condition). " \
"SRR7430706-709: female 6 PCW, 709-713: male 6 PCW, 738-741: female 12 PCW, 742-745: male 12 PCW.")

### **3- Generate summary report**

MultiQC is verbose but, as it will work only on FastQC reports, it is quite short.  
So, we will let it lines show in notebook while saving them in a file for later use.

In [19]:
logfile="${logfolder}multiqc-processing_raw-quality.log"
echo "Screen output is also saved in ${logfile}"

# as time command does not redirect output
echo "operation starting by $(date)" >> ${logfile}
multiqc --interactive --export \
        --outdir "${qcsummaries}" \
        --filename "${inamemyfile}" \
        --title "${mytitle}"  \
        --comment "${mycomment}" \
        "${qcfolder}" \
        |& tee -a ${logfile}
echo "operation finished by $(date)" >> ${logfile}

# to see which files we have afterward and follow folder sizes
ls -lh "${qcsummaries}" >> ${logfile}
ls -lh "${gohome}Results/" >> ${logfile}

Screen output is also saved in /shared/projects/gonseq/Building/Results/logfiles/multiqc-processing_raw-quality.log
[INFO   ]         multiqc : This is MultiQC v1.9
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Report title: Raw fastq files quality for 16 paired end sequenced samples
[INFO   ]         multiqc : Searching   : /shared/projects/gonseq/Building/Results/fastqc
[INFO   ]          fastqc : Found 32 reports
  axes.set_xticklabels(['{:.0f}%'.format(x) for x in vals])
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : ../../../../../projects/gonseq/Building/Results/multiqc/1_raw-fastq-files.html
[INFO   ]         multiqc : Data        : ../../../../../projects/gonseq/Building/Results/multiqc/1_raw-fastq-files_data
[INFO   ]         multiqc : Plots       : ../../../../../projects/gonseq/Building/Results/multiqc/1_raw-fastq-files_plots
[INFO   ]         multiqc : MultiQC complete


The three options that have not been yet introduced previously in this notebook are:
<blockquote>
    <code>-ip</code> or <code>--interactive</code> stands for integrate dynamical graphics to have interactive plots in html report <br>
    <code>-p</code> or <code>--export</code> to export plot as static images besides html report <br>
    <code>-o</code> or <code>--outdir</code> to define the destination folder for output and report files <br>
    then, folder we want to be scanned <br>
</blockquote>

Others options exist: <br>
<blockquote>
     <code>-m</code> or <code>--module</code> to limit MultiqQC's searches for some quality tools <br>
    <code>-d</code> or <code>--dirs</code> to append directory names to files (useful for same names in different folders) <br>
    <code>-f</code> or <code>--force</code> to force overwriting existing files <br>
    <code>-v</code> or <code>--verbose</code> to increase output verbosity <br>
    <code>--tag TEXT</code> if only TEXT-matching modules are desired <br>
    <code>--pdf</code> to get a pdf report (available only with <code>pandoc</code> library)
</blockquote>

To open the report (``.html`` file using Javascript, language not supported by *JupyterLab* so far), download the html file from the left-hand panel and open it in your own browser for it to express its full abilities.


---
___

After you watch MultiQC in order to know what to correct for your data, please proceed to next step.
  
**=> Step 3 : Preprocessing reads and checking for their quality** 