# Preprocessing reads: cleaning and checking quality

<b>File completed </b> (05/18/2021)  
Table of Contents + tool versions on 06/22/2021

Bénédicte Noblet

- Preparing session for IFB core cluster
- Cleaning reads along, with some quality thanks to `fastp`
- FMultoQC report summary post prepping
- Checking for species contamination (added on 06/23/2021)
- MultiQC summary report after `fastq-screen` (run on 06/24/2021)
- Current disk usage situation after preprocessing


---

## <b>Preparing session for IFB core cluster</b>

<em>loaded JupyterLab</em> : Version 2.2.9

In [1]:
echo "=== Cell launched on $(date) ==="

echo "=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid}

echo "=== Working's root folder is ==="
gohome="/shared/projects/gonseq/Building/" # to adjust with your project's folder
echo "${gohome}"
echo ""

echo "=== current folder tree ==="
tree -d "${gohome}"
echo "=== current working directory ==="
echo "${PWD}"

=== Cell launched on Tue May 18 16:42:21 CEST 2021 ===
=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
       JobID  AllocCPUS        NodeList 
------------ ---------- --------------- 
16600945             10      cpu-node-8 
16600945.ba+         10      cpu-node-8 
16600945.0           10      cpu-node-8 
=== Working's root folder is ===
/shared/projects/gonseq/Building/

=== current folder tree ===
/shared/projects/gonseq/Building/
├── Data
│   ├── fastq
│   │   ├── fastp
│   │   └── raw
│   ├── info
│   └── sra
│       ├── SRR7430706
│       ├── SRR7430707
│       ├── SRR7430708
│       ├── SRR7430709
│       ├── SRR7430710
│       ├── SRR7430711
│       ├── SRR7430712
│       ├── SRR7430713
│       ├── SRR7430738
│       ├── SRR7430739
│       ├── SRR7430740
│       ├── SRR7430741
│       ├── SRR7430742
│       ├── SRR7430743
│       ├── SRR7430744
│       └── SRR7430745
├── Pipeline
└── Results
    ├── fastp
    ├── fastqc
    ├── logfiles
    └── mult

In [2]:
module load fastp/0.20.0 multiqc/1.9 fastq-screen/0.13.0

echo "===== cleanning, pre and post-quilities ====="
fastp --version
echo "===== compiled report ====="
multiqc --version
echo "===== checking species contamination ====="  # not used, see below
fastq_screen --version 

===== cleanning, pre and post-quilities =====
fastp 0.20.0
===== compiled report =====
multiqc, version 1.9
===== checking species contamination =====
FastQ Screen v0.13.0


---
## <b>I- Cleaning reads along, with some quality thanks to <code>fastp</code></b>

A widely used software is <a href="http://www.usadellab.org/cms/?page=trimmomatic">Trimmomatic</a>. <br>
Unfornately, this tool is only performing trimming step (low quality bases removal) an do not include quality control afterwards. Thus, files need to be open at least 3 times, and for big files that takes quite some time as you discovered with FastQC. <br>

Researchers from China developped a complete preprocessor tool, called <b>Fastp</b>, and published it in <a href="https://academic.oup.com/bioinformatics/article/34/17/i884/5093234"><em>Bioinformatics Journal</em> in 2018</a>. Currently, code and guides are available on their <a href="https://github.com/OpenGene/fastp">GitHub repository</a>.

### **1- Tool version and default options overview**
This notebook was developped with ``fastp 0.20.0``.

In [3]:
fastp --version

fastp 0.20.0


Simpliest usage for paired end (PE) sequencing data is:

<code>fastp -i R1.fastqsanger.gz -I R2.fastqsanger.gz \
       -o R1.fastp.fastqsanger.gz -O R2.fastp.fastqsanger.gz <br></code>
       
If output options (`-o` and `-O` for paired end data) are omitted, `fastp` will operate the same and generate a report but won't write clean `.fastq.gz` files!

<div class="alert alert-block alert-info">
    As its manual page, either on <a href="https://github.com/OpenGene/fastp/blob/master/README.md">Github repository</a> or with <code>fastp --help</code> (also displayed on Github page), is quite long, we will summarize here some features.
</div>

In its default options, *Fastp* proceeds to:
- adaptater removal, also named trimming (automatic identification for commercial sets, ``-A`` to disable it)
- base removal when phred quality drops below 15 (``-q 15``, change value if wanted)
- read removal when there are more than 40% of unqualified bases (``-u 40``) <br>
- read removal when length drops below 15 bases (``-l 15``, ``-L`` can be used to inactivate this option)
- read removal when there are more than 5 N (undetermined) bases (``-n 5`` or ``--n_base_limit 5`` to adjust) <br>
- record of mated reads in separated files (specified with ``-o`` and ``-O``)
- usage of 2 threads (<a href="https://github.com/OpenGene/fastp/issues/13">IN FACT CORES</a>!), standing for 4 computer threads (``-w 2`` or ``--thread 2``)
- a compression of created ``.fastq`` files (``-z 4`` compression level gzip, ranging from 1-faster to 9-smaller)
- files overwriting if names already used (to change it, ``--dont_overwrite``) 
- naming report as "fastp report" (title inside file)
- report files writing IN CURRENT WORKING DIRECTORY

Thus, afterwards, you don't have the removed reads anymore, no matter what is the reason (good quality but alone or bad quality per itself)... and each report erases the previous one!  
We will adjust this later!

### **2- Further options for ``fastp``**

Others avalaible options are:
<blockquote>
    <code>-e score</code> or <code>--average_qual score</code> to filter reads by their mean quality (default 0, no requirement) <br>
    <code>-V</code> to set it verbose every 1M reads are processed <br>
    <code>-a</code> or <code>--adapter_sequence</code> to specify adapter sequence, else autodetected <br>
    <code>--adapter_sequence_r2</code> for read2 adapter sequence <br>
    <code>--adapter_fasta</code> to add a <code>.fasta</code> file with adaptors sequences to apply all sequences on both read1 and read2 <br>
    <code>--filter_by_index1</code> (same for 2), to specify file containing barcodes list and <code>--filter_by_index_treshold</code> to allow for mismatches (default is set to 0) <br>
    <code>-c</code> or <code>--correction</code>, enable base correction for overlapping regions in PE: if one base is unqualified use corresponding base in mate pair if of good quality. <i>Caution: There are other options to add with this one.</i> <br>
    <code>-p</code> or <code>--overrepresentation_analysis</code>, to have overrepresented sequences per sample (<i>careful, big hmtl report!</i>) and     <code>-P number</code> or <code>--overrepresentation_samples number</code>, to have 1 in <i>number</i> reads used to identify those overrepresentated sequences (default, <code>-P 20</code>)
</blockquote>

PolyG tail trimming is required for sequencing data from Illumina NextSeq and NovaSeqand platforms (2-colors technology). It is activated by default for such datasets (<code>-g</code> or <code>--trim_poly_g</code>). See <a href="https://github.com/OpenGene/fastp#polyg-tail-trimming">corresponding section</a> for more details and <a href="https://github.com/OpenGene/fastp#polyx-tail-trimming">polyX tail trimming</a> counterpart.

In addition to default bad quality removal, you can add more stringent filters if needed:
<blockquote>
    <code>-f x</code> or <code>--trim_front1 x</code> to trim x bases from 5' reads extremity of read 1 (<code>-F x</code> or <code>--trim_front2 x</code> for read 2) <br>
    <code>-t x</code> or <code>--trim_tail1 x</code> to trim x bases from the 3' tail of the reads in read1 file (<code>-T x</code> or <code>--trim_tail2 x</code> for read 2) <br>
    <code>-y</code> or <code>--low_complexity_filter</code> to enable removal of reads containing few base changes (with <code>-Y 30</code> percent by default) <br>
</blockquote>

For next 3 options, you need to specify a window size (default <code>-W 4</code> or <code>--cut_window_size 4</code>, number ranging from 1 to 1000) to work with.
<blockquote>
    <code>-5</code> or <code>--cut_front</code>: from start, sliding window, drops just hit windows <br>
    <code>-3</code> or <code>--cut_tail</code>: from tail, same <br>
    <code>-r</code> or <code>--cut_right</code>: drop window and dowstream bases
</blockquote>

With a window size of 1 (``-W 1``), they are equivalent to, respectively, run ``LEADING``, ``TRAILING`` and ``SLIDINGWINDOW`` with *Trimmomatic*. Please, be aware that **these tools interfere with downstream deduplicated algorithms** (see <a href="https://github.com/OpenGene/fastp#per-read-cutting-by-quality-score">corresponding section</a> in manual).

### **3- Preparing step**

For *Fastp* to treat raw ``.fastq.gz`` files, let's remember where they can be found:

In [4]:
rawfolder="${gohome}Data/fastq/raw/"

We will create 2 folders to store the files ``fastp`` will create:
- cleaned ``.fastq.gz`` files

In [5]:
cleanedfastqfolder="${gohome}Data/fastq/fastp/"
mkdir -p ${cleanedfastqfolder}

- quality reports, one per sample (including quality before and after processiong reads)

In [6]:
reportfolder="${gohome}Results/fastp/"
mkdir -p ${reportfolder}

In addition, we remember here destination folder for text file of redirected outputs.

In [7]:
logfolder="${gohome}Results/logfiles/"

### **4- Looping for bad quality removal**

And, now the big loop! :-D

In [8]:
logfile="${logfolder}fastp_prequality-filtering-postquality.log"
echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starting by $(date)" >> ${logfile}

time for read1 in $(ls "${rawfolder}"*_1.fastq.gz); do

    # starting with the sample name
    id=$(basename ${read1} | cut -d"_" -f1)
    echo "====== Processing sampleID: ${id}..." | tee -a ${logfile}
        
    # fastq files section
    read2=$(echo ${read1} | sed 's#_1#_2#')
    outread1="${cleanedfastqfolder}${id}_1.fastp.fastq.gz"
    outread2=$(echo ${outread1} | sed "s#_1#_2#")
    outremoved="${cleanedfastqfolder}${id}_removed.fastp.fastq.gz"
    
    # report section
    hreport="${reportfolder}${id}_fastp.html"  # fastp at the end else multiqc doesn't see it
    jreport=$(echo ${hreport} | sed "s#html#json#")
    myheader=$(echo "Sample ${id} fastp report") # append sample reference to default title

    echo "fastp starts by $(date)" >> ${logfile}
    # fastp working
    fastp --in1 ${read1} --in2 ${read2} \
          --out1 ${outread1} --out2 ${outread2} \
          --failed_out ${outremoved} \
          --qualified_quality_phred 15 \
          --length_required 30 \
          --thread 8 \
          --html ${hreport} --json ${jreport} \
          --report_title "$(echo ${myheader})" \
          &>> $logfile
    echo "fastp ends by $(date)" >> ${logfile}
    
    echo "...done" | tee -a ${logfile} 
    
done

# size of files
ls -lh "${cleanedfastqfolder}" >> ${logfile}

echo "fastq folder contains $(ls -l "${cleanedfastqfolder}" | wc -l) fastq.gz files." \
     | tee -a ${logfile}
echo "quality folder contains $(ls -l "${reportfolder}"*.html | wc -l) html reports." \
     | tee -a ${logfile} 

Screen output is redirected to /shared/projects/gonseq/Building/Results/logfiles/fastp_prequality-filtering-postquality.log
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done

real	82m57.082s
user	329m13.003s
sys	15m34.068s
fastq folder contains 49 fastq.gz files.
quality folder contains 16 html reports.


<blockquote>
    <code>--failed_out</code> to have an outpt for failed reads with specified failure reason. For paired-end (PE) data with no filename for unpaired reads, the qualified discarded read is registered as <code>paired_read_is_failing</code> <br>
    <code>-R text</code> or <code>--report_title text</code>, report <b>title</b> (default "fastp report") <br>
    <code>-j</code> or <code>--json</code> for json <b>filename</b> <br>
    <code>-h</code> or <code>--html</code> for html <b>filename</b>
</blockquote>

Before modifying filenames, plesae note:
<ul class="alert alert-block alert-info">
    <li>
        MultiQC detects reports only based <a href="https://multiqc.info/docs/#fastp">on the end of the filenames</a>: it looks for <code>fastp.json</code> and <code>_fastqc.zip</code>.
    </li>
</ul>

---
## <b>II- MultiQC report summary post prepping</b>

Now let's have a look at our cleaned (we expect) dataset before mapping reads to the reference genome. <br>

In [9]:
multiqc --version

multiqc, version 1.9


For more details about MultiQC, please refer to previous notebook.

### **1- Folder, filename, title and comment**

Let's remember wher report files are to be placed:

In [10]:
qcsummaries="${gohome}Results/multiqc/"

We specify then names for files and title to display on html report page.

In [11]:
inamemyfile="2_fastp-fastq-files"
mytitle="Fastq files qualities by Fastp: before and after filtering"

To keep record of what have been done with these files, I add an additionnal comment to remember for later use (along with to inform others readers):

In [12]:
mycomment=$(echo "Fastq files processed by Fastp with following options. " \
"qualified quality: 15, minimum length: 30, removed reads kept in same file.")

### **2- Generate summary report**

In [13]:
logfile="${logfolder}multiqc-processing_fastp-quality.log"
echo "Screen output is also saved in ${logfile}"

echo "operation starting by $(date)" >> ${logfile}
multiqc --interactive --export \
        --module fastp \
        --outdir "${qcsummaries}" \
        --filename "${inamemyfile}" \
        --title "${mytitle}"  \
        --comment "${mycomment}" \
        "${reportfolder}" \
        |& tee -a ${logfile}
echo "operation finished by $(date)" >> ${logfile}

# to see which files we have afterward and follow folder sizes
ls -lh "${gohome}Results/multiqc/" >> ${logfile}
ls -lh "${gohome}Results/" >> ${logfile}

Screen output is also saved in /shared/projects/gonseq/Building/Results/logfiles/multiqc-processing_fastp-quality.log
[INFO   ]         multiqc : This is MultiQC v1.9
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Report title: Fastq files qualities by Fastp: before and after filtering
[INFO   ]         multiqc : Searching   : /shared/projects/gonseq/Building/Results/fastp
[INFO   ]         multiqc : Only using modules fastp
[INFO   ]           fastp : Found 16 reports
  axes.set_xticklabels(['{:.0f}%'.format(x) for x in vals])
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : ../../../../../projects/gonseq/Building/Results/multiqc/2_fastp-fastq-files.html
[INFO   ]         multiqc : Data        : ../../../../../projects/gonseq/Building/Results/multiqc/2_fastp-fastq-files_data
[INFO   ]         multiqc : Plots       : ../../../../../projects/gonseq/Building/Results/multiqc/2_fastp-fastq-files_plots
[INFO   ]      

---
## <b>III- Checking for species contamination</b>

<ul class="alert alert-block alert-info">
    <li>
        Reading <b>FastQ-screen</b> documentation's introduction to see how can I retrieve application's version, I found that <b>FastQ Screen</b> <a href="https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html?highlight=version#installation">is compatible with Bowtie, Bowtie2 or BWA</a> aligners... and <a href="https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html?highlight=version#configuration">the aligner used to build the index files must be used to map RNA reads</a>
    </li>
    <li>
       After discussion with Claire Vandiedonck, we will use Fastq-Screen aligner (I will use the default one, that is Bowtie2) only to identify contaminations regardless of nucleic acid type and align RNA reads propperly later with a splice-aware aligner.
    </li>
</ul>

The session I used for ``fastq_screen`` addition:

In [1]:
echo "=== Cell launched on $(date) ==="
squeue -hu $USER

echo "=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid} | head -n 3

echo "=== Working's root folder is ==="
gohome="/shared/projects/gonseq/Building/" # to adjust with your project's folder
echo "${gohome}"

=== Cell launched on Wed Jun 23 18:20:23 CEST 2021 ===
          17246424      fast  jupyter  bnoblet  R    8:19:49      1 cpu-node-47
=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
       JobID  AllocCPUS        NodeList 
------------ ---------- --------------- 
17246424             10     cpu-node-47 
=== Working's root folder is ===
/shared/projects/gonseq/Building/


In [2]:
module load fastq-screen/0.13.0 multiqc/1.9

### **1- Tool version and default options overview**
This notebook was developped with ``fastq-screen 0.13.0``.

In [3]:
fastq_screen --version

FastQ Screen v0.13.0


FastQ Screen allows to detect species contamination by aligning a sample set of reads from a ``.fastq`` on chosen reference genomes.  
To date, only 3 aligners can be used to do so: ``bowtie``, ``bowtie2``and ``bwa``.

To use this tool, we need indexes files for other species. We can:  
- download genomes sequence files and create indexes using the chosen aligner
- retrieve <a href="https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html#obtaining-reference-genomes">pre-indexed <code>bowtie2</code> genomes for a range of commonly studied species and sequences</a> using ``--get_genomes``

During this first step, we can specify destination folder for retrieved and generated files with <code>--outdir TEXT</code> option. If not specified then output files are saved in our current working directory.  
A folder called ``FastQ_Screen_Genomes/`` is created and contains a text configuration file along with reference genomes indexes. The configuration gives ``fastq_screen`` tool's script the paths to reference genomes (same folder when using ``--get_genomes``). It is meant to be modified to specify path to the aligner but we can also directly use ``--aligner`` option in command line.  
  
In the second step, the analysis itself, the default command line is ``fastq_screen path/to/file.fastq``.  
  
Along with path to ``.fastq.gz files``, we will use below options: 
<blockquote>
    <code>--aligner OPTION</code>, to specify the aligner to use for the mapping. Valid arguments are 'bowtie', bowtie2' (default) or 'bwa'. If not specified, it will search for <code>bowtie2</code> location in config file <br>
    <code>--threads INTEGER</code>, to specify how many threads the aligner will be allowed to run. This options overrides the default value set in the configuration file <br>
    <code>--conf TEXT</code> to manually specify a location for the configuration. Alternatively, the configuration file should be within the program's folder <br>
</blockquote>

### **2- Preparation steps**

First, we will create a destination folder for all ``fastq_screen`` linked files

In [4]:
contascreenfolder="${gohome}Results/fastq_screen/"
mkdir -p ${contascreenfolder}

Then, we will download genome references thanks to ``--get_genomes`` option.

<div class="alert alert-block alert-warning">
    It took about half an hour (30 minutes) to download all reference files
</div>

In [6]:
logfile=${gohome}Results/logfiles/fastq_screen_get_genomes.log
echo "Screen output is redirected to ${logfile}"

echo "fastq_screen get_genomes starts at $(date)" >> ${logfile}
time fastq_screen --get_genomes \
                  --outdir ${contascreenfolder} \
                  &>> ${logfile}
echo "fastq_screen get_genomes ends at $(date)" >> ${logfile}

echo "Folder's composition:" >> ${logfile}
tree ${contascreenfolder} >> ${logfile}
du -h "${contascreenfolder}FastQ_Screen_Genomes/" >> ${logfile}

echo "Downloaded genomes are: $(ls "${contascreenfolder}FastQ_Screen_Genomes/")" | tee -a ${logfile}

Screen output is redirected to /shared/projects/gonseq/Building/Results/logfiles/fastq_screen_get_genomes.log

real	58m7.440s
user	0m11.905s
sys	3m55.364s
Downloaded genomes are: Adapters
Arabidopsis
Drosophila
E_coli
fastq_screen.conf
Human
Lambda
Mitochondria
Mouse
PhiX
Rat
rRNA
Vectors
Worm
Yeast


We can see genomes versions inside logfile searching for network links (``http://``), then filtering for particular species.

In [8]:
cat "${logfile}" | grep "http://" | grep -e "Human/Homo_sapiens" -e "Mouse/Mus_musculus"

--2021-06-23 19:13:04--  http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/Human/Homo_sapiens.GRCh38.1.bt2
--2021-06-23 19:17:22--  http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/Human/Homo_sapiens.GRCh38.2.bt2
--2021-06-23 19:20:42--  http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/Human/Homo_sapiens.GRCh38.3.bt2
--2021-06-23 19:20:42--  http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/Human/Homo_sapiens.GRCh38.4.bt2
--2021-06-23 19:24:02--  http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/Human/Homo_sapiens.GRCh38.rev.1.bt2
--2021-06-23 19:28:21--  http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/Human/Homo_sapiens.GRCh38.rev.2.bt2
--2021-06-23 19:31:47--  http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/Mouse/Mus_musculus.GRCm38.1.bt2
--2021-06-23 19:35:51--  http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genomes/Mouse/Mus_musculus.GRCm38.2.bt2
--2021-06-23 19:39:05--  http://ftp1.babraham.ac.uk/ftpusr46/FastQ_Screen_Genome

### **3- All samples analysis**

We will need to use predefined config file to tell ``fastq_screen`` where to find genome files. Let's create a variable to handle it easier.

In [10]:
fscreenconffile="${contascreenfolder}FastQ_Screen_Genomes/fastq_screen.conf"

We can now launch on all cleaned ``.fastq`` files produced earlier by ``fastp``.

<div class="alert alert-block alert-danger">
    Following section's cells were not run inside this notebook but were run in the same IFB session and same destination folder. <br>
    <b>This cell uses 8 CPUs, please change <code>--threads</code> value if you have less than 10 CPUs</b>.
</div>

In [None]:
logfile="${gohome}Results/logfiles/fastq_screen_all_v2.log"
echo "Screen output is redirected to ${logfile}"

echo "Destination folder already contains:" >> ${logfile}
ls -lh ${contascreenfolder} >> ${logfile}

echo "operations start at $(date)" >> ${logfile}

time for fastqfile in $(ls "${cleanedfastqfolder}"* | grep -v "removed"); do
    
    echo "=== starting for $(basename ${fastqfile})..." |& tee -a ${logfile}
    date >> ${logfile}
    
    fastq_screen --aligner bowtie2  \
                 --outdir ${contascreenfolder} \
                 --threads 8 \
                 --conf ${fscreenconffile} \
                 ${fastqfile} \
                 &>> ${logfile}
                 
    echo "... done ===" | tee -a ${logfile}
    date >> ${logfile}
    
done

echo "operations end at $(date)" >> ${logfile}

ls -lh ${contascreenfolder} >> ${logfile}
echo "$(ls ${contascreenfolder} | wc -l) files were created."

Screen output is redirected to /shared/projects/gonseq/Building/Results/logfiles/fastq_screen_all_v2.log
=== starting for SRR7430706_1.fastp.fastq.gz...
... done ===
=== starting for SRR7430706_2.fastp.fastq.gz...
... done ===
=== starting for SRR7430707_1.fastp.fastq.gz...
... done ===
=== starting for SRR7430707_2.fastp.fastq.gz...
... done ===
=== starting for SRR7430708_1.fastp.fastq.gz...
... done ===
=== starting for SRR7430708_2.fastp.fastq.gz...
... done ===
=== starting for SRR7430709_1.fastp.fastq.gz...
... done ===
=== starting for SRR7430709_2.fastp.fastq.gz...
... done ===
=== starting for SRR7430710_1.fastp.fastq.gz...
... done ===
=== starting for SRR7430710_2.fastp.fastq.gz...
... done ===
=== starting for SRR7430711_1.fastp.fastq.gz...
... done ===
=== starting for SRR7430711_2.fastp.fastq.gz...
... done ===
=== starting for SRR7430712_1.fastp.fastq.gz...
... done ===
=== starting for SRR7430712_2.fastp.fastq.gz...
... done ===
=== starting for SRR7430713_1.fastp.fastq

<div class="alert alert-block alert-danger">
    Output screen information stopped. <br>
    To handle <b>2 x 16 <code>.fastq.gz</code> files, it took almost 2 hours</b>.
</div>

In [67]:
cat ${logfile} | grep "operations"

operations start at Wed Jun 23 14:48:49 CEST 2021
operations end at Wed Jun 23 16:43:20 CEST 2021


In [68]:
ls "${watchingforcontafolder}"*.html | wc -l

32


---
## <b>IV- MultiQC summary report after <code>fastq_screen</code></b>

In [1]:
echo "=== Cell launched on $(date) ==="
squeue -hu $USER

echo "=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid} | head -n 3

echo "=== Working's root folder is ==="
gohome="/shared/projects/gonseq/Building/" # to adjust with your project's folder
echo "${gohome}"

module load multiqc/1.9
echo "===== compiled report ====="
multiqc --version

=== Cell launched on Thu Jun 24 10:24:54 CEST 2021 ===
          17256102      fast  jupyter  bnoblet  R      13:03      1 cpu-node-12
=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
       JobID  AllocCPUS        NodeList 
------------ ---------- --------------- 
17256102             10     cpu-node-12 
=== Working's root folder is ===
/shared/projects/gonseq/Building/
===== compiled report =====
multiqc, version 1.9


In [2]:
reportfolder="${gohome}Results/fastp/"
contascreenfolder="${gohome}Results/fastq_screen/"
logfolder="${gohome}Results/logfiles/"

### **1- Folder, filename, title and comment**

Let's remember wher report files are to be placed:

In [3]:
qcsummaries="${gohome}Results/multiqc/"

We specify then names for files and title to display on html report page.

In [4]:
inamemyfile="2_fastp-fastq-files_with_fastqscreen"

mytitle=$(echo "Fastq files qualities by Fastp (before and after filtering) " \
"and FastQ Screen genomes contamination screening")

To keep record of what have been done with these files, I add an additionnal comment to remember for later use (along with to inform others readers):

In [5]:
mycomment=$(echo "Fastq files processed by Fastp with following options. " \
"qualified quality: 15, minimum length: 30, removed reads kept in same file." \
"FastQ Screen with bowtie2 (included version) and Babraham genomes reference.")

### **2- Generate summary report**

In [6]:
logfile="${logfolder}multiqc-processing_fastp-quality_and_fastqscreen_screening.log"
echo "Screen output is also saved in ${logfile}"

echo "operation starting by $(date)" >> ${logfile}
multiqc --interactive --export \
        --module fastp ${reportfolder} \
        --module fastq_screen ${contascreenfolder} \
        --outdir "${qcsummaries}" \
        --filename "${inamemyfile}" \
        --title "${mytitle}"  \
        --comment "${mycomment}" \
        "${gohome}Results/" \
        |& tee -a ${logfile}
echo "operation finished by $(date)" >> ${logfile}

# to see which files we have afterward and follow folder sizes
ls -lh "${qcsummaries}" >> ${logfile}
ls -lh "${gohome}Results/" >> ${logfile}

Screen output is also saved in multiqc-processing_fastp-quality_and_fastqscreen_screening.log
[INFO   ]         multiqc : This is MultiQC v1.9
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Report title: Fastq files qualities by Fastp (before and after filtering)  and FastQ Screen genomes contamination screening
[INFO   ]         multiqc : Searching   : /shared/projects/gonseq/Building/Results/fastp
[INFO   ]         multiqc : Searching   : /shared/projects/gonseq/Building/Results/fastq_screen
[INFO   ]         multiqc : Searching   : /shared/projects/gonseq/Building/Results
[INFO   ]         multiqc : Only using modules fastp, fastq_screen
[INFO   ]    fastq_screen : Found 32 reports
  axes.set_xticklabels(['{:.0f}%'.format(x) for x in vals])
[INFO   ]           fastp : Found 16 reports
  axes.set_xticklabels(['{:.0f}%'.format(x) for x in vals])
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : ../../../../../pro

---
## <b>V- Current disk usage situation after preprocessing</b>

<div class="alert alert-block alert-danger">
    Bellow cells have been run again after I performed FastQ Screen and MultiQC additional steps. Thus, after I made mapping and gene-level counting.
</div>

In [7]:
# little look at the folder's tree
tree -d -L 2 "${gohome}"

/shared/projects/gonseq/Building/
├── Data
│   ├── fastq
│   ├── info
│   └── sra
├── Pipeline
├── Reference
│   ├── extracted
│   └── indexes_upto49bases
└── Results
    ├── fastp
    ├── fastqc
    ├── fastq_screen
    ├── featurecounts
    ├── logfiles
    ├── multiqc
    ├── qualimap
    ├── samtools
    └── star

18 directories


Used options are:
<blockquote>
    <code>-d</code> to list only directories <br>
    Adding <code>-L</code> option and a number to avoid going too deep in the tree... and let the output be still readable.
</blockquote>

In [8]:
# memry usage by folder and subfolders
du -ch -d2 "${gohome}"

34M	/shared/projects/gonseq/Building/Results/multiqc
676M	/shared/projects/gonseq/Building/Results/featurecounts
8.4M	/shared/projects/gonseq/Building/Results/fastp
3.8M	/shared/projects/gonseq/Building/Results/samtools
20M	/shared/projects/gonseq/Building/Results/logfiles
34M	/shared/projects/gonseq/Building/Results/fastqc
117G	/shared/projects/gonseq/Building/Results/star
4.0K	/shared/projects/gonseq/Building/Results/.ipynb_checkpoints
12G	/shared/projects/gonseq/Building/Results/fastq_screen
38M	/shared/projects/gonseq/Building/Results/qualimap
130G	/shared/projects/gonseq/Building/Results
154G	/shared/projects/gonseq/Building/Data/fastq
46G	/shared/projects/gonseq/Building/Data/sra
44K	/shared/projects/gonseq/Building/Data/info
36K	/shared/projects/gonseq/Building/Data/.ipynb_checkpoints
199G	/shared/projects/gonseq/Building/Data
4.0K	/shared/projects/gonseq/Building/.ipynb_checkpoints
248K	/shared/projects/gonseq/Building/Pipeline/.ipynb_checkpoints
504K	/shared/projects/gonseq/Bu

The options stand for:
<blockquote>
    <code>-c</code> or <code>--total</code> to have total amount displayed <br>
    <code>-h</code> or <code>--human-readable</code>  to get sizes in Mega and Giga bytes format <br>
    <code>-dx</code> or <code>--max-depth=x</code> to limit folder enumeration to x levels
</blockquote>


---
___

Now we go on to map reads on the reference genome and observing.  
  
**=> Step 4 : Classical reads mapping** 


---
___

Another option exists to go faster (but we'll probably miss some observations): carrying out a pseudo mapping.  
  
**=> Alternate step 4 : Pseudomapping**