# __RNAseq Analysis Module__

## **Practical Session 3: Quality check of raw data and mapping**

Tuesday, the 1st of December, 2020   
Claire Vandiedonck and Sandrine Caburet - 2020  


   1. Getting started   
   2. Quality controls on Cparapsilosis fastq files   
   3. Mapping the reads on CParasilosis genome using the BOWTIE program  
   4. Managing the output files
   5. Batch analysing of the other samples


---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly.
<div>

<div class="alert alert-block alert-info"> 
    
__*About jupyter notebooks:*__

- To add a new cell, click on the "+" icon in the toolbar above your notebook
- You can "click and drag" to move a cell up or down
- You choose the type of cell in the toolbar above your notebook:
    - 'Code' to enter command lines to be executed 
    - 'Markdown' cells to add text, that can be formatted with some characters
- To execute a 'Code' cell, press SHIFT+ENTER or click on the "play" icon 
- To display a 'Markdown' cell, press SHIFT+ENTER or click on the "play" icon  
- To modify a 'Markdown'cell, double-click on it

  
*To make nice html reports with markdown:* [html visualization tool1](https://dillinger.io/) or [html visualization tool2](https://stackedit.io/app#) and [to draw nice tables](https://www.tablesgenerator.com/markdown_tables ) and the [Ultimate guide](https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd)   
*Further reading on JupyterLab notebooks:*  https://jupyterlab.readthedocs.io/en/latest/user/notebook.html <br>
*Here we are using JupyterLab interface implemented as part of the https://plasmabio.org/ project lead by Sandrine Caburet, Pierre Poulain and Claire Vandiedonck.*

</div>    

___

__*=> About this jupyter notebook*__

This a jupyter notebook in **bash**, meaning that the commands you will enter or run in `Code` cells are directly understood by the server. <br>You could run the same commands in a `Terminal` (the frightening black window that informaticians use :-D). 

>_If you want to see this by yourself, you can open a terminal on adenine:_
>- _in the `File` menu in the top bar, select `New Launcher` or click on the `+` sign below_
>- _open either a bash `Console` or a `Terminal`_
>- _you'll be able to copy and paste the commands from the `Code` cells of the notebook in the "bottom cell" (for the console) or after the `$` sign (for the terminal)_
>
>_This is for your information only, and not needed. All the commands are already included in this notebook_
<br>

- In Unix, all characters are case sensitive.
- It is good practice to avoid accents and special characters.   
- Within `Code` cells, lines starting with a `#` are comments and are not interpreted as a command. They are meant to help you.  
- You may add your own comments as well, either in a `Code` cell using this `#`, or in a new `Markdown` cell added with the "+" above.  
- <mark>If you add cells with comments, or modify existing cells, **don't forget to save your notebook**.<mark>
___

## **I - Getting started**

### **1- Working directory**

The working directory is where you are currently located in the server. By default, for this practical session using this JupyterLab notebook, this is the folder displayed by the opening of the environment, that you performed when you selected the correct 'server' and launched it: it created the corresponding folder in you home.  

To check where you are working, use the `pwd` command, which stands for "path to working directory".

In [None]:
pwd

<div class="alert alert-block alert-warning"><b>The result should be like this:</b>`/srv/home/mylogin/m2meg-rnaseq-tp3to5-bash` with your "login". If not, call us! We can change a working directory using the Unix command <b>cd</b> (change directory)<div>

>_Here is a link for some basic Unix commands: https://files.fosswire.com/2007/08/fwunixref.pdf (there are plenty of other good ones on the net).<br>
>You can also get an explanation of general Unix commands using this tool: https://explainshell.com/. <br>
> Some other tips: you may use the autocompletion of the names of your files and folders with the tab arrow on your keyboard._

The content of this working directory is displayed in the left panel. You can also list the content of this folder with the `ls` command (which stands for "list"):

In [None]:
# the option -l will provide details of size for each file, 
# the option h stands for human, to read the file size in a human easy manner. 
# the two options are combined with -lh
# you may add -tr as well to see the files sorted by reverse time.

ls -lh

### **2- Data** 
The data files are already present on the server, in the `/srv/data/meg-m2-rnaseq/genome/` and in `/srv/data/meg-m2-rnaseq/experimental_data` folders.
<br><mark> Do not copy them to your working directory. </mark> <br>
We will directly read them from where they are by indicating the  **absolute path** to these folders.

#### **2.a- list of input files:**

In [None]:
# Here we list the content of the folder containing the genome data

ls -lh /srv/data/meg-m2-rnaseq/genome/

In [None]:
# Here we list the content of the folder containing the experimental data

ls -lhtr /srv/data/meg-m2-rnaseq/experimental_data/

You may count the number of files in one folder using the following command. The symbol `|` is a "pipe". It redirects the output of the command on its left to its right. The command `grep` (*globally search for a regular expression and print matching lines*) is used to identify a specific pattern. The final part of the command `wc -l` is used to count the number of lines.

In [None]:
ls /srv/data/meg-m2-rnaseq/experimental_data/ | grep "fastq" | wc -l 

The first two files are `.fastq` files containing raw data of the Immunina sequencer. The other 8 are gunzipped `.gz` compressed files. You can notice their size is reduced compared to the `.fastq` files. Most genomics tools can work with both compressed and uncompressed files.

#### **2.b- checking files integrity:**

<div class="alert alert-block alert-warning"><b>Checking the data are not corrupted</b><br>
Whenever you get such input files, it is mandatory to verify that they are intact and not corrupted before analysing the data further.
This can be performed by computing a <b>md5sum</b>, a kind of "barcode" or "fingerprint" of each file. It should remain the same after a copy on your computer for example.<br>
Similarly in your laboratories, if you get files from collaborators or a Next-Generation-Sequencing platform, always ask for the md5sums to check files integrity<div>

You may either get the md5sum of one file at a time like this using the command `md5sum` followed by a space and the name of the file:

   - on the __genomic files__:

In [None]:
md5sum /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta

Or you may get the `md5sum` fingerprint of all the files at once in the folder by using `*` which stands for "anything"

In [None]:
# In a command, * stands for 'anything'.

md5sum /srv/data/meg-m2-rnaseq/genome/*

#You should get the following "barcodes" for each file :
# 6455d97a060c3c7d1e94112f818fa046  /srv/data/meg-m2-rnaseq/C_parapsilosis_CDC317_GO_distrib-5958g.txt
# e189032dafc2b7013eeae7d33cbf9458  /srv/data/meg-m2-rnaseq/C_parapsilosis_CGD.fasta
# 537217ec9ac54343af31b28521c0c6f3  /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta.fai
# e86c62e99a240c0ac309cd067d105522  /srv/data/meg-m2-rnaseq/C_parapsilosis_ORFs.gff
# be6f316b0fcca1b653ee5b98648ddfb2  /srv/data/meg-m2-rnaseq/genome/md5sums.txt


What is even better is to have already in the folder a file, classically called `md5sum.txt`, with the outputs of the above `md5sum` command. Should you have the rights to do it, the command to generate that file would be:

Thus, you can automatically do the comparison of the md5sum fingerprints you obtain with the ones stored in the `md5sum.txt` file in a recursive manner using the argument `-c`. This is very convenient when you have lot of files to check from a platform.

In [None]:
md5sum -c /srv/data/meg-m2-rnaseq/genome/md5sums.txt 

_Remark: To get information on a Unix command, just enter the name of the command followed by `--help` as below. If it is installed on the server/computer, you can also enter the command `man` followed by the name of the command._

In [None]:
md5sum --help
#man md5sum #

   - on the __experimental data__ :
   
*Be patient, it can take a minute.*

In [None]:
md5sum -c /srv/data/meg-m2-rnaseq/experimental_data/md5sums.txt

#You should get the following "barcodes" for each file :

# 2fb96155f5c708709a7539c7ff19e9ff  /srv/data/meg-m2-rnaseq/Hypoxia_1.fastq
# 0d8d81a7464f6b662b89a9cea5bb8d1c  /srv/data/meg-m2-rnaseq/Normoxia_1.fastq
# 18a714651a337245bc728f3de2d14c87  /srv/data/meg-m2-rnaseq/experimental_data/SRR352261.fastqsanger.gz
# 72249ca523761575a85c61345529595b  /srv/data/meg-m2-rnaseq/experimental_data/SRR352264.fastqsanger.gz
# 857247cf34e788aef24aeaf9c4081a10  /srv/data/meg-m2-rnaseq/experimental_data/SRR352266.fastqsanger.gz
# d7f3e511652f9f6f08092cb6dbde37b4  /srv/data/meg-m2-rnaseq/experimental_data/SRR352267.fastqsanger.gz
# 55350bf610cafb705956068851038447  /srv/data/meg-m2-rnaseq/experimental_data/SRR352270.fastqsanger.gz
# fa987e543da5da808dd73e36e341c621  /srv/data/meg-m2-rnaseq/experimental_data/SRR352273.fastqsanger.gz
# 4a3449674775c9baa76296244dfe9e3d  /srv/data/meg-m2-rnaseq/experimental_data/SRR352274.fastqsanger.gz
# 39dc93ec7820c315d1a9742444b7f83b  /srv/data/meg-m2-rnaseq/experimental_data/SRR352276.fastqsanger.gz

### **3- Creating a folder for analysis results:**

Now we'll create a new directory to store the results of our analysis, using the _*mkdir*_ command, for "make directory", and within it a sub-folder for quality checks outputs:

In [None]:
  mkdir Results
  mkdir Results/Fastqc

You can check the arborescence of your folder with the Unix command `tree`.

In [None]:
tree

_Of note, the `binder` folder was automatically created with your environment. For those interested, it contains all the configuration information to recreate a similar JupyterLab environment outside of adenine._ 

**=> Well done, you are now ready to check and analyse the data!** 

-------

## **II - Quality controls on *CParapsilosis* `.fastq` and `fastq.gz` files**

### **1- Examining the data**

- `.fastq` files are readable by the human eye, and we can display the first and last lines of each file, using the Unix `head` and `tail` commands:  

In [None]:
head /srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq

In [None]:
tail /srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq

In [None]:
head /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq

In [None]:
tail /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq

>Another great command Unix command is `less` when installed. If you want to try it on adenine, you have to do it in a terminal (it does not work in this notebook). It displays initially the first lines of a file. By pressing the spacebar, you will see the next lines. The parameters `S` and `N` respectively display the lines with no wrap and add the line number at the beginning. Press `Q` to escape.

> _For geeks only:_
>
> Similarly, you can count the number of rows in a file:

In [None]:
wc -l /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq

> and get the number of reads by dividing by 4:

In [None]:
nb_row=$(wc -l /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq | cut -d" " -f1) 
echo $((${nb_row}/4))

> or directly get the number of reads noticing all reads in this file start with an `@noO2`:

In [None]:
grep "^@noO2" /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq | wc -l

- On the `gz` files, you need to combine the `zcat` command first that reads compressed files, and the `head` or `tail` commands using a pipe `|`.

In [None]:
zcat /srv/data/meg-m2-rnaseq/experimental_data/SRR352261.fastqsanger.gz | head

> and for geeks, the command `zgrep` will do the pattern search in a gz file: 

In [None]:
zgrep "^@SRR" /srv/data/meg-m2-rnaseq/experimental_data/SRR352261.fastqsanger.gz | wc -l

<div class="alert alert-block alert-success"><b>=> Question: What can you say on the data?</b><br>

*(you can click here to add your answers directly in this markdown cell)*<br>

For each dataset:

- How many reads do you have in each file?
- What is the size of the reads?<\div>

### **2- fastqc**
Now we run the fastqc quality control with **FASTQC** (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) with the following version of the tool.

In [None]:
fastqc --version

To run it on a sample, use the following command lines, where we indicate after the command `fastqc` and the name of the file to examine (with its path) and where to write the results after the argument `outdir`. Here the dot `.` stands for "current working directory". 

In [None]:
fastqc /srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq --outdir ./Results/Fastqc

The ouputs are in a `.zip` folder you could unzip with the `unzip` Unix command. But there is no need to open do so, as a summary in `.html` format is also provided. To open this `html` file, in the left-hand pannel of the JupyterLab double-click the "Results" folder, and in it, on the html file: it should open in a new tab beside this notebook.

In [None]:
fastqc /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq --outdir ./Results/Fastqc

> In some web browsers, the display of the letters and special characters might not be correct. If you encounter this problem with firefox, open the menu on the top right hand corner. Click on "customize" and select the text encoding icon. Slide it to the menu on the right. It now appears in your menu bar. Click on it and select "Unicode" instead of "occidental".

In [None]:
#For more help on fastqc used in command line, you can always type:
fastqc --help

---

## **III - Mapping reads on *CParapsilosis* genome using BOWTIE algorithm (version 1.3.0)**


Checking wich version of **BOWTIE** (http://bowtie-bio.sourceforge.net/manual.shtml) is used.

In [None]:
bowtie --version


### **1- Generating the indexes of the *C.parapsilosis* genome**
The indexes are small files that tell a program where to look for data in a large data file. They are required for mapping algorithms, as they allow for faster processing of millions reads. With BOWTIE they are generated with the `bowtie-build` fonction.

In [None]:
bowtie-build -q /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta C_parapsilosis 

The 6 created index files have the `.ebwt` suffix :

In [None]:
ls -lh *.ebwt

### **2- Mapping the reads**
We use BOWTIE, a mapper that is very simple and efficient. It's not recent at all, and cannot deal with intron-containing genome, but here it works fine.

To start with, we will run BOWTIE on the two `.fastq` files. On section V of this notebbok, we will run it on the other `fastq.gz` samples.

In [None]:
# the -S option tells bowtie to generate a .sam file  
# the -x option indicates the prefix name of the various index files 
# the you specify the name of the fastq file
# the last argument is the name of the output file, here located directly into the Results folder ./Results/

bowtie -S -x C_parapsilosis /srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq ./Results/Normoxia_1_bowtie_mapping.sam

In [None]:
bowtie -S -x C_parapsilosis /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq ./Results/Hypoxia_1_bowtie_mapping.sam

<div class="alert alert-block alert-success"><b>=> Question: What can you say on the data?</b><br>

*(you can click here to add your answers directly in this markdown cell)*<br>

For each dataset, how many reads were:
- processed?  
- mapped?  
- written in the output file?</div>

---

## **IV - Managing the output files**

### **1- Converting, sorting and indexing the output files**
The downstream analysis is not performed on `.sam` files, but on binary versions of these : `bam` files.  
So we are going to:  
- convert the `sam` into `bam` files, 
- then sort them in genomic order,  
- finally index them, to produce the companion `bai` files

The commands used for this part belong to a large package of utilities that are very useful to manage those types of files: **SAMTOOLS** (http://www.htslib.org/).  

  Let's check first which version of SAMTOOLS we are using.

In [None]:
samtools --version

<br>- We will start first with the ***Normoxia dataset:***

#### **1.a-** Converting .sam into .bam with **samtools view**

In [None]:
# The 'view' function allows to display bam/sam files, 
# -b is to specify that outputs are .bam files
# it is followed by the name of the .sam
# -o is to provide the name of the ouput .bam file.

samtools view -b ./Results/Normoxia_1_bowtie_mapping.sam -o ./Results/Normoxia_1_bowtie_mapping.bam

#### **1.b-** Sorting .bam with **samtools sort**

Again, `-o` is to provide the name of the ouput file.

In [None]:
samtools sort ./Results/Normoxia_1_bowtie_mapping.bam -o ./Results/Normoxia_1_bowtie_mapping.sorted.bam

#### **1.c-** Generating an index with **samtools index**.  
There is no need to provide a name of the ouput file, as it should always be the same as the corresponding *bam* file, except for the `.bai` suffix.

In [None]:
samtools index ./Results/Normoxia_1_bowtie_mapping.sorted.bam

<br>***- For the Hypoxia data set***, we can proceed to the 3 steps in the same cell: the commands will be executed one after another:

In [None]:
samtools view -b ./Results/Hypoxia_1_bowtie_mapping.sam -o ./Results/Hypoxia_1_bowtie_mapping.bam
samtools sort ./Results/Hypoxia_1_bowtie_mapping.bam -o ./Results/Hypoxia_1_bowtie_mapping.sorted.bam
samtools index ./Results/Hypoxia_1_bowtie_mapping.sorted.bam

### **2- Removing the intermediate files**  
The only files needed for the rest of the analysis are the `mapped.sorted.bam` files and their corresponding `.bai` index files. So we are going to save some space by deleting the intermediate files that are not needed any more. (Anyway you can easily produce them again, by running the corresponding Code cell above).  
You can delete a file by right-clicking on it and choosing 'x Delete', or by running the *rm* command (remove) in a cell:

In [None]:
rm ./Results/Normoxia_1_bowtie_mapping.bam
rm ./Results/Hypoxia_1_bowtie_mapping.bam

In [None]:
# removing all the .sam files at the same time

rm ./Results/*.sam

___

## **V - Analysis of the other 8 samples**

The complete study involves 6 Normoxia samples and 4 Hypoxia samples. For the remaining 8 samples, we will perform a batch analysis (all the steps together, for multiple files at once) :
- quality check with fastqc
- mapping with bowtie
- sam-to-bam conversion with samtools
- bam sorting and indexing with samtools
- removal of intermediate files



FASTQC can deal with several files without a loop.

In [None]:
fastqc /srv/data/meg-m2-rnaseq/experimental_data/*.fastqsanger.gz --outdir ./Results/Fastqc

For the next steps, we use a `for` **loop**, that will run the program once for each element in the provided list, and produce the properly-named output files.

> Here are some explanations on the loop:<br>
> - `fn` is used as a variable to define the "filenames" in the folder containing the data; for each file, we iterate the loop
> - `${}` is used to say we are using a predefined variable
> - an `id` variable is created with the prefix name of the fastqsanger.gz files
> - `basename` is used as a shortcut to extract the name of the file from its absolute path: only the name of the file is kept
> - `cut` is used to split the basename file with `.` as separator defined with the `-d` argument, then `-f1` is used to keep only the first element before the first `.`
> - `echo` is used to print a message
> - we then define the variable `mysortedbam` with the name of the output and its relative path
> - then we use the bowtie command but we redirect its output to samtools using the pipe `|`
> - for samtools, the `-` is given instead of the name of the input file to specify this is the output of the command on the left of the pipe; idem for the next pipe
> - we save here only the sorted.bam and the .sorted.bam.bai files without intermediate files
>
<div class="alert alert-block alert-danger"><b>Danger:<br></b>The loop will probably take ~30 minutes to 1 hour. It generates <b>temporary "bam.tmp" files</b> in the Results folder.<br> <b>Do not delete them during the process!</b> Once the sample is processed, the server will automatically delete these temporary files</div>

In [None]:
date

for fn in $(ls /srv/data/meg-m2-rnaseq/experimental_data/*.fastqsanger.gz); do
       
    id=$(basename ${fn} | cut -d. -f1)
    echo "========Processing sampleID: ${id}..."
    
    myoutsortedbam="./Results/${id}_bowtie_mapping.sorted.bam"
    bowtie -S -x C_parapsilosis ${fn} | samtools view -b - | samtools sort - -o $myoutsortedbam
    samtools index $myoutsortedbam  

    echo "...done"
    
done
date

<div class="alert alert-block alert-success"><b>Success:</b> Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to adenine! </div>


___
___

Now we go on with a lecture about what is indicated in the output sorted *bam* files. 

**=> Lecture 5 : Mapping output** 