# Mapping reads to genome

<big><b>File completed </b></big> (05/21/2021) <br>
An issue appended with <code>STAR</code> for mapping first sample: <b>I added some cells (thus not to be included in default pipeline)</b> to re-run this sample.  
TODO for next use: change counted generated files post-Star mapping, only <code>*.bam</code> files should be better... but do not help when one is empty :/.

<div class="alert alert-block alert-danger">
    Please note, that genome file indexing is appropriate for RNA sequencing of 50-bases reads. <br>
    If different, change <code>rawreadlength</code>'s below setted value with your dataset sequenced read length.
</div>

---

## <b>Preparing session for IFB core cluster</b>

<em>loaded JupyterLab</em> : Version 2.2.9

In [1]:
echo "=== Cell launched on $(date) ==="

echo "=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid}

echo "=== Working's root folder is ==="
gohome="/shared/projects/gonseq/Building/" # to adjust with your project's folder
echo "${gohome}"
echo ""

echo "=== current folder tree ==="
tree -d -L 2 "${gohome}"
echo "=== current working directory ==="
echo "${PWD}"

=== Cell launched on Thu May 20 13:12:55 CEST 2021 ===
=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
       JobID  AllocCPUS        NodeList 
------------ ---------- --------------- 
16677880             10     cpu-node-17 
16677880.ba+         10     cpu-node-17 
16677880.0           10     cpu-node-17 
=== Working's root folder is ===
/shared/projects/gonseq/Building/

=== current folder tree ===
/shared/projects/gonseq/Building/
├── Data
│   ├── fastq
│   ├── info
│   └── sra
├── Pipeline
└── Results
    ├── fastp
    ├── fastqc
    ├── logfiles
    └── multiqc

10 directories
=== current working directory ===
/shared/ifbstor1/projects/gonseq/Testing/Pipeline


In [2]:
module load star samtools

echo "===== download network files ====="
wget --version | head -n 1
echo "===== alignement tool ====="
STAR --version
echo "===== index construction + quality ====="
samtools --version

===== download network files =====
GNU Wget 1.14 built on linux-gnu.
===== alignement tool =====
2.7.5a
===== index construction + quality =====
samtools 1.10
Using htslib 1.10.2
Copyright (C) 2019 Genome Research Ltd.


---
## <b>I- Reference genome and annotation files</b>

### **1- Searching for them (their url!) on the web**

There are several websites to download reference genome and annotation files:
- <a href="https://www.gencodegenes.org/human/"><i>Gencode</i></a> from the European Biomolecular Institute (EBI)  
- <a href="http://www.ensembl.org/info/data/ftp/index.html"><i>Accessing Ensembl Data</i></a> from Ensembl project database  
- <a href="https://www.ncbi.nlm.nih.gov/genome/guide/human/"><i>Human Genome Ressources</i></a> at NCBI  
- ... and maybe one day a commom NCBI and Ensembl/Gencode realease (MANE collaboration, <a href="https://ncbiinsights.ncbi.nlm.nih.gov/2020/11/02/ncbi-refseq-ensembl-gencode-mane-v0-92/#more-4781">a story beginning in 2020</a>)

We will use a **Primary assembly** (PRI) release. It includes chromosomes and scaffolds (candidate regions to be integrated or discarded in next genome build).  
On the contrary, the main annotation file is limited to chromosomes while the extensive annotation file also includes all hnown haplotypes (for highly variables regions).

<div class="alert alert-block alert-warning">
    In this notebook, we use <b>Gencode release</b>, that provides user with <b><code>.gz</code> compressed</b> files, and we choose <b>GTF format</b> for you annotation file. <br>
    Feel free to choose the source you want among above citated ones, as far as downloaded files follow the same file formats (else change next sections code cells!).
</div>

<div class="alert alert-block alert-info">
    Nonetheless, please note that annotation file format is an actively opened issue as some relevant official sources are contradictory: 
    <ul>
        <li>
            For US's Galaxy's project: <a href="https://galaxyproject.org/learn/datatypes/#gtf">GTF</a> is the GFF's version 2 while <a href="https://galaxyproject.org/learn/datatypes/#gff">GFF</a> is version 1 and <a href="https://galaxyproject.org/learn/datatypes/#gff3">GFF3</a> is the latest and 3rd version... 
        </li>
        <li>
        ... but IGV Broad Institute, as UCSC genome browser, makes distinction between <a href="http://software.broadinstitute.org/software/igv/GFF">GFF2 and GTF formats</a>, <a href="https://genome.ucsc.edu/FAQ/FAQformat.html#format3"> the later being only compatible with the former</a>.
        </li>
        <li>
            While both <a href="https://biocorecrg.github.io/PhD_course/gtf_format.html">GTF</a> and <a href="https://biostar.usegalaxy.org/p/28147/">GFF</a> formats have 9 columns, field in the ninth column is longer for <code>.gtf</code> files than for <code>.gff</code> files (<a href="https://genome.ucsc.edu/FAQ/FAQformat.html#format3">UCSC Genome browser documentation</a> and <a href="https://www.ensembl.org/info/website/upload/gff.html">ensembl documentation</a>).
        </li>
        <li>
            Even if both file format have header lines, some tools do not support them (<a href="https://biostar.usegalaxy.org/p/28147/">second bullet point in last anwser</a>) and US Galaxy portal ask users to remove those lines before use (see upper US Galaxy's links).
        </li>
        <li>
            <code>FeatureCounts</code> (a downstream tool we will use) only <a href="http://bioinf.wehi.edu.au/featureCounts/">works with GTF files</a>. Tool expects to have <i>exon</i> in <i>features</i> column (both GFF and GTF!) and <i>gene_id</i> as a gene identifier (missing in GFF), see <a href="https://biostar.usegalaxy.org/p/28094/index.html#28099">item 4 in latest answer</a>.
        </li>
    </ul>
</div>

In order to have latest current genome release for your analyses, please go to Gencode's <a href="https://www.gencodegenes.org/human/">download page</a> (or to other chosen reference download page) and adapt url links for:
- Primary annotation (notebook developped with GTF file)
> in *GTF/GFF3 files* Gencode's chart: Comprehensive gene annotation > primary annotation > *gtf* file 

In [3]:
gtfgzurl="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.primary_assembly.annotation.gtf.gz"

- Primary genome sequence file
> in *Fasta files* Gencode's chart: Genome sequence, primary assembly > *Fasta* file 

In [4]:
fagzurl="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/GRCh38.primary_assembly.genome.fa.gz"

*Note*: You can get url link with a right click on download links, then *copy link to clipboard*.

<div class="alert alert-block alert-danger">
    Both files have to be retrieved from same source as sequence region names need to be the same in both in order to be useful and avoid dowstream analysis issues. <br>
    <i>In Gencode, this files' compatibility is specified in the <b>fasta's description field</b></i>.
</div>

### **2- Retrieve files with ``wget``**

We will download those files in a distinct folder:

In [5]:
reffolder="${gohome}Reference/"
mkdir -p ${reffolder}

<ul class="alert alert-block alert-info">
    <li>
        Sometimes (often?!), other users issues help us understand a command more than its manual. For instance, a Stackoverflow's <a href="https://unix.stackexchange.com/questions/23501/download-using-wget-to-a-different-directory-than-current-directory">thread</a> about <code>wget</code> command and the way to write into a chosen output folder. 
    </li>
</ul>

In [6]:
logfile="${reffolder}wget_reference_files_downloads.log"
echo "Some output is redirected to ${logfile} for record"

echo "===== Annotation file retrieval ..." >> ${logfile}
wget -P "${reffolder}" -N "${gtfgzurl}"
echo "... done" >> ${logfile}

# to get record of used command line thus url and file size
#history | tail -n 4 | head -n 1 >> ${logfile}  # not informative
echo "Used command is: wget -P ${reffolder} -N ${gtfgzurl}"
ls -lh "${reffolder}"*.gtf.gz >> ${logfile}

Some output is redirected to /shared/projects/gonseq/Building/Reference/wget_reference_files_downloads.log for record
--2021-05-20 13:12:58--  ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/gencode.v37.primary_assembly.annotation.gtf.gz
           => ‘/shared/projects/gonseq/Building/Reference/.listing’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/gencode/Gencode_human/release_37 ... done.
==> PASV ... done.    ==> LIST ... done.

    [ <=>                                   ] 3,928       --.-K/s   in 0.005s  

2021-05-20 13:12:58 (742 KB/s) - ‘/shared/projects/gonseq/Building/Reference/.listing’ saved [3928]

Removed ‘/shared/projects/gonseq/Building/Reference/.listing’.
--2021-05-20 13:12:58--  ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/rel

We only use these two options:
> ``-P PREFIX`` or ``--directory-prefix=PREFIX`` to specify output folder  
> ``-N`` or ``--timestamping``: don't re-retrieve files unless newer than local  

Some other available options exist and among them this one:
> ``-a FILE`` or ``--append-output=FILE`` to append messages to FILE  

In [7]:
echo "Some output is redirected to ${logfile} for record"

echo "===== Genome sequence retrieval ..." >> ${logfile}
wget -P "${reffolder}" -N "${fagzurl}"
echo "... done" >> ${logfile}

# to get record of used command line thus url and file size
#history | tail -n 4 | head -n 1 >> ${logfile}  # not informative
echo "Used command is: wget -P ${reffolder} -N ${fagzurl}"
ls -lh "${reffolder}"*.fa.gz >> ${logfile}

Some output is redirected to /shared/projects/gonseq/Building/Reference/wget_reference_files_downloads.log for record
--2021-05-20 13:13:04--  ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/GRCh38.primary_assembly.genome.fa.gz
           => ‘/shared/projects/gonseq/Building/Reference/.listing’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.197.74
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.197.74|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/gencode/Gencode_human/release_37 ... done.
==> PASV ... done.    ==> LIST ... done.

    [ <=>                                   ] 3,928       --.-K/s   in 0.04s   

2021-05-20 13:13:04 (90.8 KB/s) - ‘/shared/projects/gonseq/Building/Reference/.listing’ saved [3928]

Removed ‘/shared/projects/gonseq/Building/Reference/.listing’.
--2021-05-20 13:13:04--  ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/G

### **3- Extract archive files**

``STAR``, as other downstream tools, can't deal with compressed reference files.  
Extracted files are quite big but compressed's ones are of rather affordable size. So we will keep them as is along with simplier filename assigment for extracted files (easier to handle, in particular when changing release and/or database source).

In [8]:
mkdir -p "${reffolder}extracted/"

- Primary annotation (notebook developped with GTF file)

In [9]:
echo "===== Extracting annotation file ..." |& tee -a  ${logfile}
gtfgzfile=$(ls "${reffolder}"*.gtf.gz)
gtffile="${reffolder}extracted/genome_annotation.gtf"

zcat ${gtfgzfile} > ${gtffile}
echo "... done" |& tee -a  ${logfile}

ls -lh "${reffolder}extracted/"*.gtf >> ${logfile}

===== Extracting annotation file ...
... done


- Primary genome sequence file

In [10]:
echo "===== Extracting sequence file ..." |& tee -a  ${logfile}
fastagzfile=$(ls "${reffolder}"*.fa.gz)
fastafile="${reffolder}extracted/genome_sequence.fa"

zcat ${fastagzfile} > ${fastafile}
echo "... done" |& tee -a  ${logfile}

ls -lh "${reffolder}extracted/"*.fa >> ${logfile}

===== Extracting sequence file ...
... done


### **4- Get an eye in donwloaded files**
Let's get an eye in those files to check they correspond to what we expect (or just discover file format).

- Primary annotation (notebook developped with GTF file)

In [11]:
head ${gtffile}

##description: evidence-based annotation of the human genome (GRCh38), version 37 (Ensembl 103)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2020-12-07
chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_n

- Primary genome sequence file

In [12]:
head ${fastafile}

>chr1 1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN


---
## <b>II- Bluiding genome reference files</b>

The indexes are small files that tell a program where to look for data in a large data file. They are required for mapping algorithms, as they allow for faster processing of millions reads.

### **1- Tool version and command line presentation**

In [13]:
STAR --version

2.7.5a


To create reference genome files, the default command is: <br>
<code>STAR --runMode genomeGenerate --genomeDir destination/folder \
      --genomeFastaFiles path/to/sequence.fa
</code>

<blockquote>
    <code>--runMode genomeGenerate</code>, to switch to indexing step, else STAR is by defaul turned to alignReads (mapping step) <br>
    <code>--genomeDir</code>, to specify folder where to put reference genome indexes <br>
    <code>--genomeFastaFiles</code>, fasta file reference genome path (DOES NOT work with gz files)

We are working on RNAseq data and need to have files that take splice junctions into account. Thus, we need to add following 2 parameters: <br>
<code>--sjdbGTFfile path/to/annotation.file --sjdbOverhang readlengthnum</code>

They stand for:
<blockquote>
    <code>--sjdbGTFfiles</code>, to specify where to find annotation file that contains exon positions, thus placing splice junction along genome sequence <br>
    <code>--sjdbOverhang</code>, the maximum size that we expected to found on one splice junction side (<em>ideally, mate length-1</em>)
</blockquote>

For STAR, we can specify those two additional options either when genereating genome index files or when mapping sample. As we may be limited in computational ressources, we will add these items here and avoid memory-consuming operation repetition lately when iterating on all samples for mapping.

### **2- Preparing command line variables**

The dataset used to develop this pipeline is based on reads sequenced on 50 bases.  
As it may not be the case for yours, please check and/or change following value:

In [14]:
rawreadlength=50
maxoneside=$((${rawreadlength}-1))

We will then create a folder to put those specific genomes indexes files, with an explicit name for later use:

In [15]:
indexfolder="${gohome}Reference/indexes_upto${maxoneside}bases/"
mkdir -p ${indexfolder}

Regarding files, let's use again ``fastafile`` and ``gtffile`` variables, defined when extracting compressed files in previous step.  

### **3- Running command line**

<div class="alert alert-block alert-danger">
    Following <b>command is prepared for usage on a computational cluster</b> and was developped on the <i>Institut Français de Bioinformatique</i> (IFB)'s core cluster. We use a <i>Large</i> session defined as <b>10 CPU with 50 GB available for RAM</b>. 
</div>

If you have limited computer ressources, please change following parameters directly in the command cell below. 
<blockquote>
    <code>--limitGenomeGenerateRAM</code>, to set maximum available RAM (in bytes, standing for <i>octets</i> in French) for genome generation (integer, positive and not null, default value: 31000000000) <br>
    <code>--runThreadN</code>, to limit the number of threads that <code>STAR</code> can use, it has to be set to the number of available cores
</blockquote>


In [16]:
logfile="${gohome}Reference/star_indexing_genome.log"
echo "Screen output is also redirected to ${logfile} for record"

echo "=== starting genome indexing ..." |& tee -a "${logfile}"
echo "operation starts at $(date)" >> ${logfile}

time STAR --runThreadN 9 --runMode genomeGenerate \
          --genomeDir "${indexfolder}" \
          --genomeFastaFiles "${fastafile}" \
          --sjdbGTFfile "${gtffile}" \
          --sjdbOverhang "${maxoneside}" \
          --limitGenomeGenerateRAM 48000000000 \
          |& tee -a "${logfile}"
echo "STAR indexing ends at $(date)" >> ${logfile}

# list files with their size
ls -lh "${indexfolder}" >> ${logfile}

echo "... done" |& tee -a "${logfile}"

Screen output is also redirected to /shared/projects/gonseq/Building/Reference/star_indexing_genome.log for record
=== starting genome indexing ...
May 20 13:13:51 ..... started STAR run
May 20 13:13:51 ... starting to generate Genome files
May 20 13:14:51 ..... processing annotations GTF
May 20 13:15:27 ... starting to sort Suffix Array. This may take a long time...
May 20 13:15:50 ... sorting Suffix Array chunks and saving them to disk...
May 20 14:04:43 ... loading chunks from disk, packing SA...
May 20 14:06:23 ... finished generating suffix array
May 20 14:06:23 ... generating Suffix Array index
May 20 14:09:43 ... completed Suffix Array index
May 20 14:09:43 ..... inserting junctions into the genome indices
May 20 14:12:19 ... writing Genome to disk ...
May 20 14:12:22 ... writing Suffix Array to disk ...
May 20 14:12:48 ... writing SAindex to disk
May 20 14:12:50 ..... finished successfully

real	58m59.366s
user	393m48.061s
sys	9m47.630s
... done


If there is any issue, among all output files that STAR writes, start with ``Log.out``. It's a plain text file containing understood command line. It's quite verbose, that's very helpful!

In [17]:
head -n 25 "${indexfolder}Log.out"

STAR version=2.7.5a
STAR compilation time,server,dir=Tue Jun 16 12:17:16 EDT 2020 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
##### Command Line:
STAR --runThreadN 9 --runMode genomeGenerate --genomeDir /shared/projects/gonseq/Building/Reference/indexes_upto49bases/ --genomeFastaFiles /shared/projects/gonseq/Building/Reference/extracted/genome_sequence.fa --sjdbGTFfile /shared/projects/gonseq/Building/Reference/extracted/genome_annotation.gtf --sjdbOverhang 49 --limitGenomeGenerateRAM 48000000000
##### Initial USER parameters from Command Line:
###### All USER parameters from Command Line:
runThreadN                    9     ~RE-DEFINED
runMode                       genomeGenerate     ~RE-DEFINED
genomeDir                     /shared/projects/gonseq/Building/Reference/indexes_upto49bases/     ~RE-DEFINED
genomeFastaFiles              /shared/projects/gonseq/Building/Reference/extracted/genome_sequence.fa        ~RE-DEFINED
sjdbGTFfile                   /shared/projects/gonse

### **4- Genome extracted files removal**

We can see how many disk space Reference files use:

In [23]:
du -h ${reffolder}

4.3G	/shared/projects/gonseq/Building/Reference/extracted
28G	/shared/projects/gonseq/Building/Reference/indexes_upto49bases
33G	/shared/projects/gonseq/Building/Reference/


Genome extracted file is no more used, let's remove it to spare some space:

In [24]:
rm "${reffolder}extracted/genome_sequence.fa"  # line changed after run
du -h ${reffolder}
#ls -lh "${reffolder}extracted/"  # suggested additionnal line to use

28G	/shared/projects/gonseq/Building/Reference/indexes_upto49bases
29G	/shared/projects/gonseq/Building/Reference/


## <b>III- Mapping samples on reference genome</b>

### **1- Tool version and command line presentation**

A little stop to discover ``STAR`` version as you may have skiped genome indexing:

In [25]:
STAR --version

2.7.5a


A rather simple version of commandline or mapping is: <br>
<code>STAR --genomeDir path/to/indexes/folder/ \
      --readFilesIn path/to/read1.fastq.gz path/to/read2.fastq.gz \
      --readFilesCommand zcat \
      --outSAMtype BAM SortedByCoordinate \
      --quantMode GeneCounts \
</code>

<blockquote>
    <code>--readFilesIn</code> for <code>Read</code> (for Single End data) or both <code>Read1 Read2</code> (for Paired End data) as full paths to files that contain input read(s)
    <code>--readFilesCommand</code>, to indicate tool that can handle read file format. <code>STAR</code> allow user a direct use of compressed file but rely on available tools <br>
    <br>
    <code>--outSAMtype word1 word2</code>, to set output file format we want (default, SAM). <br>
    Options for <code>word1</code> are <code>BAM</code>, <code>SAM</code> and <code>NoneNone</code> (no SAM/BAM output). <br>
    Options for <code>word2</code> are <code>Unsorted</code> or <code>SortedByCoordinate</code>. This option will allocate extra memory for sorting which can be specified by <code>--limitBAMsortRAM</code>.<br>
    <br>
    <code>--quantMode</code> (default, <i>none</i>), to activate and ask for one or several quantification outputs.  <br>
    Available options are: <code>GeneCounts</code> and <code>TranscriptomeSAM</code>. The latter will generate an output SAM/BAM alignments to transcriptome into a separate file while the former only generates a text file with count reads per gene.
</blockquote>

As ``_Aligned.toTranscriptome.out.bam`` generated files for downstream transcript level are as big or bigger than ``_Aligned.sortedByCoord.out.bam``, only required for downstream quantification analysis by ``FeatureCounts``, we will focus on gene level quantification mode.

If you want to use transcript level qualification, we have previously successfully used below options: <br>
<code>--quantMode TranscriptomeSAM GeneCounts</code>

### **2- Preparing command line variables**

Let's check that we still have all ``.fastq.gz`` files where we left them. We count files that do no include *_removed* in their name:

In [26]:
ls "${gohome}Data/fastq/fastp/" | grep -v -e "_removed" | wc -l

32


We here create destination folder for aligned ``.bam`` and other output files:

In [27]:
mappedfolder="${gohome}Results/star/"
mkdir -p ${mappedfolder}

... and remember matched ``Results/`` destination folder for log files...

In [28]:
logfolder="${gohome}Results/logfiles/"

### **3- Running command line**

<div class="alert alert-block alert-danger">
    Following <b>command is prepared for usage on a computational cluster</b> and was developped on the <i>Institut Français de Bioinformatique</i> (IFB)'s core cluster. We use a <i>Large</i> session defined as <b>10 CPU with 50 GB available for RAM</b>. 
</div>

If you have limited computer ressources, please change following parameters directly in the command cell below. 
<blockquote>
    <code>--limitBAMsortRAM</code>, to set maximum available RAM (in bytes, standing for <i>octets</i> in French) for sorting <code>.bam</code> file (integer, positive). <i>Note: Value can be null only if <code>--genomeLoad</code> option is unchanged, thus it will be set to the genome index size.</i> <br>
    <code>--runThreadN</code>, to limit the number of threads that <code>STAR</code> can use, it has to be set to the number of available cores
</blockquote>


In [31]:
logfile="${logfolder}star_mapping_samples.log"
echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

time for read1 in $(ls "${gohome}Data/fastq/fastp/"*_1.fastp.fastq.gz); do

    # handling names with the sample name
    samplenum=$(basename ${read1} | cut -d"_" -f1)
    echo "====== Processing sampleID: ${samplenum}..." | tee -a ${logfile}
    read2=$(echo ${read1} | sed 's#_1#_2#')

    echo "STAR starts at $(date)" >> ${logfile}
    # STAR working
    STAR --runThreadN 9 --runMode alignReads \
        --genomeDir "${indexfolder}" \
        --readFilesIn "${read1}" "${read2}" \
        --readFilesCommand zcat \
        --outFileNamePrefix "${mappedfolder}${samplenum}_" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMattributes All \
        --outReadsUnmapped Fastx \
        --limitBAMsortRAM 48000000000 \
        --quantMode GeneCounts \
        &>> ${logfile}
    echo "STAR ends at $(date)" >> ${logfile}
    
    echo "...done" | tee -a ${logfile} 
    
done
echo "operation ends at $(date)" >> ${logfile}

echo "=== files created during mapping step ===" >> ${logfile}
ls -lh "${mappedfolder}" >> ${logfile}

echo "STAR generated $(ls "${mappedfolder}" | wc -l) files during this step." \
     | tee -a ${logfile}

Screen output is redirected to /shared/projects/gonseq/Building/Results/logfiles/star_mapping_samples.log
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done

real	157m2.801s
user	963m29.190s
sys	65m41.767s
STAR generated 127 files during this step.


In [33]:
du -ch -d2 ${gohome}

16M	/shared/projects/gonseq/Building/Results/multiqc
8.4M	/shared/projects/gonseq/Building/Results/fastp
244K	/shared/projects/gonseq/Building/Results/logfiles
34M	/shared/projects/gonseq/Building/Results/fastqc
112G	/shared/projects/gonseq/Building/Results/star
4.0K	/shared/projects/gonseq/Building/Results/.ipynb_checkpoints
112G	/shared/projects/gonseq/Building/Results
154G	/shared/projects/gonseq/Building/Data/fastq
46G	/shared/projects/gonseq/Building/Data/sra
44K	/shared/projects/gonseq/Building/Data/info
36K	/shared/projects/gonseq/Building/Data/.ipynb_checkpoints
199G	/shared/projects/gonseq/Building/Data
4.0K	/shared/projects/gonseq/Building/.ipynb_checkpoints
104K	/shared/projects/gonseq/Building/Pipeline/.ipynb_checkpoints
208K	/shared/projects/gonseq/Building/Pipeline
28G	/shared/projects/gonseq/Building/Reference/indexes_upto49bases
29G	/shared/projects/gonseq/Building/Reference
339G	/shared/projects/gonseq/Building/
339G	total


In [34]:
read1="${gohome}Data/fastq/fastp/SRR7430706_1.fastp.fastq.gz"

samplenum=$(basename ${read1} | cut -d"_" -f1)
echo "====== Processing sampleID: ${samplenum}..." | tee -a ${logfile}
read2=$(echo ${read1} | sed 's#_1#_2#')

echo "STAR starts at $(date)" >> ${logfile}
# STAR working
STAR --runThreadN 9 --runMode alignReads \
    --genomeDir "${indexfolder}" \
    --readFilesIn "${read1}" "${read2}" \
    --readFilesCommand zcat \
    --outFileNamePrefix "${mappedfolder}${samplenum}_" \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMattributes All \
    --outReadsUnmapped Fastx \
    --limitBAMsortRAM 48000000000 \
    --quantMode GeneCounts \
    |& tee -a ${logfile}
echo "STAR ends at $(date)" >> ${logfile}

echo "...done" | tee -a ${logfile}

echo "=== files created during SRR7430706 mapping step ===" >> ${logfile}
ls -lh "${mappedfolder}" | grep "SRR7430706" >> ${logfile}

May 20 18:43:45 ..... started STAR run
May 20 18:43:45 ..... loading genome
May 20 18:44:36 ..... started mapping
May 20 18:51:09 ..... finished mapping
May 20 18:51:10 ..... started sorting BAM
May 20 18:52:50 ..... finished successfully
...done


## <b>IV- Building sample ``.bam`` indexes with ``samtools``</b>

We will here index ``.bam`` files to produce the companion ``.bai``. Such files help, in particular, going faster to visualize alignements ``.bam`` file in genome browser viewer.

### **1- Tool version**

The commands used for this part belong to a large package of utilities that are very useful to manage those types of files: SAMTOOLS (http://www.htslib.org/).

Let's check first which version of SAMTOOLS we are using:

In [35]:
samtools --version

samtools 1.10
Using htslib 1.10.2
Copyright (C) 2019 Genome Research Ltd.


Simple commandline syntax is: <code>samtools index path/to/file.bam</code>
  
There is no need to provide a name of the ouput file, as it should always be the same as the corresponding ``.bam`` file, expect for the added ``.bai`` suffix.

### **2- Creating files**

In [36]:
logfile="${logfolder}samtools_indexing_samples.log"
echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

time for bamfile in $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam); do

    samplenum=$(basename ${bamfile} | cut -d"_" -f1)
    echo "====== Processing sampleID: ${samplenum}..." | tee -a ${logfile}
    
    echo "samtools index starts at $(date)" >> ${logfile}
    samtools index "${bamfile}" \
             &>> ${logfile}
    echo "samtools index ends at $(date)" >> ${logfile}
    
    echo "...done" | tee -a ${logfile} 
    
done
echo "operation ends at $(date)" >> ${logfile}

echo "=== files created during indexing step ===" >> ${logfile}
ls -lh "${mappedfolder}"*.bai >> ${logfile}

echo "samtools index generated $(ls "${mappedfolder}"*.bai | wc -l) files during this step." \
     | tee -a ${logfile}

Screen output is redirected to /shared/projects/gonseq/Building/Results/logfiles/samtools_indexing_samples.log
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done

real	18m43.168s
user	17m39.576s
sys	0m56.905s
samtools index generated 16 files during this step.


<div class="alert alert-block alert-warning">
    If one or more <code>.bai</code> files are missing, there should be an error in their matched <code>.bam</code> file. Have a look into generated <code>.log</code> file. <br>
    When there is not enough disk space during mapping process, <code>.bam</code> file may be incomplete: you can find <i>missing EOF block when one should be present</i> error for this sample. <br>
    Enhance, please be sure before you start again mapping step that you have at least 5 times more space than one's sample <code>.fastq</code> files size (or 10 times if you activate <code>TranscriptomeSAM</code> along with <code>GeneCounts</code>).
</div>

### **3- Get an eye on used disk space**

In [37]:
du -h -d3 ${gohome}

6.0M	/shared/projects/gonseq/Building/Results/multiqc/2_fastp-fastq-files_data
2.0M	/shared/projects/gonseq/Building/Results/multiqc/1_raw-fastq-files_data
1.7M	/shared/projects/gonseq/Building/Results/multiqc/1_raw-fastq-files_plots
2.0M	/shared/projects/gonseq/Building/Results/multiqc/2_fastp-fastq-files_plots
1.2M	/shared/projects/gonseq/Building/Results/multiqc/.ipynb_checkpoints
16M	/shared/projects/gonseq/Building/Results/multiqc
4.0K	/shared/projects/gonseq/Building/Results/fastp/.ipynb_checkpoints
8.4M	/shared/projects/gonseq/Building/Results/fastp
112K	/shared/projects/gonseq/Building/Results/logfiles/.ipynb_checkpoints
256K	/shared/projects/gonseq/Building/Results/logfiles
34M	/shared/projects/gonseq/Building/Results/fastqc
4.0K	/shared/projects/gonseq/Building/Results/star/.ipynb_checkpoints
117G	/shared/projects/gonseq/Building/Results/star
4.0K	/shared/projects/gonseq/Building/Results/.ipynb_checkpoints
118G	/shared/projects/gonseq/Building/Results
75G	/shared/projects/gon

For current project, we can use up to 600 Gb. As next steps are less space consuming, some cleaning shouln't be required.

---
___

Now we go on to check mapping quality.

**=> Step 5: Quality post mapping** 

___