<a href="https://colab.research.google.com/github/cappelchi/T-Bio/blob/master/T_Bio_Practice_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Pipeline:**

Step 1: Pre-processing raw reads

Step 2: Mapping on a reference genome

Step 3: Calculating the abundance of reads aligned to each genomic element (i.e. exon, gene or isoform)

Step 4: Biological Interpretation of gene expression profiles

![Pipeline](https://edu.t-bio.info/wp-content/uploads/2019/12/PDX-Sailfish-DEMO-edit.gif)


In [0]:
!head links.svl

/export-data/sciservice/t15/ER-ERR1084763_1.fq

/export-data/sciservice/t15/ER-ERR1084763_2.fq

/export-data/sciservice/t15/ER-ERR1084764_1.fq

/export-data/sciservice/t15/ER-ERR1084764_2.fq

/export-data/sciservice/t15/ER-ERR1084765_1.fq



In [0]:
!head ER-ERR1084763_1.fq

@ERR1084763.2761803
GGCGTTATGGAGTGGAAGTGAAATCACATGGCTAGGCCGGAGGTCATTAGGAGGGCTGAGAGGGCCCCTGTTAGGGGTCATGGGCTGGGT
+
@@@DDDDDHFDFACDEEF3CFAHFGHIH>EEHIIGHIGHI6?A-B@FGIIICHIEEHBBBBCAA'(9>9@:>CCCAB5<BCCCBBCBBB9
@ERR1084763.46715705
GTGGGTTTTACTATATGATAGGCATGTGATTGGTGGGTCATTATGTGTTGTCGTGCAGGTAGAGGCTGAGAGGGCCCCTGTTAGGGGTCA
+
@:?DD)=ADHFDDGIHDHEAHGB?@F9?<ACGC*CD:@DBAD*BG?FHIGDBGFFG@@D.8=;CHGHF;;)7(9>5>8?B>A;59<@'8@
@ERR1084763.18852160
CTTCTAGTAAGCCTCTACCTGCACGACAACACATAATGACCCACCAATCACATGGCTAGGACCGAGGTCATTAGGAGGGGTGAGGGGGCC


### Step 1: Pre-processing raw reads 

#### Trimmomatic
Trimmomatic algorithm cleans technical sequences (from a database which stores sequences known to be used as adaptors in NGS experiments) from raw sequencing data.

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice2/Trimmomatic.jpg" width="340" height="270" />

In [0]:
!head ER-ERR1084763_1_trim.fq

@ERR1084763.2761803
GGCGTTATGGAGTGGAAGTGAAATCACATGGCTAGGCCGGAGGTCATTAGGAGGGCTGAGAGGGCCCCTGTTAGGGGTCATGGGCTGGGT
+
@@@DDDDDHFDFACDEEF3CFAHFGHIH>EEHIIGHIGHI6?A-B@FGIIICHIEEHBBBBCAA'(9>9@:>CCCAB5<BCCCBBCBBB9
@ERR1084763.46715705
GTGGGTTTTACTATATGATAGGCATGTGATTGGTGGGTCATTATGTGTTGTCGTGCAGGTAGAGGCTGAGAGGGCCCCTGTTAGGGGTCA
+
@:?DD)=ADHFDDGIHDHEAHGB?@F9?<ACGC*CD:@DBAD*BG?FHIGDBGFFG@@D.8=;CHGHF;;)7(9>5>8?B>A;59<@'8@
@ERR1084763.34197928
ATAACGCTCCTCATACTAGGCCTACTAACCAACACACTAACCATATACCAATGATGGCGCGATGGAGTGGAAGTGAAATCACATGGCTA


###  PCR clean
PCR Clean removes all duplicated reads from raw sequencing data. The presence of duplicated reads from polymerase chain reaction (PCR) amplification can distort estimates of gene expression levels.

<img src=" https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice2/PCR_Clean.jpg" width="340" height="270">

In [0]:
!head ER-ERR1084763_1_trim_pcr.fq

@ERR1084763.387172/1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
CCCFFFFFHHHGHJJJHFDDDDDDDDDDDDDDDDDDDDDDDDDDDB98<85-507@5557;@
@ERR1084763.548902/1
AAAAAAAAAAAAAAATTTCTCTTCTTCCTGTTATTGGTAGTTCTGAACGTTAGATATTTTTTTTCCATGGGGTCAAAAGGTACCTAAG
+
@@@DDDDDHHHHHI@FHG;@=DC=CCE>C?DE7?CDC>@ACD@C>@A>;?CC?::;@CDECBBBCC>B@?BB9>BCC:>?4:@>@CC?
@ERR1084763.135941/1
AAAAAAAAAAAAAATTTCTCTTCTTCCTGTTATTGGTAGTTCTGAACGTTAGATATTTTTTTTCCATGGGGTCAAAAGGTACCTAA


### Step 2: Mapping on Transcripts

#### Bowtie2-t 
is a version of the bowtie2 algorithm configured to run mapping on transcripts defined in the GTF file.

Mapping reads to the reference genome can be computationally expensive and take a long time. This process is essential for short-read sequencing. Bowtie2 is a fast alignment (mapping) algorithm that is based on the “seed” (or k-mer) approach. “Seed” substrings from the read and their reverse complements are extracted and aligned to the reference in an ungapped fashion. Then, their positions on the reference are recorded, they are extended into full alignments using SIMD-accelerated dynamic programming. 

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice2/Bowtie2-g.jpg" width="340" height="270">

### Step 3: Calculating the abundance of reads aligned to each genomic element (i.e. exon, gene or isoform)

#### RSEM 
(RNA Seq using Expectation Maximization). RSEM is a software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene.

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice2/gene_expression_table.jpg" width="340" height="270">

### Step 4: Biological Interpretation of gene expression profiles

In [7]:
!wget --no-check-certificate --content-disposition https://github.com/cappelchi/T-Bio/blob/master/practice2/PDX_project_pre-processing__bowtie2-t__RSEM_expression_genes_FPKM.txt
!curl -LJO https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice2/PDX_project_pre-processing__bowtie2-t__RSEM_expression_genes_FPKM.txt

--2020-03-04 12:51:42--  https://github.com/cappelchi/T-Bio/blob/master/practice2/PDX_project_pre-processing__bowtie2-t__RSEM_expression_genes_FPKM.txt
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘PDX_project_pre-processing__bowtie2-t__RSEM_expression_genes_FPKM.txt’

PDX_project_pre-pro     [ <=>                ]   1.03M  --.-KB/s    in 0.08s   

2020-03-04 12:51:48 (13.3 MB/s) - ‘PDX_project_pre-processing__bowtie2-t__RSEM_expression_genes_FPKM.txt’ saved [1077279]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  393k  100  393k    0     0  2553k      0 --:--:-- --:--:-- --:--:-- 2553k


In [8]:
!head PDX_project_pre-processing__bowtie2-t__RSEM_expression_genes_FPKM.txt

id	ER-ERR1084763_PE	ER-ERR1084764_PE	ER-ERR1084765_PE	ER-ERR1084775_PE	ER-ERR1084805_PE	ER-ERR1084806_PE	ER-ERR1084811_PE	TN-ERR1084766_PE	TN-ERR1084768_PE	TN-ERR1084798_PE	TN-ERR1084799_PE	TN-ERR1084800_PE	TN-ERR1084801_PE	TN-ERR1084802_PE	TN-ERR1084803_PE	TN-ERR1084804_PE	TN-ERR1084807_PE	TN-ERR1084808_PE	TN-ERR1084809_PE	TN-ERR1084810_PE
ENSG00000001630	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	23.00	21.22	0.00
ENSG00000002834	18107.99	19943.94	19480.73	16688.86	11468.35	25380.75	25531.70	6908.48	1392.67	19009.15	12868.90	8726.15	8844.10	9536.82	15575.65	6247.11	7130.97	9967.39	4307.43	13905.67
ENSG00000003402	0.90	0.00	0.93	1.70	0.00	1.49	3.56	0.00	0.77	2.84	0.00	2.22	1.76	4.63	0.00	1.96	1.58	4.70	3.72	7.24
ENSG00000003436	0.00	0.00	0.00	1.09	1.03	0.00	0.00	0.00	0.00	3.73	1.07	0.00	1.00	0.00	1.02	0.00	0.00	0.00	1.01	0.00
ENSG00000003509	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	3.30
ENSG

In [0]:
import pandas as pd

In [0]:
df = pd.read_csv('PDX_project_pre-processing__bowtie2-t__RSEM_expression_genes_FPKM.txt', sep = '\s+')

In [13]:
df

Unnamed: 0,id,ER-ERR1084763_PE,ER-ERR1084764_PE,ER-ERR1084765_PE,ER-ERR1084775_PE,ER-ERR1084805_PE,ER-ERR1084806_PE,ER-ERR1084811_PE,TN-ERR1084766_PE,TN-ERR1084768_PE,TN-ERR1084798_PE,TN-ERR1084799_PE,TN-ERR1084800_PE,TN-ERR1084801_PE,TN-ERR1084802_PE,TN-ERR1084803_PE,TN-ERR1084804_PE,TN-ERR1084807_PE,TN-ERR1084808_PE,TN-ERR1084809_PE,TN-ERR1084810_PE
0,ENSG00000001630,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,23.00,21.22,0.00
1,ENSG00000002834,18107.99,19943.94,19480.73,16688.86,11468.35,25380.75,25531.70,6908.48,1392.67,19009.15,12868.90,8726.15,8844.10,9536.82,15575.65,6247.11,7130.97,9967.39,4307.43,13905.67
2,ENSG00000003402,0.90,0.00,0.93,1.70,0.00,1.49,3.56,0.00,0.77,2.84,0.00,2.22,1.76,4.63,0.00,1.96,1.58,4.70,3.72,7.24
3,ENSG00000003436,0.00,0.00,0.00,1.09,1.03,0.00,0.00,0.00,0.00,3.73,1.07,0.00,1.00,0.00,1.02,0.00,0.00,0.00,1.01,0.00
4,ENSG00000003509,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,3.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3305,ENSMUSG00000097705,0.00,0.00,0.00,74.51,125.65,0.00,0.00,0.00,5.23,47.45,41.95,11.18,130.55,0.00,74.65,31.42,28.59,437.45,0.00,71.00
3306,ENSMUSG00000097779,10.28,9.52,31.24,0.00,6.20,5.64,0.00,58.04,0.00,0.00,12.41,0.00,39.16,8.59,5.95,21.69,0.00,730.10,46.76,4.28
3307,ENSMUSG00000097873,0.00,0.00,0.00,0.00,3.72,0.00,0.00,0.00,0.00,3.20,3.03,0.00,0.00,0.00,0.00,0.00,1.82,3.03,0.00,0.00
3308,ENSMUSG00000098650,0.00,0.00,0.00,0.00,0.00,0.00,207.83,712.00,8.49,0.00,0.00,50.91,31.01,68.03,54.68,0.00,0.00,0.00,0.00,222.61
