<a href="https://colab.research.google.com/github/cappelchi/T-Bio/blob/master/T_Bio_Practice_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Hands-on Assignment: Transcriptomics 1
YOUR ASSIGNMENT:

**Run the RNA-seq pipeline and upload an excel file with the KRAS gene expression visualized as a bar plot.**

RNA-Seq
Also known as whole transcriptome sequencing allows for the user to identify the RNA (mRNA generally) in a biological sample at any given time. This can be used to analyze the ever changing cellular transcriptome.

Example: Jabbari, et al. used RNA-seq to investigate psoriasis and find new genes for functional analysis.  They compared their RNA-seq data to published array studies and found 1700 new candiadates. These were validated by qPCR, and comparison to functional databases for psoriasis supported their role in pathogenesis.

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice3/RNA-Seq1.jpg" width="600" height="300" />


## **Pipeline:**

### Description:

Step: no Pre-processing raw reads

Step 1: Mapping on a reference genome

Step 2: Calculating the abundance of reads aligned to each genomic element (i.e. exon, gene or isoform)

Step 3: Biological Interpretation of gene expression profiles

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice3/Pipeline1_tr1_p3.png" width="600" height="300" />


### Step 1: Mapping on a reference genome

#### Bowtie2

Bowtie2  is a fast alignment algorithm that is based on the “seed” (or k-mer) approach. “Seed” substrings from the read and their reverse complements are extracted and aligned to the reference in an ungapped fashion. Then their positions on the reference are recorded and they are extended into full alignments using SIMD-accelerated dynamic programming.

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice3/bowtie2-2.jpg" width="600" height="300" />

### Step 2: Calculating the abundance of reads aligned to each genomic element (i.e. exon, gene or isoform)

#### RSEM 
(RNA Seq using Expectation Maximization). RSEM is a software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene.

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice3/RSEM.png" width="500" height="270">

### Step 3: Biological Interpretation of gene expression profiles

In [0]:
mypath = '/content/pipeline1p1ch3/'

In [0]:
!wget https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice2/pipeline1/PDX_project_pre-processing__bowtie2-t__RSEM_AllResultsRSEMRun.zip
!unzip PDX_project_pre-processing__bowtie2-t__RSEM_AllResultsRSEMRun.zip -d {mypath}

--2020-03-05 12:52:43--  https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice2/pipeline1/PDX_project_pre-processing__bowtie2-t__RSEM_AllResultsRSEMRun.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3040179 (2.9M) [application/zip]
Saving to: ‘PDX_project_pre-processing__bowtie2-t__RSEM_AllResultsRSEMRun.zip.1’


2020-03-05 12:52:43 (10.9 MB/s) - ‘PDX_project_pre-processing__bowtie2-t__RSEM_AllResultsRSEMRun.zip.1’ saved [3040179/3040179]

Archive:  PDX_project_pre-processing__bowtie2-t__RSEM_AllResultsRSEMRun.zip
  inflating: mypath/expression_genes_FPKM_not_filtered.txt  
  inflating: mypath/expression_genes_FPKM.txt  
  inflating: mypath/expression_genes_not_filtered.txt  
  inflating: mypath/expression_genes.txt  
  inflating: mypath/expressi

In [0]:
#!pip install jupyter-datatables
#!pip install pandas-summary

In [0]:
import pandas as pd
#from jupyter_datatables import init_datatables_mode
#init_datatables_mode()  # initialize [DataTables]
#from pandas_summary import DataFrameSummary
from os import listdir
from os.path import isfile, join

In [0]:
results = [f for f in listdir(mypath) if isfile(join(mypath, f))]

In [0]:
for result in results:
    comand = f'{result[:-4]} = pd.read_csv("{mypath + result}", sep = "\s+")'
    exec(comand)
    print (result[:-4])
    exec(f'print ({result[:-4]}.describe())')

expression_genes_FPKM
       ER-ERR1084763_PE  ER-ERR1084764_PE  ...  TN-ERR1084809_PE  TN-ERR1084810_PE
count       3310.000000       3310.000000  ...       3310.000000        3310.00000
mean         248.134335        252.239305  ...        307.271483         287.03629
std         3580.248232       3467.634219  ...       3731.114167        3714.63254
min            0.000000          0.000000  ...          0.000000           0.00000
25%            0.000000          0.000000  ...          0.000000           0.00000
50%            0.000000          0.000000  ...          0.000000           0.00000
75%            0.000000          0.000000  ...          4.697500           2.55000
max       110309.210000     102040.800000  ...     113424.790000      123289.43000

[8 rows x 20 columns]
temp
       ER-ERR1084763_PE  ER-ERR1084764_PE  ...  TN-ERR1084809_PE  TN-ERR1084810_PE
count       4981.000000       4981.000000  ...       4981.000000       4981.000000
mean         164.891508        167.61