<a href="https://colab.research.google.com/github/cappelchi/T-Bio/blob/master/T_Bio_Practice_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Hands-on Assignment: Transcriptomics 1
YOUR ASSIGNMENT:

**Run the RNA-seq pipeline and upload an excel file with the KRAS gene expression visualized as a bar plot.**

RNA-Seq
Also known as whole transcriptome sequencing allows for the user to identify the RNA (mRNA generally) in a biological sample at any given time. This can be used to analyze the ever changing cellular transcriptome.

Example: Jabbari, et al. used RNA-seq to investigate psoriasis and find new genes for functional analysis.  They compared their RNA-seq data to published array studies and found 1700 new candiadates. These were validated by qPCR, and comparison to functional databases for psoriasis supported their role in pathogenesis.

The three different types of tumors are listed below.

○ Estrogen Receptor positive (ER+),

○ Human Epidermal Growth Factor positive (HER2+), and

○ Triple Negative (TN), which is negative for expression of ER, HER2,
and progesterone receptor.

There are seven ER+ samples, thirteen TN samples. Each type of breast cancer
has different characteristics, however, for this first practice exercise, we will
not make any prior assumptions about the different types of breast cancer;
we will look at all of the samples as a single group and see if a very basic
RNA-seq analysis can show some differences between them. This is similar to
an approach taken if you did not have any hypothesis as to whether the
groups you are looking at are different. For example, this is the approach that
would be taken by an oncologist with several unknown cancer biopsies who
wants to see if these cancers were biologically different and therefore
amenable to different treatments. (Data analysis steps like this are followed
by biological analysis to confirm the differences between variation found in
data and interpret them in biological terms.)

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice3/RNA-Seq1.jpg" width="600" height="300" />

For this analysis, we will use an assembled genome as a scaffold (that is, a
structure on which we hang sequences) on which to map the RNA reads. An
assembled genome sequence is a chromosomal DNA sequence. It’s created
by fragmenting the genome, sequencing the fragments, then putting the
sequenced fragments together in the same order as in the real biological
genome. The Human Genome Project is an example of an assembled
genome. In this case, since this dataset is from a study of human breast
cancer in mouse models, the assembled genome used as the reference
genome is “human mouse,” a mixed assembled genome of Homo sapiens
and Mus musculus.

## **Pipeline:**

### Description:

Step: no Pre-processing raw reads

Step 1: Mapping on a reference genome

Step 2: Calculating the abundance of reads aligned to each genomic element (i.e. exon, gene or isoform)

Step 3: Biological Interpretation of gene expression profiles

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice3/Pipeline1_tr1_p3.png" width="600" height="300" />


### Step 1: Mapping on a reference genome

#### Bowtie2

Bowtie2  is a fast alignment algorithm that is based on the “seed” (or k-mer) approach. “Seed” substrings from the read and their reverse complements are extracted and aligned to the reference in an ungapped fashion. Then their positions on the reference are recorded and they are extended into full alignments using SIMD-accelerated dynamic programming.

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice3/bowtie2-2.jpg" width="600" height="300" />

### Step 2: Calculating the abundance of reads aligned to each genomic element (i.e. exon, gene or isoform)

#### RSEM 
(RNA Seq using Expectation Maximization). RSEM is a software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene.

<img src="https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice3/RSEM.png" width="500" height="270">

### Step 3: Biological Interpretation of gene expression profiles

In [0]:
mypath = '/content/pipeline1p1ch3/'

In [2]:
!wget https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice3/Practice_p1ch3_AllResultsRSEMRun.zip
!unzip Practice_p1ch3_AllResultsRSEMRun.zip -d {mypath}

--2020-03-06 08:12:55--  https://raw.githubusercontent.com/cappelchi/T-Bio/master/practice3/Practice_p1ch3_AllResultsRSEMRun.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3258926 (3.1M) [application/zip]
Saving to: ‘Practice_p1ch3_AllResultsRSEMRun.zip’


2020-03-06 08:13:01 (13.6 MB/s) - ‘Practice_p1ch3_AllResultsRSEMRun.zip’ saved [3258926/3258926]

Archive:  Practice_p1ch3_AllResultsRSEMRun.zip
  inflating: /content/pipeline1p1ch3/expression_genes_FPKM_not_filtered.txt  
  inflating: /content/pipeline1p1ch3/expression_genes_FPKM.txt  
  inflating: /content/pipeline1p1ch3/expression_genes_FPKM_with_annotation.txt  
  inflating: /content/pipeline1p1ch3/expression_genes_not_filtered.txt  
  inflating: /content/pipeline1p1ch3/expression_genes.txt  
  inflati

In [0]:
#!pip install jupyter-datatables
#!pip install pandas-summary

In [0]:
import pandas as pd
#from jupyter_datatables import init_datatables_mode
#init_datatables_mode()  # initialize [DataTables]
#from pandas_summary import DataFrameSummary
from os import listdir
from os.path import isfile, join

In [0]:
results = [f for f in listdir(mypath) if isfile(join(mypath, f))]

In [13]:
for result in results:
    if result[:11] == 'expression_':
        df_name = result[11:-4]
    else:
        df_name = result[:-4]
    comand = f'{df_name} = pd.read_csv("{mypath + result}", sep = "\s+")'
    print (comand)
    exec(comand)
    print (df_name)
    exec(f'print ({df_name}.describe())')

genes_FPKM = pd.read_csv("/content/pipeline1p1ch3/expression_genes_FPKM.txt", sep = "\s+")
genes_FPKM
       ER-ERR1084763_PE  ER-ERR1084764_PE  ...  TN-ERR1084809_PE  TN-ERR1084810_PE
count       3070.000000       3070.000000  ...       3070.000000       3070.000000
mean         291.171003        295.306749  ...        365.988137        345.710003
std         4768.355100       4653.392001  ...       5272.297673       5286.727094
min            0.000000          0.000000  ...          0.000000          0.000000
25%            0.000000          0.000000  ...          0.000000          0.000000
50%            0.000000          0.000000  ...          0.000000          0.000000
75%            0.000000          0.000000  ...          4.535000          3.232500
max       183643.520000     183262.460000  ...     181583.440000     212177.950000

[8 rows x 20 columns]
temp = pd.read_csv("/content/pipeline1p1ch3/temp.txt", sep = "\s+")
temp
       ER-ERR1084763_PE  ER-ERR1084764_PE  ...  TN-ERR1

The output files labeled with ‘FPKM’ contain expression measurements in
FPKM, whereas the others simply contain the count of reads per gene (or
isoform). The files labeled with ‘_not_filtred.txt’ contain genes with zeroexpression for all samples whereas those genes are filtered out in the other files.
Recall from section 2.1 that Bowtie can output tables of the expression levels of all genes (including those with expression levels of zero), or a table of only those genes with expression above zero. For most applications, we will only be interested in those genes with expression above zero. Further, it is computationally much more expensive, when dealing with large datasets, to open a table including zero-expression-level genes. Therefore, we will practice using the table of only those genes with expression; in other words, a table in which genes with expression=0 have been removed.
Open the files in Excel (Open the gene expression file to look at genes,and the isoform expression file to look at isoforms). Click on the links to the gene and isoform expression tables to open these.
Row 1 contains the names of the samples. Column A contains the names of
each gene in the GTF file for which there was some expression above zero in any sample. Remember, the values represent the level of expression for the gene (named in Column A), not values for individual reads. (We have now moved from the level of reads to the level of genes.)
Searching in the interactive table below, a subset of 20 genes, for the Ensembl name ENSG00000133703 will locate KRAS, a known proto-oncogene. You will further explore the biological significance of your
results in the following courses.
The isoform expression table looks just like the gene expression table, except that there are more rows, since some genes have more than one isoform. For the rest of this practice example, we will use the gene expression table.

Выходные файлы, помеченные «FPKM», содержат измерения экспрессии в FPKM, в то время как другие просто содержат количество чтений на ген (или изоформу). Файлы, помеченные «_not_filtred.txt», содержат гены с нулевой экспрессией для всех образцов, тогда как эти гены отфильтрованы в других файлах. Напомним из раздела 2.1, что Bowtie может выводить таблицы уровней экспрессии всех генов (в том числе с уровнями экспрессии ноль) или таблицу только тех генов с экспрессией выше нуля. Для большинства приложений нас будут интересовать только те гены с выражением выше нуля. Кроме того, в вычислительном отношении гораздо дороже при работе с большими наборами данных открывать таблицу, включающую гены с нулевым уровнем экспрессии. Поэтому мы будем практиковать использование таблицы только тех генов с выражением; другими словами, таблица, в которой были удалены гены с выражением = 0. Откройте файлы в Excel (откройте файл генного выражения, чтобы посмотреть на гены, и файл выражения изоформы, чтобы посмотреть на изоформы). Нажмите на ссылки на таблицы экспрессии генов и изоформ, чтобы открыть их. Строка 1 содержит названия образцов. Столбец A содержит названия каждого гена в файле GTF, для которого в любом образце было некоторое выражение выше нуля. Помните, что значения представляют уровень экспрессии для гена (назван в столбце A), а не значения для отдельных операций чтения. (Теперь мы перешли от уровня чтения к уровню генов.) Поиск в интерактивной таблице ниже подмножества из 20 генов для имени Ensembl ENSG00000133703 позволит найти KRAS, известный протоонкоген. Далее вы изучите биологическую значимость ваших результатов на следующих курсах. Таблица экспрессии изоформ выглядит так же, как таблица экспрессии генов, за исключением того, что имеется больше строк, поскольку некоторые гены имеют более одной изоформы. В оставшейся части этого практического примера мы будем использовать таблицу экспрессии генов.

In [14]:
genes_FPKM

Unnamed: 0,id,ER-ERR1084763_PE,ER-ERR1084764_PE,ER-ERR1084765_PE,ER-ERR1084775_PE,ER-ERR1084805_PE,ER-ERR1084806_PE,ER-ERR1084811_PE,TN-ERR1084766_PE,TN-ERR1084768_PE,TN-ERR1084798_PE,TN-ERR1084799_PE,TN-ERR1084800_PE,TN-ERR1084801_PE,TN-ERR1084802_PE,TN-ERR1084803_PE,TN-ERR1084804_PE,TN-ERR1084807_PE,TN-ERR1084808_PE,TN-ERR1084809_PE,TN-ERR1084810_PE
0,ENSG00000001630,0.00,0.00,0.00,12.95,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,10.27,10.28,0.00
1,ENSG00000001631,0.00,0.00,3.49,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
2,ENSG00000002834,13707.91,15284.44,15155.14,8173.33,7621.02,21836.52,20287.39,4810.31,794.38,11611.88,8155.97,5737.05,5914.49,7506.39,11705.55,3982.07,5158.16,5831.16,2946.15,9858.35
3,ENSG00000003402,0.00,0.00,0.54,0.41,0.00,0.00,2.53,0.00,0.44,1.23,0.26,1.83,0.70,3.40,0.00,1.04,0.92,2.09,2.38,4.39
4,ENSG00000003436,0.00,0.00,0.00,0.39,0.53,0.00,0.00,0.00,0.00,1.61,0.50,0.00,1.03,0.00,0.51,0.00,0.00,0.00,1.08,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3065,ENSMUSG00000097705,0.00,0.00,0.00,46.67,83.16,0.00,0.00,0.00,2.76,38.81,33.89,8.31,73.94,0.00,46.09,47.86,20.06,319.77,0.00,50.12
3066,ENSMUSG00000097779,5.87,5.55,18.00,0.00,3.08,2.87,0.00,73.65,0.00,0.00,11.70,0.00,22.51,5.41,2.98,14.38,3.46,443.57,40.66,5.18
3067,ENSMUSG00000097873,0.00,0.00,0.00,0.00,3.43,0.00,0.00,0.00,0.00,0.00,1.43,0.00,0.00,0.00,2.63,0.00,1.05,1.60,0.00,0.86
3068,ENSMUSG00000098650,0.00,0.00,0.00,0.00,0.00,0.00,49.60,782.79,0.00,0.00,0.00,62.99,23.33,124.08,23.47,0.00,0.00,0.00,12.99,241.46


In [15]:
genes_FPKM_with_annotation

Unnamed: 0,id,ER-ERR1084763_PE,ER-ERR1084764_PE,ER-ERR1084765_PE,ER-ERR1084775_PE,ER-ERR1084805_PE,ER-ERR1084806_PE,ER-ERR1084811_PE,TN-ERR1084766_PE,TN-ERR1084768_PE,TN-ERR1084798_PE,TN-ERR1084799_PE,TN-ERR1084800_PE,TN-ERR1084801_PE,TN-ERR1084802_PE,TN-ERR1084803_PE,TN-ERR1084804_PE,TN-ERR1084807_PE,TN-ERR1084808_PE,TN-ERR1084809_PE,TN-ERR1084810_PE
0,ENSG00000001630_ensembl_havana_protein_coding,0.00,0.00,0.00,12.95,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,10.27,10.28,0.00
1,ENSG00000001631_KRIT1_ensembl_havana,0.00,0.00,3.49,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
2,ENSG00000002834_ensembl_havana_protein_coding,13707.91,15284.44,15155.14,8173.33,7621.02,21836.52,20287.39,4810.31,794.38,11611.88,8155.97,5737.05,5914.49,7506.39,11705.55,3982.07,5158.16,5831.16,2946.15,9858.35
3,ENSG00000003402_CFLAR_ensembl_havana,0.00,0.00,0.54,0.41,0.00,0.00,2.53,0.00,0.44,1.23,0.26,1.83,0.70,3.40,0.00,1.04,0.92,2.09,2.38,4.39
4,ENSG00000003436_ensembl_havana_protein_coding,0.00,0.00,0.00,0.39,0.53,0.00,0.00,0.00,0.00,1.61,0.50,0.00,1.03,0.00,0.51,0.00,0.00,0.00,1.08,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3065,ENSMUSG00000097705_Gm26740_havana,0.00,0.00,0.00,46.67,83.16,0.00,0.00,0.00,2.76,38.81,33.89,8.31,73.94,0.00,46.09,47.86,20.06,319.77,0.00,50.12
3066,ENSMUSG00000097779_4833407H14Rik_ensembl,5.87,5.55,18.00,0.00,3.08,2.87,0.00,73.65,0.00,0.00,11.70,0.00,22.51,5.41,2.98,14.38,3.46,443.57,40.66,5.18
3067,ENSMUSG00000097873_Gm26867_ensembl,0.00,0.00,0.00,0.00,3.43,0.00,0.00,0.00,0.00,0.00,1.43,0.00,0.00,0.00,2.63,0.00,1.05,1.60,0.00,0.86
3068,ENSMUSG00000098650_havana_protein_coding,0.00,0.00,0.00,0.00,0.00,0.00,49.60,782.79,0.00,0.00,0.00,62.99,23.33,124.08,23.47,0.00,0.00,0.00,12.99,241.46


In [16]:
isoforms_FPKM

Unnamed: 0,id,ER-ERR1084763_PE,ER-ERR1084764_PE,ER-ERR1084765_PE,ER-ERR1084775_PE,ER-ERR1084805_PE,ER-ERR1084806_PE,ER-ERR1084811_PE,TN-ERR1084766_PE,TN-ERR1084768_PE,TN-ERR1084798_PE,TN-ERR1084799_PE,TN-ERR1084800_PE,TN-ERR1084801_PE,TN-ERR1084802_PE,TN-ERR1084803_PE,TN-ERR1084804_PE,TN-ERR1084807_PE,TN-ERR1084808_PE,TN-ERR1084809_PE,TN-ERR1084810_PE
0,ENSMUST00000000412,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.44,0.00,0.00
1,ENSMUST00000000704,0.00,5.34,0.00,8.88,0.00,63.53,11.11,8.17,0.00,0.00,8.43,11.03,0.00,15.50,2.87,8.30,23.30,113.32,3.01,4.99
2,ENSMUST00000001675,0.00,0.00,6.78,2.58,0.00,0.00,12.84,4.90,0.00,2.23,1.66,1.61,0.83,3.07,0.00,1.63,0.00,6.30,0.00,2.96
3,ENSMUST00000001706,0.00,0.00,0.00,0.00,0.00,0.00,0.00,3.38,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4,ENSMUST00000001809,32.65,59.09,91.26,37.08,34.55,10.76,63.32,54.12,15.13,30.46,40.34,32.67,23.78,114.72,18.52,43.75,22.17,83.85,36.09,45.72
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4765,ENST00000610020,0.00,0.51,0.00,1.57,0.24,3.52,1.37,1.22,2.73,3.14,3.48,97.96,2.48,6.61,1.29,0.78,3.42,0.72,6.66,1.69
4766,ENST00000610026,0.00,0.00,0.00,0.00,1.19,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4767,ENST00000610067,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,10.47,0.00,0.00,0.00,0.00
4768,ENST00000610091,0.00,0.00,0.00,0.00,0.00,2.99,0.00,0.00,0.00,0.00,0.00,1.18,0.00,0.00,0.00,0.00,0.00,0.00,0.64,0.51


In [18]:
isoforms_FPKM_with_annotation

Unnamed: 0,id,ER-ERR1084763_PE,ER-ERR1084764_PE,ER-ERR1084765_PE,ER-ERR1084775_PE,ER-ERR1084805_PE,ER-ERR1084806_PE,ER-ERR1084811_PE,TN-ERR1084766_PE,TN-ERR1084768_PE,TN-ERR1084798_PE,TN-ERR1084799_PE,TN-ERR1084800_PE,TN-ERR1084801_PE,TN-ERR1084802_PE,TN-ERR1084803_PE,TN-ERR1084804_PE,TN-ERR1084807_PE,TN-ERR1084808_PE,TN-ERR1084809_PE,TN-ERR1084810_PE
0,ENSMUST00000000412_ensembl_havana_protein_coding,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.44,0.00,0.00
1,ENSMUST00000000704_ensembl_havana_protein_coding,0.00,5.34,0.00,8.88,0.00,63.53,11.11,8.17,0.00,0.00,8.43,11.03,0.00,15.50,2.87,8.30,23.30,113.32,3.01,4.99
2,ENSMUST00000001675_ensembl_havana_protein_coding,0.00,0.00,6.78,2.58,0.00,0.00,12.84,4.90,0.00,2.23,1.66,1.61,0.83,3.07,0.00,1.63,0.00,6.30,0.00,2.96
3,ENSMUST00000001706_ensembl_havana_protein_coding,0.00,0.00,0.00,0.00,0.00,0.00,0.00,3.38,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4,ENSMUST00000001809_ensembl_havana_protein_coding,32.65,59.09,91.26,37.08,34.55,10.76,63.32,54.12,15.13,30.46,40.34,32.67,23.78,114.72,18.52,43.75,22.17,83.85,36.09,45.72
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4765,ENST00000610020_ensembl_havana_protein_coding,0.00,0.51,0.00,1.57,0.24,3.52,1.37,1.22,2.73,3.14,3.48,97.96,2.48,6.61,1.29,0.78,3.42,0.72,6.66,1.69
4766,ENST00000610026_PPA1_ensembl_havana,0.00,0.00,0.00,0.00,1.19,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4767,ENST00000610067_LINC01128_havana,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,10.47,0.00,0.00,0.00,0.00
4768,ENST00000610091_RP11-417J8.6_ensembl_havana,0.00,0.00,0.00,0.00,0.00,2.99,0.00,0.00,0.00,0.00,0.00,1.18,0.00,0.00,0.00,0.00,0.00,0.00,0.64,0.51
