# RNA Expression

With RNA sequencing we have the promise of a measurement that should help us understand the actions and reactions of a cell.

One of the findings of the genomic analysis found less protein coding differences than originally expected.  RNA expression levels and feed back control look like the major differences in human evolution between humans and chimps.
 * 'there are significant differences in how genes are expressed and regulated'
 * '.. those differences are most marked in the brain'
 * https://science.sciencemag.org/content/292/5514/44
 * '.. rapid change, along the five million years of the human lineage, that was concentrated on these specific groups of genes.'
 * 'Nearly half of the genes that had been pushed to express themselves more in humans involved transcription factors--gene-encoded proteins that control the expression of other proteins'
 * https://www.scientificamerican.com/article/separation-of-man-and-ape/


## Expression Level Differences can complicated

What is a 'significant' difference in expression?

### Small differences can have a large effect

Classic blond hair vs brunette in europeans is caused by 20% decrease in mRNA expression.
 * Caused by a SNP 350 kb away from KITLG coding region
 * SNP alters transcription factor binding
 * only 20% change is less than most check for significance.
 * https://www.nature.com/articles/ng.3019 Guether et al. 2014

### But not always
 * [mRNA] / [protein] ratio varies by several orders of magnitude across genes.
 * but is consistent across cells and tissues, for each gene
 * Quantifying gene expression: the importance of being subtle
 * https://www.embopress.org/doi/pdf/10.15252/msb.20167325

![Variations in mRNA vs Protein levels](img/Gene_expression_subtle.png)

### Population at large can have outliers
Can be large differences in normal population.
 * The impact of rare variation on gene expression across tissues
  * 58% of underexpression and 28% of overexpression outliers have nearby conserved rare variants compared to 8% of non-outliers
  * bayesian improved calling
 * https://www.ncbi.nlm.nih.gov/pubmed/29022581

# Data Exploration - Nature of the beast

Most uses of expression data will use a comparison of the gene expression pattern to known references.
What would this involve.

## Normal Tissue Expressions

Genotype-Tissue Expression (GTEx) project is becoming a major resource for normal human tissue.
 * v8 to be released this month.

The spread includes extreme values, most of the tissues are similar.

Mean is ~ 16 - 17 TPM, but 50th percentile is 0 and 75th percentile is ~ 2.

This indicates that the TPM is not a normal distribution, but has a small number of extreme values.  Almost half of the 56K gene set show no expression.

In [1]:
import importlib
import os, sys, glob

import pandas as pd
import numpy as np
import seaborn as sns
import pathlib
script_dir = pathlib.Path().resolve()

gm_fn = os.path.join(script_dir, 'tables', 'GTEX_tissue_medians_desc.tsv')
gm = pd.read_csv(gm_fn, sep='\t')
print(gm[gm.columns[:5]])

  Unnamed: 0  Adipose Tissue  Adrenal Gland       Bladder          Blood
0      count    56202.000000   56202.000000  56202.000000   56202.000000
1       mean       16.335619      17.107780     16.663300      15.233666
2        std      398.650451     582.642470    351.055527    1113.018285
3        min        0.000000       0.000000      0.000000       0.000000
4        25%        0.000000       0.000000      0.000000       0.000000
5        50%        0.000000       0.000000      0.000000       0.000000
6        75%        1.897000       1.492375      2.668000       0.492100
7        max    38300.000000   60775.000000  34110.000000  246600.000000


Most oncogenes and tumour suppressors are among genes with higher expression.

In [3]:
gmo_fn = os.path.join(script_dir, 'tables', 'GTEX_tissue_medians_onco_ts.tsv')
gmo = pd.read_csv(gmo_fn, sep='\t')
print(gmo[gmo.columns[:5]])

  Unnamed: 0  Adipose Tissue  Adrenal Gland      Bladder        Blood
0      count      779.000000     779.000000   779.000000   779.000000
1       mean       50.191434      33.618767    49.578746    50.097736
2        std      158.616941     101.872038   131.885045   360.652820
3        min        0.000000       0.000000     0.000000     0.000000
4        25%        4.507500       2.593250     5.602000     0.576500
5        50%       14.680000      10.005000    18.820000     5.052000
6        75%       33.765000      25.515000    41.905000    20.595000
7        max     1979.000000    1084.500000  1770.000000  8644.000000


## Natural Distribution is log normal

This means the log of the values shows a normal distribution.

### PTEN shows the normal distribution after log transform

![](img/PTEN_raw.png)
![](img/PTEN_log2.png)

### What's happening with TP53

What does the camel like curve mean?

![](img/TP53_raw.png)
![](img/TP53_log2.png)

### What's happening with TP53 - Different Expression by tissue

![](img/TP53_by_tissue.png)

# Experimental Variantions


### Units RPKM, FPKM, TPM

 * RPKM (single-end data) - Reads Per Kilobase of transcript per Million mapped reads
 * FPKM (paired-end data) - Fragments Per Kilobase of transcript per Million mapped reads
 * TPM  - Transcripts Per Million RNA molecules

RPKM and FPKM are normalized within a sample

TPM is more normalized for comparing between samples

Convert to TPM simply by dividing each RPKM value by the sum of the RPKM values for all genes (or transcripts) and multiplying by one million.

TPM = (mean transcript length in kilobases) x RPKM 
where "mean transcript length" is the expression-weighted mean of the lengths of all isoforms.  Because the mean transcript length can change from sample to sample, we have generally recommended the use of TPM instead of RPKM.

If you have RPKM (single-end data) or FPKM (paired-end data) computed for a set of genes or transcripts you can convert to TPM with 

TPM = FPKM / (sum of FPKM over all genes/transcripts) * 10^6

Since TPM is independent of the mean transcript length, it should be more comparable between samples


### Experimental Differences

 * Gene models used
  * Ensembl69 vs Ensembl75 vs Ensembl92
  * Collapsed model, or individual transcripts?
 * Ribodepleted
 
 Other complications:
  * known factors - age, gender
  * sample treatment - FFPE samples
  * tumour content - mixtures
  * hidden factors - environment, temperature

## GSC RNA Expression Uses

### Spearman Correlation Against Known Cancer Types

Identify cancer types by gene expression patterns using LOGANOVA spearman correlations.  By C Chng and P Eirew

![Spearman correlation plot](img/Spearman_tcga.png)

### Cibersort

Uses patters of mRNA expression levels to quantify immune cell

### Within Project Correlation (WPC)

<can display>?

### Report Outliers and KB matching

Identify unusual gene expression.  Literature is reporting relevant gene expressions.
 
 Current method is a number of simple cutoff values, based on available data:
  * 2 x illumina bodymap expression comparator
  * TCGA top quartile
  ** > 2 inter-quartile ranges ~ top 1%
  * > TCGA adjacent normal
 
#### TF4CN Problem Case

Many outliers, but not obviously wrong.

 {Show histogram - 0 and 100 % too high} 


# Current Work

## Improve Reference Values

Which genes have strange behaviour in GTEX/TCGA?
 * Check for curves skew, and kurtosis outliers
  * Can these cases be sub-divided into further subtypes?
  ** Sex differences?
  ** Age differences?
  ** Drug? Caffeine? Medication?

Separations of known values, like sex and age, can be used as tests for classification schemes.

Current known methods:
 * TMM
 * PEER baysian score of significance
 

# Expression Effects

## Genes are highly correlated in expression

 * Expression as a pathway property
 * Altered pathway expression - re-wiring concept
 * Measure metabolite(s) cycle activity
 * Measure Transcription Factor Levels
 * Other genomic influences:
  * Many SNPs, copy number changes alter expression
  * Driver fusions creating inappropriate activity

In [2]:
from gtex import GTEX_preprocess
importlib.reload(GTEX_preprocess)
gtex = GTEX_preprocess.GtexData()


2019-08-20 10:12:59 - root - INFO - logging
2019-08-20 10:12:59 - root - INFO - logging
2019-08-20 10:12:59 - root - INFO - Found: /home/dbleile/science/Xpress/gtex/data_cache/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct
2019-08-20 10:12:59 - root - INFO - Found: /home/dbleile/science/Xpress/gtex/data_cache/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct
2019-08-20 10:12:59 - root - INFO - Found: /home/dbleile/science/Xpress/gtex/data_cache/GTEx_v7_Annotations_SampleAttributesDS.txt
2019-08-20 10:12:59 - root - INFO - Found: /home/dbleile/science/Xpress/gtex/data_cache/GTEx_Analysis_v7_Annotations_SampleAttributesDD.xlsx
2019-08-20 10:12:59 - root - INFO - Found: /home/dbleile/science/Xpress/gtex/data_cache/GTEx_v7_Annotations_SubjectPhenotypesDS.txt


In [3]:
tp53 = gtex.gtex_gene_table(['TP53', 'PTEN'])

2019-08-20 10:13:07 - root - INFO - Loading reference GTEX7 row table
2019-08-20 10:13:20 - root - INFO - loading 3 GTEX TPM rows
2019-08-20 10:13:31 - root - INFO - Gtex TPM partial table took 0.3966263214747111 minutes


In [6]:
gtpm = gtex.gtex_gene_table(None)

2019-08-20 10:15:56 - root - INFO - loading all rows
2019-08-20 10:19:41 - root - INFO - Gtex TPM partial table took 3.757295882701874 minutes


In [7]:
gtpm = gtmp
gtpm.head()

Unnamed: 0,Name,Description,GTEX-1117F-0226-SM-5GZZ7,GTEX-111CU-1826-SM-5GZYN,GTEX-111FC-0226-SM-5N9B8,GTEX-111VG-2326-SM-5N9BK,GTEX-111YS-2426-SM-5GZZQ,GTEX-1122O-2026-SM-5NQ91,GTEX-1128S-2126-SM-5H12U,GTEX-113IC-0226-SM-5HL5C,...,GTEX-ZVE2-0006-SM-51MRW,GTEX-ZVP2-0005-SM-51MRK,GTEX-ZVT2-0005-SM-57WBW,GTEX-ZVT3-0006-SM-51MT9,GTEX-ZVT4-0006-SM-57WB8,GTEX-ZVTK-0006-SM-57WBK,GTEX-ZVZP-0006-SM-51MSW,GTEX-ZVZQ-0006-SM-51MR8,GTEX-ZXES-0005-SM-57WCB,GTEX-ZXG5-0005-SM-57WCN
0,ENSG00000223972.4,DDX11L1,0.1082,0.1158,0.02104,0.02329,0.0,0.04641,0.03076,0.09358,...,0.09012,0.1462,0.1045,0.0,0.6603,0.695,0.1213,0.4169,0.2355,0.145
1,ENSG00000227232.4,WASH7P,21.4,11.03,16.75,8.172,7.658,9.372,10.08,13.56,...,3.926,13.13,5.537,5.789,8.439,7.843,12.39,12.53,8.027,12.76
2,ENSG00000243485.2,MIR1302-11,0.1602,0.06433,0.04674,0.0,0.05864,0.0,0.1367,0.2079,...,0.08008,0.03607,0.0,0.1059,0.0,0.06432,0.05388,0.0,0.04756,0.05367
3,ENSG00000237613.2,FAM138A,0.05045,0.0,0.02945,0.0326,0.0,0.0,0.0861,0.131,...,0.0,0.06818,0.07309,0.03336,0.0,0.08105,0.0,0.05304,0.02996,0.03381
4,ENSG00000268020.2,OR4G4P,0.0,0.0,0.0,0.0,0.0,0.0,0.1108,0.05619,...,0.0,0.0,0.0,0.0,0.0,0.0,0.08739,0.0,0.0,0.04353
