-
Notifications
You must be signed in to change notification settings - Fork 1
/
tissue.Rmd
323 lines (231 loc) · 9.94 KB
/
tissue.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
---
title: "Tissue specificity"
date: "`r Sys.Date()`"
output:
workflowr::wflow_html:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```
A key measure in [information theory](https://en.wikipedia.org/wiki/Entropy_(information_theory)) is entropy, which is:
>"The amount of uncertainty involved in a random process; the lower the uncertainty, the lower the entropy."
For example, there is lower entropy in a fair coin flip versus a fair die roll since there are more possible outcomes with a die roll ${1, 2, 3, 4, 5, 6}$ compared to a coin flip ${H, T}$.
Entropy is measured in bits, which has a single binary value of either 1 or 0. Since a coin toss has only two outcomes, each toss has one bit of information. However, if the coin is not fair, meaning that it is biased towards either heads or tails, there is less uncertainty, i.e. lower entropy; if a die lands on heads 60% of the time, we are more certain of heads than in a fair die (50% heads).
There is a nice example on the [Information Content Wikipedia page](https://en.wikipedia.org/wiki/Information_content#Derivation) explaining the relationship between entropy (uncertainty) and information content.
>For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin, "Weather forecast for tonight: dark. Continued dark overnight, with widely scattered light by morning." Assuming one not residing near the Earth's poles or polar circles, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.
There is no uncertainty in the above statement hence that piece of information has 0 bits.
Mathematically, the Shannon entropy is defined as:
$$ -\sum_{i=1}^n p(x_i) log_{b}p(x_i) $$
Let's test this out in R using the coin flip example above.
Firstly let's define the `entropy` function according to the formula above.
```{r entropy}
entropy <- function(x){
-sum(log2(x)*x)
}
```
Generate 100 fair coin tosses.
```{r fair_res}
set.seed(1984)
fair_res <- rbinom(n = 100, size = 1, prob = 0.5)
prop.table(table(fair_res))
```
Calculate Shannon entropy of fair coin tosses.
```{r entropy_fair_res}
entropy(as.vector(prop.table(table(fair_res))))
```
Generate 100 biased coin tosses.
```{r unfair_res}
set.seed(1984)
unfair_res <- rbinom(n = 100, size = 1, prob = 0.2)
prop.table(table(unfair_res))
```
Calculate Shannon entropy of biased coin tosses.
```{r entropy_unfair_res}
entropy(as.vector(prop.table(table(unfair_res))))
```
A biased coin will is more predicable, i.e. has less uncertainty, and therefore has less entropy than a fair coin.
In fact a fair coin has the highest entropy. This makes sense because when it's 50/50, it is the most uncertain!
```{r coin_highest_entropy}
x <- seq(0.05, 0.95, 0.05)
y <- 1 - x
e <- sapply(seq_along(x), function(i) entropy(c(x[i], y[i])))
plot(x, e, xlab = "Probability of heads", ylab = "Entropy", pch = 16, xaxt = 'n')
axis(side=1, at=x)
abline(v = 0.5, lty = 3)
abline(h = 1, lty = 3)
```
## Tissue specificity
What does all this have to do with measuring tissue specificity? I came across this paper: [Promoter features related to tissue specificity as measured by Shannon entropy](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2005-6-4-r33) and it spurred me to learn about entropy. Basically, if a gene is expressed in a tissue specific manner, we are more certain of its expression and hence there is lower entropy.
Let's begin by defining a Shannon entropy function for use with tissue expression. The code is from the [R help mail](https://stat.ethz.ch/pipermail/r-help/2008-July/167112.html). This function includes a simple normalisation method of normalising each value by the sum. In addition, if any value is less than 0 or if the sum of all values is less than or equal to zero, the function will return an `NA`.
```{r shannon_entropy}
shannon.entropy <- function(p){
if (min(p) < 0 || sum(p) <= 0) return(NA)
p.norm <- p[p>0]/sum(p)
-sum(log2(p.norm)*p.norm)
}
```
Let's imagine we have 30 samples and we have a gene that is ubiquitously expressed.
```{r all_gene}
set.seed(1984)
all_gene <- rnorm(n = 30, mean = 50, sd = 15)
shannon.entropy(all_gene)
```
A ubiquitously expressed gene that is highly expressed in one sample.
```{r all_gene_one_higher}
all_gene_one_higher <- all_gene
all_gene_one_higher[30] <- 100
shannon.entropy(all_gene_one_higher)
```
Higher expression in half of the samples.
```{r half_gene}
set.seed(1984)
half_gene <- c(
rnorm(n = 15, mean = 50, sd = 10),
rnorm(n = 15, mean = 5, sd = 1)
)
shannon.entropy(half_gene)
```
Expression only in one sample.
```{r one_gene}
one_gene <- rep(0, 29)
one_gene[30] <- 50
shannon.entropy(one_gene)
```
Expression much higher in one sample.
```{r one_gene_with_bg}
one_gene_with_bg <- rep(1, 29)
one_gene_with_bg[30] <- 50
shannon.entropy(one_gene_with_bg)
```
Expression only in three samples.
```{r three_gene}
three_gene <- c(rep(1,27), 25, 65, 100)
shannon.entropy(three_gene)
```
Equal expression in all samples; note that the Shannon entropy will be the same regardless of the expression strength.
```{r all_gene_equal}
all_gene_equal <- rep(50, 30)
shannon.entropy(all_gene_equal)
```
Plot the expression patterns for the 6 scenarios.
```{r plot_entropy}
plot_entropy <- function(x){
barplot(x, main = round(shannon.entropy(x), 3), xlab = 'Samples', ylab = 'Expression')
}
par(mfrow=c(2,3))
examples <- list(all_gene_equal, all_gene, all_gene_one_higher, half_gene, three_gene, one_gene)
invisible(sapply(examples, plot_entropy))
```
## Other tissue specificity metrics
Metrics and code from [A benchmark of gene expression tissue-specificity metrics](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5444245/).
Tau:
$$ \tau = \frac{\sum_{i=1}^n (1 - \widehat x_i)}{n-1}; \widehat x_i = \frac{x_i}{\max_{1\le i \le n} (x_i)}$$
Implementation in R.
```{r tau}
tau <- function(x){
x <- (1-(x/max(x)))
res <- sum(x, na.rm=TRUE)
res/(length(x)-1)
}
sapply(examples, tau)
```
EE score (summary of the expression of all genes in tissue $i$):
$$ EE = \frac{x_i}{\sum^n_{i=1}x_i \times \frac{s_i}{\sum^n_{i=1} s_i}} = \frac{\sum^n_{i=1} s_i}{s_i} \times \frac{x_i}{\sum^n_{i=1}x_i}s_i $$
No implementation yet as this metric requires expression data.
TSI:
$$ TSI = \frac{\max_{1 \le i \le n}(x_i)}{\sum^n_{i=1}x_i} $$
Implementation in R.
```{r tsi}
tsi <- function(x){
max(x) / sum(x)
}
sapply(examples, tsi)
```
Gini coefficient, where $x_i$ has to be ordered from least to greatest:
$$ Gini = \frac{n + 1}{n} - \frac{2 \sum^n_{i=1} (n + 1 - i) x_i}{n \sum^n_{i=1}x_i} $$
Implementation in R (the `gini` function is missing in the supplementary data of the paper and the loaded packages do not have this function, so it is unclear where `gini` is from.). There is a `gini` function in the [reldist](https://cran.r-project.org/web/packages/reldist/index.html) package, which computes the Gini coefficient.
```{r reldist_gini}
reldist::gini
```
Calculate Gini coefficients (using the `reldist` code) for the examples. (To save you from installing `reldist`, I've copied the code.)
```{r gini}
gini <- function (x, weights = rep(1, length = length(x))) {
# code is from the reldist package by Mark S. Handcock
# https://cran.r-project.org/web/packages/reldist/index.html
ox <- order(x)
x <- x[ox]
weights <- weights[ox]/sum(weights)
p <- cumsum(weights)
nu <- cumsum(weights * x)
n <- length(nu)
nu <- nu/nu[n]
sum(nu[-1] * p[-n]) - sum(nu[-n] * p[-1])
}
sapply(examples, gini)
```
Shannon entropy:
$$ H_g = - \sum^n_{i=1} p_i \times log_2(p_i); p_i = \frac{x_i}{\sum^n_{i=1}x_i} $$
Implementation in R.
```{r hg}
hg <- function(x, norm = FALSE){
p <- x / sum(x)
res <- -sum(p*log2(p), na.rm=TRUE)
if(norm){
res <- 1 - (res/log2(length(p)))
}
res
}
sapply(examples, hg)
sapply(examples, hg, norm = TRUE)
```
The z-score can be implemented in two ways: either only over-expressed genes are defined as tissue specific or the absolute distance from the mean is used, so that under-expressed genes are also defined as tissue specific.
$$ \zeta = \frac{x_i - \mu}{\sigma} $$
SPM:
$$ SPM = \frac{x_i^2}{\sum^n_{i=1}x^2_i} $$
Implementation in R.
```{r spm}
spm <- function(x, norm=TRUE){
res <- x^2/sum(x^2)
if (norm){
res <- max(res)
}
res
}
sapply(examples, spm)
```
PEM estimates how different the expression of the gene is relative to an expected expression, under the assumption of uniform expression in all tissues. $s_i$ is the summary of the expression of all genes in tissue $i$.
$$ PEM = log_{10} \left( \frac{\sum^n_{i=1}}{s_i} \times \frac{x_i}{\sum^n_{i=1} x_i} \right) $$
No implementation yet as this metric requires expression data.
### Comparing metrics
Compare the metrics.
```{r compare_metrics}
my_metrics <- data.frame(
eg = c('identical', 'all_exp', 'all_exp_one_higher', 'half_exp', 'three_exp', 'one_exp'),
tau = sapply(examples, tau),
tsi = sapply(examples, tsi),
gini = sapply(examples, gini),
hg = sapply(examples, hg, norm=TRUE),
spm = sapply(examples, spm)
)
pivot_longer(my_metrics, -eg, names_to = "metrics") |>
mutate(eg = factor(eg, levels = unique(eg))) |>
ggplot(data = _, aes(eg, value, group = metrics, colour = metrics)) +
geom_line() +
geom_point() +
theme_minimal()
```
Taking the average of all metrics.
```{r consensus}
my_metrics$mean <- apply(my_metrics[, -1], 1, mean)
my_metrics$median <- apply(my_metrics[, -1], 1, median)
pivot_longer(my_metrics, -eg, names_to = "metrics") |>
mutate(eg = factor(eg, levels = unique(eg))) |>
ggplot(data = _, aes(eg, value, group = metrics, colour = metrics)) +
geom_line() +
geom_point() +
theme_minimal()
```
## Summary
Equal expression amongst the 30 libraries resulted in a Shannon entropy of ~4.91 bits; this is similar to an even coin toss. This is close to 5 bits because we need 5 bits to transfer information on 30 samples. The more specific a gene is expressed, the less uncertainty, and therefore the lower the entropy.