# Analysis Notebook - Hierarchical Bayesian Modelling

## **NOTE**:

We assume that you have cloned the analysis repository and have `cd` into the parent directory. Before starting with the analysis make sure you have first completed the dependencies set up by following the instructions described in the **`dependencies/README.md`** document. All paths defined in this Notebook are relative to the parent directory (repository). Please close this Notebook and start again by following the above guidelines if you have not completed the aforementioned steps.

## Prerequisite input files

Before starting the execution of the following code, make sure you have available in the folders `sbas/data` and `sbas/assets` the files listed below as prerequisites.

###  **`sbas/data`**.
The present analysis requires the following files to be present in the folder **`sbas/data`**.


- [x] The contents of `data.tar.gz` after unpacking them into the `sbas/data` folder with `tar xvzf data.tar.gz -C sbas/data `
- [x] `rmats_final.se.jc.ijc.txt`
- [x] `rmats_final.se.jc.sjc.txt`
- [x] `srr_pdata.csv` : the corrected GTEx data as created by the forked yarn and in the `annes-changes` branch https://github.com/TheJacksonLaboratory/yarn/tree/annes-changes with the SRR data as used in the `rMATS 3.2.5` experiment.


Additionally, the file `GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct` which is retrieved in the script from [`https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/`](https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz) and stored into the folder 
`sbas/data` as well.


### **`sbas/assets`**
The present analysis requires the following files to be present in the folder **`sbas/assets`**.

- [x] `tissues.tsv`: metadata file with information on which tissues will be used for analysis -- only using the `include` option for analysis
- [x] `splice-relevant-genes.txt`: list of RNA binding proteins that are annotated to splicing relevant functions from GO.

## Loading dependencies

If `conda` is available on your environment you can install the required dependencies by running the following commands:


```bash
time conda install -y r-base==3.6.2 &&
conda install -y r-ggplot2 r-ggsci r-coda r-rstan r-rjags r-compute.es r-snakecase &&
Rscript -e 'install.packages("runjags", repos = "https://cloud.r-project.org/")'
```



In [1]:
# Start the clock!
start_time <- Sys.time()

In [2]:
# dataviz dependencies
library(ggplot2)
library(ggsci)
library(grid)
library(gridExtra)
library(stringr)
library(snakecase)

# BDA2E-utilities dependencies
library(rstan)
library(parallel)
library(rjags)
library(runjags)
library(compute.es)

“package ‘ggplot2’ was built under R version 3.6.3”
“package ‘ggsci’ was built under R version 3.6.3”
“package ‘gridExtra’ was built under R version 3.6.3”
“package ‘snakecase’ was built under R version 3.6.3”
“package ‘rstan’ was built under R version 3.6.3”
Loading required package: StanHeaders

“package ‘StanHeaders’ was built under R version 3.6.3”
rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)

“package ‘rjags’ was built under R version 3.6.3”
Loading required package: coda

“package ‘coda’ was built under R version 3.6.3”

Attaching package: ‘coda’


The following object is masked from ‘package:rstan’:

    traceplot


Linked to JAGS 4.3.0

Loaded modules: basemod,bugs


Attaching package: ‘runjags’


The following object is masked from ‘package:rstan’:

   

In [3]:
file.exists("../data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct")

Download GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct from Google Cloud


In [4]:
if (!("GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct" %in% list.files("../data/"))) {
    message("Downloading GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct \nfrom https://console.cloud.google.com/storage/browser/_details/gtex_analysis_v7/rna_seq_data/ ..")
    system("wget -O ../data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz", intern = TRUE)
    message("Done!\n\n")
    message("Unzipping compressed file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz..")
    system("gunzip ../data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz", intern = TRUE)
    message("Done! \n\nThe file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct can be found in ../data/")
}

Previously used list of tissues to use for the Hierarchical Bayesian modelling:



```R
tissue.list<-c("Heart - Left Ventricle",
               "Breast - Mammary Tissue",
               "Brain - Cortex.Brain - Frontal Cortex (BA9).Brain - Anterior cingulate cortex (BA24)",
               "Adrenal Gland",
               "Adipose - Subcutaneous",
               "Muscle - Skeletal",
               "Thyroid",
               "Cells - Transformed fibroblasts",
               "Artery - Aorta",
               "Skin - Sun Exposed (Lower leg).Skin - Not Sun Exposed (Suprapubic)")
```

In [5]:
tissues_df <- readr::read_delim("../assets/tissues.tsv", delim = "\t")

Parsed with column specification:
cols(
  name = [31mcol_character()[39m,
  female = [32mcol_double()[39m,
  male = [32mcol_double()[39m,
  include = [32mcol_double()[39m,
  display.name = [31mcol_character()[39m
)



In [6]:
tissue.list <- tissues_df$name[ tissues_df$include ==1]

In [7]:
message(length(tissue.list), " tissues")
cat(tissue.list, sep = "\n")

39 tissues



adipose_subcutaneous
adipose_visceral_omentum
adrenal_gland
artery_aorta
artery_coronary
artery_tibial
brain_caudate_basal_ganglia
brain_cerebellar_hemisphere
brain_cerebellum
brain_cortex
brain_frontal_cortex_ba_9
brain_hippocampus
brain_hypothalamus
brain_nucleus_accumbens_basal_ganglia
brain_putamen_basal_ganglia
brain_spinal_cord_cervical_c_1
breast_mammary_tissue
cells_cultured_fibroblasts
cells_ebv_transformed_lymphocytes
colon_sigmoid
colon_transverse
esophagus_gastroesophageal_junction
esophagus_mucosa
esophagus_muscularis
heart_atrial_appendage
heart_left_ventricle
liver
lung
muscle_skeletal
nerve_tibial
pancreas
pituitary
skin_not_sun_exposed_suprapubic
skin_sun_exposed_lower_leg
small_intestine_terminal_ileum
spleen
stomach
thyroid
whole_blood


In [8]:
tissue <- tissue.list[tissue_index]  #can be replaced with a loop or argument to choose a different tissue

In [9]:
tissue

## Pattern for choosing `topTable()` files from `limma`

```bash
# {as_site_type} + '_' + {tissue} + '_' + suffix_pattern 
se_skin_not_sun_exposed_suprapubic_AS_model_B_sex_as_events.csv
```

In [10]:
dataDir <- "../data/"
assetsDir <- "../assets/"
as_site_type <- "se"
suffix_pattern <- "AS_model_B_sex_as_events.csv"

file.with.de.results <- paste0(dataDir, as_site_type, "_", tissue, "_" , suffix_pattern  )
file.with.de.results
file.exists(file.with.de.results)
system( paste0("ls -l ", file.with.de.results), intern = TRUE )

In [11]:
events.table         <- read.table(file.with.de.results, sep = ",")
head(events.table, 2)

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
XIST-10154,-6.796072,1.38572,-38.73465,5.301573e-131,1.523859e-126,280.911
XIST-10149,-7.124726,1.597306,-38.68238,7.86853e-131,1.523859e-126,280.5225


## Add annotation columns to the topTable dataframe:

The feature information is encoded in the topTable dataframe as rownames. The `ID` and `geneSymbol` variables have been combined in the following pattern:

```console
{geneSymbol}-{ID} 
```

- `ID`: everything **_after_** last occurence of hyphen `-`
example: 
```R
stringr::str_replace("apples - oranges - bananas", "^.+-", "")
```

```console
# output:

' bananas'
```

- `geneSymbol`: everything **_before_** last occurence of `-`
example: 

```R
sub('-[^-]*$', '',"apples - oranges - bananas")
```

```console
# output:

'apples - oranges '
```

```diff
- NOTE: The above solution covers the cases where a hyphen is part of the geneSymbol.
```

In [12]:
cols_initially <- colnames(events.table)
cols_initially

In [13]:
events.table[["ID"]] <- stringr::str_replace(rownames(events.table),  "^.+-", "")
events.table[["gene_name"]] <- sub('-[^-]*$', '', rownames(events.table))

In [14]:
keepInOrderCols <- c("gene_name", "ID", cols_initially)

In [15]:
events.table <- events.table[ , keepInOrderCols ]

In [16]:
tail(events.table, 2)

Unnamed: 0_level_0,gene_name,ID,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
RPLP0-28659,RPLP0,28659,0.02222658,9.584065,0.1354014,0.8923696,0.9544973,-7.100906
CD74-25493,CD74,25493,-0.01277937,9.839909,-0.1084203,0.9137222,0.9639965,-7.254767


## Define filepaths of required inputs

`file.with.de.results` has been defined above

In [17]:
rbp.table.name        <- paste0(assetsDir, "splice-relevant-genes.txt")
file.exists(rbp.table.name)

In [18]:
events.table.name     <- paste0(dataDir, "fromGTF.SE.txt")
file.exists(events.table.name)

In [19]:
inc.counts.file.name  <- paste0(dataDir, "rmats_final.se.jc.ijc.txt")
file.exists(inc.counts.file.name)

In [20]:
skip.counts.file.name <- paste0(dataDir, "rmats_final.se.jc.sjc.txt")
file.exists(skip.counts.file.name)

In [21]:
metadata.file.name    <- paste0(dataDir, "srr_pdata.csv")
file.exists(metadata.file.name)

In [22]:
expression.file.name  <- paste0(dataDir, "GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct")
file.exists(expression.file.name)

## Use the define filepaths to load/read in the tables 

Load the skip and inclusion count matrices, and the list of RNA binding proteins that are annotated to either:
- mRNA splicing, via spliceosome `(GO:0000398)`,
- regulation of mRNA splicing, via spliceosome `(GO:0048024)`, or 
- both. 

The table has the:
- Gene Symbol
- the Uniprot ID (`uprot.id`)
- the NCBI Gene ID (`gene.id`) and 
- boolean columns for being 
  - `S`=mRNA splicing, via spliceosome `(GO:0000398)` and 
  - `R`=regulation of mRNA splicing, via spliceosome `(GO:0048024)`.

### Filtering of the `topTable()` object

- `abs(events.table$logFC)>=log2(1.5)`
- `events.table$adj.P.Val<=0.05`

In [23]:
dim(events.table)
events.table <- events.table[abs(events.table$logFC)>=log2(1.5) & events.table$adj.P.Val<=0.05,]
dim(events.table)
head(events.table,2)

Unnamed: 0_level_0,gene_name,ID,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
XIST-10154,XIST,10154,-6.796072,1.38572,-38.73465,5.301573e-131,1.523859e-126,280.911
XIST-10149,XIST,10149,-7.124726,1.597306,-38.68238,7.86853e-131,1.523859e-126,280.5225


Make sure this command has been executed before `gunzip sbas/data/fromGTF.*` as the files are expected uncompressed.


In [24]:
annot.table  <- read.table(events.table.name,header=T)
dim(annot.table)
head(annot.table, 1)

Unnamed: 0_level_0,ID,GeneID,geneSymbol,chr,strand,exonStart_0base,exonEnd,upstreamES,upstreamEE,downstreamES,downstreamEE
Unnamed: 0_level_1,<int>,<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,ENSG00000034152.18,MAP2K3,chr17,+,21287990,21288091,21284709,21284969,21295674,21295769


In [25]:
merged.table <- merge(events.table, annot.table, by="ID")

In [26]:
dim(merged.table)
head(merged.table, 2)

Unnamed: 0_level_0,ID,gene_name,logFC,AveExpr,t,P.Value,adj.P.Val,B,GeneID,geneSymbol,chr,strand,exonStart_0base,exonEnd,upstreamES,upstreamEE,downstreamES,downstreamEE
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>
1,10012,RBM4B,0.8450169,1.183428,6.352709,6.335981e-10,4.076604e-08,12.144677,ENSG00000173914.12,RBM4B,chr11,-,66670935,66670983,66668970,66669291,66676667,66677091
2,10013,RBM4B,-0.7621263,1.752806,-4.68935,3.885523e-06,9.641124e-05,3.795546,ENSG00000173914.12,RBM4B,chr11,-,66668614,66669291,66664997,66665578,66676667,66677091


In [27]:
rbp.table    <- read.table(rbp.table.name,sep="\t",header=TRUE)
dim(rbp.table)
head(rbp.table, 1)

Unnamed: 0_level_0,Gene,uprot.id,gene.id,S,R,omim
Unnamed: 0_level_1,<fct>,<fct>,<int>,<lgl>,<lgl>,<fct>
1,AAR2,Q9Y312,25980,True,False,


Make sure this command has been executed before `gunzip sbas/data/rmats_final.se.jc.*jc.*` as the files are expected uncompressed.


In [28]:
inc.counts   <- as.data.frame(data.table::fread(inc.counts.file.name))
dim(inc.counts)
inc.counts[1:2,1:3]

Unnamed: 0_level_0,ID,SRR1068788,SRR1068808
Unnamed: 0_level_1,<int>,<int>,<int>
1,1,0,0
2,2,26,247


In [29]:
skip.counts  <- as.data.frame(data.table::fread(skip.counts.file.name))
dim(skip.counts)
skip.counts[1:2,1:3]

Unnamed: 0_level_0,ID,SRR1068788,SRR1068808
Unnamed: 0_level_1,<int>,<int>,<int>
1,1,2,0
2,2,0,0


## Check `dim()` of loaded objects

In [30]:
dim(events.table)
dim(annot.table)
dim(merged.table)
dim(rbp.table)
dim(inc.counts)
dim(skip.counts)

## Read sample info

In [31]:
metadata.file.name
file.exists(metadata.file.name)
system(paste0("ls -l", " ../data/srr_pdata.csv"), intern = TRUE)

In [32]:
meta.data <- readr::read_csv(metadata.file.name)
dim(meta.data)
head(meta.data, 1)

Parsed with column specification:
cols(
  .default = col_double(),
  SAMPID = [31mcol_character()[39m,
  SMATSSCR = [31mcol_character()[39m,
  SMCENTER = [31mcol_character()[39m,
  SMPTHNTS = [31mcol_character()[39m,
  SMTS = [31mcol_character()[39m,
  SMTSD = [31mcol_character()[39m,
  SMUBRID = [31mcol_character()[39m,
  SMNABTCH = [31mcol_character()[39m,
  SMNABTCHT = [31mcol_character()[39m,
  SMNABTCHD = [31mcol_character()[39m,
  SMGEBTCH = [31mcol_character()[39m,
  SMGEBTCHD = [31mcol_character()[39m,
  SMGEBTCHT = [31mcol_character()[39m,
  SMAFRZE = [31mcol_character()[39m,
  SMGTC = [33mcol_logical()[39m,
  SMNUMGPS = [33mcol_logical()[39m,
  SM550NRM = [33mcol_logical()[39m,
  SM350NRM = [33mcol_logical()[39m,
  SMMNCPB = [33mcol_logical()[39m,
  SMMNCV = [33mcol_logical()[39m
  # ... with 6 more columns
)

See spec(...) for full column specifications.



SAMPID,SMATSSCR,SMCENTER,SMPTHNTS,SMRIN,SMTS,SMTSD,SMUBRID,SMTSISCH,SMTSPAX,⋯,SMRRNART,SME1MPRT,SMNUM5CD,SMDPMPRT,SME2PCTS,SUBJID,SEX,AGE,DTHHRDY,SRR
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<lgl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>
GTEX.PW2O.0006.SM.2I3DV,,B1,,7.4,Blood,Whole Blood,13756,-126,,⋯,0.00351302,0.859573,,0,50.6829,GTEX-PW2O,1,20-29,0,SRR604002


In [33]:
meta.data$SMTSD[1:3]

In [34]:
meta.data[["SMTSD"]] <- as.character(meta.data[["SMTSD"]])

In [35]:
meta.data$SMTSD[1:3]

In [36]:
meta.data <- meta.data[ snakecase::to_snake_case(meta.data$SMTSD) == tissue,]

In [37]:
tissue
dim(meta.data)
head(meta.data,1)

SAMPID,SMATSSCR,SMCENTER,SMPTHNTS,SMRIN,SMTS,SMTSD,SMUBRID,SMTSISCH,SMTSPAX,⋯,SMRRNART,SME1MPRT,SMNUM5CD,SMDPMPRT,SME2PCTS,SUBJID,SEX,AGE,DTHHRDY,SRR
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<lgl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>
GTEX.S4Q7.1126.SM.4AD6R,0,B1,2 aliquots,6.9,Breast,Breast - Mammary Tissue,8367,178,395,⋯,0.00513003,0.995356,,0,50.1072,GTEX-S4Q7,1,20-29,0,SRR1100893


In [38]:
# Undo snakecase of SMTSD
tissue
tissue <- unique(meta.data$SMTSD)
tissue

In [39]:
dim(inc.counts)
inc.counts   <- inc.counts[,colnames(inc.counts) %in% meta.data$SRR]
dim(inc.counts)

In [40]:
dim(skip.counts)
skip.counts  <- skip.counts[,colnames(skip.counts) %in% meta.data$SRR]
dim(skip.counts)

In [41]:
sd.threshold <- quantile(apply(inc.counts,1,sd)+apply(skip.counts,1,sd),0.95)
sd.threshold

In [42]:
dim(skip.counts)
skip.counts  <- skip.counts[rownames(skip.counts) %in% merged.table$ID,]
dim(skip.counts)

In [43]:
dim(inc.counts)
inc.counts   <- inc.counts[rownames(inc.counts) %in% merged.table$ID,]
dim(inc.counts)

In [44]:
nrow(skip.counts)>100

In [45]:
if (nrow(skip.counts)>100)
{
  select.events <- apply(inc.counts,1,sd)+apply(skip.counts,1,sd)>sd.threshold
  inc.counts    <- inc.counts[select.events,]
  skip.counts   <- skip.counts[select.events,]
  merged.table  <- merged.table[select.events,]
}

In [46]:
dim(inc.counts)
dim(skip.counts)
dim(merged.table)

## Read expression data:

In [47]:
expression.file.name
file.exists(expression.file.name)

In [48]:
expression.mat <- read.table(expression.file.name, 
                             nrows = 1,
                             sep = "\t",
                             header = T,
                             skip = 2)

In [49]:
dim(expression.mat)
head(expression.mat, 2)

Unnamed: 0_level_0,Name,Description,GTEX.1117F.0226.SM.5GZZ7,GTEX.111CU.1826.SM.5GZYN,GTEX.111FC.0226.SM.5N9B8,GTEX.111VG.2326.SM.5N9BK,GTEX.111YS.2426.SM.5GZZQ,GTEX.1122O.2026.SM.5NQ91,GTEX.1128S.2126.SM.5H12U,GTEX.113IC.0226.SM.5HL5C,⋯,GTEX.ZVE2.0006.SM.51MRW,GTEX.ZVP2.0005.SM.51MRK,GTEX.ZVT2.0005.SM.57WBW,GTEX.ZVT3.0006.SM.51MT9,GTEX.ZVT4.0006.SM.57WB8,GTEX.ZVTK.0006.SM.57WBK,GTEX.ZVZP.0006.SM.51MSW,GTEX.ZVZQ.0006.SM.51MR8,GTEX.ZXES.0005.SM.57WCB,GTEX.ZXG5.0005.SM.57WCN
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ENSG00000223972.4,DDX11L1,0.1082,0.1158,0.02104,0.02329,0,0.04641,0.03076,0.09358,⋯,0.09012,0.1462,0.1045,0,0.6603,0.695,0.1213,0.4169,0.2355,0.145


In [50]:
colnames(expression.mat)[1:3]

In [51]:
colnames.expression.mat <- colnames(expression.mat)

In [52]:
length(colnames.expression.mat)
colnames.expression.mat[1:4]

In [53]:
length(colnames.expression.mat)

In [54]:
total.samples           <- length(colnames.expression.mat)
total.samples

In [55]:
meta.data$SAMPID[1]
gsub("-","\\.",meta.data$SAMPID[1])

In [56]:
meta.data$SAMPID   <- gsub("-","\\.",meta.data$SAMPID)

In [57]:
dim(meta.data)
meta.data               <- meta.data[meta.data$SAMPID %in% colnames(expression.mat),]
dim(meta.data)

In [58]:
head(expression.mat)

Unnamed: 0_level_0,Name,Description,GTEX.1117F.0226.SM.5GZZ7,GTEX.111CU.1826.SM.5GZYN,GTEX.111FC.0226.SM.5N9B8,GTEX.111VG.2326.SM.5N9BK,GTEX.111YS.2426.SM.5GZZQ,GTEX.1122O.2026.SM.5NQ91,GTEX.1128S.2126.SM.5H12U,GTEX.113IC.0226.SM.5HL5C,⋯,GTEX.ZVE2.0006.SM.51MRW,GTEX.ZVP2.0005.SM.51MRK,GTEX.ZVT2.0005.SM.57WBW,GTEX.ZVT3.0006.SM.51MT9,GTEX.ZVT4.0006.SM.57WB8,GTEX.ZVTK.0006.SM.57WBK,GTEX.ZVZP.0006.SM.51MSW,GTEX.ZVZQ.0006.SM.51MR8,GTEX.ZXES.0005.SM.57WCB,GTEX.ZXG5.0005.SM.57WCN
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ENSG00000223972.4,DDX11L1,0.1082,0.1158,0.02104,0.02329,0,0.04641,0.03076,0.09358,⋯,0.09012,0.1462,0.1045,0,0.6603,0.695,0.1213,0.4169,0.2355,0.145


In [59]:
meta.data <- meta.data[!duplicated(meta.data$SAMPID),]

In [60]:
dim(meta.data)

In [61]:
inc.counts <- inc.counts[,colnames(inc.counts) %in% meta.data$SRR]
dim(inc.counts)
head(inc.counts,1)

Unnamed: 0_level_0,SRR1068977,SRR1068999,SRR1070208,SRR1071084,SRR1071905,SRR1074860,SRR1075484,SRR1076219,SRR1076441,SRR1077139,⋯,SRR660283,SRR662306,SRR662378,SRR662811,SRR808428,SRR811285,SRR812198,SRR815208,SRR820571,SRR821498
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
676,22,1,16,6,18,6,34,14,16,25,⋯,8,21,25,22,14,29,26,22,21,13


In [62]:
skip.counts <- skip.counts[,colnames(skip.counts) %in% meta.data$SRR]
dim(skip.counts)
head(skip.counts, 1)

Unnamed: 0_level_0,SRR1068977,SRR1068999,SRR1070208,SRR1071084,SRR1071905,SRR1074860,SRR1075484,SRR1076219,SRR1076441,SRR1077139,⋯,SRR660283,SRR662306,SRR662378,SRR662811,SRR808428,SRR811285,SRR812198,SRR815208,SRR820571,SRR821498
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
676,292,616,320,195,399,285,83,515,1026,139,⋯,30,728,261,314,89,270,364,108,198,185


In [63]:
meta.data <- meta.data[meta.data$SRR %in% colnames(inc.counts),]
dim(meta.data)
head(meta.data, 1)

SAMPID,SMATSSCR,SMCENTER,SMPTHNTS,SMRIN,SMTS,SMTSD,SMUBRID,SMTSISCH,SMTSPAX,⋯,SMRRNART,SME1MPRT,SMNUM5CD,SMDPMPRT,SME2PCTS,SUBJID,SEX,AGE,DTHHRDY,SRR
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<lgl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>
GTEX.S4Q7.1126.SM.4AD6R,0,B1,2 aliquots,6.9,Breast,Breast - Mammary Tissue,8367,178,395,⋯,0.00513003,0.995356,,0,50.1072,GTEX-S4Q7,1,20-29,0,SRR1100893


In [64]:
colnames.expression.mat[1:4]

In [65]:
dim(expression.mat)
head(expression.mat, 1)

Unnamed: 0_level_0,Name,Description,GTEX.1117F.0226.SM.5GZZ7,GTEX.111CU.1826.SM.5GZYN,GTEX.111FC.0226.SM.5N9B8,GTEX.111VG.2326.SM.5N9BK,GTEX.111YS.2426.SM.5GZZQ,GTEX.1122O.2026.SM.5NQ91,GTEX.1128S.2126.SM.5H12U,GTEX.113IC.0226.SM.5HL5C,⋯,GTEX.ZVE2.0006.SM.51MRW,GTEX.ZVP2.0005.SM.51MRK,GTEX.ZVT2.0005.SM.57WBW,GTEX.ZVT3.0006.SM.51MT9,GTEX.ZVT4.0006.SM.57WB8,GTEX.ZVTK.0006.SM.57WBK,GTEX.ZVZP.0006.SM.51MSW,GTEX.ZVZQ.0006.SM.51MR8,GTEX.ZXES.0005.SM.57WCB,GTEX.ZXG5.0005.SM.57WCN
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ENSG00000223972.4,DDX11L1,0.1082,0.1158,0.02104,0.02329,0,0.04641,0.03076,0.09358,⋯,0.09012,0.1462,0.1045,0,0.6603,0.695,0.1213,0.4169,0.2355,0.145


In [66]:
tissue <- unique(meta.data$SMTSD [ meta.data$SMTSD == tissue])
tissue

In [67]:
col.in.tissue<-c()
for (col in colnames.expression.mat)
  
  col.in.tissue<-c(col.in.tissue, (col %in% meta.data$SAMPID) && (meta.data$SMTSD[which(meta.data$SAMPID==col)] %in% tissue) && (meta.data$SUBJID[which(meta.data$SAMPID==col)]!='GTEX-11ILO'))

In [68]:
length(col.in.tissue)
table(col.in.tissue)

col.in.tissue
FALSE  TRUE 
11517   173 

In [69]:
length(colnames.expression.mat)
length(col.in.tissue)

col.in.tissue[1:3]

In [70]:
# colClasses is used to skip columns
expression.mat <-read.table(expression.file.name, 
                            sep= "\t",
                            header = T,
                            skip = 2, 
                            colClasses = ifelse(col.in.tissue, "numeric", "NULL"))

In [71]:
length(col.in.tissue)

## Read gene names:

In [72]:
dim(expression.mat)
expression.mat <- expression.mat[,order(match(colnames(expression.mat),meta.data$SAMPID))]
dim(expression.mat)

In [73]:
inc.counts     <- inc.counts[,order(match(colnames(inc.counts),meta.data$SRR))]
dim(inc.counts)

In [74]:
skip.counts    <- skip.counts[,order(match(colnames(skip.counts),meta.data$SRR))]
dim(skip.counts)

In [75]:
all.genes      <- read.table(expression.file.name,sep="\t",header=T,skip=2,colClasses = c(rep("character", 2), rep("NULL", total.samples-2)))
dim(all.genes)
head(all.genes, 2)

Unnamed: 0_level_0,Name,Description
Unnamed: 0_level_1,<chr>,<chr>
1,ENSG00000223972.4,DDX11L1
2,ENSG00000227232.4,WASH7P


In [76]:
expression.mat <- expression.mat[!duplicated(all.genes$Description),]
dim(expression.mat)
head(expression.mat,2)

Unnamed: 0_level_0,GTEX.S4Q7.1126.SM.4AD6R,GTEX.ZZ64.1226.SM.5E43R,GTEX.ZA64.1526.SM.5CVMD,GTEX.11TT1.2126.SM.5GU5Y,GTEX.11NSD.0926.SM.5N9DR,GTEX.RU1J.0626.SM.4WAWY,GTEX.133LE.1726.SM.5K7VQ,GTEX.11EM3.1326.SM.5N9C6,GTEX.13FTX.1126.SM.5N9EN,GTEX.XQ3S.1326.SM.4BOPQ,⋯,GTEX.13OW5.2226.SM.5L3HC,GTEX.13O3O.0826.SM.5K7WE,GTEX.XMD1.0826.SM.4AT52,GTEX.11EI6.0626.SM.5985T,GTEX.14753.2426.SM.5LU8U,GTEX.X4EP.2926.SM.3P5YQ,GTEX.QVJO.1826.SM.447C9,GTEX.13NZ8.0126.SM.5IJCT,GTEX.1117F.2826.SM.5GZXL,GTEX.13N1W.0626.SM.5MR4U
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0.02393,0.07924,0.07894,0.0548,0.1406,0.02687,0.09482,0.05028,0.2165,0.05749,⋯,0.0382,0.02005,0.05832,0.06225,0.1121,0.1559,0.05962,0.03483,0.05445,0.02398
2,8.881,9.967,7.076,15.03,11.79,25.43,17.98,18.39,9.288,13.66,⋯,18.37,14.53,21.75,15.63,7.83,17.29,12.76,19.2,20.81,13.56


In [77]:
all.genes      <- all.genes[!duplicated(all.genes$Description),]
dim(all.genes)

In [78]:
skip.counts    <- skip.counts[merged.table$geneSymbol %in% all.genes$Description,]
dim(skip.counts)

In [79]:
inc.counts     <- inc.counts[merged.table$geneSymbol %in% all.genes$Description,]
dim(inc.counts)

In [80]:
merged.table   <- merged.table[merged.table$geneSymbol %in% all.genes$Description,]
dim(merged.table)

In [81]:
gene.names     <- unique(merged.table$geneSymbol)
length(gene.names)

In [82]:
expression.mat <- expression.mat[all.genes$Description %in% c(as.character(rbp.table$Gene),as.character(gene.names)),]
dim(expression.mat)

In [83]:
rownames.expression.mat <-all.genes$Description[all.genes$Description %in% c(as.character(rbp.table$Gene),as.character(gene.names))]
length(rownames.expression.mat)

In [84]:
expression.mat <-expression.mat[!duplicated(rownames.expression.mat),]
dim(expression.mat)

In [85]:
rownames.expression.mat <-rownames.expression.mat[!duplicated(rownames.expression.mat)]
length(rownames.expression.mat)

## Prepare expression of genes and RBPS:

In [86]:
num.events     <- nrow(merged.table)
num.events

In [87]:
event.to.gene  <- c()

In [88]:
gexp           <- expression.mat[rownames.expression.mat %in% gene.names,]
dim(gexp)

In [89]:
rownames(gexp) <- rownames.expression.mat[rownames.expression.mat %in% gene.names]

In [90]:
gexp           <- gexp[order(match(rownames(gexp),gene.names)),]
dim(gexp)
head(gexp,2)

Unnamed: 0_level_0,GTEX.S4Q7.1126.SM.4AD6R,GTEX.ZZ64.1226.SM.5E43R,GTEX.ZA64.1526.SM.5CVMD,GTEX.11TT1.2126.SM.5GU5Y,GTEX.11NSD.0926.SM.5N9DR,GTEX.RU1J.0626.SM.4WAWY,GTEX.133LE.1726.SM.5K7VQ,GTEX.11EM3.1326.SM.5N9C6,GTEX.13FTX.1126.SM.5N9EN,GTEX.XQ3S.1326.SM.4BOPQ,⋯,GTEX.13OW5.2226.SM.5L3HC,GTEX.13O3O.0826.SM.5K7WE,GTEX.XMD1.0826.SM.4AT52,GTEX.11EI6.0626.SM.5985T,GTEX.14753.2426.SM.5LU8U,GTEX.X4EP.2926.SM.3P5YQ,GTEX.QVJO.1826.SM.447C9,GTEX.13NZ8.0126.SM.5IJCT,GTEX.1117F.2826.SM.5GZXL,GTEX.13N1W.0626.SM.5MR4U
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
LYPD6B,0.0,0.1255,0.04762,0.3347,0.8268,1.434,3.453,3.048,0.0,0.03901,⋯,7.604,4.889,8.482,4.732,0.3549,6.12,0.5125,3.025,3.411,0.5423
SH2D3A,2.657,1.536,1.832,15.57,8.527,46.77,36.08,47.0,1.534,2.322,⋯,39.14,28.95,25.8,23.76,1.583,47.17,6.071,28.56,11.86,3.244


In [91]:
gexp           <- log2(gexp+0.5)

In [92]:
gexp           <- gexp-rowMeans(gexp)

In [93]:
gexp[apply(gexp,1,sd)>0,] <- gexp[apply(gexp,1,sd)>0,]/apply(gexp[apply(gexp,1,sd)>0,],1,sd)

In [94]:
rexp           <- expression.mat[rownames.expression.mat %in% rbp.table$Gene,]

In [95]:
rownames(rexp) <- rownames.expression.mat[rownames.expression.mat %in% rbp.table$Gene]

In [96]:
rexp           <- rexp[order(match(rownames(rexp),rbp.table$Gene)),]

In [97]:
rexp           <- log2(rexp+0.5)

In [98]:
rexp           <- rexp-rowMeans(rexp)

In [99]:
rexp           <- rexp/apply(rexp,1,function(v){ifelse(sum(v==v[1])<length(v),sd(v),1)})

In [100]:
for (i in (1:num.events))
  event.to.gene<-c(event.to.gene,which(unique(merged.table$geneSymbol)==merged.table[i,"geneSymbol"]))
    sex<-ifelse(meta.data$SEX==1,1,0)

In [101]:
sex[1:4]
table(sex)

sex
  0   1 
 73 100 

In [102]:
end_time <- Sys.time()
end_time - start_time

Time difference of 3.002217 mins

## Run stan:

In [103]:
dataList = list(
  as = round(skip.counts) ,   #skip event counts across experiments
  c = round(skip.counts+inc.counts)    , #total counts for event, i.e. skip+inclusion, across experiments
  gexp = gexp, #read counts for genes (from gtex, take the raw counts) across experiments
  rexp = rexp, #read counts for RBPs (from gtex, take the raw counts)
  event_to_gene = event.to.gene,  #the gene index for each event (1 to the number of distinct genes) 
  Nrbp = nrow(rexp), #number of RBPs
  Nevents = nrow(merged.table),  #most varying AS events in 
  Nexp = ncol(expression.mat),#number of experiments such that we measured each event, gene and RBP in each experiment
  Ngenes = nrow(gexp),
  sex=sex
)


modelString = "
data {
int<lower=0> Nevents;
int<lower=0> Nexp;
int<lower=0> Nrbp;
int<lower=0> Ngenes;
int<lower=0> as[Nevents,Nexp] ;
int<lower=0> c[Nevents,Nexp] ;
matrix[Ngenes,Nexp] gexp ; 
matrix[Nrbp,Nexp] rexp ; 
int<lower=0> event_to_gene[Nevents];
int<lower=0,upper=1> sex[Nexp];

}


parameters {
real beta0[Nevents] ;
real beta1[Nevents] ;
matrix[Nevents,Nrbp] beta2 ;
real beta3[Nevents];
real beta4[Nrbp];

}
model {

for ( i in 1:Nexp ) {  


    for ( j in 1:Nevents ) if (c[j,i]>0) { 

      as[j,i] ~ binomial(c[j,i], inv_logit(beta0[j]+beta1[j]*sex[i]+dot_product(beta2[j,],rexp[,i])+beta3[j]*gexp[event_to_gene[j],i] ) );

  }
}

for (k in 1:Nrbp){

  for ( j in 1:Nevents ) { 

        beta2[j,k] ~normal(beta4[k],1);
  }

  beta4[k]~normal(0,1);

}


for ( j in 1:Nevents ) { 

    beta1[j] ~ normal(0,1);
    beta0[j] ~ normal(0,1);
    beta3[j] ~ normal(0,1);
  }

}
"

# Start the clock!
start_time <- Sys.time()

stanDso <- rstan::stan_model( model_code=modelString ) 
stanFit <- sampling( object=stanDso , 
                    data = dataList , 
                    chains = 2 , #3
                    iter = 8, #8000
                    warmup = 6, #6000
                    thin = 1,
                    init = 0, 
                    cores = parallel::detectCores() - 2 )

mcmcCoda = coda::mcmc.list( lapply( 1:ncol(stanFit) , function(x) { mcmc(as.array(stanFit)[,x,]) } ) )

end_time <- Sys.time()
end_time - start_time

“There were 4 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
“Examine the pairs() plot to diagnose sampling problems
”
“The largest R-hat is NA, indicating chains have not mixed.
Running the chains for more iterations may help. See
“Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
“Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See


Time difference of 2.419398 mins

## Save R objects

In [108]:
save.image(file = "notebook.RData")
file.exists("notebook.RData")
system("pwd && ls -l notebook.RData", intern = TRUE)

## Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **"artefacts"**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### 1. Checksums with the sha256 algorithm

In [105]:
figure_id       <- "bayesian-modeling"

message("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
system(paste0("cd ../data/ && find . -type f -exec sha256sum {} \\; > ../metadata/",  figure_id, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

data.table::fread(paste0("../metadata/", figure_id, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

Generating sha256 checksums of the artefacts in the `..data/` directory .. 



Done!




sha256sum,file
<chr>,<chr>
6f2b3dc37ab2186f7cb6d317d80d9f70e9cd1c2bcab3cfc86c7de5c5e76e2514,./mxe_spleen_AS_model_B_sex.csv
3d378262fa13e905cfa51d6ec2a88f01965d82dd54357131771c0a95c226805a,./mxe_brain_cerebellum_AS_model_A_ijc_wo_DGE_sex_gene_set.txt
94a7fa75a9c0072a745bb513040e778f90961310fc3751b90141b62a06429903,./liver_DGE.csv
c04f787ffba8d422a66d2e807ece6ac8c6900287572f31105a5549c4021faf81,./ri_pancreas_AS_model_A_ijc_wo_DGE_sex.csv
b4776f57f9780ef4c2068f55a9a56e02f2544d70438ce16de8a98a454bbaa14e,./ri_breast_mammary_tissue_AS_model_A_ijc_wo_DGE_sex_refined.csv
f6c59cceb70e2f036dc4ad705125a10412d22dca8dc2b6b2f8ebadd4aaf98280,./mxe_spleen_AS_model_A_sjc_sex.csv
2b5e26957f499a525c03053e7b5362681e1ebc7e71f0ae69e621153e56cbc3b9,./se_lung_AS_model_A_ijc_wo_DGE_sex.csv
cf6e8ed384c002dd3beaf00318db269252e11e447f57a2829c87c597dacc6dfb,./se_whole_blood_AS_model_B_sex_as_events_universe.txt
63c22570491153b15c881dd82928c26491fac0bd246beb1de65cbf81d9049911,./a5ss_thyroid_AS_model_B_sex_as_events_universe.txt
9d72aeeee5ff54e5764bd33715031e8e035f6e84c2ec662ebf4eef5a7a74174a,./ri_brain_hippocampus_AS_model_A_ijc_sex_refined.csv


### 2. Libraries metadata

In [106]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", figure_id, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", figure_id ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]

Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..

Done!


Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..

Done!




 setting  value                       
 version  R version 3.6.2 (2019-12-12)
 os       Ubuntu 18.04.3 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_US.UTF-8                 
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Etc/UTC                     
 date     2020-06-26                  

Unnamed: 0_level_0,package,ondiskversion,loadedversion,path,loadedpath,attached,is_base,date,source,md5ok,library
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<lgl>,<fct>
coda,coda,0.19.3,0.19-3,/opt/conda/lib/R/library/coda,/opt/conda/lib/R/library/coda,True,False,2019-07-05,CRAN (R 3.6.3),,/opt/conda/lib/R/library
compute.es,compute.es,0.2.5,0.2-5,/opt/conda/lib/R/library/compute.es,/opt/conda/lib/R/library/compute.es,True,False,2020-04-01,CRAN (R 3.6.3),,/opt/conda/lib/R/library
ggplot2,ggplot2,3.3.2,3.3.2,/opt/conda/lib/R/library/ggplot2,/opt/conda/lib/R/library/ggplot2,True,False,2020-06-19,CRAN (R 3.6.3),,/opt/conda/lib/R/library
ggsci,ggsci,2.9,2.9,/opt/conda/lib/R/library/ggsci,/opt/conda/lib/R/library/ggsci,True,False,2018-05-14,CRAN (R 3.6.3),,/opt/conda/lib/R/library
gridExtra,gridExtra,2.3,2.3,/opt/conda/lib/R/library/gridExtra,/opt/conda/lib/R/library/gridExtra,True,False,2017-09-09,CRAN (R 3.6.3),,/opt/conda/lib/R/library
rjags,rjags,4.10,4-10,/opt/conda/lib/R/library/rjags,/opt/conda/lib/R/library/rjags,True,False,2019-11-06,CRAN (R 3.6.3),,/opt/conda/lib/R/library
rstan,rstan,2.19.3,2.19.3,/opt/conda/lib/R/library/rstan,/opt/conda/lib/R/library/rstan,True,False,2020-02-11,CRAN (R 3.6.3),,/opt/conda/lib/R/library
runjags,runjags,2.0.4.6,2.0.4-6,/opt/conda/lib/R/library/runjags,/opt/conda/lib/R/library/runjags,True,False,2019-12-17,CRAN (R 3.6.2),,/opt/conda/lib/R/library
snakecase,snakecase,0.11.0,0.11.0,/opt/conda/lib/R/library/snakecase,/opt/conda/lib/R/library/snakecase,True,False,2019-05-25,CRAN (R 3.6.3),,/opt/conda/lib/R/library
StanHeaders,StanHeaders,2.21.0.5,2.21.0-5,/opt/conda/lib/R/library/StanHeaders,/opt/conda/lib/R/library/StanHeaders,True,False,2020-06-09,CRAN (R 3.6.3),,/opt/conda/lib/R/library
