# Pediatric DS AML vs TAM

Sleuth Analysis with Kallisto Quantitation Input

## Following the Pachterlab walkthrough
[Using p-value aggregation to obtain gene differential expression in datasets with multiple experimental conditions](
https://pachterlab.github.io/sleuth_walkthroughs/pval_agg/analysis.html)




### Getting started with Sleuth

**sleuth** is a tool for the analysis and comparison of multiple related RNA-Seq experiments. Key features include:

* The ability to perform both transcript-level and gene-level analysis.
* Compatibility with kallisto enabling a fast and accurate workflow from reads to results.
* The use of boostraps to ascertain and correct for technical variation in experiments.
* An interactive app for exploratory data analysis.

To use sleuth, RNA-Seq data must first be quantified with kallisto ( which we did so with the CAVATICA workflow), which is a program for very fast RNA-Seq quantification based on pseudo-alignment. An important feature of kallisto is that it outputs bootstraps along with the estimates of transcript abundances. These can serve as proxies for technical replicates, allowing for an ascertainment of the variability in estimates due to the random processes underlying RNA-Seq as well as the statistical procedure of read assignment. kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. sleuth has been designed to work seamlessly and efficiently with kallisto, and therefore RNA-Seq analysis with kallisto and sleuth is tractable on a laptop computer in a matter of minutes. More details about kallisto and sleuth are provided the papers describing the methods:

#### Citations

* Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527 (2016), doi:10.1038/nbt.3519

* Harold Pimentel, Nicolas L Bray, Suzette Puente, Páll Melsted and Lior Pachter, Differential analysis of RNA-seq incorporating quantification uncertainty, in press.

sleuth has been designed to facilitate the exploration of RNA-Seq data by utilizing the Shiny web application framework by RStudio. The worked example below illustrates how to load data into sleuth and how to open Shiny plots for exploratory data analysis. The code underlying all plots is available via the Shiny interface so that analyses can be fully “open source”.



### Introduction

Applying the same techniques used in the walk through - apply it to our own AML vs TAM

We need to set up our working directory.
And we need 

* `cowplot` - for making prettier plots and plots with grids.  and 
* `biomaRt` - for extracting the Ensembl transcript to gene mapping



In [1]:
install.packages("cowplot")


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [2]:
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.14")


Bioconductor version '3.14' is out-of-date; the current release version '3.15'
  is available with R version '4.2'; see https://bioconductor.org/install

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.1 (2021-08-10)

Old packages: 'backports', 'blob', 'brew', 'brio', 'broom', 'bslib', 'callr',
  'caret', 'class', 'clipr', 'conflicted', 'covr', 'credentials', 'crosstalk',
  'curl', 'DBI', 'dbplyr', 'desc', 'devtools', 'dials', 'diffobj', 'dplyr',
  'DT', 'dtplyr', 'e1071', 'evaluate', 'forcats', 'foreach', 'forecast',
  'furrr', 'future', 'future.apply', 'gargle', 'gert', 'ggplot2', 'gh',
  'git2r', 'gitcreds', 'globals', 'googlesheets4', 'gower', 'hardhat', 'haven',
  'hms', 'httpuv', 'httr', 'infer', 'ipred', 'IRdisplay', 'IRkernel',
  'iterators', 'knitr', 'lattice', 'lhs', 'lmtest', 'lubridate', 'MASS',
  'Matrix'

In [3]:
BiocManager::install("biomaRt")

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.1 (2021-08-10)

“package(s) not installed when version(s) same as current; use `force = TRUE` to
  re-install: 'biomaRt'”
Old packages: 'backports', 'blob', 'brew', 'brio', 'broom', 'bslib', 'callr',
  'caret', 'class', 'clipr', 'conflicted', 'covr', 'credentials', 'crosstalk',
  'curl', 'DBI', 'dbplyr', 'desc', 'devtools', 'dials', 'diffobj', 'dplyr',
  'DT', 'dtplyr', 'e1071', 'evaluate', 'forcats', 'foreach', 'forecast',
  'furrr', 'future', 'future.apply', 'gargle', 'gert', 'ggplot2', 'gh',
  'git2r', 'gitcreds', 'globals', 'googlesheets4', 'gower', 'hardhat', 'haven',
  'hms', 'httpuv', 'httr', 'infer', 'ipred', 'IRdisplay', 'IRkernel',
  'iterators', 'knitr', 'lattice', 'lhs', 'lmtest', 'lubridate', 'MASS',
  'Matrix', 'mgcv', 'modeldata', 'modelr', 'nlme', 'nnet',

In [4]:
library("devtools")

Loading required package: usethis


Attaching package: ‘devtools’


The following object is masked from ‘package:BiocManager’:

    install




#### First time through - Received a Warning that rhdf5 not available for this version of R

Looked up our version and google searched R 4.1.1 rhdf5
    
Can install using `BiocManager::install("rhdf5")`

Pactherlab says to install `*rhdf5*` first

In [5]:
BiocManager::install("rhdf5")

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.1 (2021-08-10)

“package(s) not installed when version(s) same as current; use `force = TRUE` to
  re-install: 'rhdf5'”
Old packages: 'backports', 'blob', 'brew', 'brio', 'broom', 'bslib', 'callr',
  'caret', 'class', 'clipr', 'conflicted', 'covr', 'credentials', 'crosstalk',
  'curl', 'DBI', 'dbplyr', 'desc', 'devtools', 'dials', 'diffobj', 'dplyr',
  'DT', 'dtplyr', 'e1071', 'evaluate', 'forcats', 'foreach', 'forecast',
  'furrr', 'future', 'future.apply', 'gargle', 'gert', 'ggplot2', 'gh',
  'git2r', 'gitcreds', 'globals', 'googlesheets4', 'gower', 'hardhat', 'haven',
  'hms', 'httpuv', 'httr', 'infer', 'ipred', 'IRdisplay', 'IRkernel',
  'iterators', 'knitr', 'lattice', 'lhs', 'lmtest', 'lubridate', 'MASS',
  'Matrix', 'mgcv', 'modeldata', 'modelr', 'nlme', 'nnet', '

In [9]:
devtools::install_github("pachterlab/sleuth")

Downloading GitHub repo pachterlab/sleuth@HEAD



sass        (0.4.0 -> 0.4.2 ) [CRAN]
pillar      (1.6.2 -> 1.8.1 ) [CRAN]
vctrs       (0.3.8 -> 0.4.2 ) [CRAN]
tidyselect  (1.1.1 -> 1.2.0 ) [CRAN]
tibble      (3.1.4 -> 3.1.8 ) [CRAN]
bslib       (0.3.0 -> 0.4.0 ) [CRAN]
fontawesome (NA    -> 0.3.0 ) [CRAN]
httpuv      (1.6.3 -> 1.6.6 ) [CRAN]
scales      (1.1.1 -> 1.2.1 ) [CRAN]
plyr        (1.8.6 -> 1.8.7 ) [CRAN]
purrr       (0.3.4 -> 0.3.5 ) [CRAN]
dplyr       (1.0.7 -> 1.0.10) [CRAN]
shiny       (1.6.0 -> 1.7.2 ) [CRAN]
tidyr       (1.1.3 -> 1.2.1 ) [CRAN]
ggplot2     (3.3.5 -> 3.3.6 ) [CRAN]


Skipping 1 packages not available: rhdf5

Installing 15 packages: sass, pillar, vctrs, tidyselect, tibble, bslib, fontawesome, httpuv, scales, plyr, purrr, dplyr, shiny, tidyr, ggplot2

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



[32m✔[39m  [90mchecking for file ‘/tmp/RtmpD7VLyX/remotes485d2a7914f/pachterlab-sleuth-1f3760a/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘sleuth’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
   Omitted ‘LazyData’ from DESCRIPTION
[90m─[39m[90m  [39m[90mbuilding ‘sleuth_0.30.0.tar.gz’[39m[36m[39m
   


In [10]:
library(sleuth)

#### If this fails ...

Noted in issues https://github.com/pachterlab/sleuth/issues/259 -- follow the instructions from [Paast](https://github.com/pachterlab/sleuth/issues/259#issuecomment-966270599)

We have successfully run Kallisto with Kallisto Quantitation.

Results may be found after running an application on Cavatica here:

```bash
/sbgenomics/project-files/
```

For this analysis we will use the results from the run using `metadata_ten_samples_only_txt`

Results are in:

```bash
/sbgenomics/project-files/ten_samples_expression_matrix.tpm.txt
```

### Parsing metadata

A sleuth analysis is dependent on a metadata file, which describes the experimental design, the sample names, conditions and covariates. The metadata file is external to sleuth, and must be prepared prior to analysis. A metadata file should have been downloaded along with the kallisto quantifications. The first step in a sleuth analysis is loading of the metadata file. You might need the path in read_table below to where you have downloaded the kallisto dataset, so that the path directs to the sample_table.txt. We then select the relevant columns of the metadata.

In our case, I used:

```bash
/sbgenomics/project-files/metadata_ten_samples_only.csv
```

In [17]:
metadata <- read.table('/sbgenomics/project-files/metadata_ten_samples_only.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)

In [18]:
head(metadata, n=20)

Unnamed: 0_level_0,File.name,Case.ID,subject,Sample.ID,sample,Gender,Disease.type,Paired.end
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
1,PAXSBH-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PAXSBH,PAXSBH,PAXSBH-03A-01R,PAXSBH-03A-01R,Female,TAM,1.0
2,PAXSBH-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PAXSBH,PAXSBH,PAXSBH-03A-01R,PAXSBH-03A-01R,Female,TAM,2.0
3,PAXWGW-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PAXWGW,PAXWGW,PAXWGW-03A-01R,PAXWGW-03A-01R,Female,TAM,1.0
4,PAXWGW-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PAXWGW,PAXWGW,PAXWGW-03A-01R,PAXWGW-03A-01R,Female,TAM,2.0
5,PASNSP-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PASNSP,PASNSP,PASNSP-03A-01R,PASNSP-03A-01R,Male,TAM,1.0
6,PASNSP-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PASNSP,PASNSP,PASNSP-03A-01R,PASNSP-03A-01R,Male,TAM,2.0
7,PASWXF-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PASWXF,PASWXF,PASWXF-03A-01R,PASWXF-03A-01R,Male,TAM,1.0
8,PASWXF-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PASWXF,PASWXF,PASWXF-03A-01R,PASWXF-03A-01R,Male,TAM,2.0
9,PASXCL-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PASXCL,PASXCL,PASXCL-03A-01R,PASXCL-03A-01R,Male,TAM,1.0
10,PASXCL-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PASXCL,PASXCL,PASXCL-03A-01R,PASXCL-03A-01R,Male,TAM,2.0


There is an error in the last sample detail - where the paired should read `2` and not NA.  So I copied the file to a local directory and corrected it -- it is corrected permanently now - but for this run through you can see:
```bash
cp /sbgenomics/project-files/metadata_ten_samples_only.csv /sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data
```

where I edited the file and now will read this one in.

In [21]:
metadata <- read.table('/sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data/metadata_ten_samples_only.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)

In [22]:
head(metadata)

Unnamed: 0_level_0,File.name,Case.ID,subject,Sample.ID,sample,Gender,Disease.type,Paired.end
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
1,PAXSBH-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PAXSBH,PAXSBH,PAXSBH-03A-01R,PAXSBH-03A-01R,Female,TAM,1
2,PAXSBH-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PAXSBH,PAXSBH,PAXSBH-03A-01R,PAXSBH-03A-01R,Female,TAM,2
3,PAXWGW-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PAXWGW,PAXWGW,PAXWGW-03A-01R,PAXWGW-03A-01R,Female,TAM,1
4,PAXWGW-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PAXWGW,PAXWGW,PAXWGW-03A-01R,PAXWGW-03A-01R,Female,TAM,2
5,PASNSP-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PASNSP,PASNSP,PASNSP-03A-01R,PASNSP-03A-01R,Male,TAM,1
6,PASNSP-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PASNSP,PASNSP,PASNSP-03A-01R,PASNSP-03A-01R,Male,TAM,2


In [31]:
metadata <- dplyr::select(metadata, c('Case.ID', 'Sample.ID', 'Gender', 'Disease.type'))

In [32]:
head(metadata)

Unnamed: 0_level_0,Case.ID,Sample.ID,Gender,Disease.type
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,PAXSBH,PAXSBH-03A-01R,Female,TAM
2,PAXSBH,PAXSBH-03A-01R,Female,TAM
3,PAXWGW,PAXWGW-03A-01R,Female,TAM
4,PAXWGW,PAXWGW-03A-01R,Female,TAM
5,PASNSP,PASNSP-03A-01R,Male,TAM
6,PASNSP,PASNSP-03A-01R,Male,TAM


This file describes the `experimental` design.   We have cases, samples, reads, gender, disease and paired


In [34]:
metadata <- dplyr::distinct(metadata)

In [35]:
head(metadata)

Unnamed: 0_level_0,Case.ID,Sample.ID,Gender,Disease.type
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,PAXSBH,PAXSBH-03A-01R,Female,TAM
2,PAXWGW,PAXWGW-03A-01R,Female,TAM
3,PASNSP,PASNSP-03A-01R,Male,TAM
4,PASWXF,PASWXF-03A-01R,Male,TAM
5,PASXCL,PASXCL-03A-01R,Male,TAM
6,PAVZTK,PAVZTK-09A-01R,Female,DS-AML


In [37]:
dim(metadata)

now that we de-duplicated the file - and the workflow run previously was smart enough to just give a single abundance for the samples (recognizing the pair-endedness of the sample).   We add the path names to the samples -- I will do this in the external file that I copied (since I am not following how to do this with the dplyr command!).

I created a macro in Emacs that copied the Sample.ID and added the directory and the abundance details.
Lets re-read and de-duplicate.


In [57]:
metadata <- read.table('/sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data/metadata_ten_samples_only.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)

In [58]:
head(metadata)

Unnamed: 0_level_0,File.name,Case.ID,subject,Sample.ID,sample,Gender,Disease.type,Paired.end,Abundance
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>
1,PAXSBH-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PAXSBH,PAXSBH,PAXSBH-03A-01R,PAXSBH-03A-01R,Female,TAM,1,/sbgenomics/project-files/PAXSBH-03A-01R.kallisto_quant.abundance.h5
2,PAXSBH-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PAXSBH,PAXSBH,PAXSBH-03A-01R,PAXSBH-03A-01R,Female,TAM,2,/sbgenomics/project-files/PAXSBH-03A-01R.kallisto_quant.abundance.h5
3,PAXWGW-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PAXWGW,PAXWGW,PAXWGW-03A-01R,PAXWGW-03A-01R,Female,TAM,1,/sbgenomics/project-files/PAXWGW-03A-01R.kallisto_quant.abundance.h5
4,PAXWGW-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PAXWGW,PAXWGW,PAXWGW-03A-01R,PAXWGW-03A-01R,Female,TAM,2,/sbgenomics/project-files/PAXWGW-03A-01R.kallisto_quant.abundance.h5
5,PASNSP-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PASNSP,PASNSP,PASNSP-03A-01R,PASNSP-03A-01R,Male,TAM,1,/sbgenomics/project-files/PASNSP-03A-01R.kallisto_quant.abundance.h5
6,PASNSP-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PASNSP,PASNSP,PASNSP-03A-01R,PASNSP-03A-01R,Male,TAM,2,/sbgenomics/project-files/PASNSP-03A-01R.kallisto_quant.abundance.h5


In [59]:
dim(metadata)

In [60]:
metadata <- dplyr::select(metadata, c('Case.ID', 'Sample.ID', 'Gender', 'Disease.type', 'Abundance'))

In [61]:
head(metadata)

Unnamed: 0_level_0,Case.ID,Sample.ID,Gender,Disease.type,Abundance
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>
1,PAXSBH,PAXSBH-03A-01R,Female,TAM,/sbgenomics/project-files/PAXSBH-03A-01R.kallisto_quant.abundance.h5
2,PAXSBH,PAXSBH-03A-01R,Female,TAM,/sbgenomics/project-files/PAXSBH-03A-01R.kallisto_quant.abundance.h5
3,PAXWGW,PAXWGW-03A-01R,Female,TAM,/sbgenomics/project-files/PAXWGW-03A-01R.kallisto_quant.abundance.h5
4,PAXWGW,PAXWGW-03A-01R,Female,TAM,/sbgenomics/project-files/PAXWGW-03A-01R.kallisto_quant.abundance.h5
5,PASNSP,PASNSP-03A-01R,Male,TAM,/sbgenomics/project-files/PASNSP-03A-01R.kallisto_quant.abundance.h5
6,PASNSP,PASNSP-03A-01R,Male,TAM,/sbgenomics/project-files/PASNSP-03A-01R.kallisto_quant.abundance.h5


In [62]:
metadata <- dplyr::distinct(metadata)

In [63]:
head(metadata)

Unnamed: 0_level_0,Case.ID,Sample.ID,Gender,Disease.type,Abundance
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>
1,PAXSBH,PAXSBH-03A-01R,Female,TAM,/sbgenomics/project-files/PAXSBH-03A-01R.kallisto_quant.abundance.h5
2,PAXWGW,PAXWGW-03A-01R,Female,TAM,/sbgenomics/project-files/PAXWGW-03A-01R.kallisto_quant.abundance.h5
3,PASNSP,PASNSP-03A-01R,Male,TAM,/sbgenomics/project-files/PASNSP-03A-01R.kallisto_quant.abundance.h5
4,PASWXF,PASWXF-03A-01R,Male,TAM,/sbgenomics/project-files/PASWXF-03A-01R.kallisto_quant.abundance.h5
5,PASXCL,PASXCL-03A-01R,Male,TAM,/sbgenomics/project-files/PASXCL-03A-01R.kallisto_quant.abundance.h5
6,PAVZTK,PAVZTK-09A-01R,Female,DS-AML,/sbgenomics/project-files/PAVZTK-09A-01R.kallisto_quant.abundance.h5


In [64]:
dim(metadata)

In [65]:
metadata <- dplyr::rename(metadata, sample = Sample.ID)

Need to rename a colump as well to `path` where we have `Abundance`

In [79]:
metadata <- dplyr::rename(metadata, path = Abundance)

In [80]:
head(metadata)

Unnamed: 0_level_0,Case.ID,sample,Gender,Disease.type,path
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>
1,PAXSBH,PAXSBH-03A-01R,Female,TAM,/sbgenomics/project-files/PAXSBH-03A-01R.kallisto_quant.abundance.h5
2,PAXWGW,PAXWGW-03A-01R,Female,TAM,/sbgenomics/project-files/PAXWGW-03A-01R.kallisto_quant.abundance.h5
3,PASNSP,PASNSP-03A-01R,Male,TAM,/sbgenomics/project-files/PASNSP-03A-01R.kallisto_quant.abundance.h5
4,PASWXF,PASWXF-03A-01R,Male,TAM,/sbgenomics/project-files/PASWXF-03A-01R.kallisto_quant.abundance.h5
5,PASXCL,PASXCL-03A-01R,Male,TAM,/sbgenomics/project-files/PASXCL-03A-01R.kallisto_quant.abundance.h5
6,PAVZTK,PAVZTK-09A-01R,Female,DS-AML,/sbgenomics/project-files/PAVZTK-09A-01R.kallisto_quant.abundance.h5


#### biomaRt - how to use

Following instructions from the [ensembl site](https://grch37.ensembl.org/info/data/biomart/biomart_r_package.html)

In [67]:
library(biomaRt)

In [68]:
listEnsembl()

biomart,version
<chr>,<chr>
genes,Ensembl Genes 107
mouse_strains,Mouse strains 107
snps,Ensembl Variation 107
regulation,Ensembl Regulation 107


In [70]:
grch37 = useEnsembl(biomart="ensembl",GRCh=37)

In [76]:
mart <- biomaRt::useMart(biomart="ensembl", 
                     dataset = "hsapiens_gene_ensembl",
                        host = "https://useast.ensembl.org")

In [77]:
ttg <- biomaRt::getBM(
  attributes = c("ensembl_transcript_id", "transcript_version",
  "ensembl_gene_id", "external_gene_name", "description",
  "transcript_biotype"),
  mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
ttg <- dplyr::select(ttg, c('target_id', 'ens_gene', 'ext_gene'))
head(ttg)


Unnamed: 0_level_0,target_id,ens_gene,ext_gene
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,ENST00000387314,ENSG00000210049,MT-TF
2,ENST00000389680,ENSG00000211459,MT-RNR1
3,ENST00000387342,ENSG00000210077,MT-TV
4,ENST00000387347,ENSG00000210082,MT-RNR2
5,ENST00000386347,ENSG00000209082,MT-TL1
6,ENST00000361390,ENSG00000198888,MT-ND1


The resulting table contains Ensembl gene names (‘ens_gene’) and the associated transcripts (‘target_id’). Note that the gene-transcript mapping must be compatible with the transcriptome used with kallisto. In other words, to use Ensembl transcript-gene associations kallisto was run using the Ensembl transcriptome.

#### Preparing the analysis

The next step is to build a sleuth object. The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, metadata is the experimental design and target_mapping describes the transcript groupings into genes previously constructed. Furthermore, we provide an aggregation_column, the column name of in ‘target_mapping’ table that is used to aggregate the transcripts. When both ‘target_mapping’ and ‘aggregation_column’ are provided, sleuth will automatically run in gene mode, returning gene differential expression results that came from the aggregation of transcript p-values.


In [None]:
so <- sleuth_prep(metadata, 
                  target_mapping = ttg, 
                  aggregation_column = 'ens_gene',
                  extra_bootstrap_summary = TRUE)

reading in kallisto results

dropping unused factor levels

.
.
.
.
.
.
.
.
.
.


