### Getting started with Sleuth

**sleuth** is a tool for the analysis and comparison of multiple related RNA-Seq experiments. Key features include:

* The ability to perform both transcript-level and gene-level analysis.
* Compatibility with kallisto enabling a fast and accurate workflow from reads to results.
* The use of boostraps to ascertain and correct for technical variation in experiments.
* An interactive app for exploratory data analysis.

To use sleuth, RNA-Seq data must first be quantified with kallisto ( which we did so with the CAVATICA workflow), which is a program for very fast RNA-Seq quantification based on pseudo-alignment. An important feature of kallisto is that it outputs bootstraps along with the estimates of transcript abundances. These can serve as proxies for technical replicates, allowing for an ascertainment of the variability in estimates due to the random processes underlying RNA-Seq as well as the statistical procedure of read assignment. kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. sleuth has been designed to work seamlessly and efficiently with kallisto, and therefore RNA-Seq analysis with kallisto and sleuth is tractable on a laptop computer in a matter of minutes. More details about kallisto and sleuth are provided the papers describing the methods:

#### Citations

* Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527 (2016), doi:10.1038/nbt.3519

* Harold Pimentel, Nicolas L Bray, Suzette Puente, Páll Melsted and Lior Pachter, Differential analysis of RNA-seq incorporating quantification uncertainty, in press.

sleuth has been designed to facilitate the exploration of RNA-Seq data by utilizing the Shiny web application framework by RStudio. The worked example below illustrates how to load data into sleuth and how to open Shiny plots for exploratory data analysis. The code underlying all plots is available via the Shiny interface so that analyses can be fully “open source”.



### Introduction

Applying the same techniques used in the walk through - apply it to our own AML vs TAM

We need to set up our working directory.
And we need 

* `cowplot` - for making prettier plots and plots with grids.  and 
* `biomaRt` - for extracting the Ensembl transcript to gene mapping



In [1]:
install.packages("cowplot")


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [2]:
library("cowplot")


In [3]:
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.14")


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.1 (2021-08-10)

Installing package(s) 'BiocVersion'

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Old packages: 'backports', 'BiocGenerics', 'blob', 'brew', 'brio', 'broom',
  'bslib', 'callr', 'caret', 'class', 'cli', 'clipr', 'colorspace',
  'commonmark', 'conflicted', 'covr', 'cpp11', 'crayon', 'credentials',
  'crosstalk', 'curl', 'data.table', 'DBI', 'dbplyr', 'desc', 'devtools',
  'dials', 'diffobj', 'digest', 'dplyr', 'DT', 'dtplyr', 'e1071', 'evaluate',
  'fansi', 'farver', 'forcats', 'foreach', 'forecast', 'fs', 'furrr', 'future',
  'future.apply', 'gargle', 'generics', 'gert', 'ggplot2', 'gh', 'git2r',
  'gitcreds', 'globals', 'glue', 'googleshee

In [4]:
BiocManager::install("biomaRt")

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.1 (2021-08-10)

Installing package(s) 'biomaRt'

also installing the dependencies ‘zlibbioc’, ‘GenomeInfoDbData’, ‘XVector’, ‘GenomeInfoDb’, ‘BiocGenerics’, ‘png’, ‘Biostrings’, ‘Biobase’, ‘IRanges’, ‘KEGGREST’, ‘filelock’, ‘XML’, ‘AnnotationDbi’, ‘BiocFileCache’


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Old packages: 'backports', 'blob', 'brew', 'brio', 'broom', 'bslib', 'callr',
  'caret', 'class', 'cli', 'clipr', 'colorspace', 'commonmark', 'conflicted',
  'covr', 'cpp11', 'crayon', 'credentials', 'crosstalk', 'curl', 'data.table',
  'DBI', 'dbplyr', 'desc', 'devtools', 'dials', 'diffobj', 'digest', 'dplyr',
  'DT', 'dtplyr', 'e1071', 'evaluate', 'fansi', 'farver', 'forcats', 'foreach',
  'forecast', 'fs', 'furrr', 'future', 'f

In [None]:
BiocManager::install("DESeq2")

In [None]:
library(DESeq2)

In [None]:
library("devtools")

#### First time through - Received a Warning that rhdf5 not available for this version of R

Looked up our version and google searched R 4.1.1 rhdf5
    
Can install using `BiocManager::install("rhdf5")`

Pactherlab says to install `*rhdf5*` first

In [None]:
BiocManager::install("rhdf5", force=TRUE)

In [None]:
library (rhdf5)


#### Issues

Noted in issues https://github.com/pachterlab/sleuth/issues/259 -- follow the instructions from [Paast](https://github.com/pachterlab/sleuth/issues/259#issuecomment-966270599)

Install rhdf5 as noted above.

Load the library

##### Clone sleuth and install after editing the file

Change directory to the top working directory in this etheral machine.

```bash
cd /sbgenomics/workspace
```

now clone the library 

```bash
git clone https://github.com/pachterlab/sleuth.git
```

edit NAMESPACE as the instructions note - to remove the dependency remove the last line to remove the reference to **rhdf5**

And then run the install.


In [None]:
devtools::install('../../sleuth/')

In [None]:
library(sleuth)

We have successfully run Kallisto with Kallisto Quantitation.

Results may be found after running an application on Cavatica here:

```bash
/sbgenomics/project-files/
```

For this analysis we will use the results from the run using `metadata_ten_samples_only_txt`

Results are in:

```bash
/sbgenomics/project-files/ten_samples_expression_matrix.tpm.txt
```

### Parsing metadata

A sleuth analysis is dependent on a metadata file, which describes the experimental design, the sample names, conditions and covariates. The metadata file is external to sleuth, and must be prepared prior to analysis. A metadata file should have been downloaded along with the kallisto quantifications. The first step in a sleuth analysis is loading of the metadata file. You might need the path in read_table below to where you have downloaded the kallisto dataset, so that the path directs to the sample_table.txt. We then select the relevant columns of the metadata.

In our case, I used:

```bash
/sbgenomics/project-files/metadata_ten_samples_only.csv
```

In [None]:
metadata <- read.table('/sbgenomics/project-files/metadata_ten_samples_only.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)

In [None]:
head(metadata, n=20)

There is an error in the last sample detail - where the paired should read `2` and not NA.  So I copied the file to a local directory and corrected it -- it is corrected permanently now - but for this run through you can see:
```bash
cp /sbgenomics/project-files/metadata_ten_samples_only.csv /sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data
```

where I edited the file and now will read this one in.

In [None]:
metadata <- read.table('/sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data/metadata_ten_samples_only.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)

In [None]:
head(metadata)

In [None]:
dim(metadata)

In [None]:
metadata <- dplyr::select(metadata, c('Case.ID', 'Sample.ID', 'Gender', 'Disease.type', 'Abundance'))

In [None]:
head(metadata)

In [None]:
metadata <- dplyr::distinct(metadata)

In [None]:
head(metadata)

In [None]:
dim(metadata)

In [None]:
metadata <- dplyr::rename(metadata, sample = Sample.ID)

Need to rename a colump as well to `path` where we have `Abundance`

In [None]:
metadata <- dplyr::rename(metadata, path = Abundance)

In [None]:
head(metadata)

#### biomaRt - how to use

Following instructions from the [ensembl site](https://grch37.ensembl.org/info/data/biomart/biomart_r_package.html)

In [None]:
library(biomaRt)

In [None]:
mart <- biomaRt::useMart(biomart="ensembl", 
                     dataset = "hsapiens_gene_ensembl",
                        host = "https://useast.ensembl.org")

In [None]:
ttg <- biomaRt::getBM(
  attributes = c("ensembl_transcript_id", "transcript_version",
  "ensembl_gene_id", "external_gene_name", "description",
  "transcript_biotype"),
  mart = mart)


In [None]:
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)


In [None]:
ttg <- dplyr::select(ttg, c('target_id', 'ens_gene', 'ext_gene'))
head(ttg)

The resulting table contains Ensembl gene names (‘ens_gene’) and the associated transcripts (‘target_id’). Note that the gene-transcript mapping must be compatible with the transcriptome used with kallisto. In other words, to use Ensembl transcript-gene associations kallisto was run using the Ensembl transcriptome.

#### Preparing the analysis

The next step is to build a sleuth object. The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, metadata is the experimental design and target_mapping describes the transcript groupings into genes previously constructed. Furthermore, we provide an aggregation_column, the column name of in ‘target_mapping’ table that is used to aggregate the transcripts. When both ‘target_mapping’ and ‘aggregation_column’ are provided, sleuth will automatically run in gene mode, returning gene differential expression results that came from the aggregation of transcript p-values.


#### Model (Design) Matrix Required
We need to supply a model matrix -- and Sleuth implicitly uses DESeq2

[How to use DESeq2](https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)

We have the following to compare condition effects of gender (Male, Female) with Disease type(TAM, DS-AML) in our cases.

We have two groups (Male, Female) and two conditions (TAM, DS-AML)

In [None]:
group <- factor(metadata$Gender)
group

In [None]:
condition <- factor(metadata$Disease.type)
condition

In [None]:
full_model <- model.matrix(~group + condition + group:condition)
full_model

In [None]:
sample_to_covariates = metadata
target_mapping = ttg
aggregation_column = "ens_gene"
gene_mode = TRUE
extra_bootstrap_summary = TRUE
read_bootstrap_tpm = TRUE
full_model = full_model
normalize = TRUE

In [None]:
extra_opts <- list(gene_mode, extra_bootstrap_summary, read_bootstrap_tpm, full_model, normalize)
names(extra_opts) <- c("gene_mode",
                       "extra_bootstrap_summary", 
                       "read_bootstrap_tpm", 
                       "full_model",
                       "normalize")
  if ("extra_bootstrap_summary" %in% names(extra_opts)) {
    extra_bootstrap_summary <- extra_opts$extra_bootstrap_summary
  } else {
    extra_bootstrap_summary <- FALSE
  }
  if ("read_bootstrap_tpm" %in% names(extra_opts)) {
    read_bootstrap_tpm <- extra_opts$read_bootstrap_tpm
  } else {
    read_bootstrap_tpm <- FALSE
  }
  if ("max_bootstrap" %in% names(extra_opts)) {
    max_bootstrap <- extra_opts$max_bootstrap
  } else {
    max_bootstrap <- NULL
  }


In [None]:
extra_bootstrap_summary
read_bootstrap_tpm
max_bootstrap

In [None]:
names(extra_opts)

In [None]:
so <- sleuth_prep(sample_to_covariates    = metadata, 
                  target_mapping          = ttg, 
                  aggregation_column      = 'ens_gene',
                  gene_mode               = TRUE,
                  extra_bootstrap_summary = TRUE,
                  read_bootstrap_tpm      = TRUE,
                  full_model              = full_model,
                  normalize               = TRUE)


In [None]:
ttg.df <- data.frame (ttg)

In [None]:
  sample_to_covariates <- as.data.frame(sample_to_covariates)
  sample_to_covariates$sample <- as.character(sample_to_covariates$sample)


In [None]:
nrow(sample_to_covariates)