# Pediatric DS AML vs TAM

DESeq2 Analysis with Kallisto Quantitation Input

## Following the Instructions from BioConductor DESeq2 
[Transcript abundance to DESeq2 Analysis](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#quick-start)



### Transcript abundance files and tximport / tximeta

Our recommended pipeline for DESeq2 is to use fast transcript abundance quantifiers upstream of DESeq2, and then to create gene-level count matrices for use with DESeq2 by importing the quantification data using tximport (Soneson, Love, and Robinson 2015). This workflow allows users to import transcript abundance estimates from a variety of external software, including the following methods:

* [Salmon](http://combine-lab.github.io/salmon/) (Patro et al. 2017)
* [Sailfish](http://www.cs.cmu.edu/~ckingsf/software/sailfish/) (Patro, Mount, and Kingsford 2014)
* [kallisto](https://pachterlab.github.io/kallisto/about.html) (Bray et al. 2016)
* [RSEM](http://deweylab.github.io/RSEM/) (Li and Dewey 2011)

Some advantages of using the above methods for transcript abundance estimation are: 
* (i) this approach corrects for potential changes in gene length across samples (e.g. from differential isoform usage) (Trapnell et al. 2013), 
* (ii) some of these methods (Salmon, Sailfish, kallisto) are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files, and 
* (iii) it is possible to avoid discarding those fragments that can align to multiple genes with homologous sequence, thus increasing sensitivity (Robert and Watson 2015).

Full details on the motivation and methods for importing transcript level abundance and count estimates, summarizing to gene-level count matrices and producing an offset which corrects for potential changes in average transcript length across samples are described in (Soneson, Love, and Robinson 2015). Note that the tximport-to-DESeq2 approach uses estimated gene counts from the transcript abundance quantifiers, but not normalized counts.

In [1]:
install.packages("readr")
library("readr")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [2]:
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.14")

Bioconductor version '3.14' is out-of-date; the current release version '3.15'
  is available with R version '4.2'; see https://bioconductor.org/install

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.1 (2021-08-10)

Old packages: 'backports', 'blob', 'brew', 'brio', 'broom', 'bslib', 'callr',
  'caret', 'class', 'clipr', 'colorspace', 'commonmark', 'conflicted', 'covr',
  'cpp11', 'crayon', 'credentials', 'crosstalk', 'curl', 'data.table', 'DBI',
  'dbplyr', 'desc', 'devtools', 'dials', 'diffobj', 'digest', 'dplyr', 'DT',
  'dtplyr', 'e1071', 'evaluate', 'fansi', 'farver', 'forcats', 'foreach',
  'forecast', 'fs', 'furrr', 'future', 'future.apply', 'gargle', 'generics',
  'gert', 'ggplot2', 'gh', 'git2r', 'gitcreds', 'globals', 'glue',
  'googlesheets4', 'gower', 'gtable', 'hardhat', 'haven', 'hms', 'htmltools',
  '

In [3]:
BiocManager::install("tximport")
library(tximport)

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.1 (2021-08-10)

“package(s) not installed when version(s) same as current; use `force = TRUE` to
  re-install: 'tximport'”
Old packages: 'backports', 'blob', 'brew', 'brio', 'broom', 'bslib', 'callr',
  'caret', 'class', 'clipr', 'colorspace', 'commonmark', 'conflicted', 'covr',
  'cpp11', 'crayon', 'credentials', 'crosstalk', 'curl', 'data.table', 'DBI',
  'dbplyr', 'desc', 'devtools', 'dials', 'diffobj', 'digest', 'dplyr', 'DT',
  'dtplyr', 'e1071', 'evaluate', 'fansi', 'farver', 'forcats', 'foreach',
  'forecast', 'fs', 'furrr', 'future', 'future.apply', 'gargle', 'generics',
  'gert', 'ggplot2', 'gh', 'git2r', 'gitcreds', 'globals', 'glue',
  'googlesheets4', 'gower', 'gtable', 'hardhat', 'haven', 'hms', 'htmltools',
  'httpuv', 'httr', 'infer', 'ipred', 'IRdisplay',

In [4]:
BiocManager::install("biomaRt")
library(biomaRt)

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.1 (2021-08-10)

“package(s) not installed when version(s) same as current; use `force = TRUE` to
  re-install: 'biomaRt'”
Old packages: 'backports', 'blob', 'brew', 'brio', 'broom', 'bslib', 'callr',
  'caret', 'class', 'clipr', 'colorspace', 'commonmark', 'conflicted', 'covr',
  'cpp11', 'crayon', 'credentials', 'crosstalk', 'curl', 'data.table', 'DBI',
  'dbplyr', 'desc', 'devtools', 'dials', 'diffobj', 'digest', 'dplyr', 'DT',
  'dtplyr', 'e1071', 'evaluate', 'fansi', 'farver', 'forcats', 'foreach',
  'forecast', 'fs', 'furrr', 'future', 'future.apply', 'gargle', 'generics',
  'gert', 'ggplot2', 'gh', 'git2r', 'gitcreds', 'globals', 'glue',
  'googlesheets4', 'gower', 'gtable', 'hardhat', 'haven', 'hms', 'htmltools',
  'httpuv', 'httr', 'infer', 'ipred', 'IRdisplay', 

In [5]:
mart <- biomaRt::useMart(biomart="ensembl", 
                     dataset = "hsapiens_gene_ensembl",
                        host = "https://useast.ensembl.org")

In [6]:
ttg <- biomaRt::getBM(
  attributes = c("ensembl_transcript_id", "transcript_version",
  "ensembl_gene_id", "external_gene_name", "description",
  "transcript_biotype"),
  mart = mart)


In [7]:
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)


In [8]:
ttg <- dplyr::select(ttg, c('target_id', 'ens_gene', 'ext_gene'))
head(ttg)
tx2gene <- dplyr::select(ttg, c('target_id','ext_gene'))
head(tx2gene)

Unnamed: 0_level_0,target_id,ens_gene,ext_gene
Unnamed: 0_level_1,<chr>,<chr>,<chr>
1,ENST00000387314,ENSG00000210049,MT-TF
2,ENST00000389680,ENSG00000211459,MT-RNR1
3,ENST00000387342,ENSG00000210077,MT-TV
4,ENST00000387347,ENSG00000210082,MT-RNR2
5,ENST00000386347,ENSG00000209082,MT-TL1
6,ENST00000361390,ENSG00000198888,MT-ND1


Unnamed: 0_level_0,target_id,ext_gene
Unnamed: 0_level_1,<chr>,<chr>
1,ENST00000387314,MT-TF
2,ENST00000389680,MT-RNR1
3,ENST00000387342,MT-TV
4,ENST00000387347,MT-RNR2
5,ENST00000386347,MT-TL1
6,ENST00000361390,MT-ND1


### Read in Metadata
Copied from the project-files directory a custom `metadata_ten_samples_only.csv` file was created

In [33]:
metadata <- read.table('/sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data/metadata_ten_samples_only.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)

In [35]:
head(metadata)

Unnamed: 0_level_0,File.name,Case.ID,subject,Sample.ID,sample,Gender,Disease.type,Paired.end,Abundance
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>
1,PAXSBH-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PAXSBH,PAXSBH,PAXSBH-03A-01R,PAXSBH-03A-01R,Female,TAM,1,PAXSBH-03A-01R.kallisto_quant.abundance.tsv
2,PAXSBH-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PAXSBH,PAXSBH,PAXSBH-03A-01R,PAXSBH-03A-01R,Female,TAM,2,PAXSBH-03A-01R.kallisto_quant.abundance.tsv
3,PAXWGW-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PAXWGW,PAXWGW,PAXWGW-03A-01R,PAXWGW-03A-01R,Female,TAM,1,PAXWGW-03A-01R.kallisto_quant.abundance.tsv
4,PAXWGW-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PAXWGW,PAXWGW,PAXWGW-03A-01R,PAXWGW-03A-01R,Female,TAM,2,PAXWGW-03A-01R.kallisto_quant.abundance.tsv
5,PASNSP-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r1.fq.gz,PASNSP,PASNSP,PASNSP-03A-01R,PASNSP-03A-01R,Male,TAM,1,PASNSP-03A-01R.kallisto_quant.abundance.tsv
6,PASNSP-03A-01R_RBS_withJunctionsOnGenome_dupsFlagged_r2.fq.gz,PASNSP,PASNSP,PASNSP-03A-01R,PASNSP-03A-01R,Male,TAM,2,PASNSP-03A-01R.kallisto_quant.abundance.tsv


In [36]:
metadata <- dplyr::select(metadata, c('Case.ID', 'Sample.ID', 'Gender', 'Disease.type', 'Abundance'))

In [37]:
metadata <- dplyr::distinct(metadata)

In [38]:
head(metadata)

Unnamed: 0_level_0,Case.ID,Sample.ID,Gender,Disease.type,Abundance
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>
1,PAXSBH,PAXSBH-03A-01R,Female,TAM,PAXSBH-03A-01R.kallisto_quant.abundance.tsv
2,PAXWGW,PAXWGW-03A-01R,Female,TAM,PAXWGW-03A-01R.kallisto_quant.abundance.tsv
3,PASNSP,PASNSP-03A-01R,Male,TAM,PASNSP-03A-01R.kallisto_quant.abundance.tsv
4,PASWXF,PASWXF-03A-01R,Male,TAM,PASWXF-03A-01R.kallisto_quant.abundance.tsv
5,PASXCL,PASXCL-03A-01R,Male,TAM,PASXCL-03A-01R.kallisto_quant.abundance.tsv
6,PAVZTK,PAVZTK-09A-01R,Female,DS-AML,PAVZTK-09A-01R.kallisto_quant.abundance.tsv


In [39]:
metadata <- dplyr::rename(metadata, sample = Sample.ID)

In [40]:
metadata <- dplyr::rename(metadata, path = Abundance)

In [41]:
head(metadata)

Unnamed: 0_level_0,Case.ID,sample,Gender,Disease.type,path
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>
1,PAXSBH,PAXSBH-03A-01R,Female,TAM,PAXSBH-03A-01R.kallisto_quant.abundance.tsv
2,PAXWGW,PAXWGW-03A-01R,Female,TAM,PAXWGW-03A-01R.kallisto_quant.abundance.tsv
3,PASNSP,PASNSP-03A-01R,Male,TAM,PASNSP-03A-01R.kallisto_quant.abundance.tsv
4,PASWXF,PASWXF-03A-01R,Male,TAM,PASWXF-03A-01R.kallisto_quant.abundance.tsv
5,PASXCL,PASXCL-03A-01R,Male,TAM,PASXCL-03A-01R.kallisto_quant.abundance.tsv
6,PAVZTK,PAVZTK-09A-01R,Female,DS-AML,PAVZTK-09A-01R.kallisto_quant.abundance.tsv


In [17]:
BiocManager::install("tximport")

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.1 (2021-08-10)

“package(s) not installed when version(s) same as current; use `force = TRUE` to
  re-install: 'tximport'”
Old packages: 'backports', 'blob', 'brew', 'brio', 'broom', 'bslib', 'callr',
  'caret', 'class', 'clipr', 'colorspace', 'commonmark', 'conflicted', 'covr',
  'cpp11', 'crayon', 'credentials', 'crosstalk', 'curl', 'data.table', 'DBI',
  'dbplyr', 'desc', 'devtools', 'dials', 'diffobj', 'digest', 'dplyr', 'DT',
  'dtplyr', 'e1071', 'evaluate', 'fansi', 'farver', 'forcats', 'foreach',
  'forecast', 'fs', 'furrr', 'future', 'future.apply', 'gargle', 'generics',
  'gert', 'ggplot2', 'gh', 'git2r', 'gitcreds', 'globals', 'glue',
  'googlesheets4', 'gower', 'gtable', 'hardhat', 'haven', 'hms', 'htmltools',
  'httpuv', 'httr', 'infer', 'ipred', 'IRdisplay',

In [18]:
library(tximport)

In [19]:
BiocManager::install("rhdf5", force=TRUE)

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.1 (2021-08-10)

Installing package(s) 'rhdf5'

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Old packages: 'backports', 'blob', 'brew', 'brio', 'broom', 'bslib', 'callr',
  'caret', 'class', 'clipr', 'colorspace', 'commonmark', 'conflicted', 'covr',
  'cpp11', 'crayon', 'credentials', 'crosstalk', 'curl', 'data.table', 'DBI',
  'dbplyr', 'desc', 'devtools', 'dials', 'diffobj', 'digest', 'dplyr', 'DT',
  'dtplyr', 'e1071', 'evaluate', 'fansi', 'farver', 'forcats', 'foreach',
  'forecast', 'fs', 'furrr', 'future', 'future.apply', 'gargle', 'generics',
  'gert', 'ggplot2', 'gh', 'git2r', 'gitcreds', 'globals', 'glue',
  'googlesheets4', 'gower', 'gtable', 'hardhat', 'haven', 'hms', 'htmltools',
  'httpuv', 'httr', 'infer', 'ipred', 'IRdisp

In [20]:
library(rhdf5)

Made a local copy of the file in the projects file directory for that moment

### Prepping abundance data from Kallisto using the BioConductor tximport Manual options

Build for our manual was on October 16, 2022

From the manual on tximport we see the following options:

* **files**:  a character vector of filenames for the transcript-level abundances
* **type**:   character string, the type of software used to generate the abundances. Options are "salmon", "sailfish", "alevin", "kallisto", "rsem", "stringtie", or "none". This argument is used to autofill the arguments below (geneIdCol, etc.) "none" means that the user will specify these columns. Be aware that specifying type other than "none" will ignore the arguments below (geneIdCol, etc.)
* **txIn**:  a logical (TRUE/FALSE), whether the incoming files are transcript level (default TRUE)
* **txOut**: a logical (TRUE/FALSE), whether the function should just output transcript-level (default FALSE) 
* **countsFromAbundance** a character, either "no" (default), "scaledTPM", "lengthScaledTPM", or "dtuS- caledTPM". Whether to generate estimated counts using abundance estimates:
* • scaled up to library size (scaledTPM),
* • scaled using the average transcript length over samples and then the library size (lengthScaledTPM), or
* • scaled using the median transcript length among isoforms of a gene, and then the library size (dtuScaledTPM).
dtuScaledTPM is designed for DTU analysis in combination with txOut=TRUE, and it requires specifing a tx2gene data.frame. dtuScaledTPM works such that within a gene, values from all samples and all transcripts get scaled by the same fixed median transcript length. If using scaledTPM, lengthScaledTPM, or gene- LengthScaledTPM, the counts are no longer correlated across samples with tran- script length, and so the length offset matrix should not be used.
* **tx2gene** a two-column data.frame linking transcript id (column 1) to gene id (column 2). the column names are not relevant, but this column order must be used. this argument is required for gene-level summarization, and the tximport vignette describes how to construct this data.frame (see Details below). An automated solution to avoid having to create tx2gene if one has quantified with Salmon or alevin with human or mouse transcriptomes is to use the tximeta function from the tximeta Bioconductor package.
* **ignoreTxVersion** a logical (TRUE/FALSE), whether to split the tx id on the ’.’ character to remove version informa- tion to facilitate matching with the tx id in tx2gene (default FALSE)
* **ignoreAfterBar** a logical (TRUE/FALSE), whether to split the tx id on the ’|’ character to facilitate matching with the tx id in tx2gene (default FALSE). if txOut=TRUE it will strip the text after ’|’ on the rownames of the matrices
* **geneIdCol** name of column with gene id. if missing, the tx2gene argument can be used. Note that this argument and the other four "...Col" arguments below are ignored unless type="none"
* **txIdCol** name of column with tx id
* **abundanceCol** name of column with abundances (e.g. TPM or FPKM) name of column with estimated counts
* **countsCol** name of column with feature length information
* **lengthCol** name of column with feature length information
* **importer** a function used to read in the files

We are going to use the following:

* files           - will be specified from the metadata file - column renamed to `path`
* type            - "kallisto"
* txIn            - TRUE
* txOut           - FALSE
* tx2gene         - "tx2gene" (we used biomaRt to create the tx2gene file)
* ignoreTxVersion - TRUE (we are going to ignore the version info after the "."
* ignoreAfterBar  - TRUE (we will ignore the other info after the "|"
* geneIdCol       - "target_id" (from the ttg)
* txIdCol         - "ext_gene"
* abundanceCol    - "tpm"
* countsCol       - "est_counts"
* lengthCol       - "eff_length"
* importer        - "readr"



In [28]:
library(readr)

In [44]:
metadata$path

In [45]:
setwd("/sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data/")

In [46]:
txi.kallisto <- tximport(files           = "metadata$path", 
                         existenceOptional = TRUE,
                         type            = "kallisto",
                         txIn            = TRUE,
                         txOut           = FALSE,
                         tx2gene         = "tx2gene",
                         ignoreTxVersion = TRUE,
                         ignoreAfterBar  = TRUE,
                         geneIdCol       = "ext_gene",
                         txIdCol         = "target_id",
                         abundanceCol    = "tpm",
                         countsCol       = "est_counts",
                         lengthCol       = "eff_length",)


Note: importing `abundance.h5` is typically faster than `abundance.tsv`

reading in files with read_tsv

1 


ERROR: Error: 'metadata$path' does not exist in current working directory ('/sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data').


In [24]:
all(file.exists(metadata$path))

In [26]:
metadata$path

In [25]:
tximport

In [None]:
install.packages("cowplot")


In [None]:
BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
library (TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")

In [None]:
library("cowplot")


In [None]:
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.14")


In [None]:
BiocManager::install("biomaRt")

In [None]:
BiocManager::install("DESeq2")

In [None]:
library(DESeq2)

In [None]:
library("devtools")

#### First time through - Received a Warning that rhdf5 not available for this version of R

Looked up our version and google searched R 4.1.1 rhdf5
    
Can install using `BiocManager::install("rhdf5")`

Pactherlab says to install `*rhdf5*` first

In [None]:
BiocManager::install("rhdf5", force=TRUE)

In [None]:
library (rhdf5)


#### Issues

Noted in issues https://github.com/pachterlab/sleuth/issues/259 -- follow the instructions from [Paast](https://github.com/pachterlab/sleuth/issues/259#issuecomment-966270599)

Install rhdf5 as noted above.

Load the library

##### Clone sleuth and install after editing the file

Change directory to the top working directory in this etheral machine.

```bash
cd /sbgenomics/workspace
```

now clone the library 

```bash
git clone https://github.com/pachterlab/sleuth.git
```

edit NAMESPACE as the instructions note - to remove the dependency remove the last line to remove the reference to **rhdf5**

And then run the install.


In [None]:
devtools::install('../../sleuth/')

In [None]:
library(sleuth)

We have successfully run Kallisto with Kallisto Quantitation.

Results may be found after running an application on Cavatica here:

```bash
/sbgenomics/project-files/
```

For this analysis we will use the results from the run using `metadata_ten_samples_only_txt`

Results are in:

```bash
/sbgenomics/project-files/ten_samples_expression_matrix.tpm.txt
```

### Parsing metadata

A sleuth analysis is dependent on a metadata file, which describes the experimental design, the sample names, conditions and covariates. The metadata file is external to sleuth, and must be prepared prior to analysis. A metadata file should have been downloaded along with the kallisto quantifications. The first step in a sleuth analysis is loading of the metadata file. You might need the path in read_table below to where you have downloaded the kallisto dataset, so that the path directs to the sample_table.txt. We then select the relevant columns of the metadata.

In our case, I used:

```bash
/sbgenomics/project-files/metadata_ten_samples_only.csv
```

In [None]:
metadata <- read.table('/sbgenomics/project-files/metadata_ten_samples_only.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)

In [None]:
head(metadata, n=20)

There is an error in the last sample detail - where the paired should read `2` and not NA.  So I copied the file to a local directory and corrected it -- it is corrected permanently now - but for this run through you can see:
```bash
cp /sbgenomics/project-files/metadata_ten_samples_only.csv /sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data
```

where I edited the file and now will read this one in.

In [None]:
metadata <- read.table('/sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data/metadata_ten_samples_only.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)

In [None]:
head(metadata)

In [None]:
dim(metadata)

In [None]:
metadata <- dplyr::select(metadata, c('Case.ID', 'Sample.ID', 'Gender', 'Disease.type', 'Abundance'))

In [None]:
head(metadata)

In [None]:
metadata <- dplyr::distinct(metadata)

In [None]:
head(metadata)

In [None]:
dim(metadata)

In [None]:
metadata <- dplyr::rename(metadata, sample = Sample.ID)

Need to rename a colump as well to `path` where we have `Abundance`

In [None]:
metadata <- dplyr::rename(metadata, path = Abundance)

In [None]:
head(metadata)

#### biomaRt - how to use

Following instructions from the [ensembl site](https://grch37.ensembl.org/info/data/biomart/biomart_r_package.html)

In [None]:
library(biomaRt)

In [None]:
mart <- biomaRt::useMart(biomart="ensembl", 
                     dataset = "hsapiens_gene_ensembl",
                        host = "https://useast.ensembl.org")

In [None]:
ttg <- biomaRt::getBM(
  attributes = c("ensembl_transcript_id", "transcript_version",
  "ensembl_gene_id", "external_gene_name", "description",
  "transcript_biotype"),
  mart = mart)


In [None]:
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)


In [None]:
ttg <- dplyr::select(ttg, c('target_id', 'ens_gene', 'ext_gene'))
head(ttg)

In [None]:
ttg <- read.table('/sbgenomics/workspace/pediatric-DS-AML-TAM-Analysis/data/ttg.csv', sep=",", header=TRUE, stringsAsFactors = FALSE)

In [None]:
head(ttg)

The resulting table contains Ensembl gene names (‘ens_gene’) and the associated transcripts (‘target_id’). Note that the gene-transcript mapping must be compatible with the transcriptome used with kallisto. In other words, to use Ensembl transcript-gene associations kallisto was run using the Ensembl transcriptome.

#### Preparing the analysis

The next step is to build a sleuth object. The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, metadata is the experimental design and target_mapping describes the transcript groupings into genes previously constructed. Furthermore, we provide an aggregation_column, the column name of in ‘target_mapping’ table that is used to aggregate the transcripts. When both ‘target_mapping’ and ‘aggregation_column’ are provided, sleuth will automatically run in gene mode, returning gene differential expression results that came from the aggregation of transcript p-values.


In [None]:
ttg      <- data.frame(ttg)
metadata <- data.frame(metadata)

In [None]:
head(ttg)

In [None]:
head(metadata)

#### Model (Design) Matrix Required
We need to supply a model matrix -- and Sleuth implicitly uses DESeq2

[How to use DESeq2](https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)

We have the following to compare condition effects of gender (Male, Female) with Disease type(TAM, DS-AML) in our cases.

We have two groups (Male, Female) and two conditions (TAM, DS-AML)

In [None]:
group <- factor(metadata$Gender)
group

In [None]:
condition <- factor(metadata$Disease.type)
condition

In [None]:
full_model <- model.matrix(~group + condition + group:condition)
full_model

In [None]:
sample_to_covariates = metadata
target_mapping = ttg
aggregation_column = "ens_gene"
gene_mode = TRUE
extra_bootstrap_summary = TRUE
read_bootstrap_tpm = TRUE
full_model = full_model
normalize = TRUE

In [None]:
extra_opts <- list(gene_mode, extra_bootstrap_summary, read_bootstrap_tpm, full_model, normalize)
names(extra_opts) <- c("gene_mode",
                       "extra_bootstrap_summary", 
                       "read_bootstrap_tpm", 
                       "full_model",
                       "normalize")
  if ("extra_bootstrap_summary" %in% names(extra_opts)) {
    extra_bootstrap_summary <- extra_opts$extra_bootstrap_summary
  } else {
    extra_bootstrap_summary <- FALSE
  }
  if ("read_bootstrap_tpm" %in% names(extra_opts)) {
    read_bootstrap_tpm <- extra_opts$read_bootstrap_tpm
  } else {
    read_bootstrap_tpm <- FALSE
  }
  if ("max_bootstrap" %in% names(extra_opts)) {
    max_bootstrap <- extra_opts$max_bootstrap
  } else {
    max_bootstrap <- NULL
  }


In [None]:
extra_bootstrap_summary
read_bootstrap_tpm
max_bootstrap

In [None]:
names(extra_opts)

In [None]:
so <- sleuth_prep(sample_to_covariates    = metadata, 
                  target_mapping          = ttg, 
                  aggregation_column      = 'ens_gene',
                  gene_mode               = TRUE,
                  extra_bootstrap_summary = TRUE,
                  read_bootstrap_tpm      = TRUE,
                  full_model              = full_model,
                  normalize               = TRUE)


In [None]:
ttg.df <- data.frame (ttg)

In [None]:
  sample_to_covariates <- as.data.frame(sample_to_covariates)
  sample_to_covariates$sample <- as.character(sample_to_covariates$sample)


In [None]:
nrow(sample_to_covariates)