# Transcript quantification with kallisto

[kallisto](https://pachterlab.github.io/kallisto/about) is a tool that performs fast quantification with bootstrapping for transcripts, using *pseudoalignment* for rapid mapping. In order to perform transcript quantificaiton, we first need to create an index of the **transcriptome**.

Note: kallisto and companion tools have been installed under:

    /tigress/BEE/tools_group/bin/
Make sure to add this directory to your PATH to run the command directly.

First, we can obtain the transcriptome fasta files from:

http://ftp.ensembl.org/pub/grch37/release-85/fasta/homo_sapiens/

Note that as for RNA-seq samples with GTEx, we are still using GRCh37 (or hg18). This is due to the fact that the genotyping, as well as RNA quantificaiton and eQTL studies for the GTEx consortium, is based on this version of the genome.

[A quick overview of the differences between GRCh37 and GRCh38](http://genomeref.blogspot.com/2013/12/announcing-grch38.html)

I downloaded the cdna and also ncrna (non-coding RNA, since some of our reads do come from this library as well) data for this release under:

    /tigress/BEE/RNAseq/Data/Resources/transcriptome
Note that there is also a cdna and ncrna combined version:

    Homo_sapiens.GRCh37.cdna_ncrna.fa.gz

Now, we can run the commands:

In [None]:
%%bash
kallisto index -i /tigress/BEE/RNAseq/Data/kallisto/GRCh37.p13.transcriptome.idx /tigress/BEE/RNAseq/Data/Resources/transcriptome/Homo_sapiens.GRCh37.cdna.all.fa.gz

kallisto index -i /tigress/BEE/RNAseq/Data/kallisto/GRCh37.p13.transcriptome_ncrna.idx /tigress/BEE/RNAseq/Data/Resources/transcriptome/Homo_sapiens.GRCh37.cdna_ncrna.fa.gz

The outputs are following:

    [build] loading fasta file /tigress/BEE/RNAseq/Data/Resources/transcriptome/Homo_sapiens.GRCh37.cdna.all.fa.gz
	[build] k-mer length: 31
	[build] warning: clipped off poly-A tail (longer than 10)
	        from 1401 target sequences
	[build] counting k-mers ... done.
	[build] building target de Bruijn graph ...  done 
	[build] creating equivalence classes ...  done
	[build] target de Bruijn graph has 1022307 contigs and contains 101446106 k-mers 
    
for the cdna-only version, and 

    [build] loading fasta file /tigress/BEE/RNAseq/Data/Resources/transcriptome/Homo_sapiens.GRCh37.cdna_ncrna.fa.gz
	[build] k-mer length: 31
	[build] warning: clipped off poly-A tail (longer than 10)
	        from 1578 target sequences
	[build] warning: replaced 1 non-ACGUT characters in the input sequence
	        with pseudorandom nucleotides
	[build] counting k-mers ... done.
	[build] building target de Bruijn graph ...  done 
	[build] creating equivalence classes ...  done
	[build] target de Bruijn graph has 1202693 contigs and contains 117035043 k-mers 

for the cdna-ncrna combined version.

## Running kallisto on gtex samples

Our target runs are going from the trimmed RNA-seq reads from gtex, located at:

    /tigress/BEE/gtex/data/phenotype/expression/trimmed_rna_seq_reads/

For more details on how these were prepared, consult [Ian's notebook on the qc and trimming steps](http://nbviewer.jupyter.org/gist/IanMcDowell/adbbbc31d5c840680536).

An example call for the quantification run on transcripts from the index we built earlier would be the following:

In [None]:
%%bash

kallisto quant -i /tigress/BEE/RNAseq/Data/kallisto/GRCh37.p13.genome.idx -o /tigress/BEE/RNAseq/Output/kallisto/SRR1476125 \
-b 100 /tigress/BEE/gtex/data/phenotype/expression/trimmed_rna_seq_reads/SRR1476125_1.trimmed.P.fastq.bz2 \
/tigress/BEE/gtex/data/phenotype/expression/trimmed_rna_seq_reads/SRR1476125_2.trimmed.P.fastq.bz2

The result of running the above code is the following:

    [quant] fragment length distribution will be estimated from the data
	[index] k-mer length: 31
	[index] number of targets: 180,253
	[index] number of k-mers: 101,446,106
	[index] number of equivalence classes: 693,993
	[quant] running in paired-end mode
	[quant] will process pair 1: /tigress/BEE/RNAseq/Data/Expression/SRR1476125_1.trimmed.P.fastq
	                             /tigress/BEE/RNAseq/Data/Expression/SRR1476125_2.trimmed.P.fastq
	[quant] finding pseudoalignments for the reads ... done
	[quant] processed 37,417,059 reads, 31,650,885 reads pseudoaligned
	[quant] estimated average fragment length: 195.191
	[   em] quantifying the abundances ... done
	[   em] the Expectation-Maximization algorithm ran for 1,261 rounds
	[bstrp] running EM for the bootstrap: 100
    
The quantification plus 100 bootstraps took about 1.5 hours to finish on the head node (expected to be much faster when running in cluster with multiple cores). The version with ncrna had the following output:

    [quant] fragment length distribution will be estimated from the data
	[index] k-mer length: 31
	[index] number of targets: 215,170
	[index] number of k-mers: 117,035,043
	[index] number of equivalence classes: 808,924
	[quant] running in paired-end mode
	[quant] will process pair 1: /dev/fd/63
	                             /dev/fd/62
	[quant] finding pseudoalignments for the reads ... done
	[quant] processed 37,417,059 reads, 32,582,173 reads pseudoaligned
	[quant] estimated average fragment length: 193.658
	[   em] quantifying the abundances ... done
	[   em] the Expectation-Maximization algorithm ran for 1,298 rounds
	[bstrp] running EM for the bootstrap: 100
    
We can see that including ncrna increases the number of reads mapped about 1,000,000, getting it up to ~87% of all reads, but the tradeoff was in time consumed (more than 0.5 hours longer to run).

### Testing the specs of various kallisto bootstraps with an example tissue: Pancreas

Pancreas has 170 samples (148 with genotypes) in the release we are using. Let's quantify them, with bootstrap samples, and for both cdna and cdna_ncrna versions:

In [4]:
%%bash

python /tigress/BEE/RNAseq/Scripts/kallisto/quantification_script_wrapper.py 5 1 40 pancreas

Samples to process: 119
sh /tigress/BEE/RNAseq/Scripts/kallisto/batch/ka_quant_7A3I9US6.sh
Don't forget to change the permissions for the master scripts before distributing to other users!


Refer to this code: [quantification_script_wrapper](https://github.com/bee-hive/bjo-notebook/blob/master/Scripts/RNAseq/kallisto/quantification_script_wrapper.py)