In [1]:
!date

Wed May 27 01:01:23 UTC 2020


## Introduction

In this part of the course, we will discuss how to quantify droplet-based single-cell RNA-seq data using _alevin_. We will cover the details about the various command-line flags used by the [_alevin_](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1670-y) tool in its indexing & quantification stages, and quantify a small subset data for the experiment done by [Hermann et. al](https://pubmed.ncbi.nlm.nih.gov/30404016/).

## Reference Transcriptome

Alevin uses the transcriptome-alignment strategy to generate the alignments of the dscRNA-seq reads.
Under the hood, alevin uses [Salmon's](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5600148/) [selective-alignment](https://www.biorxiv.org/content/10.1101/657874v2) infrastructure to generate the alignments and starts by first _indexing_ the reference transcriptome.
In this tutorial we will use a small reference transcriptome, which we generate by subsampling all the transcripts from the Chromosome 18 & 19
of the mouse transcriptome and it is already copied in your environment.  
**NOTE**: A user can download the full transcriptome from https://www.gencodegenes.org/ .

Let's first start by checking if we can access salmon and the required data through our environment.  
**NOTE**: `!` enables the bash command mode for a line in the ipython notebook  
**NOTE**: `%%bash` enables the bash command mode for the cell in the ipython notebook

In [5]:
!salmon --help

salmon v1.2.1

Usage:  salmon -h|--help or 
        salmon -v|--version or 
        salmon -c|--cite or 
        salmon [--no-version-check] <COMMAND> [-h | options]

Commands:
     index Create a salmon index
     quant Quantify a sample
     alevin single cell analysis
     swim  Perform super-secret operation
     quantmerge Merge multiple quantifications into a single file


In [6]:
%%bash
ls data/spermatogenesis_subset
head -4 data/spermatogenesis_subset/GRCm38.gencode.vM21.chr18.chr19.txome.fa

AdultMouseRep3sub1M_S1_L001_R1_001.fastq.gz
AdultMouseRep3sub1M_S1_L001_R2_001.fastq.gz
GRCm38.gencode.vM21.chr18.chr19.genome.fa
GRCm38.gencode.vM21.chr18.chr19.gtf
GRCm38.gencode.vM21.chr18.chr19.tgMap.txt
GRCm38.gencode.vM21.chr18.chr19.txome.fa
>ENSMUST00000234132.1|ENSMUSG00000117547.1|OTTMUSG00000072753.1|OTTMUST00000176063.1|AC125218.3-201|AC125218.3|252|processed_pseudogene|
CCTTAACCATAGGTACAGGTAATCAACTCAGAATGAAAAGCCAGTAGCTATGAACAAGGCGGAGGTGCCACTGCTAACCC
TGTGGCCACAGCACCCTTACCGCAGCTCTCAAGTGAGATTGAACGCCTCATGAGTCAGGGTTATTACTACCAGGACATTC
AGAAATCTCTGGTCATTGCCCAAAACAACATTGAGATTGCTAAAAACATCCTCCAGGAATTTGTTTCTATTTCTTCTCCT


## Salmon Indexing

Indexing is the process by which salmon preprocess the reference sequences and store them into an efficient data-structure which is designed specifically to optimize the alignment speed & accuracy. Salmon follows a kmer-based indexing approach (more discussion to follow) which is enable by `salmon index` command. Understanding the command-line flags of a tool is very important to tweak the efficiency and customize the tool according to your usecase. Let's look into detail to some of the frequently used command-line flags and index the subsampled transcriptome.

In [7]:
! salmon index --help

Version Info: This is the most recent version of salmon.

Index
Creates a salmon index.

Command Line Options:
  -v [ --version ]              print version string
  -h [ --help ]                 produce help message
  -t [ --transcripts ] arg      Transcript fasta file.
  -k [ --kmerLen ] arg (=31)    The size of k-mers that should be used for the 
                                quasi index.
  -i [ --index ] arg            salmon index.
  --gencode                     This flag will expect the input transcript 
                                fasta to be in GENCODE format, and will split 
                                the transcript name at the first '|' character.
                                These reduced names will be used in the output 
                                and when looking for these transcripts in a 
                                gene to transcript GTF.
  --features                    This flag will expect the input reference to be
                             

### Some papers about indexing reference sequences.
* rapmap paper: https://academic.oup.com/bioinformatics/article/32/12/i192/2288985
* pufferfish paper: https://academic.oup.com/bioinformatics/article/34/13/i169/5045749
* selective-alignment paper: https://www.biorxiv.org/content/10.1101/138800v2

In [8]:
! salmon index -t data/spermatogenesis_subset/GRCm38.gencode.vM21.chr18.chr19.txome.fa -k 31 -i data/spermatogenesis_subset/salmon_index --gencode -p 2 

Version Info: This is the most recent version of salmon.
index ["data/spermatogenesis_subset/salmon_index"] did not previously exist  . . . creating it
[2020-05-27 01:39:13.270] [jLog] [info] building index
out : data/spermatogenesis_subset/salmon_index
[00m[2020-05-27 01:39:13.270] [puff::index::jointLog] [info] Running fixFasta
[00m
[Step 1 of 4] : counting k-mers

[00m[00m[2020-05-27 01:39:13.705] [puff::index::jointLog] [info] Replaced 0 non-ATCG nucleotides
[00m[00m[2020-05-27 01:39:13.705] [puff::index::jointLog] [info] Clipped poly-A tails from 67 transcripts
[00mwrote 7879 cleaned references
[00m[2020-05-27 01:39:13.743] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[00m[00m[2020-05-27 01:39:13.880] [puff::index::jointLog] [info] ntHll estimated 6976516 distinct k-mers, setting filter size to 2^27
[00mThreads = 2
Vertex length = 31
Hash functions = 5
Filter size = 134217728
Capacity = 2
Files: 
data/spermatogenesis

In [9]:
! ls data/spermatogenesis_subset/salmon_index

complete_ref_lens.bin	info.json	  rank.bin	       refseq.bin
ctable.bin		mphf.bin	  refAccumLengths.bin  seq.bin
ctg_offsets.bin		pos.bin		  ref_indexing.log     versionInfo.json
duplicate_clusters.tsv	pre_indexing.log  reflengths.bin


In [10]:
! head -4 data/spermatogenesis_subset/salmon_index/duplicate_clusters.tsv

RetainedRef	DuplicateRef
ENSMUST00000198203.1	ENSMUST00000199618.1
ENSMUST00000235145.1	ENSMUST00000237994.1
ENSMUST00000236485.1	ENSMUST00000237580.1


## Understanding the Input data

Droplet-based single-cell sequencing experiments like Drop-seq, 10x Chromium, typically generate a set of paired-end (PE) FASTQ file. Based on the requirements of an experiment, a library is generated with fixed Cellular Barcode (CB) and UMI length, typically 16 & 10 for 10x V2, 16 & 12 for 10x V3 and 14 & 10 for Drop-seq single-cell protocol.  
The PE FASTQ files are generated in a set of two files, typically recognized through `R1` and `R2` tags in their name. `R1` file contains the concatenated sequence of CB & UMI while `R2` file contains the transcript read sequence. 

In [6]:
!zcat data/spermatogenesis_subset/AdultMouseRep3sub1M_S1_L001_R1_001.fastq.gz | head -4 

@J00167:56:HK2GNBBXX:6:1227:13352:9684 1:N:0:0
TTGACTTGTGAGGGAGTGCCCTGCTG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJ

gzip: stdout: Broken pipe


In [7]:
!zcat data/spermatogenesis_subset/AdultMouseRep3sub1M_S1_L001_R2_001.fastq.gz | head -4 

@J00167:56:HK2GNBBXX:6:1227:13352:9684 3:N:0:0
AGAAGAGCCTGGACAGATGTTATACAGACACTAAGAGAACACAAATTCCAGCCCAGGCTACTATACCCAGCCAACTCTCAATTACCATAGATGGAGAAAC
+
AAFFFJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJ<JJJJJJJJJJJJJJJJJJJJJJJFJJFFJJFFJJJJJJJJJJJ

gzip: stdout: Broken pipe


In [11]:
! head -4 data/spermatogenesis_subset/GRCm38.gencode.vM21.chr18.chr19.tgMap.txt

ENSMUST00000234132.1	AC125218.3
ENSMUST00000176956.1	Vmn1r-ps151
ENSMUST00000176452.1	Vmn1r-ps152
ENSMUST00000234774.1	AC125218.2


## dscRNA-seq Quantification w/ alevin

As we now have basic understanding of some of the inputs required by alevin for the quantification of dscRNA-seq data, let's take a deeper dive into some of frequently used command-line flag (options) for the `salmon alevin` command.

#### some useful links
* libtype: https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype
* single-cell protocol type: https://github.com/COMBINE-lab/salmon/blob/master/include/SingleCellProtocols.hpp#L28-L84

In [12]:
! salmon alevin --help

Version Info: This is the most recent version of salmon.

alevin
salmon-based processing of single-cell RNA-seq data.

alevin options:


mapping input options:
  -l [ --libType ] arg                  Format string describing the library 
                                        type
  -i [ --index ] arg                    salmon index
  -r [ --unmatedReads ] arg             List of files containing unmated reads 
                                        of (e.g. single-end reads)
  -1 [ --mates1 ] arg                   File containing the #1 mates
  -2 [ --mates2 ] arg                   File containing the #2 mates


alevin-specific Options:
  -v [ --version ]                      print version string
  -h [ --help ]                         produce help message
  -o [ --output ] arg                   Output quantification directory.
  -p [ --threads ] arg (=2)             The number of threads to use 
                                        concurrently.
  --tgMap arg                    

In [14]:
! salmon alevin -lISR \
-1 data/spermatogenesis_subset/AdultMouseRep3sub1M_S1_L001_R1_001.fastq.gz -2 data/spermatogenesis_subset/AdultMouseRep3sub1M_S1_L001_R2_001.fastq.gz \
--chromium \
-i data/spermatogenesis_subset/salmon_index \
-p 2 \
-o data/spermatogenesis_subset/alevin_output \
--tgMap data/spermatogenesis_subset/GRCm38.gencode.vM21.chr18.chr19.tgMap.txt \
--expectCells 1000

Version Info: This is the most recent version of salmon.
Logs will be written to data/spermatogenesis_subset/alevin_output/logs
[00m[2020-05-27 02:01:24.257] [jointLog] [info] setting maxHashResizeThreads to 2
[00m[00m[2020-05-27 02:01:24.257] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[00m[00m[2020-05-27 02:01:24.257] [jointLog] [info] The --mimicBT2, --mimicStrictBT2 and --hardFilter flags imply mapping validation (--validateMappings). Enabling mapping validation.
[00m[00m[2020-05-27 02:01:24.257] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[00m[00m[2020-05-27 02:01:24.257] [jointLog] [info] The use of range-factorized equivalence classes does not make sense in conjunction with --hardFilter.  Disabling range-factorized equivalence classes. 
[00m[00m[2020-05-27 02:01:24.257] [jointLog] [info] Usage of --validateMappings i

## Understanding alevin output

In [15]:
! ls data/spermatogenesis_subset/alevin_output/alevin

alevin.log	 quants_mat_cols.txt  quants_mat_rows.txt
featureDump.txt  quants_mat.gz	      quants_tier_mat.gz


After successfully completing the quantification, inside the output folder (-o) specified while running `salmon alevin` command, alevin generates the folder `alevin` which contains the gene-v-cell count matrix. The brief summary of the various files generated asre as follows:
* `alevin.log`: This file contains the logs generated by alevin while quantifiying the dscRNA-seq data. This is very useful for debugging.
* `featureDump.txt`: This is a `tsv` file with per-cell level summary stats. This is used for whitelisting of the cells.  

In [16]:
! head -4 data/spermatogenesis_subset/alevin_output/alevin/featureDump.txt

CB	CorrectedReads	MappedReads	DeduplicatedReads	MappingRate	DedupRate	MeanByMax	NumGenesExpressed	NumGenesOverMean
CCTACCAGTAGCCTAT	3491	4	3	0.0011458	0.25	1	1	0
GAGTCCGGTCGTCTTC	2706	4	4	0.0014782	0	0.666667	3	1
TCGAGGCCATTAGGCT	1840	2	1	0.00108696	0.5	1	1	0


* `quants_mat_cols.txt`: alevin output matrix is cell by gene. This file contains the ordered list of genes denoting the column index in the matrix.

In [17]:
! head -4 data/spermatogenesis_subset/alevin_output/alevin/quants_mat_cols.txt

AC125218.3
Vmn1r-ps151
Vmn1r-ps152
AC125218.2


* `quants_mat_rows.txt`: alevin output matrix is cell by gene. This file contains the ordered list of cellular barcodes denoting the row index in the matrix.

In [18]:
! head -4 data/spermatogenesis_subset/alevin_output/alevin/quants_mat_rows.txt

CCTACCAGTAGCCTAT
GAGTCCGGTCGTCTTC
TCGAGGCCATTAGGCT
CAACCAACACAAGACG


* `quants_mat.gz`: This file contains the output cell-by-gene count matrix in a *binary* (non human readable) format. We store it in a compressed [EDS](https://github.com/COMBINE-lab/EDS) format which can be efficiently stored and loaded back into memory. We show some some of the benchmark comparing common single-cell format below.

![Time](https://github.com/COMBINE-lab/EDS/raw/master/benchmarks/time.jpg)
![Size](https://github.com/COMBINE-lab/EDS/raw/master/benchmarks/size.jpg)
![Memory](https://github.com/COMBINE-lab/EDS/raw/master/benchmarks/memory.jpg)

# Summary

In this exercise we learned:
* Input data requirements for alevin.
* what are different command-line flags and their usage for running alevin.
* What are the output format generated by alevin.

In the next exercise, we'll learn how can we import alevin quantified data into a R environment and process the data for downstream analysis.