# CellRanger

The CellRanger [website](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) has great documentation on how process 10X data. Here, we will run CellRanger on the Macoska dataset in the previous lecture. 

Since this pipeline requires a long time to run, we will submit a job rather than running it interactively like we did with dropseqtools.


**Before we begin data processing using cellranger, make sure you are following the following guidelines:**

**1) You have logged into TSCC and have a screen running.**

**2) You have secured an interactive node through the screen. We will be executing all commands over an interactive node.**

**You should have access to the cellranger module. If you still do not have access to cellranger, please let us know. You can load the cellranger module as follows (to be performed on command line in TSCC):**

**module load cellranger**

**To test if the module was successfully loaded, simply type in "cellranger"**

**Processing data using CellRanger**

***cellranger mkfastq***

Illumina sequencing instruments generate per-cycle raw base call (BCL) files as primary sequencing output. The cellranger mkfastq pipeline performs the demultiplexing and conversion step for BCL files for each flowcell directory. The final output for demultiplexing is the fastq files, which can used to perform alignments and gene expression analyses. Usually, the sequencing core will perform this step for you and return the fastq files. Hence, we will not cover this step for this tutorial. However, you can download examples datasets to test this command from [here](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq).

***cellranger count***

Cellranger count quantifies single-cell gene expression.

The cellranger count command takes FASTQ files from cellranger mkfastq and performs alignment, filtering, and UMI counting. It uses the Chromium cellular barcodes to generate gene-barcode matrices and perform clustering and gene expression analysis. count can take input from multiple sequencing runs on the same library.

*cellranger count --expect-cells= < expected number of captured cells > --id=< unique_id > --transcriptome=< path_to_reference_transcriptome > --fastqs=< path_to_directory_containing_fastq_files > --sample=< prefix_of_the_sample_to_process >*

As shown above, cellranger count takes 4 required arguments. 

--id is a unique run ID string, which can be assigned arbitrarily. 

--fastqs specifies path of the FASTQ directory that usually generated from mkfastq, in this case, this will be location where you downloaded the *Macoska* dataset. 

--sample indicates sample name as specified in the sample sheet. The sample names must matches the prefixes in the fastq files.

--transcriptome specifies path to the Cell Ranger compatible transcriptome reference, in this case, the reference genome is hg19 which can be found here : 

--expect-cells is optional flag where we can specify number of cells within the sample.

**As an exercise, fill in the appropriate path and arguments to the cellranger count command**

***cellranger count --expect-cells= < expected number of captured cells > --id=< unique_id > --transcriptome=< path_to_reference_transcriptome > --fastqs=< path_to_directory_containing_fastq_files > --sample=< prefix_of_the_sample_to_process >***

***How does cellranger count function ?***
![cellranger_workflow](img/cellranger_workflow.png)

![output_files](img/cellranger_output_files.png)

![web_summary](img/cellranger_html_summary.png)

There are 2 principal steps to quality control on single cell data:

    remove poor quality cells
    remove genes with very sparse information

For any given gene there will be many cells where there is no observed expression. Most of the times this is a consequence of the low input material. In order to properly estimate the normalization factors for each cell we need to reduce the number of 0’s contained in each cell, without discarding too many genes. The easiest way is to just remove genes with all 0-values, i.e. no evidence of expression in any cell. We can also set a more conservative threshold, where a gene must be expressed in at least N cells.

We can judge the quality of a cell by several metrics:

    Total sequencing coverage or cell library size
    Mitochondrial content - cells with high mitochondrial content may have already lysed prior to encapsulation.
    Cell sparsity - i.e. proportion of genes in a cell with 0-values

**cellranger reanalyze (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/reanalyze)**

The cellranger reanalyze command reruns secondary analysis performed on the feature-barcode matrix (dimensionality reduction, clustering and visualization) using different parameter settings.