# CellRanger

The CellRanger [website](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) has great documentation on how process 10X data. Here, we will run CellRanger example dataset :

Human peripheral blood mononuclear cells (PBMCs) of a healthy female donor aged 25-30 were obtained by 10x Genomics from AllCells.

Libraries were generated from ~16,000 cells (11,984 cells recovered) as described in the Chromium Single Cell 3' Reagent Kits User Guide (v3.1 Chemistry Dual Index) (CG000315 Rev C) using the Chromium X and sequenced on an Illumina NovaSeq 6000 to a read depth of approximately 40,000 mean reads per cell.

Paired-end, dual indexing:

    Read 1: 28 cycles (16 bp barcode, 12 bp UMI)
    i5 index: 10 cycles (sample index)
    i7 index: 10 cycles (sample index)
    Read 2: 90 cycles (transcript)

Analysis parameters used: --expect-cells=10000 and --include-introns

Since this pipeline requires a long time to run, we will submit a job on the cluster rather than running it interactively on JupyterHub.


## Before we begin data processing using cellranger, make sure you are following the following guidelines:

**1) You have logged into TSCC and have a screen running.**
    
    ssh trainXY@tscc-login11.sdsc.edu
    
    Replace "XY" with your ID (24,25,26, etc)

**2) Start a "screen" session.**
    
    Check if you already have a screen session running by typing "screen -ls"
    
    If the output of the above command is "No Sockets found", start a new screen session by typing "screen"
    
    Else, connect to the existing session by typing "screen -r <ID>". The ID is the numerical ID of the existing screen session.

**3) Submit a request to the cluster for a high compute node :**

    "qsub -I -l walltime=12:00:00 -l nodes=1:ppn=4 -q home-yeo"
    
    This is send a request to the cluster to provide you with a single node (nodes=1) of 4 high compute processing (ppn=4) for 12 hours (walltime=12:00:00)
    
    Please wait for the request to be approved before proceeding with data processing

**4) Since data processing created large amounts of data, we need to process the data in a different location within the cluster. Change your location on the cluster as follows :

    "cd /oasis/tscc/scratch/ucsd-trainXY"
    
    XY == Your course ID (train23, 24, 25, etc)

**5) Load the cellranger program so that we can begin processing the data**
    
    module load cellranger
    
    Test if the module was successfully loaded by typing "cellranger --version"

## Processing data using CellRanger

***cellranger mkfastq***

Illumina sequencing instruments generate per-cycle raw base call (BCL) files as primary sequencing output. The cellranger mkfastq pipeline performs the demultiplexing and conversion step for BCL files for each flowcell directory. The final output for demultiplexing is the fastq files, which can used to perform alignments and gene expression analyses. 

Usually, the sequencing core will perform this step for you and return the fastq files. Hence, we will not cover this step for this tutorial.

***cellranger count***

Cellranger count quantifies single-cell gene expression.

The cellranger count command takes FASTQ files from cellranger mkfastq and performs alignment, filtering, and UMI counting. It uses the Chromium cellular barcodes to generate gene-barcode matrices and perform clustering and gene expression analysis. count can take input from multiple sequencing runs on the same library.

*cellranger count --expect-cells= < expected number of captured cells > --id=< unique_id > --transcriptome=< path_to_reference_transcriptome > --fastqs=< path_to_directory_containing_fastq_files > --sample=< prefix_of_the_sample_to_process >*

As shown above, cellranger count takes 4 required arguments. 

--id is a unique run ID string, which can be assigned arbitrarily. 

--fastqs specifies path of the FASTQ directory that usually generated from mkfastq, in this case, this will be location where you downloaded the *Macoska* dataset. 

--sample indicates sample name as specified in the sample sheet. The sample names must matches the prefixes in the fastq files.

--transcriptome specifies path to the Cell Ranger compatible transcriptome reference, in this case, the reference genome is hg19 which can be found here : 

--expect-cells is optional flag where we can specify the expected number of cells within the sample.

**As an exercise, fill in the appropriate path and arguments to the cellranger count command**

You can find the necessary fastq files in the following directory : **/oasis/tscc/scratch/CSHL_single_cell_2022/data/rnaseq/10k_PBMC_3p_nextgem_Chromium_X_fastqs_downsampled/**

Reference genome is available here : **/oasis/tscc/scratch/CSHL_single_cell_2022/single_cell_rnaseq/refdata-gex-GRCh38-2020-A/**

Expected # of cells == **10,000**

***cellranger count --expect-cells < expected number of captured cells > --id < unique_id > --transcriptome < path_to_reference_transcriptome > --fastqs < path_to_directory_containing_fastq_files > --sample < prefix_of_the_sample_to_process > --localcores 4***

***How does cellranger count function ?***
![cellranger_workflow](img/cellranger_workflow.png)

![output_files](img/cellranger_output_files.png)


**Cellranger web summary report**

https://assets.ctfassets.net/an68im79xiti/163qWiQBTVi2YLbskJphQX/e90bb82151b1cdab6d7e9b6c845e6130/CG000329_TechnicalNote_InterpretingCellRangerWebSummaryFiles_RevA.pdf

There are 2 principal steps to quality control on single cell data:

    remove poor quality cells
    remove genes with very sparse information

For any given gene there will be many cells where there is no observed expression. Most of the times this is a consequence of the low input material. In order to properly estimate the normalization factors for each cell we need to reduce the number of 0’s contained in each cell, without discarding too many genes. The easiest way is to just remove genes with all 0-values, i.e. no evidence of expression in any cell. We can also set a more conservative threshold, where a gene must be expressed in at least N cells.

We can judge the quality of a cell by several metrics:

    Total sequencing coverage or cell library size
    Mitochondrial content - cells with high mitochondrial content may have already lysed prior to encapsulation.
    Cell sparsity - i.e. proportion of genes in a cell with 0-values