# Contents
[1. How to use this notebook](#overview)

[2. scRNA-Seq Analysis pipeline](#pipeline)

[3. Importing your data](#import)

[4. Results summary](#summary)

[5. Filtering cells and clustering](./scRNASeq_5_Filter_Cells.ipynb)

[6. Aggregate clustering](./scRNASeq_6_aggregate_cluster.ipynb)

[7. Differential expression](./scRNASeq_7_DE.ipynb)


# 1. How to use this notebook <a class="anchor" id="overview"></a>

This is a [Jupyter notebook](https://jupyter.org/) for downstream analysis of single cell RNA-Seq (scRNA-Seq) data that has been initially processed by the [10x Cell Ranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) pipeline.

The majority of the sections in this notebook can be opened as separate Jupyter notebooks (by clicking on the link in the table of contents) and are designed to be run completely independantly (this also means some of the initial code steps, such as defining your working directory, are repeated in each section).

Jupyter notebooks are interactive documents that contain 'live code', which allows the user to complete an analysis by running code 'cells', which can be modified, updated or added to by the user.

Individual Jupyter notebooks are based on a specific 'kernel', or analysis envirnment (mostly programming languages). This particular notebook is based on R. To see which version of R this notebook is based on, and as an example of running a code cell, click on the cell below and press the 'Run' button (top of the page).

In [None]:
R.Version()

You should now see details about the version of R installed for this notebook. Every other code cell can be run similarly. There are two main types of code cells, plain R code (as seen above) and [markdown](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf). 

Markdown is a simple language for formatting text and the instructions (including this cell), headings, etc are written in markdown code cells. See: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf. You can add additional cells (markdown or R) by clicking on the plus sign, and then in the dropdown box, selecting 'Markdown' or 'code'. This way you can add your own analysis code cells or your own notes in markdown.

# 2. scRNA-Seq analysis pipeline <a class="anchor" id="pipeline"></a>



**IMPORTANT: This notebook is for downstream analysis of [10x Genomics single cell RNA-Seq](https://www.10xgenomics.com/products/single-cell-gene-expression). It assumes that initial, upstream analysis has been completed using 10x Genomics analysis tool, [Cell Ranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger).**

![](https://support.10xgenomics.com/img/single-cell-gex/gex-analysis-tour-1.png)

Processing of 10x scRNA-Seq data using Cell Ranger involves 3 main steps:

1. [cellranger mkfastq](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq) is used to demultiplex raw base call data files and convert these to fastq format files.

2. [cellranger count](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count) is the core analysis tool, aligning and sequences to a reference genome and quantifying the number of aligned sequences per genomic feature (e.g. gene). Differentially expressed genes are calculated from this data.

3. [cellranger aggr](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/aggregate) aggregates the output from multiple cellranger count runs. Counts are re-quantified based on relative library size (normalisation) and gene expression is re-calculated.

Output directories and files from CellRanger follow a standard directory structure and naming convention, which allows the code in this notebook to be run on any CellRanger output with minimal modifications (e.g. experiment-specific information such as sample groups).

## R analysis

Cell Ranger is usually run on a Linux server or cloud service. The Cell Ranger output is used in this R-based Jupyter workflow to do downstream analysis (generate figures, statistics, etc) on this data.

The main R package used in this workflow is [Seurat: 'R toolkit for single cell genomics'](https://satijalab.org/seurat/). Seurat has been designed to be able to directly import 10X Genomics datasets and analyse this data in a wide variety of ways. See the Seurat website for more details. https://satijalab.org/seurat/

To cite Seurat:

>Hao, Y., Hao, S., Andersen-Nissen, E., Mauck III, W.M., Zheng, S., Butler, A., Lee, M.J., Wilk, A.J., Darby, C., Zager, M. and Hoffman, P., 2021. Integrated analysis of multimodal single-cell data. Cell, 184(13), pp.3573-3587.

## cloupe files

The CellRanger output includes a ‘cloupe’ file, which is a 10X database designed to be open and viewed in 10X’s Loupe browser:

https://support.10xgenomics.com/single-cell-gene-expression/software/visualization/latest/what-is-loupe-cell-browser

The Loupe browser can easily be installed on a Windows or macOS computer. This can be used as an alternative to using the R-based analysis in this notebook.

# 3. Importing your data <a class="anchor" id="import"></a>

This analysis pipeline imports several database files that have been generated by Cell Ranger.
Cell Ranger creates a default directory structure, described here: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/overview
This analysis pipeline is also based on R, which calls in files from a [working directory](https://intro2r.com/work-d.html) that is defined in the script. This means anyone who has Cell Ranger data can use the location of this data to define the working directory in R, and then the script will automatically find the relevant database files needed to complete this analysis.

**NOTE** At time of writing (27-04-2022) the QUT Jupyter Hub is not connected to the main QUT file storage system, thus someone following this pipeline needs to first upload their Cell Ranger data to the Jupyter Hub. In the left hand panel of this notebook is an 'upload' button. Click on this to upload files. Unfortunately, Jupyter can only upload single files, so you will need to first 1) zip your entire Cell Ranger output directories, then 2) upload this single zip file to Jupyter, and finally 3) extract this file using a terminal in Jupyter (in the 'launcher' tab is a button to launch a terminal). These 3 steps are fairly easy to figure out with a small amount of Googling, or contact Paul Whatmore at eResearch (paul.whatmore@qut.edu.au) if you are still having difficulties doing this.

We are currently working on integrating QUTs file system to this Jupyter Hub, and when this is complete you will simply need to make the location of your Cell Ranger results as the working directory (i.e. no uploading and unzipping of files to the Jupyter Hub).

## Set your working directory and check your data

To make sure you now have everything correctly set up (i.e. all your Cell Ranger database files) to run the following analysis, we will now chck your data.

R needs a base directory to work from (a 'working directory'). Set this to be the 'scDATA' directory by running the below code cell.

In [None]:
setwd("~/scDATA")

<mark><font color="red">**Fazeleh, your datasets are both already there. You can skip the data import and instead run one of the below cells to set the working directory to process either your first or second dataset. If you want to switch between datasets, just run the 'setwd' code cell for that dataset.**</font></mark>

First record the root working directory, so you can switch between datasets as needed.

In [None]:
setwd("~/Fazeleh/Dataset1/scDATA")

In [None]:
setwd("~/Fazeleh/Dataset2/scDATA")

Now you can see what is in this directory by running the list.dirs() function:

In [None]:
list.dirs(full.names = F, recursive = F)

You should see a list of all the sample directories for your dataset.
You can see all the files under each directory by using the dir_tree() command from the **fs** package. First, install and load the fs package by running the below code cell.

In [None]:
library(fs)

Now run the the dir_tree() command

In [None]:
dir_tree(recurse=2, type = "directory")

You should see each of your sample directories with the analysis subdirectories and data files in them. If not, go back to the top of the 'Importing your data' section and make sure you followed every step.

# 4. Results summary <a class="anchor" id="summary"></a>


The **metrics_summary.csv** file contains information about each sample, such as estimated number of cells, number of reads, read quality metrics, where reads mapped to genomic regions, etc.
See: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/metrics-summary-csv


**NOTE: the 'metrics_summary.csv' file is only available for individual samples. If you are working on an aggregate dataset (see Aggregate Clustering section) then the below commands won't work.**

To view the data, read it in to R using the `read.csv` function. First, enter which sample you wish to examine (based on the sample directory names in your Cell Ranger output) in the code cell below. E.g. if your sample directory is called "Liver", the below cell will be `sample <- "Liver"`

In [None]:
sample <- "Cerebellum"

Now the following cell can run based on the sample directory information you provided. 

In [None]:
metrics <- read.csv(paste0("./" , sample, "/metrics_summary.csv"))

View the metrics as a table

In [None]:
paste(sample, "metrics")
t(metrics)

You can go back and change the sample name then re-run the last two code cells to display metrics on each sample.

[Click here to go to the next section: Filtering cells using markers](./scRNASeq_5_Filter_Cells.ipynb)