# Single-cell RNA sequencing data analysis tutorial

In this first session, we will preprocess and cell type a [dataset of peripheral blood mononuclear cells](https://www.10xgenomics.com/datasets/10-k-pbm-cs-from-a-healthy-donor-v-3-chemistry-3-standard-3-0-0) (PBMCs).

If you cloned the GitHub repo and installed the corresponding package, save the data in the `data/` directory.

Useful links:
* [anndata](https://anndata.readthedocs.io/en/stable/)
* [scanpy](https://scanpy.readthedocs.io/en/stable/)
* [Single-cell best practices book](https://www.sc-best-practices.org/preamble.html)

## Library imports

In [None]:
# `DATA_DIR` is a pathlib Path pointing to this repo's data directory.
# You can specify a path with the syntax `DATA_DIR / path / to / file`.
from crabs import DATA_DIR  # noqa

## General settings

In [None]:
# Set the verbosity of Scanpy to 2

## Constants

## Function definitions

## Data loading

In [None]:
# Load the PBMC data into the AnnData format

## Data exploration

In [None]:
# Follow the instruction issued when loading the data

In [None]:
# How many cells and genes does the data contain?

In [None]:
# Which format is the data saved as? Why?

# scRNA-seq data is very sparse, i.e., has many zeros entries

## Quality control

In [None]:
# Remove cells with less than 100 transcripts

In [None]:
# Remove genes expressed in less than 10 cells

In [None]:
# Detect mitochondrial genes and add a boolean flag to the gene metadata

In [None]:
# Detect ribosomal genes and add a boolean flag to the gene metadata

In [None]:
# Compute cell-wise quality metrics
# - Total counts
# - The number of genes expressed each cell
# - Total number of counts for mitochondrial genes
# - Proportion of total counts which are mitochondrial

In [None]:
# Plot the number of transcripts present in each cell against the percentage of observed mitochondrial genes

In [None]:
# Plot the number of transcripts present in each cell against the number of genes

In [None]:
# Remove likely dead/dying cells and data outliers

In [None]:
# Replot above's plots

## Data preprocessing

### Doublet detection

In [None]:
# Compute doublet statistics with scrublet

In [None]:
# Visualize the distribution of the doublet score
# Above which threshold is a cell considered a doublet?

In [None]:
# How many cells does scrublet identify as putative doublets?

### Transformation and feature selection

In [None]:
# Normalize the cells to their median library size (total number of transcripts per cell)

In [None]:
# Log1p transform the data
# Why do we use log1p transformation?

In [None]:
# Select the 4000 most highly variable genes

### Dimensionality reduction

In [None]:
# Compute the principal component embedding

In [None]:
# What is a suitbale number of principle components to use

In [None]:
# Compute a k-nearest neighbor graph with k=30

In [None]:
# Compute and plot the UMAP embedding of the data

## Data clustering

In [None]:
# Compute a Leiden clustering of the data

In [None]:
# Display the UMAP embedding, colored by leiden cluster

In [None]:
# Is there a cluster of cells likely comprised of doublets?

## Cell typing

In [None]:
# Given the following cell type markers, assign cell types to clusters
# B cells: MS4A1, CD79A, IGHM
# T cells: CD4, CD8A, CD3D, CD3E, TRAC, LTB, IL7R, CCL4
# Natural killer cells: NKG7, KLRD1, PRF1, CD74, FCGR3A and low T cell marker expression
# Monocytes: CD14, CD163, TYROBP, LYZ, CSF1R, CD68