# 10X Genomics - test dataset "3000 PBMCs From a Healthy Individual"

You can find this data and many other single cell datasets here:

https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k

## 1 Import Python packages/libraries to use in analysis

In [None]:
import scanpy as sc
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib import colors
import seaborn as sns
import warnings;
warnings.filterwarnings('ignore');
from gprofiler import GProfiler
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=170, color_map='viridis')  # low dpi (dots per inch) yields small inline figures
sc.logging.print_header()
results_file = './PBMCs3000_answers.h5ad'
results_file_denoised = './PBMCs3000_deno_answers.h5ad'

When we import a package as something else, we are creating a shortcut to easily call the package. For example, import pandas as pd will allow us to type pd when calling a pandas function rather than pandas.

## Starting from a 10X Dataset

Because we opened jupyter lab/notebook from the directory that we made, we only need to enter the path and filename starting from this point. We are going to use Scanpy's "read_10x_mtx" function because we are reading in a 10X mtx file to analyze. We could also start with a matrix generated from FeatureCounts or DESeq2 or any other program using the python package pandas or use a .h5ad file from cell ranger (10X).

Here, we are reading in sparse count matrices. If we were going to perform batch correction because we had multiple sets of data we would need to convert these to dense representation with the .toarray() function. We create dense matrices as our batch correction method outputs a dense expression matrix, and the data transfer between R and python is currently limited to dense matrices. When sparse batch correction methods are available, and rpy2 is extended to sparse matrices, it is more memory-efficient to keep the data in a sparse format.

It should also be noted that the conventions for storing single cell data differ between R (Seurat, or Scater) and python platforms (Scanpy). Scanpy expects the data to be stored in the format (cells x genes), while R platforms expect the transpose of that. As data is typically stored in the format (genes x cells) as in GEO database, we must transpose the data before using it. However, because we are using scanpy's built in 10x data loading, it will transpose the data for us.

In [None]:
adata = sc.read_10x_mtx('./PBMC/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                  # use gene symbols for the variable names (variables-axis index)
    cache=True)                                # write a cache file for faster subsequent reading

We can't have duplicate gene names so we need to make sure that they are unique

In [None]:
adata.var_names_make_unique()  # this is unnecessary if using 'gene_ids'

In [None]:
adata

The above shows that we have 2700 cells and over 32,000 genes

In [None]:
from IPython.display import Image
Image("PBMCs.png")

## 2 Pre-processing and Visualization 

## 2.1 Quality Control 

Data quality control can be split into cell QC and gene QC. Typical quality measures for assessing the quality of a cell include the number of molecule counts (UMIs), the number of expressed genes, and the fraction of counts that are mitochondrial. A high fraction of mitochondrial reads being picked up can indicate cell stress, as there is a low proportion of nuclear mRNA in the cell. It should be noted that high mitochondrial RNA fractions can also be biological signals indicating elevated respiration.

Alternatively, preprocessing can be done somewhere else (such as in R) and then loaded into scanpy. If this is your situation, skip this portion. You can also freeze the raw data at any point if you want to come back to it. Set the .raw attribute of AnnData object to the logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object returned by sc.pp.log1p.

Log transforming the data in one simple way to "variance-stabilize" your data to adjust the data so that the influence of gene expression magnitude on gene variance.

In [None]:
# Quality control - calculate QC covariates
adata.obs['n_counts'] = adata.X.sum(1)
adata.obs['log_counts'] = np.log(adata.obs['n_counts'])
adata.obs['n_genes'] = (adata.X > 0).sum(1)

Citing from “Simple Single Cell” workflows (Lun, McCarthy & Marioni, 2017):

High proportions of mitochondrial reads are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.

In [None]:
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1

In [None]:
adata.obs.head()

Scanpy has a "cheat" command called "pp.calculate_qc_metrics()" that we could also use instead of doing all the calculations we did above

In [None]:
#first we need to make a copy so we don't write anything in our Anndata
adata_qcTest = adata.copy()
sc.pp.calculate_qc_metrics(adata_qcTest, inplace=True)

In [None]:
#check to see we got the same numbers as our "manual" calculations above:
adata_qcTest.obs.head()

Now lets use a violin plot to take a look at the quality of our data

In [None]:
sc.pl.violin(adata, ['n_genes', 'n_counts'],jitter=0.4)
sc.pl.violin(adata, ['percent_mito'],jitter=0.4)

The plots show that the counts per cell is around 2500 and the fraction of mitochondrial reads (MT frac) for most cells are still far below 20-25%, which ishe typical filtering thresholds.

In [None]:
#Data quality summary plots
p1 = sc.pl.scatter(adata, 'n_counts', 'n_genes', color='percent_mito')
p2 = sc.pl.scatter(adata[adata.obs['n_counts']<10000], 'n_counts', 'n_genes', color='percent_mito')

By looking at plots of the number of genes versus the number of counts with MT fraction information, we can assess whether there cells with unexpected summary statistics. It is important here to look at these statistics jointly. For example, there is a cloud of points with many counts, but few genes. Our first instinct would be to filter out these as "dying" outliers, however they don't seem to show high MT fraction. We should probably still filter out some cells with very few genes as these may be difficult to annotate later. This will be true for the initial cellular density between 1000-4000 counts and < ~500 genes.

Furthermore it can be seen in the main cloud of data points, that cells with lower counts and genes tend to have a higher fraction of mitochondrial counts. These cells are likely under stress or are dying. When apoptotic cells are sequenced, there is less mRNA to be captured in the nucleus, and therefore fewer counts overall, and thus a higher fraction of counts fall upon mitochondrial RNA. If cells with high mitochondrial activity were found at higher counts/genes per cell, this would indicate biologically relevant mitochondrial activity.

In [None]:
#Thresholding decision: counts
p3 = sns.distplot(adata.obs['n_counts'], kde=False)
plt.show()

p4 = sns.distplot(adata.obs['n_counts'][adata.obs['n_counts']<4000], kde=False, bins=60)
plt.show()

p5 = sns.distplot(adata.obs['n_counts'][adata.obs['n_counts']>8000], kde=False, bins=60)
plt.show()

Histograms of the number of counts per cell show that there are two small peaks of groups of cells with fewer than 1500 counts. which are likely uninformative given the overall distribution of counts. This may be cellular debris found in droplets.

On the upper end of counts, we see just a few cells with counts over 8000.

In [None]:
#Thresholding decision: genes
p6 = sns.distplot(adata.obs['n_genes'], kde=False, bins=60)
plt.show()

p7 = sns.distplot(adata.obs['n_genes'][adata.obs['n_genes']<750], kde=False, bins=60)
plt.show()

It looks like our data just has a small population of celss with low gene counts. Not suprisingly, given that this data was taken from the 10X website it is quite clean with very little filtering needed. However, given these plots, and the plot of genes vs counts above, we decide to filter out cells with fewer than 500 genes expressed. Below this we are likely to find dying cells or empty droplets with ambient RNA. Looking above at the joint plots, we see that we filter out the main density of low gene cells with this threshold.

In general it is a good idea to be permissive in the early filtering steps, and then come back to filter out more stringently when a clear picture is available of what would be filtered out. This is available after visualization/clustering. For demonstration purposes we stick to a simple (and slightly more stringent) filtering here.

In [None]:
# Filter cells according to identified QC thresholds:
print('Total number of cells: {:d}'.format(adata.n_obs))

sc.pp.filter_cells(adata, min_counts = 1250)
print('Number of cells after min count filter: {:d}'.format(adata.n_obs))

sc.pp.filter_cells(adata, max_counts = 8000)
print('Number of cells after max count filter: {:d}'.format(adata.n_obs))

adata = adata[adata.obs['percent_mito'] < 0.2]
print('Number of cells after MT filter: {:d}'.format(adata.n_obs))

sc.pp.filter_cells(adata, min_genes = 500)
print('Number of cells after gene filter: {:d}'.format(adata.n_obs))

In [None]:
#Filter genes:
print('Total number of genes: {:d}'.format(adata.n_vars))

# Min 15 cells - filters out 0 count genes
sc.pp.filter_genes(adata, min_cells=15)
print('Number of genes after cell filter: {:d}'.format(adata.n_vars))

Genes are also filtered if they are not detected in at least 15 cells. This reduces the dimensions of the matrix by removing 0 count genes and genes which are not sufficiently informative of the dataset. 15 is not a magic number and should be adjusted down or up depending on how many cells are in your dataset.

## 2.2 Normalization

Each count in a count matrix represents the successful capture, reverse transcription and sequencing of a molecule of cellular mRNA. Count depths for identical cells can differ due to the variability inherent in each of these steps. Thus, when gene expression is compared between cells based on count data, any difference may have arisen solely due to sampling effects. Normalization addresses this issue by e.g. scaling count data to obtain correct relative gene expression abundances between cells.

Scanpy has a builtin command (scanpy.pp.normalize_total) which normalizes the counts per cell. If we choose target_sum=1e6, this is CPM normalization. CPM normalization is the most commonly used normalization protocol and is also referred to as “counts per million”. This protocol comes from bulk expression analysis and normalizes count data using a so-called size factor proportional to the count depth per cell. Variations of this method scale the size factors with different factors of 10, or by the median count depth per cell in the dataset. CPM normalization assumes that all cells in the dataset initially contained an equal number of mRNA molecules and count depth differences arise only due to sampling. If target_sum=None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization 

If exclude_highly_expressed=True, very highly expressed genes are excluded from the computation of the normalization factor (size factor) for each cell. This is meaningful as these can strongly influence the resulting normalized values for all other genes [Weinreb17]

We will use this normalization for our data today because we have limited cellular variability in our data and due to the time and effort it takes to set up SCRAN, but it is probably not the best method to use.

**Important: Before normalizing the data, we ensure that a copy of the raw count data is kept in a separate AnnData object. This allows us to use methods downstream that require this data as input.**

In [None]:
#Keep the count data in a counts layer
adata.layers["counts"] = adata.X.copy()

In [None]:
#Normalize the data and log transform it
sc.pp.normalize_per_cell(adata)
sc.pp.log1p(adata)

In [None]:
# Store the full data set in 'raw' as log-normalised data for statistical testing
adata.raw = adata

The count data has been normalized and log-transformed with an offset of 1. The latter is performed to normalize the data distributions. The offset of 1 ensures that zero counts map to zeros. We keep this data in the '.raw' part of the AnnData object as it will be used to visualize gene expression and perform statistical tests such as computing marker genes for clusters.

## 2.3 Batch Correction

Batch correction is performed to adjust for batch effects from the different samples that you might load together. Our dataset is one patients PBMCs so in our case we do not have to batch correct. However, if we had multiple tissue samples that were being combined together we would want to correct for any batch effects. As the batch effect from samples are often overlapping, correcting for this batch effect will also partially regress out differences between regions. We allow for this to optimally cluster the data. This approach can also be helpful to find differentiation trajectories, but we revert back to non-batch-corrected data for differential testing and computing marker genes.

Regression batch correction essentially assumes that each batch must have the same average gene expression across all cells and removes the difference (this can over-correct your data).

Note that ComBat batch correction requires a dense matrix format as input (which is already the case in this example).

If we wanted to batch correct we would perform the below:

In [None]:
'''# ComBat batch correction
sc.pp.combat(adata, key='sample')'''
#where "sample" is some batch we want to correct for. We would set this up when we load the data.

Note ComBat batch correction can produce negative expression values. One can either set all negative values to 0 or force zeros pre-batch-correction to remain zero post-batch-correction.

There are also several data integration tools that you can and should use to integrate single cell data from multiple experiments such as bbknn (https://doi.org/10.1093/bioinformatics/btz625) or Harmony (Nature Methods <https://doi.org/10.1038/s41592-019-0619-0>). Scanpy has the "ingest" (scanpy.tl.ingest) built into the API but also has bbknn, Harmony, and mutual nearest neighbor in the external API (pp.bbknn(adata[, batch_key, approx, metric, …]) and (pp.harmony_integrate(adata, key[, basis, …]) and (pp.mnn_correct(*datas[, var_index, …])

## 2.4 Highly Variable Genes

We extract highly variable genes (HVGs) to further reduce the dimensionality of the dataset and include only the most informative genes. Genes that vary substantially across the dataset are informative of the underlying biological variation in the data. As we only want to capture biological variation in these genes, we select highly variable genes after normalization and batch correction. **HVGs are used for clustering, trajectory inference, and dimensionality reduction/visualization, while the full data set is used for computing marker genes, differential testing, cell cycle scoring, and visualizing expression values on the data.**

Keep in mind that any given tissue or cell type expresses 11,000 to 13,000 genes, of which 3,000 to 5,000 have a cell type expression pattern, whereas the remaining genes are ubiquitously expressed.

NOTE: The below command expects log data (except when the flavor='seurat_v3') is used.

In [None]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=4, min_disp=0.5)

In [None]:
sc.pl.highly_variable_genes(adata)

Alternatively, we could use the standard technique for the extraction of highly variable genes from the 10X genomics preprocessing software CellRanger. Typically between 1000 and 5000 genes are selected. Here, we extract the top 2000 most variable genes for further processing. If particular genes of importance are known, one could assess how many highly variable genes are necessary to include all, or the majority, of these

In [None]:
'''sc.pp.highly_variable_genes(adata, flavor='cell_ranger', n_top_genes=4000)
print('\n','Number of highly variable genes: {:d}'.format(np.sum(adata.var['highly_variable'])))'''

The plots show how the data was normalized to select highly variable genes irrespective of the mean expression of the genes unless  you set a maximum. This is achieved by using the index of dispersion which divides by mean expression, and subsequently binning the data by mean expression and selecting the most variable genes within each bin.

Highly variable gene information is stored automatically in the adata.var['highly_variable'] field. The dataset now contains:

- a 'counts' layer with count data
- log-normalized data in adata.raw
- highly variable gene annotations in adata.var['highly_variable']

The HVG labels will be used to subselect genes for clustering and trajectory analysis.

In [None]:
adata

## 2.5 Dimenstionality Reduction Methods: Visualization and Summarization

There are two main objectives of dimensionality reduction methods: visualization and summarization. Visualization is the attempt to optimally describe the dataset in two or three dimensions. These reduced dimensions are used as coordinates on a scatter plot to obtain a visual representation of the data. Summarization does not prescribe the number of output components. Instead, higher components become less important for describing the variability present in the data. Summarization techniques can be used to reduce the data to its essential components by finding the inherent dimensionality of the data, and are thus helpful for downstream analysis.


Visualizing scRNA-seq data is the process of projecting a high-dimensional matrix of cells and genes into a few coordinates such that every cell is meaningfully represented in a two-dimensional graph. However, the visualization of scRNA-seq data is an active area of research and each method defines 'meaningful' in its own way. Thus, it is a good idea to look at several visualizations and decide which visualization best represents the aspect of the data that is being investigated.

Overall t-SNE visualizations have been very popular in the community, however the recent UMAP algorithm has been shown to better represent the topology of the data.

Note that we do not scale the genes to have zero mean and unit variance. A lack of rescaling is equivalent to giving genes with a higher mean expression a higher weight in dimensionality reduction (despite correcting for mean offsets in PCA, due to the mean-variance relationship). We argue that this weighting based on mean expression being a biologically relevant signal. However, rescaling HVG expression is also common, and the number of publications that use this approach suggests that scaling is at least not detrimental to downstream scRNA-seq analysis.


In [None]:
# Calculate the visualizations
sc.pp.pca(adata, n_comps=50, use_highly_variable=True, svd_solver='arpack')
sc.pp.neighbors(adata)

sc.tl.tsne(adata, n_jobs=12) #Note n_jobs works for MulticoreTSNE, but not regular implementation)
sc.tl.umap(adata)
sc.tl.diffmap(adata)
sc.tl.draw_graph(adata)

In [None]:
sc.pl.pca_scatter(adata, color='n_counts')
sc.pl.tsne(adata, color='n_counts')
sc.pl.umap(adata, color='n_counts')
sc.pl.diffmap(adata, color='n_counts', components=['1,2','1,3'])
sc.pl.draw_graph(adata, color='n_counts')

PCA:
- Unsurprisingly, the first principle component captures variation in count depth between cells, and is thus only marginally informative
- The plot usually will not show the expected clustering of the data in two dimensions unless your data is not very heterogenous as ours is

t-SNE:
- Shows several distinct clusters with clear subcluster structure
- Connections between clusters are difficult to interpret visually

UMAP:
- Data points are spread out on the plot showing several clusters
- Connections between clusters can be readily identified

Diffusion Maps:
- Shows connections between regions of higher density
- Very clear trajectories are suggested, but clusters are less clear
- Each diffusion component extracts heterogeneity in a different part of the data

Graph:
- Does not show much new information when compared to the others but often will show clear connections between the clusters and hint at stem cell clusters

The strengths and weaknesses of the visualizations might not be readily identified in the above plots. However, generally t-SNE exaggerates differences (but sacrafices accurate global distance representation in favor of local distance), diffusion maps exaggerate transitions. Overall UMAP and force-directed graph drawings show the best compromise of the two aspects, however UMAP is much faster to compute (8s vs 11s here). UMAP has furthermore been shown to more accurately display the structure in the data.

## 2.6 Cell Cycle Scoring

So far we have tried to correct for known sources of technical variation in the data (e.g. batch, count depth). There are also known sources of biological variation that can explain the data and also can be corrected for (e.g. cell cycle). This data correction can be performed by a simple linear regression against a cell cycle score as implemented in the Scanpy and Seurat platforms (Butler et al, 2018; Wolf et al, 2018) or in specialized packages with more complex mixture models such as scLVM (Buettner et al, 2015) or f-scLVM (Buettner et al, 2017). Lists of marker genes to compute cell cycle scores are obtained from the literature (Macosko et al, 2015). These methods can also be used to regress out other known biological effects such as mitochondrial gene expression, which is interpreted as an indication of cell stress. Several aspects should be considered prior to correcting data
for biological effects. Firstly, correcting for biological covariates is not always helpful to interpret scRNA-seq data. While removing cell cycle effects can improve the inference of developmental trajectories (Buettner et al, 2015; Vento-Tormo et al, 2018), cell cycle signals can also be informative of the biology. 

Here, we will use a gene list from Macosko et al., Cell (2015) to score the cell cycle effect in the data and classify cells by cell cycle phase. Please note, the gene list was generated for human HeLa cells. We perform cell cycle scoring on the full batch-corrected data set in adata.

In [None]:
#Score cell cycle and visualize the effect:
cc_genes = pd.read_table('./Macosko_cell_cycle_genes.txt', delimiter='\t')
s_genes = cc_genes['S'].dropna()
g2m_genes = cc_genes['G2.M'].dropna()

s_genes_ens = adata.var_names[np.in1d(adata.var_names, s_genes)] #in1d = Test whether each element of a 1-D array is also present in a second array
g2m_genes_ens = adata.var_names[np.in1d(adata.var_names, g2m_genes)]

sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes_ens, g2m_genes=g2m_genes_ens)

In [None]:
sc.pl.umap(adata, color=['S_score', 'G2M_score'], use_raw=False)
sc.pl.umap(adata, color='phase', use_raw=False)

None of our clusters show a pronounced proliferation signal (probably because they are in the process of dying).

# 3 Downstream analysis

## 3.1 Clustering

Clustering is a central component of the scRNA-seq analysis pipeline. To understand the data, we must identify cell types and states present. The first step of doing so is clustering. Performing Modularity optimization by Louvain community detection on the k-nearest-neighbour graph of cells has become an established practice in scRNA-seq analysis. Thus, this is the method of choice in this tutorial as well.

Here, we perform clustering at two resolutions. Investigating several resolutions allows us to select a clustering that appears to capture the main clusters in the visualization and can provide a good baseline for further subclustering of the data to identify more specific substructure.

Clustering is performed on the highly variable gene data, dimensionality reduced by PCA, and embedded into a KNN graph. (see sc.pp.pca() and sc.pp.neighbors() functions used in the visualization section.


Modularity optimization via louvain has a stochastic element to it. This stochasticity typically does not affect the biological interpretation of the data, but can change the details of analysis scripts. Normally scanpy fixes the random seed to 0 to make scripts exactly reproducible.

In [None]:
sc.tl.louvain(adata, key_added='louvain_r1')
sc.tl.louvain(adata, resolution=0.5, key_added='louvain_r0.5')

In [None]:
adata.obs['louvain_r0.5'].value_counts()

In [None]:
#Visualize the clustering and how this is reflected by different technical covariates
sc.pl.umap(adata, color=['louvain_r1', 'louvain_r0.5'], palette='Set2')
sc.pl.umap(adata, color=['n_counts'])
sc.pl.umap(adata, color=['percent_mito'])

At a resolution of 0.5 the broad clusters in the visualization are captured well in the data. It remanins to be seen if cluster 1 in the 0.5 resolution should be two clusters instead of one. The covariate plots show that none of the clusters have unusually high or low counts or high percent mitochondria so none of the clusters seem to represent stressed or dying cells.

To look at differences in clustering algorithms we will cluster cells using the Leiden algorithm [Traag18], an improved version of the Louvain algorithm [Blondel08]. It has been proposed for single-cell analysis by [Levine15].

In [None]:
sc.tl.leiden(adata, key_added='leiden_r1')
sc.tl.leiden(adata, resolution=0.5, key_added='leiden_r0.5')
sc.pl.umap(adata, color=['louvain_r1', 'leiden_r1'], palette='Set2')

In [None]:
adata.obs['leiden_r0.5'].value_counts()

In [None]:
adata.obs['louvain_r0.5'].value_counts()

In [None]:
sc.tl.louvain(adata, resolution=0.6, key_added='louvain_r0.6')
adata.obs['louvain_r0.6'].value_counts()

In [None]:
sc.pl.umap(adata, color=['louvain_r0.6', 'louvain_r0.5'], palette='Dark2')

## 3.2 Marker genes & cluster annotation 

To annotate the clusters we obtained, we find genes that are up-regulated in the cluster compared to all other clusters (marker genes). This differential expression test is performed by a Welch t-test. This is the default in scanpy. The test is automatically performed on the .raw data set, which is uncorrected and contains all genes. All genes are taken into account, as any gene may be an informative marker.

Also note that the results are displayed as ranked gene lists instead of genes that have a p-value below some cut-off. This is because the test treats each cell as an independent replicate leading to very low p-values. It might be useful in some cases to create pseudo-bulk counts.

We can use some known marker genes to annotate the clusters and if we didn't have any information on the cells in question we could run automated annotation using scmap or garnett or the recently published MARS (https://github.com/snap-stanford/mars) as well as do a gene ontolody enrichment analysis.

In [None]:
#Calculate marker genes
sc.tl.rank_genes_groups(adata, groupby='louvain_r0.5', key_added='rank_genes_r0.5')

In [None]:
#Plot marker genes
sc.pl.rank_genes_groups(adata, key='rank_genes_r0.5', groups=['0','1','2'], fontsize=12)
sc.pl.rank_genes_groups(adata, key='rank_genes_r0.5', groups=['3','4'], fontsize=12)

Cluster 3 and 4 share quite a few "marker genes" so instead of finding highly expressed genes in these clusters vs all of the other clusters we can ask what genes are highly expressed in cluster 2 compared to just cluster 5 and vice versa.

In [None]:
sc.tl.rank_genes_groups(adata,groupby='louvain_r0.5',groups=['3'],reference='4',key_added='rank_genes_r0.5_clust3')

In [None]:
sc.pl.rank_genes_groups(adata, key='rank_genes_r0.5_clust3', groups=['3'], n_genes=30, sharey=False)

To make things easier to look at we can put the top 20 genes in a table

In [None]:
pd.DataFrame(adata.uns['rank_genes_r0.5']['names']).head(20)

If we would like to take a look at some of the marker genes across the groups we can do:

In [None]:
sc.pl.violin(adata, ['CD3D', 'CST3', 'CD79A', 'CCL5', 'NKG7'], groupby='louvain_r0.5')

In [None]:
sc.pl.violin(adata, ['IL7R', 'LYZ', 'MS4A1', 'CD8A', 'GNLY'], groupby='louvain_r0.5')

If we were unsure about the makeup of one or more cluster we could check the fraction of known marker genes that are found in the cluster marker gene set.

I think cluster 0 is a Naive CD4+ T-Cell. To check the fraction of the known marker genes I can do:

In [None]:
marker_genes = {'Naive_T': {'IL7R','LDHB','CD45RA', 'CCR7', 'CD62L', 'CD27','CD3D'}}

In [None]:
marker_genes

In [None]:
cell_annotation = sc.tl.marker_gene_overlap(adata, marker_genes, key='rank_genes_r0.5')
cell_annotation

If you had a lot of different cell types you could make a heatmap of the gene overlap

In [None]:
cell_annotation_norm = sc.tl.marker_gene_overlap(adata, marker_genes, key='rank_genes_r0.5', normalize='reference')
sns.heatmap(cell_annotation_norm, cbar=False, annot=True)

A more rigorous analysis would be to perform an enrichment test. Yet, in this data set the assignment is sufficiently clear so that it is not necessary.

Note that use_raw=False is used to visualize batch-corrected data on top of the UMAP layout.

In [None]:
new_cluster_names = ['CD4 T', 'Monocytes', 'B', 'CD8 T', 'NK']
adata.rename_categories('louvain_r0.5', new_cluster_names)

In [None]:
sc.pl.tsne(adata, color=['louvain_r0.5'], legend_loc='on data' ,title='', frameon=False, legend_fontsize=10, alpha=0.8, size=20, save = 'tsne_cellType_res0_5_winter2021.pdf')

In [None]:
adata.obs['louvain_r0.5'].value_counts()

We could also look at some of our marker genes on the umap or t-sne plots

In [None]:
sc.pl.umap(adata, color=['MS4A1', 'louvain_r0.5'], legend_loc='on data', legend_fontsize=10, alpha=0.2, size=20)

## 3.3 Subclustering 

To build on the basis clustering, we can now subcluster parts of the data to identify substructure within the identified cell types. Here, we subcluster the 'Monocyte' population to see if we can find any difference in the monocytes in this cluster.

Subclustering is normally performed at a lower resolution than on the entire dataset given that clustering is more sensitive when performed on a small subset of the data.

In [None]:
#Subcluster enterocytes
sc.tl.louvain(adata, restrict_to=('louvain_r0.5', ['Monocytes']), resolution=0.2, key_added='louvain_r0.5_Mono_sub')

In [None]:
#Show the new clustering
if 'louvain_r0.5_Mono_sub_colors' in adata.uns:
    del adata.uns['louvain_r0.5_Mono_sub_colors']

sc.pl.umap(adata, color='louvain_r0.5_Mono_sub', palette='ocean')
sc.pl.umap(adata, color='louvain_r0.5', palette='ocean')

The monocyte cluster broke up into three clusters and now we can find the marker genes in these clusters

In [None]:
#Get the new marker genes
sc.tl.rank_genes_groups(adata, groupby='louvain_r0.5_Mono_sub', key_added='rank_genes_r0.5_Mono_sub')

In [None]:
#Plot the new marker genes
sc.pl.rank_genes_groups(adata, key='rank_genes_r0.5_Mono_sub', groups=['Monocytes,0','Monocytes,1','Monocytes,2'], fontsize=12)

In [None]:
pd.DataFrame(adata.uns['rank_genes_r0.5_Mono_sub']['names']).head(10)

In [None]:
Image("Fig4_Mono_het.jpg")

"Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors" Science, April 21, 2017 = seems to indicate the our "Monocytes 2" cluster are dendritic cells = "Numerous studies have shown that human dendritic cells express high levels of major histocompatibility complex (MHC) class II (HLA-DR), a molecule essential for antigen presentation..."

Judging from their Fig4, our Monocytes 0 is their Mono1 and our Monocyte 1 is their Mono 2

In [None]:
#Visualize some Monocyte markers
Mono_genes = ['LYZ', 'FCGR3A', 'HLA-DPB1']
sc.pl.umap(adata, color=Mono_genes[:3], title=Mono_genes[:3])

In [None]:
#Categories to rename
adata.obs['louvain_r0.5_Mono_sub'].cat.categories

In [None]:
adata.rename_categories('louvain_r0.5_Mono_sub', ['B', 'CD4 T', 'CD8 T', 'Mono1', 'Mono2', 'Dendritic', 'NK'])

To make things simpler we can now rename to our final clustering "louvain_final"

In [None]:
adata.obs['louvain_final'] = adata.obs['louvain_r0.5_Mono_sub']

In [None]:
sc.pl.umap(adata, color='louvain_final', palette='ocean', legend_loc='on data')

In [None]:
adata.obs['louvain_final'].value_counts()

## 3.4 Trajectory Inference and pseudotime analysis 

As our data set contains differentiation processes, we can investigate the differentiation trajectories in the data. This analysis is centred around the concept of 'pseudotime'. In pseudotime analysis a latent variable is inferred based on which the cells are ordered. This latent variable is supposed to measure the differentiation state along a trajectory.

Pseudotime analysis is complicated when there are multiple trajectories in the data. In this case, the trajectory structure in the data must first be found before pseudotime can be inferred along each trajectory. The analysis is then called 'trajectory inference'.

Once the pseudotime variable is inferred, we can test for genes that vary continuously along pseudotime. These genes are seen as being associated with the trajectory, and may play a regulatory role in the potential differentiation trajectory that the analysis found.

Here, we measure the trajectory of the B/T/NK cells, which are supposed to be derived from a common progenitor. We also investigate which genes vary along pseudotime.

Based on a recent comparison of pseudotime methods [Saelens et al., 2018], we have selected two of the top performing 'Monocle2', and 'Diffusion Pseudotime (DPT)'. Two methods were chosen as trajectory inference is a complex problem which is not yet solved. Different methods perform well on different types of trajectories. For example, 'Slingshot' was the top performer for simple bifurcating and multifurcating trajectories but takes more time and therefore we will skip this time; 'Monocle2' performed best for complex tree structures, and 'DPT' performed well in bifurcating trajectories. As the complexity of trajectories are generally not known, it is adviseable to compare trajectory inference outputs.

We first subset the data to include only the B cell, T cell, and NK cell clusters. After subsetting it is important to recalculate the dimensionality reduction methods such as PCA, and diffusion maps, as the variability of the subsetted data set will be projected onto different basis vectors.

Note that we subset the data to include only the cells we want to examine. Trajectory inference, and especially measuring gene expression changes over pseudotime can be a computationally expensive process, thus we often work with reduced gene sets that are informative of the variance in the data.

In [None]:
#Subsetting to relevant clusters
clusters_to_include = [g for g in adata.obs['louvain_final'].cat.categories if (g.startswith('CD4 T') or g.startswith('CD8 T') or g.startswith('NK') or g.startswith('B'))]
adata_BTNK = adata[np.isin(adata.obs['louvain_final'], clusters_to_include),:].copy()

#Subset to highly variable genes
sc.pp.highly_variable_genes(adata_BTNK, flavor='cell_ranger', n_top_genes=2500, subset=True)

As we have subsetted the data to include only cell types that we assume are of interest, we recalculate the dimension reduction algorithms on this data. This ensures that for example the first few PCs capture only the variance in this data and not variance in parts of the full data set we have filtered out.

In [None]:
#Recalculating PCA for subset
sc.pp.pca(adata_BTNK, svd_solver='arpack')
sc.pl.pca(adata_BTNK)
sc.pl.pca_variance_ratio(adata_BTNK)

Trajectory inference is often performed on PCA-reduced data, as is the case for Slingshot and Monocle2. To assess how many principal components (PCs) should be included in the low-dimensional representation we can use the 'elbow method'. This method involves looking for the 'elbow' in the plot of the variance ratio explained per PC. Above we can see the elbow at PC6. Thus the first six PCs are included in the slingshot data.

In [None]:
adata_BTNK.obsm['X_pca'] = adata_BTNK.obsm['X_pca'][:,0:6]

## 3.4.2 Diffusion Pseudotime (DPT) 

We include Diffusion Pseudotime in the analysis to further support the found trajectories. Diffusion pseudotime is integrated into scanpy and is therefore easy to use with the current setup.

DPT is based on diffusion maps, thus a diffusion map representation must be calculated prior to pseudotime inference. This in turn is based on a KNN graph embedding obtained from sc.pp.neighbors().

In [None]:
sc.pp.neighbors(adata_BTNK)
sc.tl.diffmap(adata_BTNK)

In [None]:
sc.pl.diffmap(adata_BTNK, components='1,2', color='louvain_final')
sc.pl.diffmap(adata_BTNK, components='1,3', color='louvain_final')

Looking at the first three diffusion components (DCs) we can see that DC3 separates the NK trajectory.

In DPT we must assign a root cell to infer pseudotime. In the plots we can observe that the most appropriate root will be either the CD4 T or CD8 T cell. We will stick with the CD4 T cell for consistency and because it has the minimum DC1 and the maximum DC2 value.

Note that 'DC3' is stored in adata_BTNK.obsm['X_diffmap'][:,3] as the 0-th column is the steady-state solution, which is non-informative in diffusion maps.

In [None]:
#Find the CD4 T cell with the highest DC3 value to act as root for the diffusion pseudotime and compute DPT
stem_mask = np.isin(adata_BTNK.obs['louvain_final'], 'CD4 T')
max_stem_id = np.argmin(adata_BTNK.obsm['X_diffmap'][stem_mask,3])
root_id = np.arange(len(stem_mask))[stem_mask][max_stem_id]
adata_BTNK.uns['iroot'] = root_id

#Compute dpt
sc.tl.dpt(adata_BTNK)

In [None]:
#Visualize pseudotime over differentiation
sc.pl.diffmap(adata_BTNK, components='1,3', color='dpt_pseudotime')

# 3.5 Partition-based graph abstraction

Partition-based graph abstraction (PAGA) is a method to reconcile clustering and pseudotemporal ordering. It can be applied to an entire dataset and does not assume that there are continuous trajectories connecting all clusters.

As PAGA is integrated into scanpy, we can easily run it on the entire data set. Here we run and visualize PAGA with different clustering inputs.

In [None]:
sc.tl.paga(adata, groups='louvain_final')
sc.pl.paga_compare(adata)
sc.pl.paga(adata)

The close connectivity of the two different subclusters shows the similarity of these cells. However, we would have predicted the B cells would be closer to the CD cluster. This result shows that not all connections in PAGA represent differentiation trajectories, but instead transcriptional similarity between states. Thus, further experiments are required to confirm potential lineage trajectories obtained via PAGA or other trajectory inference methods.

We can do the same visualization on a umap layout.

In [None]:
sc.pl.paga_compare(adata, basis='umap')

In [None]:
fig1, ax1 = plt.subplots()
sc.pl.umap(adata, size=40, ax=ax1, show=False)
sc.pl.paga(adata, pos=adata.uns['paga']['pos'], show=False, node_size_scale=10, node_size_power=1, ax=ax1, text_kwds={'alpha':0})
#plt.savefig('./figures/umap_paga_overlay_PBMCs.pdf', dpi=300, format='pdf')
plt.show()

Implementation note:

Note that the above plotting function only works when sc.pl.paga_compare(adata, basis='umap') is run before. The sc.pl.paga_compare() function stores the correct positions in adata.uns['paga']['pos'] to overlay the PAGA plot with a umap representation. To overlap PAGA with other representation, you can run sc.pl.paga_compare() with other basis parameters before plotting the combined plot.

Regressing out the cell cycle effect will likely change how 'TA' cells are included in the trajectory. In this manner trajectory inference and graph abstraction can be iteratively improved. Trajectory inference and PAGA can be iteratively improved to better represent the biology. Knowing when to stop attempting to improve, or assessing when all of the relevant technical covariates have been taken into account can only be achieved with sufficient knowledge of the biological system, experience, and possibly some luck.

It should also be noted that while an abstracted graph or an inferred trajectory can help to infer a lineage tree, experimental validation is necessary. Key driving forces in lineage specification might be lowly expressed genes and therefore neglected in the graph or even excluded in the HVG filtering.

# 4 Summary

In this case study we went through the typical steps of an scRNA-seq data analysis workflow. We started with general preprocessing steps, which included cell and gene quality control, normalization, batch correction, selection of highly variable genes, visualization, and cell cycle scoring. In these steps the overall structure of the data is explored and filtered to produce optimal downstream analysis results. In the downstream scRNA-seq analysis section we then used methods to interpret the data and investigate particular parts of it. These steps included clustering and cluster identification via marker genes, trajectory inference, inferring an abstracted graph to relate clusters and trajectory inference in one visualization, and an example of differential expression although we did not have any experimental conditions to test.

Any individual data analysis script will not always follow all of these steps. For example, trajectory inference may not be relevant where no differentiation processes are captured (this is probably the case with our data). Furthermore, a typical analysis will not always traverse the above steps as linearly as shown here. We have attempted to show this by suggesting how one can move back to data preprocessing to improve trajectory inference or PAGA. Indeed, all downstream analyses may require going back to tweak or add preprocessing steps to improve the downstream analysis results.

A further aspect we stressed in this case study is that one should take care of which stage of data is being used to perform different analysis steps. Uncorrected, normalized data (or raw counts depending on the tool) are generally used for statistical tests where technical covariates such as batch can be included; corrected data should be used to visually compare results as the human eye will not be able to take into account technical differences that confound the comparison. When computing tests such as differential expression or finding marker genes, it is adviseable to use the full data set that is not restricted to find highly variable genes.

In [None]:
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))