### Exercises: Basic single-cell data analysis workflow 
In this exercise, we will import an example dataset, perform quality control steps, preprocess it, embed it into lower dimensional space and give cell type labels to clusters of cells.

#### Import required packages and data
We will use a subset of the 10x Genomics v3 10k PBMC dataset (“10k PBMCs from a Healthy Donor (v3 chemistry)”; https://www.10xgenomics.com/datasets/10-k-pbm-cs-from-a-healthy-donor-v-3-chemistry-3-standard-3-0-0). Cell numbers were reduced while keeping the individual cells’ sequencing depth the same. Only some cell types in the original dataset were kept. A number of empty droplets and low-quality cells are also included.

You need to download the dataset from the course github repository (https://github.com/buchauer-lab/charite-sc-data-course/blob/main/materials/Day2/healthy_PBMCs.zip), unzip it, and use the correct path to the data on your system in the import function below.

In [None]:
# general data handling
import numpy as np
import pandas as pd
from scipy import sparse

# single cell analysis
import scanpy as sc
import decoupler as dc

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# adjust the path to the location of the folder "healthy_PBMCs" on your system
pbmc_data_path = 

In [None]:
# import single cell data
adata = sc.read_10x_mtx(pbmc_data_path)

In [None]:
# basic information about the object is displayed if you enter 'adata' and execute the cell


In [None]:
# documentation of anndata objects is here https://anndata.readthedocs.io/en/stable/
# inspect the count matrix adata.X


In [None]:
# inspect the gene names in adata.var_names


In [None]:
# inspect the cell identifiers in adata.obs_names


#### Questions
1. How many cells does the object contain? What do the names of the cells represent?
2. How many genes have been recorded?

### Quality control
Before getting started, recap with your neighbor what we discussed in the lectures:
1. Which kind of quality issues could our data suffer from?
2. Which metrics can we inspect to identify problematic cells?  

Now, we will begin by calculating the fraction of mitchondrial reads per cell. Mitochondrial gene names start with "MT-" for human, a pattern we can use to find the relevant genes.

In [None]:
# we first introduce a flag (True or False) value for each gene indicating whether it is mitochondrial
adata.var["mt"] = adata.var_names.str.startswith("MT-")

In [None]:
# use the function sc.pp.calculate_qc_metrics to calculate QC metrics including ones on mitochondrial genes
# use flag inplace=True


Which other QC metrics does the function calculate autoamtically? Visit the documentation to find out.

In [None]:
# plot violin plots of number of counts per cell, number of genes per cell
# and percentage of mitochondrial reads per cell using sc.pl.violin
# use flag multi_panel=True to get three separate panels


Discuss the shapes of these distribution with your neighbor. Which features of these distributions are of particular interest? Are there intuitive cut-offs that come to mind? 

In [None]:
# inspect if there is a correlation between transcript counts and the number of recorded features
# as well as with the mitochondrial percentage
# use sc.pl.scatter to show total_counts, n_genes_by_counts and pct_counts_mt in one plot
# generate three plots, one with each pair of the three QC metrics on the axes


In [None]:
# Run this cell to have a quick look at the Pearson correlation coefficients between the QC metrics
qc_metrics = ['total_counts', 'n_genes_by_counts', 'pct_counts_mt']
correlation_matrix = adata.obs[qc_metrics].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('QC Metrics Correlation Matrix')

#### Questions
1. Is there a correlation between the number of counts per cell and the number of genes per cell? Why (not)?
2. Is there a correlation between the number of recorded genes per cell and the mitochondrial content? Why (not)?
3. What can we learn about cells with high mitochondrial content from all the plots we have generated until here?

We now need to decide on QC cut-offs for filtering, i.e. values for each of the QC metrics below (or above) which we will discard cells from the dataset. Generally, we advise you to be conservative at this stage in order to prevent losing too many cells. You can always add additional filtering steps later on.   
  
4. Keeping this advice in mind, decide on the following cut-offs:  
- minimum n_genes_by_counts   
- maximum n_genes_by_counts  
- maximum pct_counts_mt  

In [None]:
# apply the gene number filters to the data using sc.pp.filter_cells

# modify the below funcion to include your mitochondrial cutoff
adata = adata[adata.obs.pct_counts_mt < 57]

In [None]:
# we will also remove genes which appear in less than 5 cells in order to reduce dataset size
sc.pp.filter_genes(adata, min_cells=5)

In [None]:
# how many cells did we lose through this filtering?


### Normalisation and identification of variable features
Following the lecture on data preprocessing, we will now normalize and logarithmize the data and will select highly variable features. We will also explore Pearson's reasiduals and compare the result to lognorm-preprocessing.

In [None]:
# unlike Seurat, which saves intermediate processed data layers per default, 
# processing steps in scanpy overwrite the orginial data. Therefore, we store
# the orginal count data in a separate data layer prior to proceeding
adata.layers["counts"] = adata.X.copy()
# we will also make a copy to be used in alternative processing with Pearson residuals below
adata2 = adata.copy()

In [None]:
# next, use scanpy's normalize_total to normalize each cell to 10.000 counts


In [None]:
# use scanpy's log1p to log-transform the data


In [None]:
# run scanpy's highly_variable_genes to find the top 2000 variable genes


### Principal Component Analysis
We will calculate the principal components of our data (using the highly variable genes identified above as an input) and then inspect our data in the first few components - a first step towards cell type identification!

In [None]:
# run scanpys tl pca function on the data


In [None]:
# use scanpy's plotting function pca_loadings to display the loadings of the  first 3 principal components


In [None]:
# plot a a 2D scatter plot of the dataset in its first two principal components using 
# scanpy's pca plot function. Colour the points by log1p_total_counts.
# use annotate_var_explained=True


In [None]:
# make the same 2D PCA plot as above, but use the top positive and
# negative loading genes for PCs 1 and 2 to colour the dots
# (should result in 4 panels, one per gene).


#### Questions
1. How much variability is contained in PC1, PC2 and PC3?
2. What is the biological interpretation of genes with positive vs. negative loadings on the same principal component? What does it mean when genes have opposite signs in their PC loadings?
3. Using PC1 and PC2 as an example, explain what biological processes or cell states might be represented by the positive vs. negative gene sets.
4. Looking at the 2D PCA plot above, there seem to be 3 main groups of cells separated by PCs 1 and 2. Based on the loading information, make an intelligent guess which cell types these might represent (you might search for the relevant genes online or ask a Chatbot of your choice).

### Alternative preprocessing with Pearson residuals [optional]
Pearson residuals, as introduced in the lecture, represent an alternative preprocessing strategy which can provide advantages for the identification of small populations. It is important to note that Pearson residuals _replace_ lognorm transformation and are not be used on top of them. Their computation requires raw counts as an input.

In [None]:
# use the unprocessed copy ot the anndata object made above and 
# transform it using sc.experimental.pp.normalize_pearson_residuals
sc.experimental.pp.normalize_pearson_residuals(adata2)

In [None]:
# we will also calculate highly variable genes based on Pearson residuals
# of note, this function expects raw counts as input, which we here provide via the corresponding layer
sc.experimental.pp.highly_variable_genes(adata2, flavor='pearson_residuals', n_top_genes=2000, layer='counts')

In [None]:
# calculate a pca for the Pearson transformed dataset


In [None]:
# use scanpy's plotting function pca_loadings to display the loadings of the  first 3 principal components
sc.pl.pca_loadings(adata2, 
                   components=[1, 2, 3], 
                   include_lowest=True
    )

In [None]:
# plot a a 2D scatter plot of the dataset in its first two principal components using 
# scanpy's pca plot function. Colour the points by log1p_total_counts.
# use annotate_var_explained=True


In [None]:
# make the same 2D PCA plot as above, but use the top positive and
# negative loading genes for PCs 1 and 2 to colour the dots
# (should result in 4 panels, one per gene).


In [None]:
# for a simple comparison, plot the four genes identified above from the Pearson residual analysis
# onto the PCA derived from the lognorm approach (i.e. on object adata)


The differences between the two normalization methods for our dataset are minor, as is to be expected based on benchmarking work (Ahlmann-Eltze et al., 2023). Interestingly, Pearson residuals here pick up on  ribosomal genes on PC1, a group of genes for which most cells have some expression and the dynamic range in the lognorm approach is not very large (lognormalized expression between 3 and 6). Pearson residuals open up this dynamic range more, resulting in higher weight on PC1. This illustrates how Pearson residues may pick up on more subtle differences compared to standard approaches, but also shows that in most cases (including ours), the extra effort is not necessary.

### Preliminary cell type exploration
The data we are working with contains subtypes of peripheral blood mononuclear cells (some cell types were excluded during the preparation of the example dataset). Find marker genes for potentially included cell types, for example using the Protein Atlas (https://www.proteinatlas.org/humanproteome/immune+cell ), and plot them onto the PCA to check whether they are present.

In [None]:
# plot the candidate marker genes you found onto the 2D PCA plot


### Non-linear dimension reduction and clustering
Below, we will further reduce the dimensionality of the data by discarding higher PCs, create a UMAP as an example for a non-linear 2D embedding, and use graph-based clustering for community detection.

In [None]:
# use the plot function sc.pl.pca_variance_ratio to get a look at
# the variance captured by each PC
# play with using log=True and log=False


#### Questions
1. What do we learn from the elbow plot? 
2. Based on the plots as well as on what was discussed during the lecture, how many PCs would you suggest keeping for downstream steps?

In [None]:
# Graph-based clustering and UMAP calculation need a nearest-neighbor graph
# as an input. Calculate the nearest neighbors using scanpy's neighbors function,
# and be sure to specify the number of PCs to consider for this step


In [None]:
# next, we calculate a clustering using the Leiden algorithm
# by calling the leiden function.


In [None]:
# calculate a umap embedding using the corresponding scanpy function


In [None]:
# plot the UMAP coloured by the Leiden clustering returend above


In [None]:
# plot the Leiden clustering onto the PCA calculated above


#### Questions
1. Please describe the UMAP and PCA plots above. What are the main differences?
2. Discuss with your neighbor how you think these differences may arise given that PCA is a linear transformation and UMAP is not, but rather an algorithm optimised to project local similarities in high-dimensional space into 2D.

In [None]:
# the UMAP projection algorithm has stochastic elements meaning
# that its outcome depends on random initial assignments of cells. Exlore these random effects
# by running and plotting UMAP three times with different random_state values.
sc.tl.umap(adata, random_state=1)
sc.pl.umap(adata, color='leiden')
plt.show()


In [None]:
# An important parameter of Leiden clustering is resolution. Run the clustering three
# times with different resolutions and plot the results. You can store each clustering
# outcome by using different name tags using key_added='leiden_res1.0' or similar
sc.tl.leiden(adata, resolution=0.5, key_added='leiden_res0.5')


#### Questions
1. Which UMAP would you chose for publication?
2. What happens as you change the resolution parameter? With which resolution whould you proceed for further investigation?

In [None]:
# plot the QC metrics discussed above onto a UMAP plot with 4 panels,
# the fourth showing the clustering in the resolution you intend to
# proceed with


### Marker gene calculation and bottom-up cell type annotation
We will now calculate genes characteristic for each cluster (marker genes) and use them to identify the cell types in our dataset. We will also explore an automed cell type annotation method and compare the results.

In [None]:
# for annotation purposes, select a clustering resolution to continue working with
# then, run marker gene calculations with the rank_genes_groups function


In [None]:
# first, have a look at the top marker genes associated with each cluster
# using the plot function rank_genes_groups


In [None]:
# you can inspect each groups marker genes using a get function
# as shown below, and by entering the group name you are interested in
sc.get.rank_genes_groups_df(adata, group='0')

In [None]:
# now, we visualize the top 5 differentially expressed genes per cluster as a dotplot
# using rank_genes_groups_dotplot
# the function automatically calculates and plots a dendrogram showing similarity between
# cell types


In [None]:
# scanpy offers other types of plots for marker gene analysis. Have a look at the
# corresponding documentation and try another 1-2 types of plots, e.g. the tracksplot,
# which helps you visualize expression across individual cells
# https://scanpy-tutorials.readthedocs.io/en/multiomics/visualizing-marker-genes.html


#### Questions
1. What do positive and negative fold changes in the marker gene table mean?
2. Why is there a p-value and an adjusted p-value?
3. What does the dendrogramm on the side of the dot plot tell us? Are there any pairs of clusters which seem very similar to each other based on the dendrogramm?
4. Use the marker genes derived for each group and the human protein atlas (https://www.proteinatlas.org/humanproteome/immune+cell) or other resources of your choice to give cell type labels to each cluster.

In [None]:
# if there are clusters for which you are not sure yet, it can help to compare them
# to a neighboring population specifically using rank_genes_groups with the additional
# keys groups=['0', '1'], reference='2' which would compare groups 0 and 1 to group 2


#### Cell-type annotation using marker gene lists (from literature)
Typically, you know which tissue you are analysing and can also find annoated datasets of the same type or even published marker gene lists for your tissue. You can use these to help you annotate your cell types. Search the web for a few marker genes for each of the cell types below and plot these for your clusters.

In [None]:
PBMC_marker_genes = {
    "Classical Monocytes": ["gene1", "gene2"],
    "B cells": [],
    "CD4+ T": [],
    "CD8+ T": []}

In [None]:
# visualize these genes using the scanpy function dotplot


In [None]:
# you can also visualise these markers onto your umap, try it with a few genes
# next to the clustering


In [None]:
# based on the two annotation approaches above, give a cell type label to every cluster.
# fill the dictionary below according to your needs
mapping_dict = {
    '0': 'strawberries',
    '1': 'blackberries', 
    '2': 'raisins',
    '3': 'smarties',
    '4': 'bertie botts \n beans',
    '5': 'MAOAM',
    '6': 'gummi bears',
    '7': 'blueberries',
    '8': 'raspberries',
    '9': 'grapes'
}
adata.obs['cell_type'] = adata.obs['leiden_res1'].map(mapping_dict).astype('category')

In [None]:
# plot a umap with cell type labels on top of each cluster


#### Questions
1. Did your chosen clustering resolution make sense? Were there some clusters which you would combine (annotate with the same label), or, conversely, clusters which you would like to split and analyse in more depth?

### Automated cell type annotation
There are several packages available for automated cell type annotation. One option is `celltypist` which comes as a as a python package, command line tool or webservice https://www.celltypist.org/. Online analysis accepts a .csv file, which contains an expression matrix with cells as rows and gene symbols as columns (or the opposite). A raw count matrix is expected. We will generate the file, upload it, and compare their results to ours in this section.

In [None]:
# we will first put the required data into a dataframe and then export to csv
celltypist_df = pd.DataFrame(data=adata.layers['counts'].todense(), index=adata.obs_names, columns=adata.var_names)
celltypist_df.to_csv('celltypist_input.csv')

Now, please go to the celltypist website and upload your file (should be around 100MB), hopefully the WiFi will be with us on this. After submitting the query, you will receive an email when the results are availabel for download (typically quite fast). In case you encounter problems with the data upload, you can also take pre-generated celltypist results in the `materials/Day3`section of the course github.

In [None]:
# import the results and match them to the dataframe
celltypist_results = pd.read_csv('../Day3/celltypist_predicted_labels.csv', index_col=0)
# inspect :)
celltypist_results

In [None]:
# add the results to the anndata object


In [None]:
# plot two UMAPS, one with your annotation and the other with the celltypist label as a color code


#### Questions
1. How does your annotation compare to celltypist? Are there any differences? Who is right?
2. What is the difference between celltypist outputs 'predicted_labels' and 'majority_voting'?
3. There seem to be two populations each of naive and memory B cells. What is the difference between them?
4. If you have the time, run the processing steps neighbors/leiden/umap on the object normalized with Pearson residuals (adata2). Does the split still appear? Which data analysis strategy is better?