## Introduction
In this demo, we will analyze single-cell RNA-seq data using Python on Google Colab platform


## Download data and install required library
Data contain 3,000 peripheral blood mononulcear cells (PBMCs) from a healthy donor (provided by 10x Genomics)

In [None]:
!mkdir data
!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
!cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz

In [None]:
!ls -la 'drive/MyDrive/Courses Content/RNAseq_Workshop_Jan2022'

Google cloud environment does not have some required libary, so we have to install them

Once the installation finished, you have to restart the environment to use the new library (There should be a *RESTART RUNTIME* button at the end of the output)

Then you need to rerun these *import* commands again

In [None]:
!pip install scanpy leidenalg bbknn

import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt

## Load single-cell data into Python

In [None]:
adata = sc.read_10x_mtx('data/filtered_gene_bc_matrices/hg19/',
                        var_names = 'gene_symbols',
                        cache = True)

adata.var_names_make_unique() ## ensure that all gene symbols are unique
print(adata)

Visualize highly expressed genes

In [None]:
sc.pl.highest_expr_genes(adata, n_top = 20, )

## Quality filter
* Each cell must express at least 200 genes
* Each gene must be observed in at least 3 cells

In [None]:
sc.pp.filter_cells(adata, min_genes = 200)
sc.pp.filter_genes(adata, min_cells = 3)

Analyze proportion of *mitochondrial* gene expression (gene symbols beginning with *MT-*)

Remove cell with above 5% mitochondrial expression

In [None]:
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars = ['mt'], percent_top = None, log1p = False, inplace = True)

sc.pl.scatter(adata, x = 'total_counts', y = 'pct_counts_mt')
adata = adata[adata.obs.pct_counts_mt < 5, :]

Remove cells with more than 2,500 genes detected (possibly multi-cells)

In [None]:
sc.pl.scatter(adata, x = 'total_counts', y = 'n_genes_by_counts')
adata = adata[adata.obs.n_genes_by_counts < 2500, :]

## Data normalization and preprocessing
* Scale total read counts to 10,000 reads for each cell
* Log-transform read counts


In [None]:
sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)

Identify highly variable genes (high variance across cells) as they likely are cell type-specifics markers

Note the impact of data normalization (left panel) compared to raw data (right panel)

In [None]:
sc.pp.highly_variable_genes(adata, min_mean = 0.0125, max_mean = 3, min_disp = 0.5)
sc.pl.highly_variable_genes(adata)

adata.raw = adata
adata = adata[:, adata.var.highly_variable]

Remove biases from sequencing depth and mitochondrial expression (linear effect model)

*Regress out* = use regression to identity the effect sizes and them subtracting them from the expression data

In [None]:
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])

## Visualize data on 2D
First, using Principal Component Analysis (PCA, linear dimensionality reduction)

Labeling with [*CST3*](https://www.genecards.org/cgi-bin/carddisp.pl?gene=CST3) expression levels show that the cells are well-ordered 

In [None]:
sc.pp.scale(adata, max_value = 10)
sc.tl.pca(adata, svd_solver = 'arpack')
sc.pl.pca(adata, color = 'CST3')

Multi-step dimensionality reduction
* Perform PCA (linear dimensionality reduction)
* Compute *neighbor network* using only the first 40 dimensions from PCA
* Cluster cells using *leiden* algorithm
* Perform UMAP (non-linear dimensionality reduction)

Labeling with CST3, [NKG7](https://www.genecards.org/cgi-bin/carddisp.pl?gene=NKG7), and [PPBP](https://www.genecards.org/cgi-bin/carddisp.pl?gene=PPBP) expression levels show that each cell cluster on UMAP is associated with some gene markers

In [None]:
sc.pp.neighbors(adata, n_neighbors = 10, n_pcs = 40)
sc.tl.leiden(adata)
sc.tl.paga(adata)
sc.pl.paga(adata, plot = False)
sc.tl.umap(adata, init_pos = 'paga')
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7', 'PPBP'])

## Finding marker genes
For each cluster identified, perform differential expression between that cluster versus the result of the cell using *Wilcoxon*

Show the top 25 genes along with their scores (*Wilcoxon* test statistics)

* CST3 shows up as a marker for cluster 1
* NKG7 shows up as a marker for clusters 4 and 5
* PPBP shows up as a marker for cluster 10

In [None]:
sc.tl.rank_genes_groups(adata, 'leiden', method = 'wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes = 25, sharey = False)

View the expression of top 8 marker genes for cluster 1

In [None]:
sc.pl.rank_genes_groups_violin(adata, groups = '1', n_genes = 8)

Compare gene expression across clusters

In [None]:
sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby = 'leiden')

Summarize marker genes using *dot plot*
* Color indicates expression level
* Circle size indicates % of cell expressing that marker

In [None]:
marker_genes = ['IL7R', 'CST3', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'CD79A', 'MS4A1', 
                'CD8A', 'CD8B', 'NKG7', 'GNLY', 
                'FCGR3A', 'MS4A7', 'FCER1A', 'PTP4A3',
                'ISG15', 'IFI6', 'PPBP']

sc.pl.dotplot(adata, marker_genes, groupby = 'leiden', vmax = 5);

## Data integration
We will combine two PBMC datasets

In [None]:
adata_ref = sc.datasets.pbmc3k_processed()  # This is an annotated version of the PBMC dataset above
adata = sc.datasets.pbmc68k_reduced()

Identify overlapping gene sets

In [None]:
var_names = adata_ref.var_names.intersection(adata.var_names)
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]

Visualize data before integration using UMAP

In [None]:
adata_concat = adata_ref.concatenate(adata, batch_categories = ['reference', 'new batch'])
sc.pl.umap(adata_concat, color=['batch'])

Perform dimensionality reduction and clustering on the reference dataset (same as earlier)

In [None]:
sc.tl.pca(adata_ref, svd_solver = 'arpack')
sc.pp.neighbors(adata_ref, n_neighbors = 10, n_pcs = 40)
sc.tl.leiden(adata_ref)
sc.tl.paga(adata_ref)
sc.pl.paga(adata_ref, plot = False)
sc.tl.umap(adata_ref, init_pos = 'paga')

### One-way mapping

In [None]:
sc.tl.ingest(adata, adata_ref, obs = 'leiden')
adata_concat = adata_ref.concatenate(adata, batch_categories = ['reference', 'new batch'])
sc.pl.umap(adata_concat, color=['batch'])

### Mutual nearest neighbor

In [None]:
sc.tl.pca(adata_concat)
sc.external.pp.bbknn(adata_concat, batch_key = 'batch')
sc.tl.umap(adata_concat)
sc.pl.umap(adata_concat, color = ['batch', 'leiden'], wspace = 0.3)

## Another data integration example
Pancreas tissue sample with 14,693 cells

In [None]:
adata_all = sc.read('data/pancreas.h5ad', backup_url='https://www.dropbox.com/s/qj1jlm9w10wmt0u/pancreas.h5ad?dl=1')
print('number of cells:', adata_all.shape[0])

Remove minority cell types

In [None]:
counts = adata_all.obs.celltype.value_counts()
counts

In [None]:
minority_classes = counts.index[-10:].tolist()
minority_classes.append('not applicable')

adata_all = adata_all[~adata_all.obs.celltype.isin(minority_classes)]

Before integration

In [None]:
sc.pp.pca(adata_all)
sc.pp.neighbors(adata_all)
sc.tl.umap(adata_all)
sc.pl.umap(adata_all, color = ['batch', 'celltype'], palette = sc.pl.palettes.vega_20_scanpy)

Integration using mutual nearest neighbor

In [None]:
sc.external.pp.bbknn(adata_all, batch_key = 'batch')
sc.tl.umap(adata_all)
sc.pl.umap(adata_all, color = ['batch', 'celltype'])

Visualize distribution of batches

In [None]:
for batch in ['1', '2', '3']:
    sc.pl.umap(adata_all, color = 'batch', groups = [batch])

## Trajectory inference
2,700 cells from myeloid and erythroid differentiation

Use *diffusion* map to calculate transition probabilities between cells

In [None]:
adata = sc.datasets.paul15()

sc.pp.recipe_zheng17(adata)
sc.tl.pca(adata, svd_solver = 'arpack')
sc.pp.neighbors(adata, n_neighbors = 4, n_pcs = 20)
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors = 10, use_rep = 'X_diffmap')

sc.tl.draw_graph(adata)
sc.pl.draw_graph(adata, color = 'paul15_clusters', legend_loc = 'on data')

### Cluster-level trajectory
Reconstruct cluster-to-cluster trajectory outline using [PAGA](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1663-x) algorithm

In [None]:
sc.tl.leiden(adata, resolution = 1.0)
sc.tl.paga(adata, groups = 'leiden')
sc.pl.paga(adata, color = ['leiden', 'Hba-a2', 'Elane'])
sc.pl.paga(adata, color=['Prss34', 'Itga2b', 'Cma1'])

### Cell-level trajectory
Use cluster-level connectivity as a template for placing individual cells

In [None]:
sc.tl.draw_graph(adata, init_pos = 'paga')
sc.pl.draw_graph(adata, color = ['Elane', 'Itga2b', 'Hba-a2'])

In [None]:
sc.pl.paga_compare(
    adata, threshold = 0.03, title='', right_margin = 0.2, size = 10, edge_width_scale = 0.5,
    legend_fontsize = 12, fontsize = 12, frameon = False, edges = True)

### Estimate pseudotime
Use annotated stem cells as the origin (leiden cluster 2)

In [None]:
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden']  == '2')[0]
sc.tl.dpt(adata)
sc.pl.draw_graph(adata, color = ['dpt_pseudotime'])

Diffusion map simplifies the developmental trajectory as one straight line

In [None]:
sc.pl.diffmap(adata[~np.isinf(adata.obs['dpt_pseudotime']), :], color = 'dpt_pseudotime')