analysis/scRNAseq_workshop_2.Rmd

---
title: "scRNAseq_workshop_2"
output: html_document
---

# Annotating the cell types with large data set

We have gone through a basic Seurat workflow in the last section using 5k PBMC as an example. We have identified some marker genes for each cluster, and how do we assign each cluster a cell type? Usually, cell type assignment requires a lot of expert opinions based on the known biology of the cells being studied. Large single-cell consortiums such as The Human Cell Atalas (HCA) has produced a lot of data sets with a lot of cells for each tissue/organ and has annotated each cell type. A practical problem is that we have our own single-cell experiemnt done and want to know what are the cell types in our own data set when comparing to a reference data set e.g. in HCA?

Seurat V3 provide convinient functions to do that. For more details, read the paper: [Comprehensive Integration of Single-Cell Data](https://www.biorxiv.org/content/10.1101/460147v1) and [tutorial](https://satijalab.org/seurat/v3.0/integration.html)

Their method aims to first identify ‘anchors’ between pairs of datasets. These represent pairwise correspondences between individual cells (one in each dataset), that we hypothesize originate from the same biological state. These ‘anchors’ are then used to harmonize the datasets, or transfer information from one dataset to another.

For this example, we have a 10k PBMC data set (reference data set) which was annotated by the Seurat developing group. Let's annotate our 5k PMBC data with the reference.

**Transfer of cell type labels from a reference dataset onto a new query dataset**

go to the `Terminal` tab in your Rstudio:

### download 10k pbmc reference data

```{bash eval = FALSE}
cd data
mkdir pbmc10k
cd pbmc10k

curl -Lo pbmc_10k_v3.rds https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1 
# the size of the data
ls -sh
```


### read in the 10k pbmc data 

```{r}
library(Seurat)
library(tidyverse)
# this returns a seurat object
pbmc.10k<- readRDS("data/pbmc10k/pbmc_10k_v3.rds")
pbmc.10k
```

```{r}
pbmc.10k@meta.data %>% head()

## how many cells for each cell type? 14 cell types
table(pbmc.10k@meta.data$celltype)
# 10k cells
length(colnames(pbmc.10k))

p1<- DimPlot(pbmc.10k, group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("10k pbmc")

```

### plot the UMAP for two data sets side by side

```{r}
pbmc<- readRDS("data/pbmc5k/pbmc_5k_v3.rds")

p2<- DimPlot(pbmc, group.by = "seurat_clusters", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("5k pbmc")

CombinePlots(plots = list(p1, p2))
```

Now, we can identify anchors between the two dataset and use these anchors to transfer the celltype labels we learned from the 10K scRNA-seq data to the 5k pmbc cells.

```{r}
transfer.anchors <- FindTransferAnchors(reference = pbmc.10k, query = pbmc, features = VariableFeatures(object = pbmc.10k), 
    reference.assay = "RNA", query.assay = "RNA", reduction = "pcaproject")
```

**Note if transferring scRNAseq label to scATACseq data, set reduction = "cca" is recommended**

To transfer the cluster ids, we provide a vector of previously annotated cell type labels for the RNA to the refdata parameter. The output will contain a matrix with predictions and confidence scores for each 5k pbmc cell.


```{r}
celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = pbmc.10k$celltype, 
    dims = 1:30)

head(celltype.predictions)

pbmc<- AddMetaData(pbmc, metadata = celltype.predictions)

pbmc@meta.data %>% head()

DimPlot(pbmc, group.by = "predicted.id", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("5k pbmc")
```

### confidence of the label transfering

```{r}
pbmc@meta.data$prediction.score.max %>%
  tibble::enframe() %>%
  ggplot(aes(x=value)) + 
  geom_histogram(color = "white", bins = 50) +
  geom_vline(xintercept = 0.9, linetype = 2, color = "red") + 
  theme_classic(base_size = 14) +
  ggtitle("distribution of max prediciton score")
```

Most of the cells receive a high prediction score (> 0.9).


# Assembly of multiple distinct scRNA-seq datasets into an integrated reference

There are times we want to merge multiple data sets to produce a master reference data. To do this,
we can identify anchors using the `FindIntegrationAnchors` function, which takes a list of Seurat objects as input. Here, We can combine the 5k and 10k data set into a 15k data set.  

Use `IntegrateData` when the data sets may have batch effect. (e.g. sequenced in different platforms, on different days).

Let's first check if those two data sets have batch effect or not.

```{r}
pbmc.merge<- merge(pbmc, pbmc.10k)
table(pbmc.merge@meta.data$orig.ident)
```

Investigate in a PCA plot 

```{r}
## using default for all steps
pbmc.merge<- pbmc.merge %>% 
  FindVariableFeatures() %>% 
  NormalizeData() %>% 
  ScaleData() %>% 
  RunPCA(verbose = FALSE)


DimPlot(pbmc.merge, group.by = "orig.ident")
```

The points are overlapping each other and make it difficult to read.
use `split.by` argument

```{r}
p1<- DimPlot(pbmc.merge, group.by = "orig.ident", split.by = "orig.ident")
p1
```

We see the cells in these two data sets mixed well in the PCA space and no batch effect was observed.

If we do see the cells are seprated by batch. we can do 

```{r}
pbmc.anchors <- FindIntegrationAnchors(object.list = list(pbmc.10k, pbmc), dims = 1:30)
pbmc.integrated <- IntegrateData(anchorset = pbmc.anchors, dims = 1:30)

DefaultAssay(pbmc.integrated) <- "integrated"

# Run the standard workflow for visualization and clustering
pbmc.integrated <- pbmc.integrated %>%
  ScaleData() %>% 
  RunPCA(verbose = FALSE) %>%
  RunUMAP(reduction = "pca", dims = 1:30)

DimPlot(pbmc.integrated, reduction = "umap", group.by = "orig.ident", split.by = "orig.ident")

```

Let's compare the PCA with the merged but not "integrated" data

```{r}
p2<- DimPlot(pbmc.integrated, reduction = "pca", group.by = "orig.ident", split.by = "orig.ident")

CombinePlots(plots = list(p1,p2), ncol = 1)
```

You may observe the subtle differences :)