vignettes/single_basic_tutorial.Rmd

---
title: "A basic overview of single data analysis"
output:
  html_document:
  theme: united
df_print: kable
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
---
***

```{r setup, include=FALSE}
all_times <- list()  # store the time for each chunk
knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      now <<- Sys.time()
    } else {
      res <- difftime(Sys.time(), now, units = "secs")
      all_times[[options$label]] <<- res
    }
  }
}))
knitr::opts_chunk$set(
  tidy = TRUE,
  tidy.opts = list(width.cutoff = 95),
  message = FALSE,
  warning = FALSE,
  time_it = TRUE
)
```

# Overview
This tutorial demonstrates how to use `SRTpipeline` (>=0.1.0) to analyze a spatially-resolved transcriptomics (SRT) data. This tutorial will cover the following tasks, which we believe will be common for many SRT data analyses:

* Normalization
* Feature selection, including detecting spatially-variable features
* Dimensional reduction
* Clustering
* Visualization
* DEG analysis
* Trajectory inference


# Prepare SRTProject

For this tutorial, we will introduce how to create a `SRTProject` object with single SRT sample using SRTpipeline that includes an introduction to common analytical workflows for a single data batch.
Here, we will be taking  a spatial transcriptomics dataset (SampleID: 151672) for human dorsolateral prefrontal cortex (DLPFC) as an example. 
There are 4015  spots and 33538 genes that were sequenced on the 10x Visium platform. Our preprocessed data can be downloaded [here](https://github.com/feiyoung/PRECAST/blob/main/vignettes_data/dlpfc_151672.rda?raw=true), and the raw data can be found [here](https://github.com/LieberInstitute/spatialLIBD).


First, we  loaded the required packages with the command:
```{r init}
library(SRTpipeline)


```

Next, we  download the data to the current working path for the followed analysis by the following command:
```{r eval = FALSE}
githubURL <- "https://github.com/feiyoung/PRECAST/blob/main/vignettes_data/dlpfc_151672.rda?raw=true"
#download.file(githubURL,"dlpfc_151672.rda",mode='wb')
```
Then load to R. The data is saved in a Seurat object named `dlpfc_151672`.
```{r  eval = FALSE}
load("dlpfc_151672.rda")
```

<details>
 <summary>**How to read the raw sptial transcriptomics data?**</summary>
In addition, if users' data are in the raw form from spaceranger.
We start by reading in the data. The `read10XVisium()` function reads in the output of the [spaceranger](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/what-is-space-ranger) pipeline from 10X, returning a unique molecular identified (UMI) count matrix and a spatial coordinates file. The values in count matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each spot (column). The spatial information of all spots is represented by the spatial coordinates file.

We next use the count matrix and spatial coordinates to create a `SRTProject` object. The object serves as a container that contains both data (like the count matrix, spatial coordinates) and analysis (like PCA, or clustering results) for a spatial transcriptomics dataset. For a technical discussion of the `SRTProject` object structure, check the '?SRTProject-class'.
</details>

<details>
  <summary>**What does data in a count matrix look like?**</summary>

```{r eval = FALSE}
# Let's examine a few genes  in the first five spots
count <- dlpfc_151672[["RNA"]]@counts
count[1:5, 1:5]
```

The `.` values in the matrix represent 0s (no molecules detected). Since most values in an count matrix are 0,  a sparse-matrix representation is necessary to save  memory and speed savings for  data.

````{r eval = FALSE}
meta_data <- dlpfc_151672@meta.data
head(meta_data[,1:5])
```
</details>

We check the data by printing it. We can see there are 4015 spots and 33538 genes.
```{r  eval = FALSE}
dlpfc_151672
```
Check the meta data: each row contains the meta information of a spot with row name such as `AAACAAGTATCTCCCA-1`, which consists of the barcode, to identify each spot in RNA sequencing, separator `-` and a number `1`; note that the columns, `row`, `col`, `imagerow` and `imagecol`, save the spatial information of each spot, in wihch `row` and `col` are the location of a spot on the slide and will be used in spatial analysis, `imagerow` and `imagecol` are the location of a spot on the H&E image; there is the manual annotations `layer_guess_reordered` based on the cytoarchitecture in the original study (Maynard et al., 2021).
```{r  eval = FALSE}
head(dlpfc_151672@meta.data)
```
Check the manual annotations: this slice is annotated as five layers: Layer 3-6 and White matter. 
```{r  eval = FALSE}
table(dlpfc_151672@meta.data$layer_guess_reordered)
```


# Pre-processing workflow

## Create a SRTProject object

First, we show how to create a SRTProject object step by step.

Because the gene name is Ensembl format in this data, we change the genes' Ensembl name to symbols for convenience of followed analyses.
```{r  eval = FALSE}
count_matrix <- dlpfc_151672[['RNA']]@counts
## Use eg.db database: this method is fast
symbol_name <- transferGeneNames(row.names(count_matrix), now_name = "ensembl", to_name="symbol", species="Human", Method='eg.db')
symbol_name[1:10]
row.names(count_matrix) <- symbol_name
```


```{r  eval = FALSE}
## create count matrix list: note each component has a name, i.e., `ID151672`.

cntList <- list(ID151672=count_matrix)

## create spatial coordinate matrix 
coordList <- list(cbind(dlpfc_151672$row, dlpfc_151672$col))

## create metadata list
meta.data <- dlpfc_151672@meta.data
metadataList <- list(meta.data)

## create meta data for each data batches. Here we only have one data batch.
sampleMetadata <- data.frame(species="Human", tissues=c('DLPFC'))
row.names(sampleMetadata) <- names(cntList)

## Name of this project
projectName <- "DLPFC151672"

rm(dlpfc_151672)
```

Next, we start creating `SRTProject` object. We can print the basic information of this object, including  three parts. The first part has the class of this object, outputPath of data that require to output, h5filePath that save the memory-cusuming data (i.e., count, logcount, ...). The second part is about the datasets basic information, such as how many data batches(sample) and the data names, sample meta data (sampleColData) and meta data for each spot (cellMetaData). The last part is about downstream analyses information (that is empty) when this object created.
```{r  eval = FALSE}
SRTProj <- CreateSRTProject(cntList, coordList, projectName =projectName, metadataList,
    sampleMetadata, force=F)
SRTProj
```   

<details>
  <summary>**What does data in a h5 file look like?**</summary>

```{r  eval = FALSE}
h5ls(SRTProj@projectMetadata$h5filePath)
```

</details>


## Normalizing the data
After removing unwanted cells and genes from the dataset, the next step is to normalize the data.  To save RAM memory, normalized values are stored in disk as a `h5file`.
```{r  eval = FALSE}
SRTProj <- normalizeSRT(SRTProj)
```

### Feature selection
We next select a subset of genes that exhibit high spot-to-spot variation in the dataset (i.e, they are highly expressed in some spots, and lowly expressed in others). It has been  found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets [here](https://www.nature.com/articles/nmeth.2645). 

Then we choose variable features. The default number of variable features is 2,000, and users can change it using argument `nfeatures`. The default type is highly variable genes (HVGs), but users can use spatially variable genes by seting `type='SVGs'`, then `SPARK-X` will be used to choose SVGs.

#### Choose highly variable features
We can also directly use `selectVariableFeatures()` function to choose the top highly variable genes (HVGs). We found the performance of using SVGs or HVGs has no significant difference for the downstream analyses, such as dimension reduction and spatial clustering.

```{r  eval = FALSE}
SRTProj <- selectVariableFeatures(SRTProj)
```
#### Choose spatially variable features
Additionally, `selectVariableFeatures()` function can also  choose the top spatially variable genes (SVGs) based on SPARK-X. SPARK-X is embeded in the `DR.SC` R package and is implemented in the `FindSVGs()` function. By default, we return 2,000 genes per dataset.  These will be used in downstream analysis, like probabilistic embeddings and spatial clustering. To further speed up the computation, we first use `FindVariableFeatures()` to select 5,000 highly variable genes, then use SPARK-X to choose top SVGs in the  implementation of `FindSVGs()` function. 

```{r  eval =FALSE}
SRTProj <- selectVariableFeatures(SRTProj, type='SVGs', method='spark-x')
```

Calculate the adjcence matrix
```{r  eval = FALSE}
## Obtain adjacence matrix
SRTProj <- AddAdj(SRTProj, platform = "Visium", force=T)

```
Visualize
```{r, eval = FALSE}
selectFeatures <- row.names(SRTProj@geneMetaData[which(SRTProj@geneMetaData$`isVGs#combine`),])[1:2]
EachExprSpaHeatMap(SRTProj, features=selectFeatures, title_name=T, quantVec = c(0.90,0.90))
```


# Dimension reduction

PCA is the most popular dimension reduction technique in single cell RNA sequencing (scRNA-seq) data and SRT　data because of its simplicity, computational efficiency, and relatively comparable performance. SRTpipeline provided three versions of PCA: standard PCA (PCA), approximated PCA (APCA),  and weighted PCA (WPCA) by the function `AddPCA` with default version as APCA for fastest computation. The 15-dimensional PCs are extracted by default, and users can use their own values. 

After running `AddPCA`, we would see the output of `SRTProj` includes `PCA` in the `Low-dimensional embeddings` field.
```{r, eval = FALSE}
## APCA
SRTProj <- AddPCA(SRTProj, n_comp=15)
#accurate PCA

#SRTProj <- AddPCA(SRTProj, Method='PCA')

#weighted PCA

# SRTProj <- AddPCA(SRTProj, Method='WPCA')
SRTProj
```

# Clustering using SC-MEB

Some SRT clustering methods use Markov random field to model  clusters of spots, such as SC-MEB and DR-SC. These approaches work extremely well for data with spatial smoothness and are a standard practice in SRT data. For this reason, SRTpipeline uses existing state-of-the-art clustering methods from SRT packages for clustering.

We have had the most success using the  Markov random field model implemented by SC-MEB. In SRTpipeline, SC-MEB clustering is performed using the ` Cluster_SCMEB()` function.
```{r, eval = FALSE}
SRTProj <- Cluster_SCMEB(SRTProj, K= 5, reduction = "PCA")
SRTProj
```


## Visualization
To run tSNE in SRTpipeline we use the AddTSNE() function:
```{r, eval = FALSE}
SRTProj <- AddTSNE(SRTProj, n_comp = 2, reduction='PCA')
```
The `reduction` tells `AddTSNE()` function uses specified reduction in the `SRTProj@reductions` slot.

To plot the two-dimensional tSNE results, we use the `EmbedPlot()` function and pass the name of the tSNE embedding we just generated (“tSNE”). We can tell SRTpipeline how to color the spots by using `item` argument which tells SRTpipeline which matrix to use to find the specified metadata column provided to name.

`SRTpipeline` package oﬀers a variety of visualization tools.  We
visualized the inferred domain types on  embeddings  using two
components from tSNE. The tSNE plot showed the domain clusters were well segregated.
```{r, eval = FALSE}
p_pca_tsne2 <- EmbedPlot(SRTProj, item='cluster', plotEmbeddings = 'tSNE', cols=chooseColors(n_colors = 5), legend.position='right')
p_pca_tsne2
```

To run UMAP in SRTpipeline we use the AddUMAP() function. Frist, we evaluate the two-dimensional UMAPs.
```{r, eval = FALSE}
SRTProj <- AddUMAP(SRTProj, n_comp = 2, reduction='PCA')
p_pca_umap2 <- EmbedPlot(SRTProj, item='cluster', plotEmbeddings = 'UMAP', cols=chooseColors(n_colors = 5), legend.position='right')
p_pca_umap2
```

Next, we merge all plots into one figure.
```{r, eval = FALSE}
library(patchwork)
p_all <- p_pca_tsne2 +p_pca_umap2 + plot_layout(nrow=1, ncol=2)
p_all
```

To save the plot, we can use `write_fig()` function.
```{r, eval = FALSE}
write_fig(p_all, filename = 'SCMEB_plots.png')
```


Except for the embedding plots, SRTpipeline also provides a variaty of visualization functions.
First, we visualize the spatial distribution of cluster labels that shows the layer structure.
```{r, eval = FALSE}
## choose colors to function chooseColors
cols <- chooseColors(n_colors = 5)
EachClusterSpaHeatMap(SRTProj, cols=cols, legend.position='right',base_size=16)
# remove the border
# EachClusterSpaHeatMap(SRTProj, cols=cols, legend.position='right',base_size=16, border_col='white')
```

We plotted the heatmap of Pearson’s correlation coefcients of the PCA embeddings among
the detected domains shows the good separation of the estimated embeddings across
domains and the correlations between deeper layers were high, e.g., there were high
correlations between domain 2 and 3, while correlations among the separated layers were low, i.e., domain 1 and 4.

```{r, eval = FALSE}
CCHeatMap(SRTProj, reduction = 'PCA', grp_color=cols)
```

After adding the quantities for data visualization, the `SRTProject` object will have more information in the downstream analyses information.
Now, we print this `SRTProject` object to check it. We observed two components added in the slot `plotEmbeddings` (Embeddings for plotting): `tSNE`, `tSNE3`, `UMAP` and `UMAP3`.
```{r, eval = FALSE}
SRTProj
```


# DEG analysis
To do downstream analyses, we require to get the count data from h5file. The function `getGeneSpotData()` access the gene-by-spot matrix and other data in `SRTProj` and then return a `SpatialExperiment` object,  including two assays: counts, logcounts; rowData: from `SRTProj@geneMetaData`; colData: from `SRTProj@cellMetaData`, `SRTProj@clusters` and sample_id; reducedDim: from `SRTProj@reductions`.
```{r, eval = FALSE}
spe <- getGeneSpotData(SRTProj)
spe
```
After obtain the spatial cluster labels using a clustering model, we can perform differentially expression analysis. First, we detect the DE genes for cluster 3.
```{r, eval = FALSE}
dat_deg <- FindDEGs(spe, cluster1='3')
subset(dat_deg, avg_log2FC >0.25)
```
We perform differential expression analysis for all clusters by using `FindAllDEGs()` function, then the DE genes' information is saved in a data.frame object `dat_degs`. 
```{r, eval = FALSE}
dat_degs <- FindAllDEGs(spe)
dat_degs
```
We identify the significant DE genes by two criteria: (a) adjustd p-value less than 0.01 and (b) average log fold change greater than 0.25.

```{r, eval = FALSE}
degs_sig <- subset(dat_degs, p_val_adj < 0.01 & avg_log2FC>0.25)
degs_sig
```

In the following, we perform gene set enrichment analysis for the DE genes of each  Domain identified by SC-MEB model using R package `gprofiler2`. 
```{r, eval = FALSE}
library(gprofiler2)
termList <- list()
for(k in 1: 5){
  # k <- 1
  cat("k = ", k, '\n')
  dat_degs_sub <- subset(degs_sig,  cluster==k)
  
  que1 <- dat_degs_sub$gene
  gostres <- gost(query = que1, organism = "hsapiens", correction_method='fdr')
  termList[[k]] <- gostres
}
head(termList[[1]]$result)
```

 To understand the functions of the identified spatial domains by SC-MEB model, we compare the top significant biological process (BP)  pathways in GO database for the DE genes from Domain 1 and 2. Here, we only show to visualize the significant BP pathways and users can explore the other databases such as KEGG and HPA.
 

```{r, eval = FALSE}
## Most commonly used databases
source_set <- c("GO:BP","GO:CC", "GO:MF",   'KEGG', "HPA")
cols <- c("steelblue3", "goldenrod", "brown3", "#f98866", "#CE6DBD")
## Here, we show GO:BP
source1 <- "GO:BP"
ss <- which(source_set==source1)
ntop = 5
names(cols) <- source_set
pList_enrich <- list()
for(ii in 1: 5){
  ## ii <- 3
  message("ii=", ii)
  gostres2 <- termList[[ii]]
  dat1 <- subset(gostres2$result, term_size< 500)
  dat1 <- get_top_pathway(dat1, ntop=ntop, source_set = source1)
  dat1 <- dat1[complete.cases(dat1),]
  dat1$nlog10P <- -log10(dat1$p_value)
 
  pList_enrich[[ii]] <-  barPlot_enrich(dat1[order(dat1$nlog10P),], source='source', 'term_name',
                                        'nlog10P', cols=cols[source_set[ss]],base_size=14) + ylab("-log10(p-adj)")+
    xlab("Biological terms") + ggtitle(paste0("Domain", ii))
}

drawFigs(pList_enrich[c(1,2)], layout.dim = c(2,1), common.legend = T, align='hv')

```

We take out the top DE genes for each cluster for visualization. 
```{r, eval = FALSE}

library(dplyr)
n <- 3
dat_degs %>% as.data.frame %>%
  group_by(cluster) %>%
  top_n(n = n, wt = avg_log2FC) -> topGene
topGene
```
We visualize the DE genes for each cluster group by gene-by-cell heatmap using the `GCHeatMap()` function.
```{r, eval = FALSE}
cols_cluster <- chooseColors(n_colors = 5)

p1 <- EachGCHeatMap(spe, features=topGene$gene, grp_color = cols_cluster ,y_text_size=12)
p1
```


# Trajectory inference
Next, we performed trajectory inference using the PCA embeddings and domain labels estimated by SC-MEB model. The `EmbedPlot()` function can be used to visualize the inferred pseudotime on a specified embedding. If there is only one data batch, the function `EachEmbedPlot()` plays the same role as function `EmbedPlot()`.
```{r, eval = FALSE}
spe <- AddTrajectory(spe, reduction = 'PCA')
p1 <- EmbedPlot(spe, reduction = 'PCA' ,colour_by='PT')
p2 <- EmbedPlot(spe, reduction = 'Coord' ,colour_by='PT')
drawFigs(list(p1, p2),layout.dim = c(1,2),common.legend = TRUE, 
         legend.position = 'right')
# EachEmbedPlot(spe, reduction = 'PCA', colour_by='TrajPT')
# save(spe, SRTProj, file=paste0(SRTProj@projectMetadata$outputPath,"/SRTProj.rds"))
# load("F:/Research paper/IntegrateDRcluster/AnalysisCode/SRTpipeline/vignettes/DLPFC151672/SRTProj.rds")
```


# Other downstream analyses


<details>
<summary>**Session Info**</summary>
```{r}
sessionInfo()
```
</details>