We'll start with some magic that will make torch play nice with R on Talapas

In [None]:
Sys.setenv(TORCH_HOME='/gpfs/projects/datascience/shared/R/torch/lantern/build/libtorch')
libdir='/gpfs/projects/datascience/shared/R/Data4ML'
.libPaths(libdir)
Sys.setenv(R_LIBS = paste(libdir, Sys.getenv("R_LIBS"), sep=.Platform$path.sep))

#  Unsupervised Deep Learning for Single Cell Data

All the principles around unsupervised learning apply when you use deep unsupervised techniques. Implementing these models from scratch can be pretty involved, but there are packages available to use some of the algorithms.

1. We'll run the demo of scCAN, an auto-encoder-based tool 
2. We'll combine this method alongside seurat to use its standard tools


# Libraries
We'll use UMAP from the uwot package, scCAN, and Seurat

In [None]:
library('ggplot2')
library('uwot')
library("scCAN")
data("SCE")
library("Seurat")

In [None]:
data <- t(SCE$data); 
head(data)

max(data)
min(data)
mean(data)

# Normalization

As a reminder, these algorithms determine what's important based on distance quantities. If you're not careful, single-cell data will be dominated by the genes with the largest counts. To help deal with this, its common to log normalize your data and scale it. 

* Another important note 
 Remember it is standard for ML data is Rows = Examples (in this case cells) and Columns are features in this case counts
 
Single cell data is often comes with the columns switched with respect to the ML standard Rows = Counts and Cells = Columns, you'll see **t()** which transposes the matrix
 

In [None]:

# Get data matrix and label
data <- t(SCE$data); 
label <- as.character(SCE$cell_type1)
log_data<-log(data+1)




## Auto-encoder training and clustering in scCAN is a one-liner but it will take a little while to run; here, the more cores, the better.

In [None]:
#Generate clustering result. The input matrix has rows as samples and columns as genes
result <- scCAN(log_data,ncores=8)


In [None]:
View(result)

Like PCA this returns a reduced dimension latent space, by default 15 dimensions. We'll still need to use something like umap to plot this data.


In [None]:
head(result$latent)
head(result$cluster)

plot_data=umap(result$latent)

plot(plot_data,col=result$cluster)


# The above is the example that comes with scCAN let's try a more complicated analysis in Seurat

We'll use the dataset from the Seurat Tutorial and their pipeline

In [None]:
pbmc.data <- Read10X(data.dir = "/projects/datascience/shared/filtered_gene_bc_matrices/hg19/")

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- RunUMAP(pbmc, dims = 1:10)

features=VariableFeatures(object=pbmc)


In [None]:
#count_data=t(as.matrix(GetAssayData(object = pbmc, slot = "data")[features,])) 

#Matrix is needed to run this line it's loaded above otherwise use top line
count_data=t(GetAssayData(object = pbmc, slot = "data"))
#[features,]

In [None]:

s_result <- scCAN(count_data,ncores=16,sparse=T)

In [None]:
pbmc@meta.data$scan.clusters <- s_result$cluster
Idents(object = pbmc)<- "scan.clusters"
pbmc.markers <- FindAllMarkers(pbmc,min.pct = 0.25, logfc.threshold = 0.25)
head(pbmc.markers)


In [None]:
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)


In [None]:
library(dplyr)
pbmc.markers %>%
    group_by(cluster) %>%
    top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

In [None]:

sc_umap=umap(s_result$latent)
colnames(sc_umap) <- paste0("SCC_", 1:2)
row.names(sc_umap) <- row.names(pbmc@reductions$pca)
colnames(sc_umap)
pbmc[['scCANumap']]<-CreateDimReducObject(embeddings = sc_umap , key = "SCC2_", assay = DefaultAssay(pbmc))

DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
DimPlot(pbmc, reduction = "scCANumap", label = TRUE, pt.size = 0.5 ) + NoLegend()
