# Mouse snRNA Integrative Analysis
## Hippocampus
### Data
- [Hippocampus data table](https://github.com/erebboah/ENC4_Mouse_SingleCell/blob/master/snrna/ref/hippocampus_minimal_metadata.tsv)

### Aims
[integrate_parse_10x.R](https://github.com/erebboah/ENC4_Mouse_SingleCell/blob/master/snrna/scripts/integrate_parse_10x.R):
1. Read in pre-processed Parse and 10x data and merge counts matrices across experiments (within the same technology) for each tissue.
2. Filter nuclei by # genes, # UMIs, percent mitochondrial gene expression, and doublet score. See [detailed metadata](https://github.com/erebboah/ENC4_Mouse_SingleCell/blob/master/snrna/ref/enc4_mouse_snrna_metadata.tsv) for filter cutoffs. **Also filter 10x nuclei for those passing snATAC filters.**
3. Run SCT on the 3 objects to regress `percent.mt` and `nFeature_RNA`. Use  `method = "glmGamPoi"` to speed up this step, and save pre-integrated data in `seurat` folder.
4. Combine Parse standard, Parse deep, and 10x data by CCA integration. Use Parse standard as reference dataset because it contains all timepoints, while 10x data only contains 2 timepoints. 
5. Score nuclei by cell cycle using these [mouse cell cycle genes](https://github.com/erebboah/ENC4_Mouse_SingleCell/blob/master/snrna/ref/mouse_cellcycle_genes.rda) to aid in manual celltype annotation.

[predict_hippocampus_celltypes.R](https://github.com/erebboah/ENC4_Mouse_SingleCell/blob/master/snrna/scripts/predict_hippocampus_celltypes.R): Use an [external 10x brain atlas](https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-whole-cortex-and-hippocampus-10x) to predict celltype labels. The 1.1M cell dataset was subset for 1,000 cells in each celltype for a ~250,000 cell dataset (code coming soon).

**In this notebook**:
Manual celltype annotation by assigning each cluster to the celltype predicted for the majority of cells in the cluster, then adjusting the labels as we see fit. Find marker genes for `gen_celltype`, `celltypes`, and `subtypes` and save in `seurat/markers`.


### Results
- We decided on 3 levels of annotation: `gen_celltype` or general celltype (e.g. "Neuron"), `celltypes` for higher resolution (e.g. "Inhibitory"), and finally `subtypes` for the highest resolution of celltype annotations (e.g. "Pvalb"). 
- The external atlas did not separate their oligodendrocytes into OPCs, MFOLs, and MOLs, but we use marker genes reported in literature and on [mousebrain.org](http://www.mousebrain.org/adolescent/celltypes.html) to check marker genes and assign cell type labels.



In [4]:
library(Matrix)
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(viridis))
library(glmGamPoi)
library(RColorBrewer)
options(future.globals.maxSize = 10000 * 1024^2)
future.seed=TRUE


In [None]:
setwd("../../enc4_mouse/snrna/")

In [5]:
setwd("/share/crsp/lab/seyedam/share/enc4_mouse/snrna/")

In [6]:
system("mkdir plots/hippocampus")
system("mkdir plots/hippocampus/qc")
system("mkdir plots/hippocampus/clustering")
system("mkdir plots/hippocampus/annotation")
system("mkdir seurat/markers")
system("mkdir seurat/markers/hippocampus")
system("mkdir ref/hippocampus")

# Functions

In [4]:
get_orig_counts = function(file){
    metadata = metadata[metadata$file_accession == file,]
    counts = readMM(paste0("counts_10x/",file,"/matrix.mtx"))
    
    barcodes = read.delim(paste0("counts_10x/",file,"/barcodes.tsv"),header = F, col.names="barcode")
    features = read.delim(paste0("counts_10x/",file,"/genes.tsv"),header = F, col.names="gene_name") 
    colnames(counts) = barcodes$barcode
    rownames(counts) = features$gene_name
    out = counts

}

In [5]:
knee_df = function(mtx,expt_name){
    df = as.data.frame(rowSums(mtx))
    colnames(df) = c("nUMI")
    df <- tibble(total = df$nUMI,
               rank = row_number(dplyr::desc(total))) %>%
    distinct() %>%
    arrange(rank)
    df$experiment = expt_name
    out = df
}

# QC plots

## Knee plot

In [6]:
combined.sct = readRDS("seurat/hippocampus_Parse_10x_integrated.rds")
cellbend_10x = subset(combined.sct,subset=technology =="10x")
orig_parse = subset(combined.sct,subset=technology =="Parse")
parse_standard = subset(orig_parse,subset=depth1 =="shallow")
parse_deep = subset(orig_parse,subset=depth1 =="deep")

In [7]:
metadata = read.delim("ref/enc4_mouse_snrna_metadata.tsv")
metadata = metadata[metadata$technology == "10x",]
metadata = metadata[metadata$tissue == "Hippocampus",]

files = metadata$file_accession

orig_10x = get_orig_counts(files[1])

for (j in 2:length(files)){
    counts_adding = get_orig_counts(files[j])
    orig_10x = cbind(orig_10x, counts_adding)
}


In [8]:
cellbend_knee_plot = knee_df(cellbend_10x@assays$RNA@counts, "10x + Cellbender")
orig_knee_plot = knee_df(orig_10x, "10x")

parse_standard_knee_plot = knee_df(parse_standard@assays$RNA@counts, "Parse standard")
parse_deep_knee_plot = knee_df(parse_deep@assays$RNA@counts, "Parse deep")

pdf(file="plots/hippocampus/qc/experiment_kneeplots.pdf",
    width = 10, height = 8)
ggplot(rbind(cellbend_knee_plot,orig_knee_plot,parse_standard_knee_plot,parse_deep_knee_plot), 
       aes(rank, total, group = experiment, color = experiment)) +
geom_path() + 
scale_y_log10() + scale_x_log10() + annotation_logticks() +
labs(y = "Total UMI count", x = "Barcode rank", title = "Mouse hippocampus knee plot") + 
geom_hline(yintercept=500, linetype="dashed", color = "red", size=1)
dev.off()


"Transformation introduced infinite values in continuous y-axis"


In [9]:
pdf(file="plots/hippocampus/qc/experiment_violinplots.pdf",
    width = 20, height = 5)
VlnPlot(combined.sct, features = c("nFeature_RNA"), ncol = 1, split.by = "depth2",
        pt.size = 0, group.by = "sample", cols = c("#811b74","#C08DBA","#00a1e0"))+ ggtitle("# genes per nucleus") +
stat_summary(fun.y = median, geom='point', size = 15, colour = "black", shape = 95) & theme(text = element_text(size = 20), 
                                                                              axis.text.x = element_text(size = 20), 
                                                                              axis.text.y = element_text(size = 20))
VlnPlot(combined.sct, features = c("nCount_RNA"), ncol = 1, split.by = "depth2",
        pt.size = 0, group.by = "sample", cols = c("#811b74","#C08DBA","#00a1e0")) + ggtitle("# UMIs per nucleus") +
stat_summary(fun.y = median, geom='point', size = 15, colour = "black", shape = 95)& theme(text = element_text(size = 20), 
                                                  axis.text.x = element_text(size = 20), 
                                                  axis.text.y = element_text(size = 20))
VlnPlot(combined.sct, features = c("percent.mt"), ncol = 1, split.by = "depth2",
        pt.size = 0, group.by = "sample", cols = c("#811b74","#C08DBA","#00a1e0")) & theme(text = element_text(size = 20), 
                                                  axis.text.x = element_text(size = 20), 
                                                  axis.text.y = element_text(size = 20)) 
dev.off()

The default behaviour of split.by has changed.
Separate violin plots are now plotted side-by-side.
To restore the old behaviour of a single split violin,
set split.plot = TRUE.
      
This message will be shown once per session.

"`fun.y` is deprecated. Use `fun` instead."
"`fun.y` is deprecated. Use `fun` instead."


## UMAP "Feature Plots" of QC metadata

In [10]:
png(file="plots/hippocampus/qc/qc_featureplot.png",
    width = 1200, height = 500)
FeaturePlot(combined.sct, pt.size = 0.1,
            features =c("nFeature_RNA",
                        "nCount_RNA",
                        "percent.mt",
                        "percent.ribo",
                        "doublet_scores",
                        "G2M.Score"), ncol =3)  & scale_colour_gradientn(colours = viridis(11)) & 
                        NoAxes()& 
                        theme(text = element_text(size = 18))

dev.off()


Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.



# Check integration and clustering
Want Vip+ and Sncg+ clusters to be separate -- increased resolution.

In [12]:
DefaultAssay(combined.sct) = "integrated"
combined.sct <- FindClusters(combined.sct,resolution=2,verbose = F)


In [13]:
nclusters = length(unique(combined.sct$seurat_clusters))
cluster_cols = colorRampPalette(brewer.pal(9,"Set1"))(nclusters)

In [14]:
pdf(file="plots/hippocampus/clustering/UMAP_Parse_10x.pdf",
    width = 20, height = 8)
p1 <- DimPlot(combined.sct, reduction = "umap", group.by = "technology")
p2 <- DimPlot(combined.sct, reduction = "umap", label = TRUE, repel = TRUE, cols = cluster_cols)
p1 + p2

dev.off()

In [15]:
pdf(file="plots/hippocampus/clustering/Parse_10x_experiment_distribution.pdf",
    width = 20, height = 6)
DimPlot(combined.sct, reduction = "umap", group.by = "seurat_clusters",split.by = "depth2", label = TRUE, label.size = 6, repel = TRUE, shuffle = T,cols = cluster_cols)

ggplot(combined.sct@meta.data, aes(x=seurat_clusters, fill=depth2)) + geom_bar(position = "fill") & 
theme(text = element_text(size = 20), axis.text.x = element_text(size = 20), axis.text.y = element_text(size = 20))

dev.off()


In [16]:
combined.sct$sample = factor(combined.sct$sample, levels=paste0("HC_",rep(c("10","14","25","36","2m","18m"),each=4),rep(c("_M","_F"),each=2),c("_1","_2")))

pdf(file="plots/hippocampus/clustering/UMAP_cluster_sample_barplot.pdf",
    width = 20, height = 10)
p1=DimPlot(combined.sct, reduction = "umap", group.by = "seurat_clusters", label = TRUE, label.size = 8, repel = TRUE, 
          cols = cluster_cols)
p2=ggplot(combined.sct@meta.data, aes(x=seurat_clusters, fill=sample)) + geom_bar(position = "fill") +
theme(text = element_text(size = 20), axis.text.x = element_text(size = 20), axis.text.y = element_text(size = 20)) & coord_flip()
gridExtra::grid.arrange(
  p1, p2,
  widths = c(2,1.6),
  layout_matrix = rbind(c(1, 2)))

dev.off()

In [17]:
pdf(file="plots/hippocampus/clustering/age_sex_barplot.pdf",
    width = 18, height = 19)
p1=DimPlot(combined.sct, reduction = "umap", group.by = "timepoint", label = TRUE, label.size = 5, repel = TRUE)
p2 = ggplot(combined.sct@meta.data, aes(x=seurat_clusters, fill=timepoint)) + geom_bar(position = "fill") & 
theme(text = element_text(size = 20), axis.text.x = element_text(size = 20), axis.text.y = element_text(size = 20)) & coord_flip()

p3=DimPlot(combined.sct, reduction = "umap", group.by = "sex", label = TRUE, label.size = 5, repel = TRUE, shuffle = T)
p4 = ggplot(combined.sct@meta.data, aes(x=seurat_clusters, fill=sex)) + geom_bar(position = "fill") & 
theme(text = element_text(size = 20), axis.text.x = element_text(size = 20), axis.text.y = element_text(size = 20)) & coord_flip()
gridExtra::grid.arrange(
  p1, p2, p3, p4,
  widths = c(2,1),
  layout_matrix = rbind(c(1, 2),
                        c(3, 4)))

dev.off()

In [18]:
# I want Vip+ and Sncg+ clusters to be separate, and Sst+ and Pvalb+.
png(file="plots/hippocampus/clustering/inhib_neuron_featureplots.png",
    width = 1200, height = 1000)
DefaultAssay(combined.sct) = "SCT" # do NOT use integrated assay to visualize gene expression
FeaturePlot(combined.sct, pt.size = 0.1, order = T,
            features =c("Sst","Pvalb",
                        "Vip","Sncg"), ncol =2)  & scale_colour_gradientn(colours = viridis(11)) & 
                        NoAxes()& 
                        theme(text = element_text(size = 20))

dev.off()

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.



# Plotting: check predicted celltypes

In [19]:
pdf(file="plots/hippocampus/annotation/UMAP_predictions.pdf",
    width = 15, height = 12)
nclusters = length(unique(combined.sct$atlas_predictions))
DimPlot(combined.sct, reduction = "umap", group.by = "atlas_predictions",
        label = TRUE, label.size = 6, repel = TRUE,cols = colorRampPalette(brewer.pal(9,"Set1"))(nclusters)) + NoLegend()
dev.off()

# Rename clusters based on maximum predicted celltype

In [20]:
Idents(combined.sct) = "seurat_clusters"
mat = as.matrix(table(Idents(combined.sct), combined.sct$atlas_predictions))
ct = colnames(mat)[max.col(mat)]
names(ct) = 0:(length(ct)-1)

# basically add the cluster info to the maximum predicted celltype
for (i in 1:length(unique(Idents(combined.sct))))
{
    search = paste0("\\<",names(ct)[i],"\\>")
    replace = paste0(ct[i],".",names(ct)[i])
    Idents(combined.sct) = gsub(search,replace,Idents(combined.sct))
}

combined.sct[["atlas_celltypes"]] <- Idents(combined.sct)

In [21]:
pdf(file="plots/hippocampus/annotation/UMAP_maximum_predictions.pdf",
    width = 15, height = 12)
nclusters = length(unique(combined.sct$atlas_celltypes))
DimPlot(combined.sct, reduction = "umap", group.by = "atlas_celltypes",
        label = TRUE, label.size = 6, repel = TRUE,cols = colorRampPalette(brewer.pal(9,"Set1"))(nclusters))  & NoLegend()

dev.off()

# Manual celltype annotation

In [32]:
combined.sct$subtypes = combined.sct$atlas_celltypes

In [23]:
genes = c( "Gfap","Slc1a2","Slc1a3","Gja1", # astro
                                   "Pde1a","Galntl6","Cpne4","Hs3st4",# TEGLUs
                                   "Fibcd1","4921539H07Rik","Wfs1",# CA1
                                   "Nr4a3","Grik4", # CA3
                                   "Ndnf","Reln","Trp73", # cajal-retzius
                                    "Prox1","Bhlhe22","Igfbpl1", # early DG
                                    "Flt1", # endothelial 
                                   "Dnah6","Dnah12", "Ccdc153",# ependymal
                                   "Nr4a2","Hs3st2","Tshz2","Vwc2l", # TEGLUs
                                   "Lamp5","Lhx6",
                                   "Pdgfra","Sox6","C1ql1",# OPC
                                   "Csf1r","Cx3cr1","P2ry12",# microglia
                                   "Plp1","Mbp","Ccp110",# MFOL
                                   "Mog","Opalin","Ninj2","Hapln2","Dock5", # MOL
                                   "Sncg","Sst",
                                   "Dcn","Slc6a13","Ptgds",  # VLMC
                                   "Vip" # inhibitory subtype
          )
        

In [24]:
# dot plot of some marker genes
# http://www.mousebrain.org/adolescent/celltypes.html
pdf(file="plots/hippocampus/annotation/reference_predictions_marker_dotplot.pdf",
    width = 20, height = 12)
Idents(combined.sct) = "subtypes"
Idents(combined.sct) = factor(Idents(combined.sct), levels = sort(as.character(unique(Idents(combined.sct)))))
DotPlot(combined.sct, features = genes)+ 
theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
dev.off()

## Fix oligodendrocyte clusters

In [33]:
combined.sct$subtypes = gsub("\\<Oligo.13\\>","OPC.13",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<Oligo.14\\>","OPC.14",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<Oligo.48\\>","OPC.48",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<Oligo.55\\>","OPC.55",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<Oligo.34\\>","MFOL.34",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<Oligo.33\\>","MFOL.33",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<Oligo.43\\>","MOL.43",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<Oligo.15\\>","MOL.15",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<Oligo.56\\>","MOL.56",combined.sct$subtypes) 


## Add early DG, ependymal, another VLMC

In [34]:
combined.sct$subtypes = gsub("\\Vip.19\\>","DG_early.19",combined.sct$subtypes)
combined.sct$subtypes = gsub("\\<Vip.12\\>","DG_early.12",combined.sct$subtypes)
combined.sct$subtypes = gsub("\\<Vip.45\\>","DG_early.45",combined.sct$subtypes)
combined.sct$subtypes = gsub("\\<Micro-PVM.54\\>","DG_early.54",combined.sct$subtypes)
combined.sct$subtypes = gsub("\\<DG.8\\>","DG_early.8",combined.sct$subtypes)
combined.sct$subtypes = gsub("\\<L2/3 IT ENTl.20\\>","DG_early.20",combined.sct$subtypes)

combined.sct$subtypes = gsub("\\<Vip.39\\>","Astro.39",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<Astro.50\\>","Ependymal.50",combined.sct$subtypes) 

combined.sct$subtypes = gsub("\\<Astro.49\\>","Endo.49",combined.sct$subtypes) 


## Fix some neuron clusters

In [35]:
combined.sct$subtypes = gsub("\\<L2 IT ENTm.47\\>","CA1-ProS.47",combined.sct$subtypes)
combined.sct$subtypes = gsub("\\<L6 CT CTX.37\\>","CA3.37",combined.sct$subtypes)


In [36]:
# get rid of cluster #
combined.sct$subtypes = do.call("rbind", strsplit(as.character(combined.sct$subtypes), "[.]"))[,1]

## Change some names and checkdot plot_ UMAP

In [37]:
combined.sct$subtypes = gsub("\\<Astro\\>","Astrocyte",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<Endo\\>","Endothelial",combined.sct$subtypes) 
combined.sct$subtypes = gsub("\\<CA1-ProS\\>","CA1",combined.sct$subtypes) 


In [41]:
# dot plot of some marker genes
# http://www.mousebrain.org/adolescent/celltypes.html
pdf(file="plots/hippocampus/annotation/subtype_marker_dotplot.pdf",
    width = 20, height = 12)
Idents(combined.sct) = "subtypes"
Idents(combined.sct) = factor(Idents(combined.sct), levels = sort(as.character(unique(Idents(combined.sct)))))
DotPlot(combined.sct, features = genes)+ 
theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
dev.off()

In [40]:
pdf(file="plots/hippocampus/annotation/UMAP_subtypes.pdf",
    width = 15, height = 10)
nclusters = length(unique(combined.sct$subtypes))
DimPlot(combined.sct, reduction = "umap", group.by = "subtypes", label = TRUE, label.size = 8, repel = TRUE, 
          cols = colorRampPalette(brewer.pal(9,"Set1"))(nclusters))
dev.off()

# Add celltypes and gen_celltypes metadata
Based on the subtypes annotation, we can group the cells into broader categories.

In [42]:
combined.sct$celltypes = combined.sct$subtypes

combined.sct$celltypes = gsub("\\<CA1\\>","Excitatory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<CA3\\>","Excitatory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<CR\\>","Excitatory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<CT SUB\\>","Excitatory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<DG\\>","Excitatory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<DG_early\\>","Excitatory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<L2/3 IT PPP\\>","Excitatory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<L4/5 IT CTX\\>","Excitatory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<L5/6 NP CTX\\>","Excitatory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<SUB-ProS\\>","Excitatory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<Lamp5\\>","Inhibitory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<Pvalb\\>","Inhibitory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<Sncg\\>","Inhibitory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<Vip\\>","Inhibitory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<Sst\\>","Inhibitory",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<MFOL\\>","Oligodendrocyte",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<MOL\\>","Oligodendrocyte",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<SMC-Peri\\>","Smooth_muscle",combined.sct$celltypes)
combined.sct$celltypes = gsub("\\<Micro-PVM\\>","Microglia",combined.sct$celltypes)


In [43]:
combined.sct$gen_celltype = combined.sct$celltypes

combined.sct$gen_celltype = gsub("\\<Excitatory\\>","Neuron",combined.sct$gen_celltype)
combined.sct$gen_celltype = gsub("\\<Inhibitory\\>","Neuron",combined.sct$gen_celltype)
combined.sct$gen_celltype = gsub("\\<Astrocyte\\>","Glial",combined.sct$gen_celltype)
combined.sct$gen_celltype = gsub("\\<OPC\\>","Glial",combined.sct$gen_celltype)
combined.sct$gen_celltype = gsub("\\<Oligodendrocyte\\>","Glial",combined.sct$gen_celltype)
combined.sct$gen_celltype = gsub("\\<Microglia\\>","Myeloid",combined.sct$gen_celltype)
combined.sct$gen_celltype = gsub("\\<Ependymal\\>","Stromal",combined.sct$gen_celltype)



# Plotting the 3 levels of annotations

In [None]:
color_ref = read.delim("ref/enc4_mouse_snrna_celltypes_c2c12.csv",sep=",",col.names = c("tissue","gen_celltype","celltypes",
                                                                              "subtypes","gen_celltype_color",
                                                                              "celltype_color","subtype_color"))
gen_celltype_colors = unique(color_ref[color_ref$tissue == "Hippocampus",c("gen_celltype","gen_celltype_color")])
rownames(gen_celltype_colors) = gen_celltype_colors$gen_celltype
gen_celltype_colors = gen_celltype_colors[sort(unique(combined.sct$gen_celltype)),]

pdf(file="plots/hippocampus/annotation/UMAP_final_gen_celltype.pdf",
    width = 15, height = 10)

DimPlot(combined.sct, reduction = "umap", 
        group.by = "gen_celltype", 
        label = TRUE, label.size = 8, repel = TRUE,
       cols = gen_celltype_colors$gen_celltype_color)

dev.off()

In [None]:
celltype_colors = unique(color_ref[color_ref$tissue == "Hippocampus",c("celltypes","celltype_color")])
rownames(celltype_colors) = celltype_colors$celltypes
celltype_colors = celltype_colors[sort(unique(combined.sct$celltypes)),]

pdf(file="plots/hippocampus/annotation/UMAP_final_celltypes.pdf",
    width = 15, height = 10)

DimPlot(combined.sct, reduction = "umap", 
        group.by = "celltypes", 
        label = TRUE, label.size = 8, repel = TRUE,
       cols = celltype_colors$celltype_color)

dev.off()

In [48]:
subtype_colors = unique(color_ref[color_ref$tissue == "Hippocampus",c("subtypes","subtype_color")])
rownames(subtype_colors) = subtype_colors$subtypes
subtype_colors = subtype_colors[sort(unique(combined.sct$subtypes)),]

pdf(file="plots/hippocampus/annotation/UMAP_final_subtypes.pdf",
    width = 15, height = 10)

DimPlot(combined.sct, reduction = "umap", 
        group.by = "subtypes", 
        label = TRUE, label.size = 8, repel = TRUE,
       cols = subtype_colors$subtype_color)

dev.off()

## Proportion plot of celltypes over timepoint

In [49]:
combined.sct_parse = subset(combined.sct,subset= technology == "Parse")

samples = sort(unique(combined.sct_parse$timepoint))
dflist = list()
for (i in 1:length(unique(combined.sct_parse$timepoint))){
  tp=combined.sct_parse@meta.data[combined.sct_parse@meta.data$timepoint == samples[i],]
  #tp=tp[complete.cases(tp),]
  tp_df=as.data.frame(table(tp$celltypes))
  tp_df$percentage=tp_df$Freq/nrow(tp)
  tp_df$timepoint=rep(i,nrow(tp_df))
  dflist[[i]]=tp_df
}
df = do.call(rbind, dflist)
df <- df[order(df$timepoint),]
colnames(df)= c("celltypes","Freq","percentage","timepoint")



In [50]:
pdf(file="plots/hippocampus/annotation/timepoint_celltypes_proportions.pdf",
    width = 15, height = 10)

ggplot(df, aes(x=timepoint, y=percentage, fill=celltypes)) + 
  geom_area()  +
  scale_fill_manual(values= celltype_colors$celltype_color) + 
  scale_x_continuous(breaks = c(1,2,3,4,5,6),labels= c("PND_10","PND_14",
                                                         "PND_25","PND_36","PNM_02","PNM_18-20"))+
  scale_y_continuous(breaks = c(0,0.1,0.2,0.3,0.4,0.5,
                                0.6,0.7,0.8,0.9,1.0),labels= c("0%","10%","20%","30%","40%","50%","60%","70%","80%","90%","100%")) + 
theme_minimal()+theme(text = element_text(size = 30)) + 
theme(axis.text.x = element_text(size = 30))  + 
theme(axis.text.y = element_text(size = 30))   + 
theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
dev.off()

# Save

In [51]:
saveRDS(combined.sct,file="seurat/hippocampus_Parse_10x_integrated.rds")


# Save embedding coordinates file

In [53]:
cell_id = colnames(combined.sct)
coords = cbind(cell_id,combined.sct@reductions$umap@cell.embeddings,
               combined.sct@reductions$pca@cell.embeddings)
fn = paste0("ref/hippocampus/hippocampus_integrated_analysis_embedding_coordinates_noheader.tsv")
write.table(coords,file=fn,
            sep="\t",
            quote=F,
            row.names = F)
out = paste0("ref/hippocampus/hippocampus_integrated_analysis_embedding_coordinates.tsv")
system(paste0("cat ref/embedding_coordinates_header ", fn, "> ", out))
system(paste0("rm ", fn))

# Get marker genes

In [22]:
combined.sct  = readRDS("seurat/hippocampus_Parse_10x_integrated.rds")

In [23]:
gene_id_to_name = read.csv("ref/gene_id_to_name.csv",row.names = 1)
gene_id_to_name$seurat_gene_name = rownames(combined.sct@assays$RNA)

In [24]:
format_marker_genes = function(seurat_markers){
    markers$seurat_gene_name = markers$gene
    markers = left_join(markers, gene_id_to_name)
    markers = markers[,c("gene_id","gene_name","p_val",
                     "avg_log2FC","pct.1","pct.2",
                     "p_val_adj","cluster")]
    out = markers
}

In [None]:
DefaultAssay(combined.sct)= "RNA"
Idents(combined.sct) = "gen_celltype"
markers <- FindAllMarkers(combined.sct,
                             only.pos = TRUE, 
                             min.pct = 0.1, 
                             logfc.threshold = 0.25, 
                             verbose = T)
markers_formatted = format_marker_genes(markers)

write.table(markers_formatted,file=paste0("seurat/markers/hippocampus/hippocampus_gen_celltype_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25.tsv"),
            sep="\t",
            quote=F,
            row.names = F)


In [None]:
for (celltype in unique(markers_formatted$cluster)){
    keep = c("gene_id","gene_name","p_val",
             "avg_log2FC","pct.1","pct.2",
             "p_val_adj")
    markers_celltype = markers_formatted[markers_formatted$cluster == celltype,]
    markers_celltype$is_enriched = 1
    markers_celltype = markers_celltype[,c("gene_id","gene_name","is_enriched","p_val","avg_log2FC","pct.1","pct.2","p_val_adj")]    
    fn = paste0("seurat/markers/hippocampus/hippocampus_gen_celltype_",
                                             celltype,"_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25_noheader.tsv")
    write.table(markers_celltype,file=fn,
            sep="\t",
            quote=F,
            row.names = F)
    
    out = paste0("seurat/markers/hippocampus/hippocampus_gen_celltype_",
                                             celltype,"_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25.tsv")
    system(paste0("cat ref/marker_gene_header ", fn, "> ", out))
    system(paste0("rm ", fn))
}

In [None]:
DefaultAssay(combined.sct)= "RNA"
Idents(combined.sct) = "celltypes"
markers <- FindAllMarkers(combined.sct,
                             only.pos = TRUE, 
                             min.pct = 0.1, 
                             logfc.threshold = 0.25, 
                             verbose = T)
markers_formatted = format_marker_genes(markers)

write.table(markers_formatted,file=paste0("seurat/markers/hippocampus/hippocampus_celltypes_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25.tsv"),
            sep="\t",
            quote=F,
            row.names = F)


In [None]:
for (celltype in unique(markers_formatted$cluster)){
    keep = c("gene_id","gene_name","p_val",
             "avg_log2FC","pct.1","pct.2",
             "p_val_adj")
    markers_celltype = markers_formatted[markers_formatted$cluster == celltype,]
    markers_celltype$is_enriched = 1
    markers_celltype = markers_celltype[,c("gene_id","gene_name","is_enriched","p_val","avg_log2FC","pct.1","pct.2","p_val_adj")]    
    fn = paste0("seurat/markers/hippocampus/hippocampus_celltypes_",
                                             celltype,"_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25_noheader.tsv")
    write.table(markers_celltype,file=fn,
            sep="\t",
            quote=F,
            row.names = F)
    
    out = paste0("seurat/markers/hippocampus/hippocampus_celltypes_",
                                             celltype,"_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25.tsv")
    system(paste0("cat ref/marker_gene_header ", fn, "> ", out))
    system(paste0("rm ", fn))
}

In [None]:
DefaultAssay(combined.sct)= "RNA"
Idents(combined.sct) = "subtypes"
markers <- FindAllMarkers(combined.sct,
                             only.pos = TRUE, 
                             min.pct = 0.1, 
                             logfc.threshold = 0.25, 
                             verbose = T)
markers_formatted = format_marker_genes(markers)

write.table(markers_formatted,file=paste0("seurat/markers/hippocampus/hippocampus_subtypes_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25.tsv"),
            sep="\t",
            quote=F,
            row.names = F)


In [None]:
for (celltype in unique(markers_formatted$cluster)){
    keep = c("gene_id","gene_name","p_val",
             "avg_log2FC","pct.1","pct.2",
             "p_val_adj")
    markers_celltype = markers_formatted[markers_formatted$cluster == celltype,]
    markers_celltype$is_enriched = 1
    markers_celltype = markers_celltype[,c("gene_id","gene_name","is_enriched","p_val","avg_log2FC","pct.1","pct.2","p_val_adj")]    
    fn = paste0("seurat/markers/hippocampus/hippocampus_subtypes_",
                                             celltype,"_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25_noheader.tsv")
    write.table(markers_celltype,file=fn,
            sep="\t",
            quote=F,
            row.names = F)
    
    out = paste0("seurat/markers/hippocampus/hippocampus_subtypes_",
                                             celltype,"_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25.tsv")
    system(paste0("cat ref/marker_gene_header ", fn, "> ", out))
    system(paste0("rm ", fn))
}

# Save cell type labels file

In [None]:
get_membership_scores = function(seurat_markers,slot){
    celltype = unique(seurat_markers$cluster)
    celltype_metadata_scored = list()
    
    # for each celltyp
    for (i in 1:length(celltype)){ 
        celltype_markers = seurat_markers[seurat_markers$cluster == celltype[i],]
        combined.sct[["membership_score"]] = PercentageFeatureSet(combined.sct,
                                                              pattern = NULL,
                                                              features = celltype_markers$gene_name,
                                                              assay = "RNA")
        celltype_metadata = combined.sct@meta.data
        celltype_metadata = celltype_metadata[,c("cellID",slot,"membership_score")]
        celltype_metadata_scored[[i]] = celltype_metadata[celltype_metadata[,2] == celltype[i], ]

    }
    celltype_metadata_scored = do.call(rbind.data.frame, celltype_metadata_scored)
}

# gen_celltype
markers = read.delim("seurat/markers/hippocampus/hippocampus_gen_celltype_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25.tsv")
gen_celltype_membership = get_membership_scores(markers,"gen_celltype")
colnames(gen_celltype_membership) = c("cell_id","general_cell_type_name","general_cell_type_membership_score")

# celltypes
markers = read.delim("seurat/markers/hippocampus/hippocampus_celltypes_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25.tsv")
celltype_membership = get_membership_scores(markers,"celltypes")
colnames(celltype_membership) = c("cell_id","cell_type_name","cell_type_membership_score")

# subtypes
markers = read.delim("seurat/markers/hippocampus/hippocampus_subtypes_marker_genes_only.pos_min.pct0.1_logfc.threshold0.25.tsv")
subtype_membership = get_membership_scores(markers,"subtypes")
colnames(subtype_membership) = c("cell_id","sub_cell_type_name","sub_type_membership_score")

fn = paste0("ref/hippocampus_integrated_analysis_cell_type_labels_noheader.tsv")
write.table(membership_metadata,file=fn,
            sep="\t",
            quote=F,
            row.names = F)
out = paste0("ref/hippocampus/hippocampus_integrated_analysis_cell_type_labels.tsv")
system(paste0("cat ref/celltype_labels_header ", fn, "> ", out))
system(paste0("rm ", fn))


## Get unfiltered RNA information

In [25]:
get_orig_10x_counts = function(file){
    metadata = metadata[metadata$file_accession == file,]
    counts = readMM(paste0("counts_10x/",file,"/matrix.mtx"))
    
    barcodes = read.delim(paste0("counts_10x/",file,"/barcodes.tsv"),header = F, col.names="barcode")
    features = read.delim(paste0("counts_10x/",file,"/genes.tsv"),header = F, col.names="gene_name") 
    colnames(counts) = barcodes$barcode
    rownames(counts) = features$gene_name
    out = counts

}

metadata = read.delim("ref/enc4_mouse_snrna_metadata.tsv")
metadata = metadata[metadata$technology == "10x",]
metadata = metadata[metadata$tissue == "Hippocampus",]

files = unique(metadata$file_accession)

orig_10x = get_orig_10x_counts(files[1])

for (j in 2:length(files)){
    counts_adding = get_orig_10x_counts(files[j])
    orig_10x = cbind(orig_10x, counts_adding)
}


In [26]:
get_orig_parse_counts = function(file){
    metadata = metadata[metadata$file_accession == file,]
    counts = readMM(paste0("scrublet/",file,"_matrix.mtx"))
    
    barcodes = read.delim(paste0("scrublet/",file,"_barcodes.tsv"),header = F, col.names="barcode")
    features = read.delim(paste0("scrublet/",file,"_genes.tsv"),header = F, col.names="gene_name") 
    colnames(counts) = barcodes$barcode
    rownames(counts) = features$gene_name
    out = counts

}

metadata = read.delim("ref/enc4_mouse_snrna_metadata.tsv")
metadata = metadata[metadata$technology == "Parse",]
metadata = metadata[metadata$tissue == "Hippocampus",]

files = unique(metadata$experiment_batch)

orig_parse = get_orig_parse_counts(files[1])

for (j in 2:length(files)){
    counts_adding = get_orig_parse_counts(files[j])
    orig_parse = cbind(orig_parse, counts_adding)
}

## Join unfiltered RNA and ATAC metadata

In [27]:
metadata = read.delim("../enc4_mouse_metadata.tsv")
orig_rna = cbind(orig_10x, orig_parse)
obj = CreateSeuratObject(counts = orig_rna, min.cells = 0, min.features = 0)
obj[["percent.mt"]] = PercentageFeatureSet(obj, pattern = "^mt-")
obj[["percent.ribo"]] <- PercentageFeatureSet(obj, pattern = "^Rp[sl][[:digit:]]|^Rplp[[:digit:]]|^Rpsa")
obj$cellID = colnames(obj)
obj$rna_library_accession = do.call("rbind", strsplit(as.character(obj$cellID), "[.]"))[,2]


rna_metadata = left_join(obj@meta.data[,c("cellID","rna_library_accession","nCount_RNA","percent.mt","percent.ribo")],
                         metadata)


"Non-unique features (rownames) present in the input matrix, making unique"
[1m[22mJoining, by = "rna_library_accession"


In [29]:
atac_metadata = read.csv("../snatac/ref/atac_unfiltered_metadata_all_tissues.csv")
atac_metadata = atac_metadata[,c("cellID","Sample","nFrags","TSSEnrichment","original_atac_bc")]
atac_metadata$rna_library_accession = atac_metadata$Sample
atac_metadata$Sample = NULL
atac_metadata = left_join(atac_metadata, metadata)
atac_metadata = atac_metadata[atac_metadata$tissue == "Hippocampus",]


[1m[22mJoining, by = "rna_library_accession"


In [30]:
combined_metadata = full_join(rna_metadata,atac_metadata)
combined_metadata$rna_bc = do.call("rbind", strsplit(as.character(combined_metadata$cellID), "[.]"))[,1]
combined_metadata$passed_filtering = as.numeric(combined_metadata$cellID %in% combined.sct$cellID)


[1m[22mJoining, by = c("cellID", "rna_library_accession", "technology", "species",
"tissue", "sex", "timepoint", "rep", "sample", "depth1", "depth2",
"experiment", "experiment_batch", "integration_batch", "run_number",
"lower_nCount_RNA", "upper_nCount_RNA", "lower_nFeature_RNA",
"upper_doublet_scores", "upper_percent.mt", "rna_experiment_accession",
"rna_file_accession", "multiome_experiment_accession",
"atac_experiment_accession", "atac_library_accession", "atac_file_accession",
"minTSS", "minFrags")


# Save cell metadata file

In [31]:
combined_metadata = combined_metadata[,c("cellID",
                                         "rna_library_accession","rna_bc",
                                         "atac_library_accession","original_atac_bc",
                                         "nCount_RNA","nFrags",
                                         "percent.mt","percent.ribo","TSSEnrichment","passed_filtering",
                                        "rna_experiment_accession","atac_experiment_accession",
                                         "rna_file_accession","atac_file_accession","multiome_experiment_accession",
                                         "sample","technology","tissue","sex","timepoint","rep")]
colnames(combined_metadata) = c("cell_id","rna_dataset","rna_barcode","atac_dataset","atac_barcode",
                                "rna_umi_count","atac_fragment_count","rna_frac_mito","rna_frac_ribo","atac_tss_enrichment","passed_filtering",
                                        "rna_experiment_accession","atac_experiment_accession",
                                         "rna_file_accession","atac_file_accession","multiome_experiment_accession",
                                         "sample","technology","tissue","sex","timepoint","rep")


In [33]:
fn = paste0("ref/hippocampus/hippocampus_integrated_analysis_cell_metadata_noheader.tsv")
write.table(combined_metadata,file=fn,
            sep="\t",
            quote=F,
            row.names = F)
out = paste0("ref/hippocampus/hippocampus_integrated_analysis_cell_metadata.tsv")
system(paste0("cat ref/cell_metadata_header ", fn, "> ", out))
system(paste0("rm ", fn))