Make bone marrow scRNA-seq example dataset smaller

Make example data smaller by excluding not expressed genes and reducing the number of cells. This should reduce memory usage and time of building and running the tapseq_target_genes vignette.
argschwind · Mar 11, 2020 · 945a66d · 945a66d
1 parent 85c321c
commit 945a66d
Show file tree

Hide file tree

Showing 4 changed files with 17 additions and 11 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: TAPseq
 Type: Package
 Title: Targeted scRNA-seq primer design for TAP-seq
-Version: 0.99.2
+Version: 0.99.3
 Authors@R: c(
   person("Andreas", "Gschwind", email = "andreas.gschwind@stanford.edu",
          role = c("aut", "cre"), comment = c(ORCID = "0000-0002-0769-6907")),

diff --git a/data-raw/bone_marrow_genex.R b/data-raw/bone_marrow_genex.R
@@ -1,4 +1,5 @@
 library(Seurat)
+library(Matrix)
 
 ## create Seurat object containing cell population example data
 
@@ -22,19 +23,24 @@ cell_idents <- Idents(NicheData10x_filt)
 object <- CreateSeuratObject(counts = counts)
 Idents(object) <- cell_idents
 
-# subsample cells to about 10% of cells (~350cells)
-set.seed("20200115")
+# get top 5% cells per population (~180cells)
+n_txs <- colSums(object)
+cell_idents <- cell_idents[names(sort(n_txs, decreasing = TRUE))]
 idents_split <- split(cell_idents, f = cell_idents)
-idents_sampled <- lapply(idents_split, FUN = function(x) {
-  sample(x, size = length(x) * 0.10)
+idents_top <- lapply(idents_split, FUN = function(x) {
+  head(x, n = length(x) * 0.05)
 })
 
 # create vector with cell ids for these cells
-names(idents_sampled) <- NULL
-sampled_cells <- names(unlist(idents_sampled))
+names(idents_top) <- NULL
+top_cells <- names(unlist(idents_top))
 
 # subset object to these cells
-bone_marrow_genex <- subset(object, cells = sampled_cells)
+bone_marrow_genex <- subset(object, cells = top_cells)
+
+# remove any genes with less than 10 total transcripts
+txs <- rowSums(GetAssayData(bone_marrow_genex))
+bone_marrow_genex <- subset(bone_marrow_genex, features = names(txs[txs > 10]))
 
 # save data as RData files in data directory
 usethis::use_data(bone_marrow_genex, overwrite = TRUE)
diff --git a/data/bone_marrow_genex.rda b/data/bone_marrow_genex.rda
diff --git a/vignettes/tapseq_target_genes.Rmd b/vignettes/tapseq_target_genes.Rmd
@@ -70,11 +70,11 @@ length(target_genes_100)
 To intuitively assess how well a chosen set of target genes distinguishes cell types, we can use
 UMAP plots based on the full gene expression data and on target genes only.
 ```{r, message=FALSE, warning=FALSE, fig.height=3, fig.width=7.15}
-plotTargetGenes(bone_marrow_genex, target_genes = target_genes_cv)
+plotTargetGenes(bone_marrow_genex, target_genes = target_genes_100)
 ```
 
-We can see that the expression of  the `r length(target_genes_cv)` automatically selected target
-genes groups cells of different populations together.
+We can see that the expression of  the `r length(target_genes_100)` selected target genes groups 
+cells of different populations together.
 
 A good follow up would be to cluster the cells based on only the target genes following the same
 workflow used to define the cell identities in the original object. This could then be used to