Josh/vignette (#158)

* Created Tissue Annotation Vignette * Synced Liver_facs with the more detailed vignette. * fixed typos.
czbiohub-sf · Mar 19, 2018 · 2ccc12a · 2ccc12a
1 parent 0791b44
commit 2ccc12a
Show file tree

Hide file tree

Showing 3 changed files with 107 additions and 129 deletions.
diff --git a/00_data_ingest/02_tissue_analysis_rmd/Liver_facs.Rmd b/00_data_ingest/02_tissue_analysis_rmd/Liver_facs.Rmd
@@ -5,8 +5,6 @@ output:
   html_notebook: default
 ---
 
-## Preprocess and Cluster
-
 Enter the name of the tissue you want to analyze.
 
 ```{r}
@@ -16,34 +14,25 @@ source(here("00_data_ingest", "02_tissue_analysis_rmd", "boilerplate.R"))
 tiss = load_tissue_facs(tissue_of_interest)
 ```
 
-The loading function above includes basic preprocessing of the gene x cell matrix, including
-
-1. Filtering out cells with fewer than 500 genes or 50,000 reads.
-2. Log-normalizing counts for each cell (log(1 + counts per million)).
-3. Shifting and scaling each gene to have mean 0 and variance 1.
-4. Finding variable genes (those with high dispersion, given their mean).
-5. PCA to project the variable genes to 20 dimensions.
-
 We can visualize top genes in each principal component.
 
-```{r, echo=FALSE, fig.height=4, fig.width=8}
+```{r, echo=FALSE}
 PCHeatmap(object = tiss, pc.use = 1:3, cells.use = 500, do.balanced = TRUE, label.columns = FALSE, num.genes = 8)
 ```
 
-To further reduce variance in the data, we will project onto the top principal components. This has the effect of keeping the major directions of variation in the data and, ideally, supressing noise. A decent rule of thumb is to pick the elbow in the plot below.
+We then project onto just the top principal components. This has the effect of keeping the major directions of variation in the data and, ideally, supressing noise. A decent rule of thumb is to pick the elbow in the plot below.
 
 ```{r}
 PCElbowPlot(object = tiss)
 ```
 
 Choose the number of principal components to use.
+
 ```{r}
-# Set number of principal components. 
 n.pcs = 11
 ```
 
-
-The clustering is performed using on a shared-nearest-neighbors graph on the cells. Two cells are connected in the graph of their k-neighborhoods of cells overlap. The (enhanced) Louvain algorithm looks for groups of cells with high modularity--more connections within the group than between groups. The resolution parameter determines the tradeoff between in-group connections and between-group connections in that objective. Higher resolution will give more clusters, lower resolution will give fewer.
+## Cluster
 
 ```{r}
 # Set resolution 
@@ -67,32 +56,26 @@ TSNEPlot(object = tiss, do.label = T, pt.size = 1.2, label.size = 4)
 
 Check expression of genes useful for indicating cell type.
 
-hepatocyte: Alb, Ttr, Apoa1, and Serpina1c
-endothelial cells: Pecam1, Nrp1, Kdr+ and Oit3+
-Kuppfer cells: Emr1, Clec4f, Cd68, Irf7
-NK/NKT cells: Zap70, Il2rb, Nkg7, Cxcr6, Klr1c, Gzma
-B cells: Cd79a, Cd79b, Cd74 and Cd19
-
 ```{r}
-genes_hep = c('Alb', 'Ttr', 'Apoa1', 'Serpina1c')
-genes_endo = c('Pecam1', 'Nrp1', 'Kdr','Oit3')
-genes_kuppfer = c('Emr1', 'Clec4f', 'Cd68', 'Irf7')
-genes_nk = c('Zap70', 'Il2rb', 'Nkg7', 'Cxcr6', 'Gzma')
-genes_b = c('Cd79a', 'Cd79b', 'Cd74')
+genes_hep = c('Alb', 'Ttr', 'Apoa1', 'Serpina1c') #hepatocyte
+genes_endo = c('Pecam1', 'Nrp1', 'Kdr','Oit3') # endothelial
+genes_kuppfer = c('Emr1', 'Clec4f', 'Cd68', 'Irf7') # Kuppfer cells
+genes_nk = c('Zap70', 'Il2rb', 'Nkg7', 'Cxcr6') # Natural Killer cells
+genes_b = c('Cd79a', 'Cd79b', 'Cd74', 'Cd19') # B Cells
 
 genes_all = c(genes_hep, genes_endo, genes_kuppfer, genes_nk, genes_b)
 ```
 
-In the tSNE plots below, the intensity of each point represents the gene expression, scaled to mean 0 and variance 1 across all cells.
+In the tSNE plots below, the intensity of each point represents the log-normalized gene expression $N_{ij}$.
 
-```{r, echo=FALSE, fig.height=24, fig.width=12}
-FeaturePlot(tiss, genes_all, pt.size = 3, nCol = 3, cols.use = c("lightgrey", "blue"))
+```{r, echo=FALSE, fig.height=20, fig.width=16}
+FeaturePlot(tiss, genes_all, pt.size = 3, nCol = 4, cols.use = c("lightgrey", "blue"), no.legend = F)
 ```
 
 Dotplots show, for each cluster and gene, the fraction of cells with at least one read for the gene (circle size) and the average scaled expression for that gene among the cells expressing it (circle color).
 
-```{r, echo=FALSE, fig.height=10, fig.width=6}
-DotPlot(tiss, genes_all, plot.legend = T, col.max = 2.5, do.return = T) + coord_flip()
+```{r, echo=FALSE}
+DotPlot(tiss, genes_all, plot.legend = T, col.max = 2.5, x.lab.rot = T)
 ```
 
 The low but nonzero levels of Albumin present in all clusters is consistent with a small amount of leakage, either through physical contamination or index hopping. Nevertheless, the absolute levels of expression confirm a sharp difference between the hepatocyte clusters and the others.
@@ -104,16 +87,17 @@ VlnPlot(tiss, 'Alb', use.raw = T, do.return = T)
 To confirm the identity of a cluster, you can inspect the genes differentially expressed in that cluster compared to the others.
 
 ```{r}
-clust.markers7 <- FindMarkers(object = tiss, ident.1 = 7, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)
+clust.markers7 <- FindMarkers(object = tiss, ident.1 = 7, 
+                              only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)
 ```
 
 The top markers for cluster 7 include histocompatibility markers H2-*, consistent with the expression of other B-cell markers seen above.
 
 ```{r}
-clust.markers7
+head(clust.markers7)
 ```
 
-Using the markers, we can confidentaly label the clusters:
+Using the markers, we can confidentaly label the clusters. We provide both a free annotation (where anything name can be used) and a cell ontology class. The latter uses a controlled vocabulary for easy comparison between studies and different levels of the taxonomy.
 
 ```{r}
 tiss <- StashIdent(object = tiss, save.name = "cluster.ids")
@@ -129,7 +113,7 @@ free_annotation <- c(
   "kuppfer",
   "hepatocyte",
   "B cell",
-  "Nk/NKT cells")
+  "NK/NKT cells")
 
 cell_ontology_class <-c(
   "endothelial cell of hepatic sinusoid",
@@ -148,13 +132,13 @@ tiss = stash_annotations(tiss, cluster.ids, free_annotation, cell_ontology_class
 
 ## Checking for batch effects
 
-Color by metadata, like plate barcode, to check for batch effects. Here we see that the clusters are segregated by sex, though cell types are mixed within each population.
+Color by metadata, like plate barcode, to check for batch effects. Here we see that the clusters are segregated by sex.
 
 ```{r}
 TSNEPlot(object = tiss, do.return = TRUE, group.by = "mouse.id")
 ```
 
-Every cluster contains cell types from multiple mice.
+Nevertheless, every cluster contains cells from multiple mice.
 
 ```{r}
 table(FetchData(tiss, c('mouse.id','ident')) %>% droplevels())
@@ -190,3 +174,5 @@ save(tiss, file=filename)
 ```{r}
 save_annotation_csv(tiss, tissue_of_interest, "facs")
 ```
+
+
diff --git a/00_data_ingest/02_tissue_analysis_rmd/Liver_facs.nb.html b/00_data_ingest/02_tissue_analysis_rmd/Liver_facs.nb.html
diff --git a/00_data_ingest/02_tissue_analysis_rmd/Tissue_Annotation_Vignette.Rmd b/00_data_ingest/02_tissue_analysis_rmd/Tissue_Annotation_Vignette.Rmd
@@ -139,7 +139,7 @@ $$Q = \sum_{ij} \left (A_{ij} - \gamma \frac{k_i k_j}{2m} \right ) \delta(c_i, c
 
 where $A_{ij}$ is the weighted adjacency matrix, $k_i$ and $k_j$ are the weighted degrees of cells $i$ and $j$, $m$ is the total weight of edges in the graph, $c_i$ denotes cluster membership, and $\delta(c_i,c_j)$ is $1$ if $i$ and $j$ are in the same cluster, and $0$ otherwise.
 
-The resolution $\gamma$ is a tuneable parameter in this analysis that sets the tradeoff off between in-group connections and between-group connections. High resolution favors smaller clusters.
+The resolution $\gamma$ is a tuneable parameter in this analysis that sets the tradeoff between in-group connections and between-group connections. High resolution favors smaller clusters.
 
 ```{r}
 # Set resolution 
@@ -204,7 +204,7 @@ The top markers for cluster 7 include histocompatibility markers H2-*, consisten
 head(clust.markers7)
 ```
 
-Using the markers, we can confidentaly label the clusters. We provide both a free annotation (where anything name can be used) and a cell ontology class. The latter uses a controlled vocabulary for easy comparison between studies and different levels of the taxonomy.
+Using the markers, we can confidentaly label the clusters. We provide both a free annotation (where any name can be used) and a cell ontology class. The latter uses a controlled vocabulary for easy comparison between studies and different levels of the taxonomy.
 
 ```{r}
 tiss <- StashIdent(object = tiss, save.name = "cluster.ids")