<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
Use this cell to login to the GenePattern Cloud server.
</div>

In [23]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to initialize notebook parameters and download the input data.
</div>

In [3]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import os 
import urllib.request
import subprocess
import rpy2
%load_ext nbtools.r_support

# for setting default plot size see
# https://stackoverflow.com/questions/40745163/jupyter-notebook-rpy2-rmagics-how-to-set-the-default-plot-size
# but its prettybrittle - fails if you call it twice or change after the function is set




@genepattern.build_ui(name="Notebook Set up", description="Setup the R and Python environments for the rest of this notebook. Downloads the example dataset to the notebook server.", 
                      parameters={
                            "output_var": {
                                "hide": True,
                            }
})
def notebook_setup():
    %load_ext rpy2.ipython
    
    print("Configuring visualization libraries...")
    import seaborn as sns
    sns.set(rc={'figure.figsize':(8,4.5)})
    
    print("Retrieving input data...")
    os.makedirs('data/pbmc3k/', exist_ok=True)
    url = 'https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz'
    urllib.request.urlretrieve(url, 'data/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz')

    subprocess.run(["tar", "-xvf", "data/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz",
                "--directory", "data/pbmc3k/" ])
    
    
    print("Done.")


UIBuilder(description='Setup the R and Python environments for the rest of this notebook. Downloads the exampl…

<div id="header" class="fluid-row" style="color: #333333; font-family: 'open sans' , 'arial' , sans-serif; font-size: 14px;">
<h1 id="Seurat---Guided-Clustering-Tutorial" class="title toc-ignore" style="margin-top: 20px; margin-bottom: 10px; font-size: 38px; padding: 0px; font-family: 'ubuntu' , 'tahoma' , 'helvetica neue' , 'helvetica' , 'arial' , sans-serif;" data-toc-modified-id="Seurat---Guided-Clustering-Tutorial-2"><a id="Seurat---Guided-Clustering-Tutorial-2" class="toc-mod-link"></a>Seurat - Guided Clustering Tutorial</h1>
<h4 id="Compiled:-January-2020" class="date" style="margin: 10px 0px; font-size: 18px; padding: 0px; font-family: 'ubuntu' , 'tahoma' , 'helvetica neue' , 'helvetica' , 'arial' , sans-serif;" data-toc-modified-id="Compiled:-January-2020-2.0.0.1"><a id="Compiled:-January-2020-2.0.0.1" class="toc-mod-link"></a>Compiled: January 2020</h4>
</div>
<hr style="overflow: visible; margin: 20px 0px; padding: 0px; color: #333333; font-family: 'open sans' , 'arial' , sans-serif; font-size: 14px;" />
<div id="setup-the-seurat-object" class="section level3" style="color: #333333; font-family: 'open sans' , 'arial' , sans-serif; font-size: 14px;">

# Set up the Seurat Object

For this tutorial, we will be analyzing a dataset of **Peripheral Blood Mononuclear Cells (PBMCs)** freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data was downloaded as part of the **Notebook Set Up** cell (above) and the original files can be found [here](https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz).

The three files for this dataset include
- **matrix.mtx**: Triples of gene ID index, cell barcode index, and UMI count.
- **barcodes.tsv**: The barcodes referenced by the indices in the **matrix.mtx** file.
- **genes.tsv**: All the annotated genes, one per row. Referenced by the indices in the **matrix.mtx** file. Sometimes named **features.tsv**

Seurat [**1**] starts with a feature (e.g. gene) expression matrix. The expected format of the input matrix is `features x cells`. Here we use the information embedded in the three files described above for this purpose. `barcodes.tsv` and `genes.tsv` contain the column names (cells) and the row names (genes) respectively. `matrix.mtx` stores the expression data in a *sparse* format to save hard drive space. Because single cell expression matrices have many zeros, `matrix.mtx` only lists the matrix entries with non-zero values.

We start by reading the data. The following GenePattern cell wraps the `Read10X` function, which reads in the output of the [cellranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) pipeline from 10X Genomics, returning a gene x cell count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column). 

We next use the count matrix to create a `Seurat` object. The object serves as a container that includes both data (i.e. the count matrix) and analysis (e.g. the principal component analysis -- PCA, or clustering results) for a single-cell dataset. For a technical discussion of the `Seurat` object structure, check out Seurat's [Github Wiki](https://github.com/satijalab/seurat/wiki). The `Seurat` object has *slots* that store different types of data to be accessed by later code. For example, the counts matrix is accessed by executing `pbmc[["RNA"]]@counts`. In this tutorial we will not be writing any code, so knowledge of the `Seurat` object structure is not required.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to load the 10X input files into the notebook. 
</div>

In [4]:
%%r_build_ui { "name": "Setup Seurat Objects", "parameters": {"tenx_data_dir": {"name": "10X data dir", "default": "data/pbmc3k/filtered_gene_bc_matrices/hg19/"}, "output_var": { "hide": "True" } } }

setupR <- function(tenx_data_dir){
    write("Loading libraries...", stdout())
#     library(dplyr)
    suppressMessages(library(Seurat))
    suppressMessages(library(scater))
    fig_height=450
    fig_width=800
    # Load the PBMC dataset
    write("Loading the dataset...", stdout())
    #suppressMessages(pbmc.data <- Read10X(data.dir = "data/pbmc3k/filtered_gene_bc_matrices/hg19/"))
    suppressMessages(pbmc.data <- Read10X(data.dir = tenx_data_dir))
    
#     raw_counts <- readSparseCounts(file="https://datasets.genepattern.org/data/module_support_files/Conos/HNSCC_noribo.txt")
#     hnscc <- CreateSeuratObject(counts = raw_counts, project = "HNSCC")
    
    # Initialize the Seurat object with the raw (non-normalized data).
    suppressMessages(pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200))
    write("Done", stdout())
#     return(hnscc)
    return(pbmc)
}
suppressMessages(pbmc <- setupR(tenx_data_dir))

UIBuilder(function_import='nbtools._r_wrappers["7ACA96C0728E8D9F20849567356DD40C"]', name='Setup Seurat Object…

# Standard pre-processing workflow

The steps below encompass the standard pre-processing workflow of scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.

## QC and selecting cells for further analysis

Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics [commonly used](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4758103/) by the community include
- The number of unique genes detected in each cell
    - Low-quality cells or empty droplets will often have very few genes
    - Cell doublets or multiplets may exhibit an aberrantly high gene count
- Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)
- The percentage of reads that map to the mitochondrial genome
    - Low-quality/dying cells often exhibit extensive mitochondrial contamination
    - We calculate mitochondrial QC metrics with the `PercentageFeatureSet` function, which calculates the percentage of counts originating from a set of features
    - We use the set of all genes starting with `MT-` as a set of mitochondrial genes

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to calculate and annotate mitochondrial gene percentage for each cell.
</div>

In [5]:
%%r_build_ui { "name": "Add Mitochondrial QC Metrics", "parameters": { "column_name": { "type": "string", "default":"percent.mt" },"pattern": { "type": "string", "default":"MT-" }, "output_var": { "hide": "True" } } }

set_mito_qc <- function(colName, pat) {
    write("Calculating the frequency of mitochondrial genes...", stdout())
    pattern <- paste("^", trimws(pat, which = "both"), sep="")
    
    # The [[ operator can add columns to object metadata. This is a great place to stash QC stats
    pbmc[[colName]] <- PercentageFeatureSet(pbmc, pattern = pattern)
    write("Done!", stdout())
    return(pbmc)
}


suppressMessages(pbmc <- set_mito_qc(column_name, pattern))

UIBuilder(function_import='nbtools._r_wrappers["335BC149DA9E8370E10E8277B81380B6"]', name='Add Mitochondrial Q…

In the example below, we visualize QC metrics and use these to filter cells.
- We filter cells that have unique feature counts over 2,500 or less than 200
- We filter cells that have >5% mitochondrial counts

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to visualize various quality control metrics.
</div>

In [6]:
%%r_build_ui -w 800 { "width": 10, "height": 300, "name": "Triple Violin Plot", "parameters": { "first_feature": { "type": "string", "default":"nFeature_RNA" }, "second_feature":{ "type": "string", "default":"nCount_RNA"}, "third_feature": { "type": "string", "default":"percent.mt" }, "output_var":{"hide":"True"} } }
# Visualize QC metrics as a violin plot
#VlnPlot(pbmc, features = c(first_feature, second_feature, third_feature), ncol = 3)
tripleViolin <- function(first, second, third){
     
    feats <- c(first, second, third)
    plot(VlnPlot(pbmc, features = feats, ncol = 3, combine=TRUE), fig.height=5, fig.width=15)
    return("")
}

tripleViolin(first_feature, second_feature, third_feature)

UIBuilder(function_import='nbtools._r_wrappers["A34D52FB75F8FFBD9CEAC96D82E57CE2"]', name='Triple Violin Plot'…

### Filtering data

Here we select which cells and genes we will filter out of our dataset based on the plots above.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to filter the data based on quality control metric thresholds.
</div>

In [7]:
%%r_build_ui { "name": "Subset Data", "parameters": { "min_n_features": { "type": "number", "default":"200" },"max_n_features": { "type": "number", "default":"2500" },"max_percent_mitochondrial": { "type": "number", "default":"5" }, "output_var": { "hide": "True" } } }

my_subset <- function(min_n_features, max_n_features, max_percent_mitochondrial){
#     print(pbmc)
    pbmc <- subset(pbmc, subset = nFeature_RNA > min_n_features & nFeature_RNA < max_n_features & percent.mt < max_percent_mitochondrial)
#     print(pbmc)
    write('filtering done!', stdout())
    return(pbmc)
}

pbmc <- my_subset(min_n_features, max_n_features, max_percent_mitochonrial)

UIBuilder(function_import='nbtools._r_wrappers["79840578BF9EC0F7F351A9D20E586F92"]', name='Subset Data', origi…

# Normalizing the data

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method "LogNormalize" that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in `pbmc[["RNA"]]@data`. The methods we will use in this notebook all assume log-normalized data, other methods may not make that assumption and other normalization methods can be considered.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to log normalize and scale the input data.
</div>

In [8]:
%%r_build_ui { "name": "Normalize", "parameters": { "method": { "type": "string", "default":"LogNormalize" },"scale_factor": { "type": "number", "default":"10000" }, "output_var": { "hide": "True" } } }

norm_pbmc <- function(meth, scale){
    write("Normalizing data...", stdout())
    invisible(pbmc <- NormalizeData(pbmc, normalization.method = meth, scale.factor = scale, verbose = F))
    write('Normalization done!', stdout())
    return(pbmc)
}

pbmc <- norm_pbmc(method, scale_factor)

UIBuilder(function_import='nbtools._r_wrappers["07451326C6384AF37DC263EC1B74A12C"]', name='Normalize', origin=…

# Identification of highly variable features (feature selection)

We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). [The literature](https://www.nature.com/articles/nmeth.2645) suggests that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets. 

Seurat's variable feature selection is described in detail [here](https://www.biorxiv.org/content/early/2018/11/02/460147.full.pdf), and improves on previous by directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the `FindVariableFeatures` function. By default, it returns 2,000 features per dataset. These will be used in downstream analysis, like PCA.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to select and visualize the most variable features in the data.
</div>

In [9]:
%%r_build_ui -w 800 { "name": "Feature Selection", "parameters": { "method": { "type": "string", "default":"vst","hide":"True" },"num_features": { "type": "number", "default":"2000" }, "num_to_label":{"type": "number", "default": "10", "description": "label the top N features in the plot."}, "output_var": { "hide": "True" } } }
#%%R -w 800 -h 450

feat_sel_plot <- function(meth, nFeat, nLabel){
    write("Identifying variable features...", stdout())
    invisible(capture.output(pbmc <- FindVariableFeatures(pbmc, selection.method = meth, nfeatures = nFeat, 
                                                         verbose=F)))
    write("Done!", stdout())

    # Identify the 10 most highly variable genes
    top10 <- head(VariableFeatures(pbmc), nLabel)

    # plot variable features with and without labels
    invisible(capture.output(plot1 <- VariableFeaturePlot(pbmc)))
    invisible(capture.output(plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)))
    print(plot2)
    #plot(CombinePlots(plots = list(plot1, plot2)))
    return(pbmc)
}

pbmc <- feat_sel_plot(method, num_features, num_to_label)

UIBuilder(function_import='nbtools._r_wrappers["8C34B15B1E4C0022B38A13827CCE9FD7"]', name='Feature Selection',…

# Scaling the data

Next, we apply a linear transformation ('scaling') that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The `ScaleData` function:
- Shifts the expression of each gene so that the mean expression across cells is 0
- Scales the expression of each gene so that the variance across cells is 1
    - This step gives equal weight in downstream analyses (which may apply weights based on variance), so that highly-expressed genes do not dominate
- The results of this are stored in `pbmc[["RNA"]]@scale.data`

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to scale the input data
</div>

In [10]:
%%r_build_ui {"name": "Scale Data", "parameters": {"output_var":{"hide": "True"}}}
myscale <- function(pbmc){
    write("Scaling data...", stdout())
    all.genes <- rownames(pbmc)
    invisible(capture.output(pbmc <- ScaleData(pbmc, features = all.genes, verbose = F)))
    write('done!', stdout())
    return(pbmc)
}
pbmc <- myscale(pbmc)

UIBuilder(function_import='nbtools._r_wrappers["AD02F8D067B3F9344DC615C90D7AD772"]', name='Scale Data', origin…

# Perform Principal Component Analysis (PCA)

Next we perform PCA on the scaled data. PCA is a method for reducing the dimensionality of large data in order to find useful latent patterns within that data. PCA defines a set of *principle components* which are vectors that point in the direction of, or "explains" high variance within the data. The first principal component is the axis in the data with the highest variance, the second principal component explains the second most variance, and so on. Seurat uses the results of PCA to perform further analysis in a more efficient manner rather than operating on the large expression matrix directly.

By default, only the previously determined variable features are used as input, but can be defined using the `features` argument if the user wishes to choose a different subset.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to peform Principal Component Analysis (PCA) on the input data
</div>

In [11]:
%%r_build_ui {"name":"Perform PCA", "parameters": {"num_pcs": {"type": "string", "default": "20"}, "output_var": {"hide":"True"}}}
mypca <-function(pbmc, npcs){ 
    feats <- VariableFeatures(object = pbmc, verbose = F)
    pbmc <-RunPCA(pbmc, npcs = npcs, features = feats, nfeatures.print=5, verbose = F)
   
    return(pbmc)
}
write("Performing PCA...", stdout())
pbmc <- mypca(pbmc, num_pcs)

UIBuilder(function_import='nbtools._r_wrappers["E8BD36ED4E46870F17EB7F8B17DC1C3E"]', name='Perform PCA', origi…

<p><span style="color: #373d3f; font-family: 'Open Sans', arial, sans-serif; font-size: 16px; text-align: justify;">Seurat provides several useful ways of visualizing both cells and features that define the PCA, including&nbsp;</span><code style="font-size: 14.4px; color: #373d3f; word-break: break-word; border: 1px solid #9ba4a7; border-radius: 4px; background-color: rgba(0, 0, 0, 0.04); white-space: pre; text-align: justify;">VizDimReduction</code><span style="color: #373d3f; font-family: 'Open Sans', arial, sans-serif; font-size: 16px; text-align: justify;">,&nbsp;</span><code style="font-size: 14.4px; color: #373d3f; word-break: break-word; border: 1px solid #9ba4a7; border-radius: 4px; background-color: rgba(0, 0, 0, 0.04); white-space: pre; text-align: justify;">DimPlot</code><span style="color: #373d3f; font-family: 'Open Sans', arial, sans-serif; font-size: 16px; text-align: justify;">, and&nbsp;</span><code style="font-size: 14.4px; color: #373d3f; word-break: break-word; border: 1px solid #9ba4a7; border-radius: 4px; background-color: rgba(0, 0, 0, 0.04); white-space: pre; text-align: justify;">DimHeatmap</code></p>

Seurat provides several useful ways of visualizing both cells and features that define the PCA, including `VizDimReduction`, `DimPlot`, and `DimHeatmap`.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to visualize the first two principal components.
</div>

In [12]:
%%r_build_ui -w 800 {"name":"Vizualize Dimension Plot", "parameters": { "output_var": {"hide": "True"} }}

vdp <- function(p1){
    
    plot(DimPlot(pbmc, reduction = "pca"))
    return("")
}
vdp()

UIBuilder(function_import='nbtools._r_wrappers["E32BD5E45F688F904A7354F8A0318EBA"]', name='Vizualize Dimension…

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to visualize the first two dimension loadings.
</div>

In [13]:
%%r_build_ui -w 800 {"name":"Vizualize Dimension Loadings", "parameters": {"num_dims": {"type":"string", "default":"2"}, "output_var": {"hide": "True"} }}

vdl <- function(nDims){
    dim_range = c(1,strtoi(nDims))
    print(VizDimLoadings(pbmc, dims = dim_range, reduction = "pca"))
    return("")
}
vdl(num_dims)

UIBuilder(function_import='nbtools._r_wrappers["68E956B063744E3054B2A9DAF4F55770"]', name='Vizualize Dimension…

# Determine the 'dimensionality' of the dataset

To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a 'metafeature' that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, how many components should we choose to include? 10? 20? 100?

To do this we will use a heuristic method which generates an 'Elbow plot': a ranking of principal components based on the percentage of variance explained by each one (`ElbowPlot` function). In this example, we can observe an 'elbow' around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to visualize an "Elbow Plot" of the principal components.
</div>

In [14]:
%%r_build_ui -w 800 {"name":"Elbow Plot", "parameters": {  "output_var": {"hide": "True"} }}
ebp <- function(){
    plot(ElbowPlot(pbmc))
    return(pbmc)
}
ebp()

UIBuilder(function_import='nbtools._r_wrappers["C36CD7530E7A4CF4C4296CBFE1D4A669"]', name='Elbow Plot', origin…

In particular `DimHeatMap` allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide how many PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Setting `cells` to a number plots the 'extreme' cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated feature sets.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to visualize a heatmap of the first <i>n</i> principal components
</div>

In [15]:
%%r_build_ui -w 800 -h 1200 {"name":"DimHeatMap", "parameters": { "num_dims":{"type":"string", "default": "15"}, "cells":{"type": "number","default":"500"}, "output_var": {"hide": "True"} }}

vdhm <- function(nd,c){    
    if (nd == 1){
        dim_range = 1
    } else {
        dim_range = c(1:strtoi(nd))
    }
    
    print(DimHeatmap(pbmc, dims = dim_range, cells = c, balanced = TRUE))
    return(pbmc)
}

vdhm(num_dims, cells)

UIBuilder(function_import='nbtools._r_wrappers["4F4FF24A11880F509C0731BF113E4461"]', name='DimHeatMap', origin…

# Export preprocessed RDS

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to save the processed data 
</div>

In [16]:
%%r_build_ui {"name":"Save preprocessed dataset", "parameters": {  "file_name": {"default":"pbmc_preprocessed.rds"}, "output_var": {"hide": "True"} }}
save_it <- function(fileName){
    saveRDS(pbmc, file = fileName)
    print("Saved file!")
    return(pbmc)
}
save_it(file_name)

UIBuilder(function_import='nbtools._r_wrappers["0FD26CD685EA9B8BA75841C3A8D5C80C"]', name='Save preprocessed d…

# Cluster Cells

Following preprocessing, we will now *cluster*, or identify similar populations of cells within our dataset. This can be used to, for example, identify cells corresponding to specific cell types or separate healthy and diseased cells. Seurat uses a "graph-based" clustering approach, which relies on drawing connections between similar cells and separating them into similar groups, or "cliques".

The GenePattern Module below will use the GenePattern Cloud infrastructure to run Seurat clustering. You can upload the preprocessed `.rds` file we saved above and specify the **maximum dimension**, the number of principal components to use for clustering, and the **resolution** parameter, which affects the number of clusters detected. Higher resolution values will generally yield more clusters. When clustering new data, it is wise to experiment with a range of resolution values.

You can also specify the **reduction** method. This refers to the algorithm that will be used to further reduce the PCA data to two dimensions for the purpose of visualization. The single cell community is still considering the merits and drawbacks of various dimensionality reduction methods for visualization, such as UMAP[**2**] and tSNE[**3**], but consensus is currently forming around UMAP, and we recommend its use in this notebook.

**NOTE FOR WORKSHOP USERS**: In the interest of time, we have provided a link to the result of this module, so you do not need to run this module during the workshop. We have left this module here so you can use this notebook to explore your own data.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to upload your preprocessed RDS file to the GenePattern Cloud server.
</div>

In [17]:
@nbtools.build_ui(name="Upload file to GenePattern Server", parameters={
    "file": {
        "name": "File to upload:",
        "default":"pbmc_preprocessed.rds"
    },
    "output_var": {
    "name": "results",
    "description": "",
    "default": "quantification_source",
    "hide": True
    }
})
def load_file(file):
    import genepattern
    uio = nbtools.UIOutput()
    display(uio)
    size = os.path.getsize(file)
    print(f'This file size is {round(size/1e6)} MB, it may take a while to upload.')
    uio.status = "Uploading..."
    uploaded_file = genepattern.session.get(0).upload_file(file_name=os.path.basename(file),file_path=file)
    uio.status = "Uploaded!"
    display(nbtools.UIOutput(files=[uploaded_file.get_url()]))
    return()

UIBuilder(function_import='nbtools.tool(id="Upload file to GenePattern Server", origin="Notebook").function_or…

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
Use the cell below to run a clustering analysis on the processed data.
</div>

In [24]:
seuratclustering_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00408')
seuratclustering_job_spec = seuratclustering_task.make_job_spec()
seuratclustering_job_spec.set_parameter("input.seurat.rds.file", "")
seuratclustering_job_spec.set_parameter("output.filename", "<input.seurat.rds.file_basename>.clustered")
seuratclustering_job_spec.set_parameter("maximum_dimension", "10")
seuratclustering_job_spec.set_parameter("resolution", "0.5")
seuratclustering_job_spec.set_parameter("reduction", "umap")
seuratclustering_job_spec.set_parameter("job.memory", "2 Gb")
seuratclustering_job_spec.set_parameter("job.queue", "gp-cloud-default")
seuratclustering_job_spec.set_parameter("job.cpuCount", "1")
seuratclustering_job_spec.set_parameter("job.walltime", "02:00:00")
genepattern.display(seuratclustering_task)

job256786 = gp.GPJob(genepattern.session.get(0), 256786)
genepattern.display(job256786)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00408')

GPJobWidget(job_number=256786)

# Download clustering results

This module retrieves the clustered RDS and cluster markers csv files from the GenePattern server

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to download the clustering result and cluster markers files from the GenePattern Cloud server.
</div>

In [18]:
import os

ui_description = "Download file from a GenePattern module job result."
ui_parameters = {
    "clustered_rds_file": {"type": "file", "kinds": ["rds"]}, 
    "output_var": {"hide": True},
    "markers_csv_file": {"type": "file", "kinds": ["csv"]}
}   
        
def DownloadJobResultFile(file):
    # extract job number and file name from url
    job_num = file.split("/")[-2]
    remote_file_name = file.split("/")[-1]
    
    # get the job based on the job number passed as the second argument
    job = gp.GPJob(genepattern.get_session(0), job_num)

    # fetch a specific file from the job
    remote_file = job.get_file(remote_file_name)
    
    uio = nbtools.UIOutput(text=file)
    display(uio)
    uio.status = "Downloading..."
    
    File_Name = os.path.basename(file)

    response = remote_file.open()
    CHUNK = 16 * 1024
    with open(File_Name, 'wb') as f:
        while True:
            chunk = response.read(CHUNK)
            if not chunk:
                break
            f.write(chunk)
    uio.status = "Downloaded!"
    print(File_Name)
    #display(nbtools.UIOutput(files=[File_Name]))
    
def download_rds_csv(clustered_rds_file, markers_csv_file):
    DownloadJobResultFile(clustered_rds_file)
    DownloadJobResultFile(markers_csv_file)
    
genepattern.GPUIBuilder(download_rds_csv, collapse=False,
                    name='Download clustered RDS and cluster markers CSV',
                    description=ui_description,
                    parameters=ui_parameters)

UIBuilder(collapse=False, description='Download file from a GenePattern module job result.', function_import='…

# Load clustering results

This module loads the clustered RDS file and the cluster markers CSV file into the notebook.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to load the clustering result and cluster markers file into the notebook.
</div>

In [19]:
%%r_build_ui {"name":"Load dataset with clustering", "parameters": {"RDS_url":{"name":"clustered_rds_file*","default":"https://datasets.genepattern.org/data/module_support_files/SeuratClustering/Seurat_preprocessed_prebaked.clustered.rds.rds",'type':"file","kinds":["rds"]}, "CSV_url":{"name":"markers csv file*","type":"file","kinds":['csv'],"default":"https://datasets.genepattern.org/data/module_support_files/SeuratClustering/Seurat_preprocessed_prebaked.clustered.rds.csv"}, "output_var": {"hide": "True"} }}

load_markers <- function(CSV_url) {
    write("Loading cluster markers into notebook...", stdout())
    markers <- read.csv(CSV_url)
    write("Done!", stdout())
    return(markers)
}
load_it <- function(RDS_url){
    write("Loading clustering results into notebook...", stdout())
    pbmc <- readRDS(file = RDS_url)
    write("Loaded file!", stdout())
    return(pbmc)
}
suppressWarnings(markers <- load_markers(CSV_url))
pbmc <- load_it(RDS_url)

UIBuilder(function_import='nbtools._r_wrappers["5774D6F33E75C162FB3C9C7730F6B597"]', name='Load dataset with c…

# Visualize clustering

The following cell shows the UMAP (Uniform Manifold Approximation and Projection) visualization of the cells colored by cluster. UMAP is one of a class of methods developed to visualize high dimensional data in two dimensions. The UMAP algorithm finds a latent subspace, or manifold, within the data and uses it to project the data onto a two dimensional visualization. The following plot shows a UMAP plot where each dot is a cell and colors correspond to cluster assignments.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to visualize the cluster results.
</div>

In [20]:
%%r_build_ui -w 800 {"name": "Visualize clusters", "parameters": {"output_var": {"hide": "True"}}}
do_dim_plot <- function() {
    plot(DimPlot(pbmc, reduction = "umap"))
    return("")
}
do_dim_plot()

UIBuilder(function_import='nbtools._r_wrappers["8CBA75AFE4C2A30B8145C23078C78F54"]', name='Visualize clusters'…

# Visualize marker gene expression

This cell produces a violin plot of gene expression in each cluster.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
Click run to visualize the expression of a gene
</div>

In [21]:
%%r_build_ui -w 800 {"name": "Violin plot of gene expression", "parameters": {"gene": {}, "output_var": {"hide": "True"}}}
do_violin <- function(gene) {
    plot(VlnPlot(pbmc, features = c(gene), slot = "counts", log = TRUE))
    return("")
}
do_violin(gene)

UIBuilder(function_import='nbtools._r_wrappers["9197CAEE74E9BCA43777B30BA530DA3E"]', name='Violin plot of gene…

# Show markers on dimension reduction

This cell produces a UMAP plot colored by marker gene expression.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Click <b>Run</b> to display a dimension reduction plot colored by a single gene's expression.
</div>

In [22]:
%%r_build_ui -w 800 {"name": "UMAP plot of gene expression", "parameters": {"gene": {}, "output_var": {"hide": "True"}}}
do_umap_gene <- function(gene) {
    plot(FeaturePlot(pbmc, features = c(gene)))
    return("")
}
do_umap_gene(gene)

UIBuilder(function_import='nbtools._r_wrappers["C74CC8AF90A72DC62B30DB4365419426"]', name='UMAP plot of gene e…

# References

1. Villani, A.-C., Satija, R., Reynolds, G., Sarkizova, S., Shekhar, K., Fletcher, J., Griesbeck, M., Butler, A., Zheng, S., Lazo, S., et al. (2017). Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356.

2. Becht, E., McInnes, L., Healy, J., Dutertre, C.-A., Kwok, I.W.H., Ng, L.G., Ginhoux, F., and Newell, E.W. (2019). Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology 37, 38–44.

3. Maaten, L. van der, and Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 2579–2605.