In [None]:
from SAM import SAM

This notebook explores the different parameters of the SAM functions more deeply

## Preprocessing the data

In [None]:
sam = SAM()
sam.load_data('../example_data/GSE74596_data.csv.gz')
sam.load_annotations('../example_data/GSE74596_ann.csv')

Shown below are all the possible parameters to the data preprocessing function.

In [None]:
"""
div : float, optional, default 1
    The factor by which the gene expression will be divided prior to
    log normalization.

downsample : float, optional, default 0
    The factor by which to randomly downsample the data. If 0, the
    data will not be downsampled.

sum_norm : float, optional, default None
    If specified, the total number of transcripts in each cell will be
    normalized to this value prior to log normalization and filtering.
    Otherwise, nothing happens.

norm : str, optional, default 'log'
    If 'log', log-normalizes the expression data. If the loaded data is
    already log-normalized, set norm = None.

include_genes : array-like of string, optional, default None
    A vector of gene names or indices that specifies the genes to keep.
    All other genes will be filtered out. Gene names are case-sensitive.

exclude_genes : array-like of string, optional, default None
    A vector of gene names or indices that specifies the genes to exclude.
    These genes will be filtered out. Gene names are case-sensitive.

include_cells : array-like of string, optional, default None
    A vector of cell names that specifies the cells to keep.
    All other cells will be filtered out. Cell names are
    case-sensitive.

exclude_cells : array-like of string, optional, default None
    A vector of cell names that specifies the cells to exclude.
    Thses cells will be filtered out. Cell names are
    case-sensitive.

min_expression : float, optional, default 1
    The threshold above which a gene is considered
    expressed. Gene expression values less than 'min_expression' are
    set to zero.

thresh : float, optional, default 0.2
    Keep genes expressed in greater than 'thresh'*100 % of cells and
    less than (1-'thresh')*100 % of cells, where a gene is considered
    expressed if its expression value exceeds 'min_expression'.

filter_genes : bool, optional, default True
    Setting this to False turns off filtering operations aside from
    removing genes with zero expression across all cells. Genes passed
    in exclude_genes or not passed in include_genes will still be
    filtered.
"""

sam.preprocess_data(div=1,
                    downsample=0,
                    sum_norm=None,
                    include_genes=None,
                    exclude_genes=None,
                    include_cells=None,
                    exclude_cells=None,
                    norm='log',
                    min_expression=1,
                    thresh=0.01,
                    filter_genes=True)

In most cases, the default preprocessing parameters do not need to be changed.

## Running SAM

The below arguments almost never need to be changed, aside from the `preprocessing` parameter.

`preprocessing` can be either `Normalizer` or `StandardScaler`. In the SAM algorithm, we either standardize the gene expression matrix to have zero mean and unit variance per gene (which corrects for differences in distributions between genes) or normalize the expressions such that each cell has unit Euclidean (L2) norm (which prevents cells with large variances in gene expressions from dominating downstream analyses) prior to dimensionality reduction. Empirically, we have found that standardization (`preprocessing='StandardScaler'`) performs well with large, sparse datasets collected through droplet-based methods, whereas L2-normalization (`preprocessing='Normalizer'`) works better on smaller datasets with higher sequencing depth such as those prepared with the Smart-Seq2 protocol. `preprocessing='Normalizer'` by default

In [None]:
"""
k - int, optional, default 20
    The number of nearest neighbors to identify for each cell.

distance : string, optional, default 'correlation'
    The distance metric to use when constructing cell distance
    matrices. Can be any of the distance metrics supported by
    sklearn's 'pdist'.

max_iter - int, optional, default 10
    The maximum number of iterations SAM will run.

stopping_condition - float, optional, default 5e-3
    The stopping condition threshold for the RMSE between gene weights in 
    adjacent iterations.

verbose - bool, optional, default True
    If True, the iteration number and error between gene weights in adjacent
    iterations will be displayed.

projection - str, optional, default 'umap'
    If 'tsne', generates a t-SNE embedding. If 'umap', generates a UMAP
    embedding. Otherwise, no embedding will be generated.

preprocessing - str, optional, default 'Normalizer'
    If 'Normalizer', use sklearn.preprocessing.Normalizer, which
    normalizes expression data prior to PCA such that each cell has
    unit L2 norm. If 'StandardScaler', use
    sklearn.preprocessing.StandardScaler, which normalizes expression
    data prior to PCA such that each gene has zero mean and unit
    variance. Otherwise, do not normalize the expression data. We
    recommend using 'StandardScaler' for large datasets and 'Normalizer'
    otherwise.

num_norm_avg - int, optional, default 50
    The top 'num_norm_avg' dispersions are averaged to determine the
    normalization factor when calculating the weights. This prevents
    genes with large spatial dispersions from skewing the distribution
    of weights.
"""

sam.run(k=20,
        distance='correlation',
        max_iter=10,
        verbose=True,
        projection='umap',        
        stopping_condition=5e-3,
        num_norm_avg=50,
        preprocessing='Normalizer')

## Key SAM attributes

All SAM outputs are stored in the `sam.adata` object. See the AnnData documentation for more details (https://anndata.readthedocs.io/en/latest/anndata.AnnData.html).

Cell annotations and cluster assignments are stored in `sam.adata.obs`:

-  `annotations` -- Provided cell annotations
-  `louvain_clusters` -- Cluster assignments output by `sam.louvain_clustering`)
-  `kmeans_clusters` -- Cluster assignments output by `sam.kmeans_clustering`)
-  `hdbknn_clusters` -- Cluster assignments output by `sam.hdbknn_clustering`)
-  `density_clusters` -- Cluster assignments output by `sam.density_clustering`)

Gene information is stored in `sam.adata.var`:

-  `mask_genes` -- A boolean vector in which `False` indicates that the corresponding genes were filtered out
-  `spatial_dispersions` -- The spatial dispersions calculated with respect to the nearest-neighbor matrix produced by SAM
-  `weights` -- The gene weights calculated from the spatial dispersions

Unstructured objects output by SAM are stored in `sam.adata.uns`:

-  `pca_obj` -- The sklearn PCA object used by SAM
-  `X_processed` -- The processed, weighted data used as input to PCA
-  `gene_groups` -- The gene groups output by `sam.corr_bin_genes`
-  `ranked_genes` -- The gene IDs ranked by their weights
-  `neighbors` -- The nearest neigbhor graph produced by SAM
-  `marker_genes_ratio` -- The output of `sam.identify_marker_genes_ratio`
-  `marker_genes_rf` -- The output of `sam.identify_marker_genes_rf`


The PCA output, UMAP projection, and t-SNE projection are stored in `sam.adata.obsm`:

-  `X_pca` -- The weighted PCA output
-  `X_umap` -- The UMAP projection (output from `sam.run_umap` and by default from `sam.run`)
-  `X_tsne` -- The t-SNE projection  (output from `sam.run_tsne`)

Different forms of the expression data are stored in `sam.adata.layers`:

-  `X_disp` -- The expression data used to calculate gene weights
-  `X_knn_avg` -- The kNN-averaged expression data

The expression data used to generate `sam.adata.uns['X_preprocessed']` is stored in `sam.adata.X`.

In [None]:
sam.adata