In [None]:
from SAM import SAM

This notebook explores the different parameters of the SAM functions more deeply

## Preprocessing the data

In [None]:
sam = SAM()
sam.load_data('../example_data/GSE74596_data.csv.gz')

Shown below are all the possible parameters to the data preprocessing function.

In [None]:
"""
div : float, optional, default 1
    The factor by which the gene expression will be divided prior to
    log normalization.

downsample : float, optional, default 0
    The factor by which to randomly downsample the data. If 0, the
    data will not be downsampled.

sum_norm : float, optional, default None
    If specified, the total number of transcripts in each cell will be
    normalized to this value prior to log normalization and filtering.
    Otherwise, nothing happens.

norm : str, optional, default 'log'
    If 'log', log-normalizes the expression data. If the loaded data is
    already log-normalized, set norm = None.

include_genes : array-like of string, optional, default None
    A vector of gene names or indices that specifies the genes to keep.
    All other genes will be filtered out. Gene names are case-sensitive.

exclude_genes : array-like of string, optional, default None
    A vector of gene names or indices that specifies the genes to exclude.
    These genes will be filtered out. Gene names are case-sensitive.

include_cells : array-like of string, optional, default None
    A vector of cell names that specifies the cells to keep.
    All other cells will be filtered out. Cell names are
    case-sensitive.

exclude_cells : array-like of string, optional, default None
    A vector of cell names that specifies the cells to exclude.
    Thses cells will be filtered out. Cell names are
    case-sensitive.

min_expression : float, optional, default 1
    The threshold above which a gene is considered
    expressed. Gene expression values less than 'min_expression' are
    set to zero.

thresh : float, optional, default 0.2
    Keep genes expressed in greater than 'thresh'*100 % of cells and
    less than (1-'thresh')*100 % of cells, where a gene is considered
    expressed if its expression value exceeds 'min_expression'.

filter_genes : bool, optional, default True
    Setting this to False turns off filtering operations aside from
    removing genes with zero expression across all cells. Genes passed
    in exclude_genes or not passed in include_genes will still be
    filtered.
"""

sam.preprocess_data(div=1,
                    downsample=0,
                    sum_norm=None,
                    include_genes=None,
                    exclude_genes=None,
                    include_cells=None,
                    exclude_cells=None,
                    norm='log',
                    min_expression=1,
                    thresh=0.01,
                    filter_genes=True)

In most cases, the default preprocessing parameters do not need to be changed.

## Running SAM

In [None]:
"""
k - int, optional, default 20
    The number of nearest neighbors to identify for each cell.

distance : string, optional, default 'correlation'
    The distance metric to use when constructing cell distance
    matrices. Can be any of the distance metrics supported by
    sklearn's 'pdist'.

max_iter - int, optional, default 10
    The maximum number of iterations SAM will run.

stopping_condition - float, optional, default 5e-3
    The stopping condition threshold for the RMSE between gene weights in 
    adjacent iterations.

verbose - bool, optional, default True
    If True, the iteration number and error between gene weights in adjacent
    iterations will be displayed.

projection - str, optional, default 'umap'
    If 'tsne', generates a t-SNE embedding. If 'umap', generates a UMAP
    embedding. Otherwise, no embedding will be generated.

preprocessing - str, optional, default 'Normalizer'
    If 'Normalizer', use sklearn.preprocessing.Normalizer, which
    normalizes expression data prior to PCA such that each cell has
    unit L2 norm. If 'StandardScaler', use
    sklearn.preprocessing.StandardScaler, which normalizes expression
    data prior to PCA such that each gene has zero mean and unit
    variance. Otherwise, do not normalize the expression data. We
    recommend using 'StandardScaler' for large datasets and 'Normalizer'
    otherwise.

num_norm_avg - int, optional, default 50
    The top 'num_norm_avg' dispersions are averaged to determine the
    normalization factor when calculating the weights. This prevents
    genes with large spatial dispersions from skewing the distribution
    of weights.
"""

sam.run(k=20,
        distance='correlation',
        max_iter=10,
        verbose=True,
        projection='umap',        
        stopping_condition=5e-3,
        num_norm_avg=50,
        preprocessing='Normalizer')

## Key SAM attributes

SAM outputs a number of useful objects stored in the `sam.adata` or `sam.output_vars` objects. For example, `sam.output_vars` contains the nearest neighbor matrix, the UMAP projection, ranked genes, the indices of the ranked genes in the data matrix, the cluster assignments, and marker genes.

In [None]:
print(list(sam.output_vars.keys()))

In [None]:
sam.adata

See below for a list of all attributes stored in SAM after running the analysis. Refer to the code documentation (in `SAM.py`) for a description of these attributes.

In [None]:
list(sam.__dict__.keys())