# Differential expression & cluster annotation  

## Introduction

Now that we've assigned cells into clusters, we'd like to understand what makes each cluster different from other cells in the dataset, or to annotate clusters according to their cell types (as has been previously done for this dataset).  

There are several approaches to this task:  

* Look for upregulation of marker genes for cell types of interest (compared to the rest of the dataset)  
* Compare the complete gene expression profiles between groups  
* Use automated methods to compare cells of interest to databases of cell type expression profiles to combine clustering and annotation  

Automated methods are a promising advance, but are not yet able to replace careful human curation. 

For well-defined cell types, we expect marker genes to show large differences in expression between the cell type of interest and the rest of the dataset, allowing us to use simple methods. We'll focus on this approach for this workshop, while building intuition that is broadly applicable to other approaches.


## Comparing distributions

Unlike bulk RNA-seq, we generally have a large number of samples (i.e. cells) for each group we are comparing in single-cell experiments. Thus we can take advantage of the whole distribution of expression values in each group to identify differences between groups rather than only comparing estimates of mean-expression as is standard for bulk RNASeq.

There are two main approaches to comparing distributions: parametric and nonparametric. 

For parametric comparisons, we can infer parameters of a distribution so that it matches the expression values in each group as best as possible. We can then ask whether there are significant differences in the parameters that best describe each group. 

Alternatively, we can use a non-parametric test which does not assume that expression values follow any particular distribution. Non-parametric tests generally convert observed expression values to ranks and test whether the distribution of ranks for one group are signficantly different from the distribution of ranks for the other group. However, some non-parametric methods fail in the presence of a large number of tied values, such as the case for dropouts (zeros) in single-cell RNA-seq expression data. Moreover, if the conditions for a parametric test hold, then it will typically be more powerful than a non-parametric test.

Here, we'll demonstrate N approaches to parametric comparisons between groups.

## Load data

We'll continue working with the mouse brain data with assigned clusters. 

In [None]:
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

adata = sc.read('../data/brain_')

### Important note! For differential expression, we need to use the _raw_ values stored in `adata.raw`.

With differential expression, we want to account for both the center and spread of the expression in each group. Recall that when we normalized our values, we standardized the distribution of each gene across cells to be centered at 0 and scaled with variance 1. 

This has several tricky questions embedded in it:  
* What constitutes a cell type? With what granularity? (E.g., lymphocytes > T cells > CD4+ T cells)  
* How does cell state interact with cell type? (E.g., when annotating in/activated T cells)  
* 