# Table of Contents
 <p><div class="lev1"><a href="#Overview"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></div><div class="lev1"><a href="#Examination-of-raw-data:-there-is-a-lot-of-variance-that-current-methods-do-not-capture"><span class="toc-item-num">2&nbsp;&nbsp;</span>Examination of raw data: there is a lot of variance that current methods do not capture</a></div><div class="lev2"><a href="#Among-samples-of-one-cell-type:-how-much-variation-is-there-in-genes-of-interest?-Is-variance-motivated?"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Among samples of one cell type: how much variation is there in genes of interest? Is variance motivated?</a></div><div class="lev2"><a href="#What-do-selected-vs-unselected-genes-look-like-in-terms-of-their-expressions-across-samples-in-a-cell-type,-across-all-samples-in-all-cell-types-(sort),-across-average-expression-in-all-cell-types?"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>What do selected vs unselected genes look like in terms of their expressions across samples in a cell type, across all samples in all cell types (sort), across average expression in all cell types?</a></div><div class="lev1"><a href="#Examination-of-reference-profile-selection-output"><span class="toc-item-num">3&nbsp;&nbsp;</span>Examination of reference profile selection output</a></div><div class="lev2"><a href="#Raw-vs-processed-reference-samples"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Raw vs processed reference samples</a></div><div class="lev2"><a href="#Differences-between-methods"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Differences between methods</a></div><div class="lev3"><a href="#What-is-being-included"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>What is being included</a></div><div class="lev3"><a href="#What-is-the-discriminatory-ability?"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>What is the discriminatory ability?</a></div><div class="lev1"><a href="#Pathological-cases"><span class="toc-item-num">4&nbsp;&nbsp;</span>Pathological cases</a></div><div class="lev2"><a href="#Distinguishing-between-two-very-similar-cell-types"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Distinguishing between two very similar cell types</a></div><div class="lev2"><a href="#Distinguishing-between-very-broad-cell-types"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Distinguishing between very broad cell types</a></div><div class="lev1"><a href="#Deconvolution-in-progressively-more-complicated-synthetic-mixtures"><span class="toc-item-num">5&nbsp;&nbsp;</span>Deconvolution in progressively more complicated synthetic mixtures</a></div><div class="lev1"><a href="#Does-RNAseq-behave-differently-from-microarray?"><span class="toc-item-num">6&nbsp;&nbsp;</span>Does RNAseq behave differently from microarray?</a></div><div class="lev1"><a href="#Remaining-things-to-read"><span class="toc-item-num">7&nbsp;&nbsp;</span>Remaining things to read</a></div>

# Overview

We explore the performance of 4 methods for isolating genes or expression profiles that are unique to immune cell types.

We also download 390 enriched cell type samples and do EDA to understand how much expression variance exists within and across cell types, especially when comparing genes that were or were not considered to be representative of a particular cell type.

We compare deconvolution performance in several synthetic mixtures, with and without noise.


# Examination of raw data: there is a lot of variance that current methods do not capture


## Among samples of one cell type: how much variation is there in genes of interest? Is variance motivated?

[notebook](Variance within and between cell types.ipynb)

For selected genes for a certain cell type: how much variance in expression is there within this cell type for these gene? Is it worth putting variance into our model?

I took many CD8 T cell samples from across papers and normalized them all (log base 2). Then I took selected genes that Bindea had flagged as CD8 T cell-specific. I plotted a histogram of each gene's expressions, and overlayed Cibersort LM22's chosen value for this gene.

![](allvariance.png?new)


Note that there are sometimes multiple plots per gene, because each plot actually is for a particular probeset (there can be multiple probesets for a single gene). However, Cibersort's matrix has gene-level specificity, not probeset-level.

There is a lot of variance in the expression of genes of interest, and the point estimates that Cibersort uses do not seem to be the means of the expression distributions of all samples. Thus, better means could be found, and variance information might improve prediction.

## What do selected vs unselected genes look like in terms of their expressions across samples in a cell type, across all samples in all cell types (sort), across average expression in all cell types?

Are we going to see that unselected genes are fairly uniform and meaningless? Are they all low expression throughout? Are these methods selecting very high expression genes only? "Selected" could mean the ones Cibersort includes in expression profiles, or ones flagged by Bindea (probably better)

When looking at CD8 T cell genes (as flagged by Bindea) in samples from other cell types, we see wider expression distributions that seem more Gaussian. This may be due to having much higher sample size in the histogram. [(notebook)](Variance%20within%20and%20between%20cell%20types.ipynb#What-do-those-CD8-T-cell-genes-look-like-in-samples-from-other-cell-types?)

When looking at non-CD8 T cell genes in CD8 T cell samples, we see that lots of other genes have high expression. Perhaps they're not unique enough or not high enough to be an outlier. [(notebook)](http://localhost:8888/notebooks/Variance%20within%20and%20between%20cell%20types.ipynb#What-do-non-CD8-T-cell-genes-look-line-in-CD8-T-cell-samples?)



# Examination of reference profile selection output

## Raw vs processed reference samples

Here is a pairwise Pearson correlation matrix from the raw data of Abbas 2009's first experiment. Notice the scale -- very poor differentiation.

![](abbas.corr.png)

Here is a similar matrix from the basis matrix Cibersort made out of this. Much better differentiation.

![](abbas.cib.corr.png)


## Differences between methods

### What is being included

The marker gene lists have mostly different genes but represent many of the same pathways. IRIS has more noise though -- e.g. lots of cell division pathways. [See section 2.1.1 of the writeup for the GO tables.](docs/reference_extraction.pdf)

Pairwise Pearson correlation and hierarchical clustering on the basis matrices recover biological similarities between certain cell types. (There is one exception: gamma delta T cells -- however these have been flagged as problematic and may be ignored.) These patterns are generally preserved across methods and across datasets. 

![Pairwise pearson correlation in combination of _LM22_ and _Abbas_ basis matrices, as well as with raw data from @abbas.](lm22_abbas_abbasbig.corr.png)

![hierarchical clustering of lm22](lm22.pdf)






### What is the discriminatory ability?

The condition number of the basis matrices is low: LM22's is 11.38 (0 is best). When one of each of the two pairs of most similar cell types is removed, the condition number decreases to 9.30.

The most similar cell types are:
* B cells memory, naive
* CD4 T cells naive, memory resting






# Pathological cases

## Distinguishing between two very similar cell types

Construct synthetic mixture in the same way where it’s just the two most similar cell types together, 48-48, with 4 % of some other type of cell. Use exactly the reference profiles. Maybe with some noise. Feed to Cibersort or CellMix and see what the output is.


## Distinguishing between very broad cell types

Would Cibersort fail on superset classification, i.e. on distinguishing T cells from B cells instead of precise subtypes?

What do the diagnostics tell us in this case?



# Deconvolution in progressively more complicated synthetic mixtures

First, reproduced deconvolution of example mixtures that come with Cibersort. I got $R^2 > 0.99$.

Next, deconvolving weighted sums of two quite different columns: naive B cells and Tregs.
* Reference profiles with no noise, with simple noise (a Gaussian added to each gene), with complex noise (a Gaussian added to each element of the weights matrix for weighted sum): $R^2$'s are .93, .93, .96 respectively. Unclear why complex noise halves the error for cell types that are actually present. However, as expected, noise makes Cibersort less certain that other cell types are not there. (TODO: need to double check this because might not have been clipping properly)
* Raw samples with with no noise, with simple noise (a Gaussian added to each gene), with complex noise (a Gaussian added to each element of the weights matrix for weighted sum): $R^2$'s are TODO, respectively. The actual cell types have poor deconvolution consistently, especially Tregs because there are many similar cell types. Error in the incorrect columns is relatively consistent regardless of noise. P value still very good.

Now two similar things: naive and memory B cells
* Reference profiles: 0.99, 0.99, 0.836
* Raw cell lines: 0.943, 0.946, 0.932





# Does RNAseq behave differently from microarray?



# Remaining things to read

* How are superset profiles made? Average overall types of t cells?
* Implementation of Cibersort regression
* Implementation of CellMix Abbas linear regression

