When using the MCV, one may fit the model to many splits of the data. How dependent is the ss-loss to the split? How dependent is the gt loss?

One may also average together the outputs resulting from denoising with many splits. The (relative) quality of the resulting denoiser should be given by subtracting from the average ss-loss the variance of the outputs.

In [1]:
# %load_ext autoreload
# %autoreload 2
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
import numpy as np
import scanpy as sc
import os
from util import normalize_rows, mse, expected_sqrt, expected_log1p, poisson_log_lik
import pickle

from sklearn.utils.extmath import randomized_svd

In [3]:
downsample_to = 3000
min_counts_per_gene = 500

In [4]:
data_file = '/Users/josh/src/noise2self-single-cell/data/neurons/neurons_deep.h5ad'
data = sc.read(data_file)

In [5]:
sc.pp.filter_genes(data, min_counts=min_counts_per_gene)
data_down = sc.pp.downsample_counts(data, downsample_to, replace = False, copy = True)

In [6]:
x = np.array(data_down.X.todense())

In [7]:
x1 = np.random.binomial(x, 0.5)
x2 = x - x1

In [8]:
y = np.array(data.X.todense())
mean = y/y.sum(axis = 1, keepdims = True) * downsample_to/2
z = expected_sqrt(mean)

In [9]:
expected_sqrt_half_means = z

## PCA

In [10]:
k_opt = 19

For each split $X = X_1, X_2$, we compute PCA on $X_1$ and record the quality of the resulting output.

In [11]:
n_splits = 12
ss_losses = []
gt_losses = []

accumulator = np.zeros(x.shape)
accumulator_sq = np.zeros(x.shape)
for i in np.arange(n_splits):
    print("Computing ", i)
    np.random.seed(i)
    x1 = np.random.binomial(x, 0.5)
    x2 = x - x1
    U, S, V = randomized_svd(np.sqrt(x1), 50)
    denoised = np.sqrt(x1).dot(V[:k_opt,:].T).dot(V[:k_opt,:])
    accumulator += denoised
    accumulator_sq += denoised**2
    ss_losses.append(mse(denoised, x2))
    gt_losses.append(mse(denoised, z))
average = accumulator/n_splits
var = accumulator_sq/n_splits - average**2
gt_loss = mse(average, z)

Computing  0
Computing  1
Computing  2
Computing  3
Computing  4
Computing  5
Computing  6
Computing  7
Computing  8
Computing  9
Computing  10
Computing  11


In [12]:
# Gain is due to variance of denoised outputs
np.allclose(var.mean() + gt_loss, np.mean(gt_losses))

True

In [13]:
print("SS Losses")
print([np.round(x, 3) for x in sorted(ss_losses)])
print("GT Losses")
print([np.round(x, 5) for x in sorted(gt_losses)])
print("Accumulator GT Loss")
print(np.round(gt_loss, 5))
print("Gain from averaging")
print(np.round(var.mean(), 5))

SS Losses
[0.73, 0.732, 0.733, 0.733, 0.734, 0.736, 0.736, 0.736, 0.736, 0.737, 0.737, 0.738]
GT Losses
[0.0055, 0.00551, 0.00551, 0.00551, 0.00551, 0.00551, 0.00551, 0.00551, 0.00551, 0.00552, 0.00552, 0.00552]
Accumulator GT Loss
0.00511
Gain from averaging
0.0004


Note that the denoisers from each split have very similar ground-truth performance (differing in the third significant digit). More variance is present in the self-supervised loss.