In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
import numpy as np
import pandas as pd
import scanpy.api as sc

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))


In [5]:
import anndata

In [6]:
import sys

In [7]:
from scipy.stats import rankdata
import scipy.sparse as spsp

## Initial data loading and exploration

Load the data from the `h5` file.  As written in the tutorial notebooks, expect this to take 32 GB of memory (though it might take around 70 GB to completely load everything)

In [5]:
%%time
adata = sc.read_10x_h5("1M_neurons_filtered_gene_bc_matrices_h5.h5")

Variable names are not unique. To make them unique, call `.var_names_make_unique`.


CPU times: user 1min 49s, sys: 14.4 s, total: 2min 4s
Wall time: 2min 9s


In [6]:
adata.var_names_make_unique()

In [7]:
sc.logging.print_memory_usage()

Memory usage: current 31.28 GB, difference +31.28 GB


This was already run once.

In [16]:
#adata.write(filename="1M-full.h5ad", force_dense=False)

In [15]:
type(adata)

anndata.base.AnnData

In [25]:
type(adata.X)

scipy.sparse.csr.csr_matrix

In [26]:
adata.X.dtype

dtype('float32')

Sparsity:

In [39]:
2624828308/(1306127 * 27998)

0.0717775259310603

In [29]:
adata.X[0:10,0:10].astype('int32').dtype

dtype('int32')

# Why does this take so much memory?

Below is the full size of the data.X object.  31508388720 bytes is 31.5 GB.

In [35]:
adata.X.data.nbytes + adata.X.indptr.nbytes + adata.X.indices.nbytes

31508388720

Both the `indptr` and `indices` objects have type `int64` and I don't want to mess with those.  Thus, I'm not getting back any of that memory.

In [34]:
adata.X.indices.dtype

dtype('int64')

I could change the data to be of type `int16`.  This would only save me about 5 GB, however, since the `dtype` of `data.X` is `float32` by default.

In [24]:
adata.X.max()

10528.0

In [36]:
adata.X.data.nbytes/10**9

10.499313232

In [38]:
adata.X.dtype

dtype('float32')

This probably loads things as `float64`.  In any case, the following rips over 56GB of memory (so don't run it!).

In [5]:
%%time
adata= anndata.read_h5ad("1M-full.h5ad")

CPU times: user 2min 47s, sys: 1min 22s, total: 4min 9s
Wall time: 4min 9s


On `fluxm`, this ran.  It peaks at about 75GB of memory, but then the final object is actually slightly smaller (30GB).  Strange.

In [7]:
adata.X.data.nbytes + adata.X.indptr.nbytes + adata.X.indices.nbytes

31508388720

same number of bytes there so whatever.

## Do a little bit of preprocessing to maybe help

Just eliminate the genes that are never observed.

Strangely, this hits 90GB peak memory usage and levels off at 60GB afterwards.

In [8]:
%%time
sc.pp.filter_genes(adata, min_counts=1)  # only consider genes with more than 1 count

CPU times: user 1min 48s, sys: 35.1 s, total: 2min 23s
Wall time: 2min 23s


In [9]:
sc.logging.print_memory_usage()

Memory usage: current 60.62 GB, difference +29.33 GB


We only filter off around 3000 genes.  This is in all 1.3 million cells though so... that's not bad at all.  Of course, it shouldn't reduce the storage costs since we were only elminating 0s.

In [10]:
adata.X.shape

(1306127, 24015)

In [12]:
adata.X.eliminate_zeros()

In [13]:
adata.X.nnz

2624828308

## Get the clustering

Here I will use the graph based clustering that is provided by 10x as well as the graph based clustering generated by the scanpy team (https://github.com/theislab/scanpy_usage/tree/master/170522_visualizing_one_million_cells).  It is easy to obtain these clusterings.

In [12]:
louvain = pd.read_csv("louvain.csv", dtype='category')
graphclust = pd.read_csv("analysis/clustering/graphclust/clusters.csv", dtype='category')

In [13]:
adata.obs['louvain'] = louvain['x']

In [14]:
adata.obs['louvain']

index
AAACCTGAGATAGGAG-1      14
AAACCTGAGCGGCTTC-1       1
AAACCTGAGGAATCGC-1       7
AAACCTGAGGACACCA-1       4
AAACCTGAGGCCCGTT-1       0
AAACCTGAGTCCGGTC-1      14
AAACCTGCAACACGCC-1      11
AAACCTGCACAGCGTC-1       5
AAACCTGCAGCCACCA-1       0
AAACCTGCAGGATTGG-1      11
AAACCTGCAGGCGATA-1       0
AAACCTGCATATGAGA-1      23
AAACCTGGTACAGCAG-1       1
AAACCTGGTATCAGTC-1      17
AAACCTGGTCTCTTTA-1      23
AAACCTGGTGGTCTCG-1      15
AAACCTGGTTGGAGGT-1      10
AAACCTGGTTTGTTTC-1      21
AAACCTGTCAATCACG-1      20
AAACCTGTCACGCGGT-1       1
AAACCTGTCCGCTGTT-1       0
AAACGGGAGAATTCCC-1       6
AAACGGGAGACCACGA-1       0
AAACGGGAGAGTCGGT-1       0
AAACGGGAGCACGCCT-1       1
AAACGGGAGTGTTGAA-1       0
AAACGGGAGTTTGCGT-1       1
AAACGGGCACAGGCCT-1       3
AAACGGGCACGGCCAT-1      14
AAACGGGCACGGTAGA-1      21
                        ..
TTTGGTTCATCACGTA-133    13
TTTGGTTCATCGGTTA-133    15
TTTGGTTGTCCATCCT-133     2
TTTGGTTGTCTAGTCA-133    19
TTTGGTTGTGCACTTA-133     8
TTTGGTTGTTCCCGAG-133  

In [30]:
adata.obs['graphclust'] = graphclust.astype('category')

Did this magically work?  It appears that it did...

In [12]:
wrong = 0
for name in louvain.index.values:
    if adata.obs.loc[name]['louvain'] != louvain.loc[name]['x']:
        wrong += 1
        
print(wrong)

0


In [21]:
%%time
wrong = 0
for name in louvain.index.values:
    if adata.obs.loc[name]['graphclust'] != graphclust.loc[name]['Cluster']:
        wrong += 1
        
print(wrong)

0
CPU times: user 5min 7s, sys: 6.38 s, total: 5min 13s
Wall time: 5min 13s


In [13]:
data.X.max()

10528.0

## Save the data 

Note that I never actually changed things to be `int16` in the `anndata` object.  I'm currently on `fluxm` with 160GB of memory so I'm not worried about it right now.

In [14]:
test = adata.X.astype('int16')

In [15]:
# save the dataset as integers
spsp.save_npz("1M-nzGenes-int16.npz", test)

In [22]:
adata.write("1M-nzGenes-clusts.h5ad")

Normalize the annData object and save it to a new file.  Only do this once and then load this new object.  You will need to change things if you want to change the library scaling factor.

In [28]:
%%time
sc.pp.normalize_per_cell(adata, counts_per_cell_after=10000, min_counts=0)
sc.pp.log1p( adata )

CPU times: user 2min 9s, sys: 1min 28s, total: 3min 38s
Wall time: 3min 38s


In [29]:
adata.write("1M-10knorm-clusts.h5ad")

In [35]:
adata

AnnData object with n_obs × n_vars = 1306127 × 24015 
    obs: 'louvain', 'graphclust', 'n_counts'
    var: 'gene_ids', 'n_counts', 'ind'

## Make and save folds, run 1bcs

In [8]:
import sys
sys.path.append('/home/ahsvargo/xvalid')

In [9]:
from picturedrocks import Rocks
from picturedrocks.performance import FoldTester, PerformanceReport, NearestCentroidClassifier

In [11]:
adata = sc.read_h5ad("1M-nzGenes-clusts.h5ad")

These folds created below

In [12]:
folds = np.load("10x-5folds.npz")
folds = [folds["fold{}".format(i)] for i in range(5)]

In [13]:
yVec = np.array([int(val) for val in adata.obs['louvain']])

In [15]:
yVec.dtype

dtype('int64')

Number of clusters

In [16]:
yVec.max()

38

In [17]:
data = Rocks(adata.X, yVec)

In [18]:
data.X = data.X.tocsc()
data.cs_currX = data.X

I killed it after 5 hours.  These all take a long time.

In [16]:
%%time
data.CSrankMarkers(lamb=2.0)

Converted to csc matrix for generating consts


KeyboardInterrupt: 

In [38]:
ft = FoldTester(data)

In [40]:
#ft.makefolds(k=5, random=True)
#ft.savefolds("./10x-5folds.npz")

In [None]:
ft.loadfolds("./10x-5folds.npz")

## Optimize the 1bcs method

With optimizations to the `sparse_dot_tau` method, the following should take about 7.5 seconds (down from 17 seconds).  This would mean that each cluster takes about $7.5 \times 240/60 = 30$ minutes on one core.  That's a long time, but acceptable.  We can run on 10 cores to finish a fold (all 40 clusters) in less than 2 hours.

In [120]:
%%prun
shoop = set(data.clusterindices[0])
for gene in range(300,400):
    data.sparse_tau_dot(data.cs_currX.getcol(gene), data.clusterindices[0], highval=1, lowval=-1, dim=data.N, setindices=shoop)

 

The three lines below show that converting from `set` to `list` is much faster than converting from `list` to `set`.  Both conversions take a significant amount of time, however: one extra `set` to `list` conversion will take $.2 \times 240/60 = .8$ minutes per cluster.  So we definitely want to avoid these conversions.

In [None]:
shoop = set(data.clusterindices[0])

In [101]:
%%time
for i in range(100):
    sutff = list(shoop)

CPU times: user 215 ms, sys: 0 ns, total: 215 ms
Wall time: 213 ms


In [102]:
%%time
for i in range(100):
    shoop = set(sutff)

CPU times: user 697 ms, sys: 973 µs, total: 698 ms
Wall time: 697 ms


Strangely, it looked like `sparse_tau_dot` was faster with floating points rather than integers.  My hypothesis was that this was because memory allocation is costly; thus, I attempted to eliminate some of the memory allocation in `sparse_tau_dot` to speed it up.  This may be an avenue for futher optimization if necessary, however.

In [30]:
yum = data.cs_currX.getcol(0).astype('int16')
bum = data.cs_currX.getcol(0)

The first time that we generated the constants for the first 100 genes in fold 0.  I extensively verified that `sparse_tau_dot` was correct when I first wrote it, so I am fairly confident that these values are correct.  Thus, use this to verify any changes that you make to `sparse_tau_dot`.

In [28]:
first = [6287.1794107465375,-1156.0851078572125,-6787.095057770366,-22144.221795874957,1193.7604637096958,-430.85144515221504,-25187.105995613114,-1598.4757607790975,-9667.266758859945,-53058.696381394526,-2116.6662213823784,10457.908788890396,-5904.591847924456,-3232.794370499311,31369.298954643808,-12938.947289044472,-694.9763999804626,-10746.409969135033,4490.151305822765,1932.4752505991335,-292.3169507701047,666.515683258755,-3227.190040248823,-24173.84256549451,25918.590574471884,-19138.071709879558,8079.593223863857,-3480.6179521187005,-20450.684238353882,-7303.30352495926,-29211.639080758032,-2810.853183706404,-5626.9584650570705,-15027.194683722355,-2584.903122562866,-5105.47518303952,-2274.2242906937195,-37430.52932793371,-11412.940070032539,-376.6036678671628,-413.39891279867606,-12109.198063790387,25269.97514995728,97.81837841203243,-5740.89436587098,-511.38380831564575,-31868.03893210871,-19764.458447066027,-876.9562236731431,-42446.091542626746,-4747.658002492089,-10594.463487845656,-1446.9211204563023, 19701.114340243454, -24482.705612243342, -5202.773130721975, 4943.660113173941, -44058.51956457996, -11333.088030161065, 896.7601089169984, 28587.714720870674, -338.2035489702048, 5422.635802236747, -18722.14151856173, -1368.282689957625, 11895.581465375903, -14343.966588034818, -13490.50780948552, -462.1942123548548, 29269.931164587742, 495.0380816230479, -3519.1422656841296, -1403.7217502495694, -3327.5978996248564, -6263.393733641591, -2680.1119647866003, -1205.2684287834686, -2364.6279939842952, -292.3169507701047, -3306.9569958831707, -3640.7149177365495, -4220.812230901144, -1093.7607644987204, -413.39891279867606, -358.01382332843866, -1720.991048557724, -56828.668642589415, -4362.617457458707, -29802.371039389946, -6151.944153763376, -25740.104609909955, -773.4015085608092, -20474.849790891767, -2038.6636118451997, -900.9872157797056, -716.0301136056983, -1012.6241495081275, -10437.934458573014, -1580.772107753235, 17860.117832102016]

The cacluation which generated `first`.  I'm keeping it in case I fucked up `first` when I removed all of the extra newlines.

In [68]:
list(map(fudge, range(100)))

[6287.1794107465375,
 -1156.0851078572125,
 -6787.095057770366,
 -22144.221795874957,
 1193.7604637096958,
 -430.85144515221504,
 -25187.105995613114,
 -1598.4757607790975,
 -9667.266758859945,
 -53058.696381394526,
 -2116.6662213823784,
 10457.908788890396,
 -5904.591847924456,
 -3232.794370499311,
 31369.298954643808,
 -12938.947289044472,
 -694.9763999804626,
 -10746.409969135033,
 4490.151305822765,
 1932.4752505991335,
 -292.3169507701047,
 666.515683258755,
 -3227.190040248823,
 -24173.84256549451,
 25918.590574471884,
 -19138.071709879558,
 8079.593223863857,
 -3480.6179521187005,
 -20450.684238353882,
 -7303.30352495926,
 -29211.639080758032,
 -2810.853183706404,
 -5626.9584650570705,
 -15027.194683722355,
 -2584.903122562866,
 -5105.47518303952,
 -2274.2242906937195,
 -37430.52932793371,
 -11412.940070032539,
 -376.6036678671628,
 -413.39891279867606,
 -12109.198063790387,
 25269.97514995728,
 97.81837841203243,
 -5740.89436587098,
 -511.38380831564575,
 -31868.03893210871,
 -

A test, also showing how to use `map` with a class method.

In [94]:
shoop = set(data.clusterindices[5])
test = list(map( lambda x: Rocks.sparse_tau_dot(data, data.cs_currX.getcol(x), data.clusterindices[5], highval=1, lowval=-1, dim=data.N, setindices=shoop), range(10)))

In [95]:
test

[23612.663500827835,
 -157.78790586751435,
 -3679.604291596297,
 -11684.100903309936,
 933.3103233016669,
 6376.044582804206,
 -14587.302215939357,
 -1405.4648778217668,
 10173.026531194588,
 -22718.731700285454]

In [50]:
np.array(list(test)) - np.array(first[:10])

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [27]:
shoop = set(data.clusterindices[0])
def fudge(gene):
    return data.sparse_tau_dot(data.cs_currX.getcol(gene), data.clusterindices[0], highval=1, lowval=-1, dim=data.N, setindices=shoop)

In [30]:
%%time
np.array(list(map(fudge, range(100)))) - np.array(first)

KeyboardInterrupt: 

One more small bit of optimization - this is now gone from the `sparse_tau_dot` function, but I like seeing the timing here.  Using `enumerate` is faster than not using enumerate.

In [34]:
%%time

rawr = data.X.getcol(0)
slurp = set(data.clusterindices[0])
#indicesClust = set(rawr.indices) & set(data.clusterindices[0])
inds = [i for i, ind in enumerate(rawr.indices) if ind in slurp]

CPU times: user 7.48 s, sys: 14.9 ms, total: 7.49 s
Wall time: 7.48 s


In [32]:
%%time
inds = [i for i, ind in enumerate(rawr.indices) if ind in slurp]

CPU times: user 2.02 ms, sys: 2 µs, total: 2.02 ms
Wall time: 2.03 ms


In [33]:
%%time
inds = [i for i in range(rawr.indices.shape[0]) if rawr.indices[i] in slurp]

CPU times: user 3.65 ms, sys: 0 ns, total: 3.65 ms
Wall time: 3.66 ms


## Setting up a multiprocessing pool to use with 1bcs

In [8]:
from pathos.multiprocessing import ProcessingPool

In [9]:
p = ProcessingPool(10)

Working on cluster 0
Working on cluster 1
Working on cluster 3
Working on cluster 2
Working on cluster 4
Working on cluster 5
Working on cluster 6
Working on cluster 7
Working on cluster 8
Working on cluster 9
Working on cluster 11
Working on cluster 14
Working on cluster 12
Working on cluster 10
Working on cluster 15
Working on cluster 13
Working on cluster 17
Working on cluster 20
Working on cluster 16
Working on cluster 23
Working on cluster 21
Working on cluster 24
Working on cluster 19
Working on cluster 22
Working on cluster 18
Working on cluster 25
Working on cluster 27
Working on cluster 29
Working on cluster 28
Working on cluster 30
Working on cluster 26
Working on cluster 31
Working on cluster 32
Working on cluster 34
Working on cluster 35
Working on cluster 33
Working on cluster 37
Working on cluster 36
Working on cluster 38
Working on cluster 0
Working on cluster 9
Working on cluster 1
Working on cluster 2
Working on cluster 3
Working on cluster 4
Working on cluster 5
Worki

You don't want to use the `multiprocessing` library.  It is buggy.  Below is a quick example. 

In [1]:
from multiprocessing import Pool
p = Pool(9)

In [5]:
p.terminate()

In [6]:
p.join()

In [7]:
p.close()

Sometimes the pool seems to malfunction.  Not sure why.  Hitting a combination of the following three lines seems to help.  See [this StackOverflow post](https://stackoverflow.com/questions/36403855/cant-stop-kill-all-processes-at-once-produced-by-multiprocessing-pool) or [the Pool documentation](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool).

In [23]:
p.close()

In [24]:
p.terminate()

In [25]:
del(p)

In [179]:
def sparse_tau_dot2(gene):
    return 2

In [180]:
list(p.map(sparse_tau_dot2, range(10)) )

[18889.419300916117,
 -615.702405483076,
 -2975.7100592428496,
 -9418.955244524006,
 929.5238908126449,
 30971.1261580768,
 -7683.496238822861,
 -1115.0174725527372,
 23865.777979693157,
 -18444.97090145795]

In [18]:
p.terminate()

Working on cluster 36
Working on cluster 37
Working on cluster 38


In [20]:
del(p)

In [17]:
p.close()
p.join()
p.terminate()
del(p)

KeyboardInterrupt: 

In [182]:
p=Pool(9)

In [183]:
def sparse_tau_dot2(gene):
    return 1

In [184]:
list(p.map(sparse_tau_dot2, range(10)) )

[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

Buggy.  This is just wrong.  Make sure to use pathos.

Turns out that it's all buggy.  You actually want ipyparallel.  That's not here right now though.

---

Here's a test - the time it takes to generate the constants for cluster 0.  We need the definition of the function fudge so that we can map (can't `p.map` with a `lambda` function: see [here](https://stackoverflow.com/questions/4827432/how-to-let-pool-map-take-a-lambda-function)).

Edit: now using the `pathos` library, using a `lambda` function shouldn't be a problem.  But I've already got the framework below, so I'm going to keep making extra function definitions for now.

In [190]:
shoop = set(data.clusterindices[0])
def fudge(gene):
    return data.sparse_tau_dot(data.cs_currX.getcol(gene), data.clusterindices[0], highval=1, lowval=-1, dim=data.N, setindices=shoop)

In [191]:
%%time

stuff = list(p.map(fudge, range(data.P)))

CPU times: user 613 ms, sys: 338 ms, total: 950 ms
Wall time: 3min 24s


`pathos` doesn't slow us down

### Run 1bcs on one fold and save the constants

In [50]:
%%time
foldNum = 4

mask = np.zeros(data.N, dtype=bool)
mask[folds[foldNum]] = True
foldData = Rocks(data.X[~mask], data.y[~mask], verbose=1)
foldData.cs_currX = foldData.X

print("Loaded data for fold {}, starting generation of constants".format(foldNum), flush=True)

def clustConsts4(clust):
    print("Working on cluster {}".format(clust), flush=True)
    setindices = set(foldData.clusterindices[clust])
    
    rankvec = rankdata(foldData.clust2vec(clust+1))
    rankvec = rankvec - rankvec.mean()
    
    consts = list(
        map( lambda x: Rocks.sparse_tau_dot(
            foldData, 
            foldData.cs_currX.getcol(x), 
            foldData.clusterindices[clust], 
            #highval=1, 
            #lowval=-1, 
            highval=rankvec.max(), 
            lowval=rankvec.min(), 
            dim=foldData.N, 
            setindices=setindices
            ),
            range(data.P)
        )
    )
    
    return np.array(consts)



Loaded data for fold 4, starting generation of constants
CPU times: user 2min 19s, sys: 3min 6s, total: 5min 25s
Wall time: 6min 16s


In [22]:
%%time
foldNum = 0

mask = np.zeros(data.N, dtype=bool)
mask[folds[foldNum]] = True
foldData = Rocks(data.X[~mask], data.y[~mask], verbose=1)
foldData.cs_currX = foldData.X

CPU times: user 1min 27s, sys: 21.1 s, total: 1min 48s
Wall time: 1min 48s


Could save one second per cluster by doing some optimal stuff here... don't think that it matters right now.

In [24]:
%%time
rankvec = rankdata(foldData.clust2vec(1))
rankvec = rankvec - rankvec.mean()
highval = rankvec.max()
lowval = rankvec.min()

CPU times: user 1.35 s, sys: 12.1 ms, total: 1.36 s
Wall time: 1.36 s


In [25]:
highval

475193.5

In [26]:
lowval

-47257.0

In [37]:
%%time
mu = (foldData.N + 1)/2
lowval = (foldData.N - foldData.clusterindices[0].shape[0] + 1)/2 - mu
highval = (foldData.N - foldData.clusterindices[0].shape[0]) +  (foldData.clusterindices[0].shape[0] + 1)/2 - mu

CPU times: user 12 µs, sys: 4 µs, total: 16 µs
Wall time: 23.1 µs


In [38]:
highval

475193.5

In [39]:
lowval

-47257.0

In [21]:
%%time
zam = np.array(list( p.map(clustConsts3, range(6)) ))

KeyboardInterrupt: 

In [63]:
%%time
bam = np.array(list( p.map(clustConsts4, range(foldData.K)) ))

CPU times: user 8.24 s, sys: 6.07 s, total: 14.3 s
Wall time: 1h 44min 51s


In [77]:
bam.shape

(39, 24015)

In [73]:
p.terminate()

In [74]:
p.close()

In [75]:
p.join()

In [76]:
del(p)

In [37]:
bam[0,:10]

array([  4613.45871887,  -1184.32667381,  -5381.66758987, -17723.41775877,
          924.07810428,     92.51213699, -20087.73365042,  -1302.68201972,
        -7446.95110748, -42428.04132737])

In [65]:
np.savez("1M-fold4-rankConsts.npz", consts=bam)

In [71]:
diff = np.load("1M-fold4-rankConsts.npz")['consts']

In [39]:
diff.shape

(39, 24015)

In [70]:
np.nonzero(bam[0,:100] - salv)[0].shape

(0,)

In [72]:
np.nonzero(diff[0,:100] - salv)

(array([], dtype=int64),)

In [60]:
bam.shape

(39, 24015)

In [50]:
p.close()

In [69]:
p.close()

In [None]:
p.terminate()

In [None]:
del(p)

Older testing/fighting with various parallel processing libraries... don't need to run

In [10]:
fold_adata = sc.read_h5ad("scanpy/fold3/1M-fold3.h5ad")
yVec = np.array([int(val) for val in fold_adata.obs['louvain']])
sc.logging.print_memory_usage()

Memory usage: current 16.14 GB, difference +16.14 GB


In [11]:
fold_adata.X = fold_adata.X.tocsc()

In [12]:
foldData = Rocks(fold_adata.X, yVec)
foldData.cs_currX = foldData.X

In [20]:
foldData.K

39

In [16]:
foldData

<picturedrocks.rocks.Rocks at 0x2b0e22d505c0>

In [69]:
bam.shape

(39, 24015)

In [70]:
bam[0,:10] - stuff[2,0,:10]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

This makes no sense.  I changed the NAME of the function and everything.  I don't understand.

In [68]:
%%time
setindices = set(foldData.clusterindices[0])

rankvec = rankdata(foldData.clust2vec(1))
rankvec = rankvec - rankvec.mean()

CPU times: user 1.54 s, sys: 155 ms, total: 1.69 s
Wall time: 1.68 s


In [69]:
%%time
salv = list(map( lambda x: Rocks.sparse_tau_dot(
        foldData, 
        foldData.cs_currX.getcol(x), 
        foldData.clusterindices[0], 
        highval=rankvec.max(), 
        lowval=rankvec.min(), 
        dim=foldData.N, 
        setindices=setindices
        ),
        range(100)
    ))

CPU times: user 5.77 s, sys: 310 ms, total: 6.08 s
Wall time: 6.05 s


In [54]:
rankvec = rankdata(foldData.clust2vec(1))
rankvec = rankvec - rankvec.mean()

foldData.sparse_tau_dot(
    foldData.cs_currX.getcol(0),
    foldData.clusterindices[0], 
    highval=rankvec.max(), 
    lowval=rankvec.min(), 
    dim=foldData.N
) - salv[0]

0.0

In [72]:
bam[0,:10]

array([  5192.89623578,   -899.79820618,  -5343.16963752, -17719.21974714,
         1126.11230423,   -287.79380852, -20193.80149437,  -1301.14327524,
        -8257.00716916, -42524.28747155])

In [41]:
stuff = [np.load("1M-fold{}-consts.npz".format(i))['consts'] for i in range(5)]

In [42]:
np.array(stuff)[:,0,:10]

array([[  5296.94113288,   -808.30365671,  -5490.02190093,
        -17647.31017347,    948.39699087,   -572.74427196,
        -20489.79414377,  -1400.25283949,  -7529.69551733,
        -42312.75312821],
       [  5125.9971411 ,   -917.52549398,  -5458.38318901,
        -17775.49146906,   1048.53275752,   -764.59207149,
        -20381.04450438,  -1248.98593285,  -7935.14396102,
        -42470.80942293],
       [  5192.89623578,   -899.79820618,  -5343.16963752,
        -17719.21974714,   1126.11230423,   -287.79380852,
        -20193.80149437,  -1301.14327524,  -8257.00716916,
        -42524.28747155],
       [  4613.45871887,  -1184.32667381,  -5381.66758987,
        -17723.41775877,    924.07810428,     92.51213699,
        -20087.73365042,  -1302.68201972,  -7446.95110748,
        -42428.04132737],
       [  4919.98950299,   -808.56829976,  -5474.74801012,
        -17711.53871073,    737.10764331,   -190.61711616,
        -19595.77306344,  -1142.23665154,  -7499.32093566,
        -42

In [43]:
stuff = np.array(stuff)

After that mess, let's do a check to make sure that we actually get some of the correct constants.

In [44]:
%%time

test = []

for foldNum in range(5):

    mask = np.zeros(data.N, dtype=bool)
    mask[folds[foldNum]] = True
    foldData = Rocks(data.X[~mask], data.y[~mask], verbose=1)
    foldData.cs_currX = foldData.X

    print("Loaded data for fold {}, starting generation of constants".format(foldNum), flush=True)
    setindices = set(foldData.clusterindices[0])
    
    test.append(
        list(
            map( lambda x: Rocks.sparse_tau_dot(
                foldData, 
                foldData.cs_currX.getcol(x), 
                foldData.clusterindices[0], 
                highval=1, 
                lowval=-1, 
                dim=foldData.N, 
                setindices=setindices
                ),
                range(100)
            )
        )
    )
    
test = np.array(test)



Loaded data for fold 0, starting generation of constants
Loaded data for fold 1, starting generation of constants
Loaded data for fold 2, starting generation of constants
Loaded data for fold 3, starting generation of constants
Loaded data for fold 4, starting generation of constants
CPU times: user 7min 2s, sys: 18min 45s, total: 25min 48s
Wall time: 21min 54s


In [49]:
np.nonzero(stuff[:,0,:100] - test)

(array([], dtype=int64), array([], dtype=int64))

Hooray!

Another potential parallelizing technique

In [218]:
import concurrent.futures

In [220]:
concurrent.futures.ProcessPoolExecutor

concurrent.futures.process.ProcessPoolExecutor

This doesn't work for some reason:

In [None]:
for clust in range(data.K):
    print("Working on cluster {}".format(clust), flush=True)
    
    setindices = set(foldData.clusterindices[clust])
    
    def sparse_tau_dot(gene):
        return Rocks.sparse_tau_dot(
            foldData, 
            foldData.cs_currX.getcol(gene), 
            foldData.clusterindices[clust], 
            highval=1, 
            lowval=-1, 
            dim=foldData.N, 
            setindices=setindices
        )
    
    consts = list( p.map(sparse_tau_dot, range(foldData.P)) )

    foldData.cs_OvA[clust] = np.array(consts)

In [250]:
clust = 10
setindices = set(foldData.clusterindices[clust])

def sparse_tau_dot(gene):
    return Rocks.sparse_tau_dot(
        foldData, 
        foldData.cs_currX.getcol(gene), 
        foldData.clusterindices[clust], 
        highval=1, 
        lowval=-1, 
        dim=foldData.N, 
        setindices=setindices
    )

foldData.cs_OvA[clust,:20] - list( map(sparse_tau_dot, range(20)) )

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.])

In [251]:
list( map(sparse_tau_dot, range(20)) )

[1854.4595558147462,
 -128.24245096925108,
 -2320.9912975450748,
 -7106.024145665475,
 331.02999345871206,
 -10833.528068319602,
 -12820.987271304646,
 -151.83366843598108,
 -11937.960246046123,
 -15595.620156888213,
 -376.44316607784276,
 -8264.016581095753,
 -1824.3626259125767,
 -81.46279087247795,
 -616.2163718755385,
 -1734.5776345859838,
 160.87341981144186,
 -9680.077349966758,
 -2185.4803671247314,
 -1206.8807244813256]

In [79]:
fold0 = np.load("1M-fold0-consts.npz")['consts']

In [80]:
fold0.shape

(39, 24015)

In [195]:
fold0[1,:10]

array([  5296.94113288,   -808.30365671,  -5490.02190093, -17647.31017347,
          948.39699087,   -572.74427196, -20489.79414377,  -1400.25283949,
        -7529.69551733, -42312.75312821])

In [199]:
foldData.cs_OvA[1,:10] 

array([  5296.94113288,   -808.30365671,  -5490.02190093, -17647.31017347,
          948.39699087,   -572.74427196, -20489.79414377,  -1400.25283949,
        -7529.69551733, -42312.75312821])

In [78]:
def riiipFold(foldNum, pool=p, folds=folds, data=data):

    mask = np.zeros(data.N, dtype=bool)
    mask[folds[foldNum]] = True
    foldData = Rocks(data.X[~mask], data.y[~mask], verbose=1)
    foldData.cs_currX = foldData.X

    print("Loaded data for fold {}, starting generation of constants".format(foldNum), flush=True)

    for clust in range(data.K):
        print("Working on cluster {}".format(clust), flush=True)

        setindices = set(foldData.clusterindices[clust])

        def sparse_tau_dot(gene):
            return Rocks.sparse_tau_dot(
                foldData, 
                foldData.cs_currX.getcol(gene), 
                foldData.clusterindices[clust], 
                highval=1, 
                lowval=-1, 
                dim=foldData.N, 
                setindices=setindices
            )

        consts = list( p.map(sparse_tau_dot, range(foldData.P)) )

        foldData.cs_OvA[clust] = np.array(consts)
    
    np.savez("1M-fold{}-consts.npz".format(foldNum), consts=foldData.cs_OvA)

## Find p values using scanpy

This was copied from `~/publicData/scanpy-pvals.ipynb` and appropriately edited on 13 February 2019.  All three of the scanpy methods take too long on the full data set for cross-validation: thus, we have made a separate file so that we can run in parallel.

(Specifically - we don't run in a jupyter notebook.  See the extra scripts in this directory)

In [6]:
def geneName2index(adata, name):
    geneNames = np.array(adata.var.index)
    inds = np.where(geneNames == name)[0]
    if inds.size == 0:
        print("Waring: gene name not found in adata variable index: returning 0", flush=True)
        return 0
    else:
        return inds[0]

In [8]:
adata = sc.read_h5ad("1M-nzGenes-clusts.h5ad")

In [9]:
adata

AnnData object with n_obs × n_vars = 1306127 × 24015 
    obs: 'louvain', 'graphclust'
    var: 'gene_ids', 'n_counts'

Originally, the saved data didn't have the clustering as categorical.  Thus, need to redefine that.

In [10]:
louvain = pd.read_csv("louvain.csv", dtype='category')
graphclust = pd.read_csv("analysis/clustering/graphclust/clusters.csv", dtype='category')

adata.obs['louvain'] = louvain['x']
adata.obs['graphclust'] = graphclust['Cluster']

In [11]:
folds = np.load("10x-5folds.npz")
folds = [folds["fold{}".format(i)] for i in range(5)]

Save every fold since we are going to need to run them individually:

In [None]:
for ifold, fold in enumerate(folds):
    
    mask = np.zeros(adata.X.shape[0], dtype=bool)
    mask[fold] = True
    
    foldAdata = adata[~mask]
    foldAdata.write("1M-fold{}.h5ad".format(ifold) )

In [10]:
mask = np.zeros(adata.X.shape[0], dtype=bool)
mask[folds[4]] = True

foldAdata = adata[~mask]
foldAdata.write("1M-fold{}.h5ad".format(4) )

Set up for running the scanpy methods

In [12]:
methods = ['wilcoxon', 't-test_overestim_var', 'logreg']
method = methods[1]

Run on the full data set to get an indication of timing

In [17]:
%%time
sc.tl.rank_genes_groups(adata, groupby='louvain', n_genes=24015, method=method, rankby_abs=True, groups='all')

After select_groups, groups_order = ['0' '1' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '2' '20' '21'
 '22' '23' '24' '25' '26' '27' '28' '29' '3' '30' '31' '32' '33' '34' '35'
 '36' '37' '38' '4' '5' '6' '7' '8' '9']
['0', '1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '4', '5', '6', '7', '8', '9']
CPU times: user 2h 22min 49s, sys: 2h 31min 20s, total: 4h 54min 9s
Wall time: 4h 54min 7s


**Running on all of the ~24000 genes**:

Welp, this didn't finish running after about 3 hours 45 minutes using wilcoxon.  So... that's not good.  Not possible to cross validate that.  (It has also only been using 1 core for some reason).

Even the t-test takes 5 hours.  That's not scalable at all.  I also think that something may wrong, since the example proivded by scanpy ([here](https://github.com/theislab/scanpy_usage/blob/master/170522_visualizing_one_million_cells/cluster.py)) takes only 8 minutes to rank_genes_groups (at least for 100 genes... perhaps it scales with the number of genes that you select)?  Hopefully it is not linear.  They also do the zheng17 recepie to select variable genes so they are working with a smaller number of genes (I think 1000) to begin with.  

After this fails, I will try to select a smaller number of genes, but we should still need the scores for all of them.

In [None]:
%%time
nMarkers = 24015
method = methods[1]
print("Method is " + method, flush=True)
groupby = "louvain"
# louvain or graphclust

# mark this if you are comparing only two clusters.
twoGroups = False

pvals = []
marks = []

foldind = 0
for fold in folds:
    
    # initialize
    print("Finding markers for fold" + str(foldind), flush=True)
    foldind += 1
    
    foldPvals = []
    foldMarkers = []

    # run the method on the fold
    if twoGroups: groups = [0,1]
    else: groups = 'all'
        
    fold_adata = adata[fold]
    sc.tl.rank_genes_groups(
        fold_adata, 
        groupby=groupby, 
        n_genes=nMarkers, 
        method=method, 
        rankby_abs=True,
        groups=groups
    )
    
    
    # fix this with an enumerate.  
    # initialize the saved data so that I can refer to it by index later.
    for index, name in enumerate(fold_adata.uns['rank_genes_groups']['names'].dtype.names):
        foldPvals.append([])
        foldMarkers.append([])
    
        foldPvals[index] = fold_adata.uns['rank_genes_groups']['scores'][name]
        
        currMarks = fold_adata.uns['rank_genes_groups']['names'][name]
        foldMarkers[index] = np.array([ geneName2index(adata, gene) for gene in currMarks ])
        
    if (method == 'logreg' and twoGroups):
        foldPvals.append(foldPvals[0])
        foldMarkers.append(foldMarkers[0])
        
    pvals.append(foldPvals)
    marks.append(foldMarkers)
    
pvals = np.array(pvals)
marks = np.array(marks)