# RAPIDS & Scanpy Single-Cell RNA-seq Workflow

Copyright (c) 2020, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License") you may not use this file except in compliance with the License. You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0 

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This notebook demonstrates a single-cell RNA analysis workflow that begins with preprocessing a count matrix of size `(n_gene, n_cell)` and results in a visualization of the clustered cells for further analysis.

For demonstration purposes, we use a dataset of ~70,000 human lung cells from Travaglini et al. 2020 (https://www.biorxiv.org/content/10.1101/742320v2) and label cells using the ACE2 and TMPRSS2 genes. See the README for instructions to download this dataset.

## Import requirements

In [1]:
import numpy as np
import scanpy as sc
import anndata

import sys
import time
import os

import cudf
import cupy as cp

from cuml.decomposition import PCA
from cuml.manifold import TSNE
from cuml.cluster import KMeans

import rapids_scanpy_funcs

import warnings
warnings.filterwarnings('ignore', 'Expected ')

import cuML
import sklearn

We use the RAPIDS memory manager on the GPU to control how memory is allocated.

In [2]:
import rmm

rmm.reinitialize(
    managed_memory=True, # Allows oversubscription
    pool_allocator=False, # default is False
    devices=0, # GPU device IDs to register. By default registers only GPU 0.
)

cp.cuda.set_allocator(rmm.rmm_cupy_allocator)

## Input data

In the cell below, we provide the path to the `.h5ad` file containing the count matrix to analyze. Please see the README for instructions on how to download the dataset we use here.

We recommend saving count matrices in the sparse .h5ad format as it is much faster to load than a dense CSV file. To run this notebook using your own dataset, please see the README for instructions to convert your own count matrix into this format. Then, replace the path in the cell below with the path to your generated `.h5ad` file.

In [3]:
input_file = "../data/krasnow_hlca_10x_UMIs.sparse.h5ad"

## Set parameters

In [4]:
# marker genes
RIBO_GENE_PREFIX = "RPS" # Prefix for ribosomal genes to regress out
markers = ["ACE2", "TMPRSS2", "EPCAM"] # Marker genes for visualization

# filtering cells
min_genes_per_cell = 200 # Filter out cells with fewer genes than this expressed 
max_genes_per_cell = 6000 # Filter out cells with more genes than this expressed 

# filtering genes
n_top_genes = 5000 # Number of highly variable genes to retain

# PCA
n_components = 50 # Number of principal components to compute

# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE

# k-means
k = 20 # Number of clusters for k-means

# KNN
n_neighbors = 15 # Number of nearest neighbors for KNN graph
knn_n_pcs = 50 # Number of principal components to use for finding nearest neighbors

# UMAP
umap_min_dist = 0.3 
umap_spread = 1.0

# Louvain
louvain_resolution = 0.4

# Gene ranking
ranking_n_top_genes = 50 # Number of differential genes to compute for each cluster

In [5]:
start = time.time()

## Load and Prepare Data

We load the sparse count matrix from an `h5ad` file using Scanpy. The sparse count matrix will then be placed on the GPU. 

In [6]:
data_load_start = time.time()

In [7]:
%%time
adata = sc.read(input_file)
adata = adata.T

CPU times: user 4.6 s, sys: 659 ms, total: 5.26 s
Wall time: 5.27 s


In [8]:
adata.shape

(65662, 26485)

We maintain the index of unique cells and genes in our dataset:

In [9]:
%%time
cells = cudf.Series(adata.obs_names)
genes = cudf.Series(adata.var_names)

CPU times: user 627 ms, sys: 485 ms, total: 1.11 s
Wall time: 1.18 s


In [10]:
%%time
sparse_gpu_array = cp.sparse.csr_matrix(adata.X)

CPU times: user 162 ms, sys: 554 ms, total: 717 ms
Wall time: 718 ms


Verify the shape of the resulting sparse matrix:

In [11]:
sparse_gpu_array.shape

(65662, 26485)

And the number of non-zero values in the matrix:

In [12]:
sparse_gpu_array.nnz

126510394

In [13]:
data_load_time = time.time()
print("Total data load and format time: %s" % (data_load_time-data_load_start))

Total data load and format time: 7.212949752807617


## Preprocessing

In [14]:
preprocess_start = time.time()

### Filter

We filter the count matrix to remove cells with an extreme number of genes expressed.

In [15]:
%%time
sparse_gpu_array = rapids_scanpy_funcs.filter_cells(sparse_gpu_array, min_genes=min_genes_per_cell, max_genes=max_genes_per_cell)

CPU times: user 580 ms, sys: 704 ms, total: 1.28 s
Wall time: 1.34 s


Some genes will now have zero expression in all cells. We filter out such genes.

In [16]:
%%time
sparse_gpu_array, genes = rapids_scanpy_funcs.filter_genes(sparse_gpu_array, genes, min_cells=1)

CPU times: user 1.01 s, sys: 239 ms, total: 1.25 s
Wall time: 1.26 s


The size of our count matrix is now reduced.

In [17]:
sparse_gpu_array.shape

(65462, 22058)

### Normalize

We normalize the count matrix so that the total counts in each cell sum to 1e4.

In [18]:
%%time
sparse_gpu_array = rapids_scanpy_funcs.normalize_total(sparse_gpu_array, target_sum=1e4)

CPU times: user 839 µs, sys: 427 µs, total: 1.27 ms
Wall time: 749 µs


Next, we log transform the count matrix.

In [19]:
%%time
sparse_gpu_array = sparse_gpu_array.log1p()

CPU times: user 77.6 ms, sys: 68.9 ms, total: 146 ms
Wall time: 146 ms


### Select Most Variable Genes

We convert the count matrix to an annData object.

In [20]:
%%time
adata = anndata.AnnData(sparse_gpu_array.get())
adata.var_names = genes.to_pandas()

CPU times: user 261 ms, sys: 229 ms, total: 491 ms
Wall time: 489 ms


Using scanpy, we filter the count matrix to retain only the 5000 most variable genes.

In [21]:
%%time
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor="cell_ranger")
adata = adata[:, adata.var.highly_variable]

CPU times: user 1.87 s, sys: 28.5 ms, total: 1.9 s
Wall time: 2.16 s


### Regress out confounding factors (number of counts, ribosomal gene expression)

We can now perform regression on the count matrix to correct for confounding factors -  for example purposes, we use the number of counts and the expression of ribosomal genes. Many workflows use the expression of mitochondrial genes (named starting with `MT-`).

Before regression, we save the 'raw' expression values of the ACE2 and TMPRSS2 genes to use for labeling cells afterward. We will also store the expression of an epithelial marker gene (EPCAM).

In [22]:
%%time
tmp_norm = sparse_gpu_array.tocsc()
ACE2_raw = tmp_norm[:, genes[genes == "ACE2"].index[0]].todense().ravel()
TMPRSS2_raw = tmp_norm[:, genes[genes == "TMPRSS2"].index[0]].todense().ravel()
EPCAM_raw = tmp_norm[:, genes[genes == "EPCAM"].index[0]].todense().ravel()

del tmp_norm

CPU times: user 454 ms, sys: 248 ms, total: 702 ms
Wall time: 709 ms


In [23]:
genes = adata.var_names
ribo_genes = adata.var_names.str.startswith(RIBO_GENE_PREFIX)

In [24]:
%%time
filtered = adata.X

CPU times: user 543 ms, sys: 24 ms, total: 567 ms
Wall time: 565 ms


We now calculate the total counts and the percentage of ribosomal counts for each cell.

In [25]:
%%time
n_counts = filtered.sum(axis=1)
percent_ribo = (filtered[:,ribo_genes].sum(axis=1) / n_counts).ravel()

n_counts = cp.array(n_counts).ravel()
percent_ribo = cp.array(percent_ribo).ravel()

CPU times: user 67.3 ms, sys: 0 ns, total: 67.3 ms
Wall time: 65.5 ms


And perform regression:

In [26]:
%%time
sparse_gpu_array = cp.sparse.csc_matrix(adata.X)
sparse_gpu_array = rapids_scanpy_funcs.regress_out(sparse_gpu_array, n_counts, percent_ribo)

CPU times: user 59.9 s, sys: 23.6 s, total: 1min 23s
Wall time: 1min 31s


### Scale

Finally, we scale the count matrix to obtain a z-score and apply a cutoff value of 10 standard deviations, obtaining the preprocessed count matrix.

In [27]:
%%time
sparse_gpu_array = rapids_scanpy_funcs.scale(sparse_gpu_array, max_value=10)

CPU times: user 252 ms, sys: 101 ms, total: 353 ms
Wall time: 351 ms


In [28]:
preprocess_time = time.time()
print("Total Preprocessing time: %s" % (preprocess_time-preprocess_start))

Total Preprocessing time: 98.83416080474854


## Cluster & Visualize

In [29]:
cluster_start = time.time()

We store the preprocessed count matrix as an AnnData object, which is currently in host memory. We also add the expression levels of the marker genes as observations to the annData object.

In [30]:
%%time

var_names = adata.var_names
adata = anndata.AnnData(sparse_gpu_array.get())
adata.var_names = var_names
adata.obs["ACE2_raw"] = ACE2_raw.get()
adata.obs["TMPRSS2_raw"] = TMPRSS2_raw.get()
adata.obs["EPCAM_raw"] = EPCAM_raw.get()

CPU times: user 289 ms, sys: 314 ms, total: 603 ms
Wall time: 602 ms


### Reduce

We use PCA to reduce the dimensionality of the matrix to its top 50 principal components.

In [31]:
%%time
adata.obsm["X_pca"] = PCA(n_components=n_components, output_type="numpy").fit_transform(adata.X)

CPU times: user 1.3 s, sys: 852 ms, total: 2.15 s
Wall time: 2.15 s


We visualize the cells using t-SNE and label cells by color according to the k-means clustering.

### UMAP + Louvain

We can also visualize the cells using the UMAP algorithm in Rapids. Before UMAP, we need to construct a k-nearest neighbors graph in which each cell is connected to its nearest neighbors. This can be done conveniently using rapids functionality already integrated into Scanpy.

Note that Scanpy uses an approximation to the nearest neighbors on the CPU while the GPU version performs an exact search. While both methods are known to yield useful results, some differences in the resulting visualization and clusters can be observed.

In [36]:
%%time
sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=knn_n_pcs, method='rapids')



CPU times: user 7.65 s, sys: 707 ms, total: 8.36 s
Wall time: 6 s


The UMAP function from Rapids is also integrated into Scanpy.

In [37]:
%%time
sc.tl.umap(adata, min_dist=umap_min_dist, spread=umap_spread, method='rapids')

CPU times: user 601 ms, sys: 461 ms, total: 1.06 s
Wall time: 1.08 s


Finally, we use the Louvain algorithm for graph-based clustering, once again using the `rapids` option in Scanpy.

In [38]:
%%time
sc.tl.louvain(adata, resolution=louvain_resolution, flavor='rapids')

  Use from_cudf_adjlist instead')


CPU times: user 229 ms, sys: 175 ms, total: 404 ms
Wall time: 753 ms


We plot the cells using the UMAP visualization, and using the Louvain clusters as labels.

In [39]:
%%time
#sc.pl.umap(adata, color=["louvain"])

CPU times: user 7 µs, sys: 0 ns, total: 7 µs
Wall time: 31.7 µs


We can also view cells using UMAP and labeling by raw EPCAM, ACE2 and TMPRSS2 expression.

In [40]:
%%time
#sc.pl.umap(adata, size=4,color=["ACE2_raw"], color_map="Blues", vmax=1, vmin=-0.05)
#sc.pl.umap(adata, size=4, color=["TMPRSS2_raw"], color_map="Blues", vmax=1, vmin=-0.05)
#sc.pl.umap(adata, size=4, color=["EPCAM_raw"], color_map="Reds", vmax=1, vmin=-0.05)

CPU times: user 5 µs, sys: 2 µs, total: 7 µs
Wall time: 13.8 µs


In [41]:
cluster_time = time.time()
print("Total cluster time : %s" % (cluster_time-cluster_start))

Total cluster time : 10.664028882980347


## Differential expression analysis

Once we have done clustering, we can compute a ranking for the highly differential genes in each cluster. Here we use the Louvain clusters as labels.

We use logistic regression to identify the top 50 genes distinguishing each cluster.

In [42]:
def select_groups(labels, groups_order_subset='all'):
    adata_obs_key = labels
    groups_order = labels.cat.categories
    groups_masks = cp.zeros(
        (len(labels.cat.categories), len(labels.cat.codes)), dtype=bool
    )
    for iname, name in enumerate(labels.cat.categories):
        # if the name is not found, fallback to index retrieval
        if labels.cat.categories[iname] in labels.cat.codes:
            mask = labels.cat.categories[iname] == labels.cat.codes
        else:
            mask = iname == labels.cat.codes
        groups_masks[iname] = mask.values
    groups_ids = list(range(len(groups_order)))
    if groups_order_subset != 'all':
        groups_ids = []
        for name in groups_order_subset:
            groups_ids.append(
                cp.where(cp.array(labels.cat.categories.to_array().astype("int32")) == int(name))[0][0]
            )
        if len(groups_ids) == 0:
            # fallback to index retrieval
            groups_ids = cp.where(
                cp.in1d(
                    cp.arange(len(labels.cat.categories)).astype(str),
                    cp.array(groups_order_subset),
                )
            )[0]
            
        groups_ids = [groups_id.item() for groups_id in groups_ids]
        groups_masks = groups_masks[groups_ids]
        groups_order_subset = labels.cat.categories[groups_ids].to_array()
    else:
        groups_order_subset = groups_order.to_array()
    return groups_order_subset, groups_masks



In [43]:
def _select_top_n(scores, n_top):
    n_from = scores.shape[0]
    reference_indices = np.arange(n_from, dtype=int)
    partition = np.argpartition(scores, -n_top)[-n_top:]
    partial_indices = np.argsort(scores[partition])[::-1]
    global_indices = reference_indices[partition][partial_indices]
    return global_indices

def _ranks(X, mask=None, mask_rest=None):
    CONST_MAX_SIZE = 10000000
    n_genes = X.shape[1]
    if issparse(X):
        merge = lambda tpl: vstack(tpl).toarray()
        adapt = lambda X: X.toarray()
    else:
        merge = np.vstack
        adapt = lambda X: X
    masked = mask is not None and mask_rest is not None
    if masked:
        n_cells = np.count_nonzero(mask) + np.count_nonzero(mask_rest)
        get_chunk = lambda X, left, right: merge(
            (X[mask, left:right], X[mask_rest, left:right])
        )
    else:
        n_cells = X.shape[0]
        get_chunk = lambda X, left, right: adapt(X[:, left:right])
    # Calculate chunk frames
    max_chunk = floor(CONST_MAX_SIZE / n_cells)
    for left in range(0, n_genes, max_chunk):
        right = min(left + max_chunk, n_genes)
        df = pd.DataFrame(data=get_chunk(X, left, right))
        ranks = df.rank()
        yield ranks, left, right

def sc_select_groups(adata, groups_order_subset='all', key='groups'):
    """Get subset of groups in adata.obs[key].
    """
    groups_order = adata.obs[key].cat.categories
    if key + '_masks' in adata.uns:
        groups_masks = adata.uns[key + '_masks']
    else:
        groups_masks = np.zeros(
            (len(adata.obs[key].cat.categories), adata.obs[key].values.size), dtype=bool
        )
        for iname, name in enumerate(adata.obs[key].cat.categories):
            # if the name is not found, fallback to index retrieval
            if adata.obs[key].cat.categories[iname] in adata.obs[key].values:
                mask = adata.obs[key].cat.categories[iname] == adata.obs[key].values
            else:
                mask = str(iname) == adata.obs[key].values
            groups_masks[iname] = mask
    groups_ids = list(range(len(groups_order)))
    if groups_order_subset != 'all':
        groups_ids = []
        for name in groups_order_subset:
            groups_ids.append(
                np.where(adata.obs[key].cat.categories.values == name)[0][0]
            )
        if len(groups_ids) == 0:
            # fallback to index retrieval
            groups_ids = np.where(
                np.in1d(
                    np.arange(len(adata.obs[key].cat.categories)).astype(str),
                    np.array(groups_order_subset),
                )
            )[0]
        if len(groups_ids) == 0:
            logg.debug(
                f'{np.array(groups_order_subset)} invalid! specify valid '
                f'groups_order (or indices) from {adata.obs[key].cat.categories}',
            )
            from sys import exit

            exit(0)
        groups_masks = groups_masks[groups_ids]
        groups_order_subset = adata.obs[key].cat.categories[groups_ids].values
    else:
        groups_order_subset = groups_order.values
    return groups_order_subset, groups_masks

In [44]:
cluster_labels = cudf.Series.from_categorical(adata.obs["louvain"].cat)
var_names = cudf.Series(var_names)

### Are DGE outputs the same?

In [45]:
scores, names, reference = rapids_scanpy_funcs.rank_genes_groups(
    sparse_gpu_array, cluster_labels, var_names, n_genes=1, groups=['1', '2', '3'], reference='4')
print(scores)
print(names)

['1', '2', '3']
['1', '2', '3', '4']
reched here!
Ranking took (GPU): 0.9700472354888916
Preparing output np.rec.fromarrays took (CPU): 0.00031185150146484375
Note: This operation will be accelerated in a future version
[(0.003589, 0.02074391, 0.01833547, 0.05072002)]
[('CST3', 'AKAP12', 'CST3', 'ACKR1')]


In [46]:
sc.tl.rank_genes_groups(adata, groupby="louvain", n_genes=1, groups=['1', '2', '3'], reference='4', method='logreg', use_raw=False)
adata.uns['rank_genes_groups']

{'params': {'groupby': 'louvain',
  'reference': '4',
  'method': 'logreg',
  'use_raw': False,
  'layer': None,
  'corr_method': 'benjamini-hochberg'},
 'scores': rec.array([(0.10110304, 0.06121348, 0.24898697, 0.10822219)],
           dtype=[('1', '<f4'), ('2', '<f4'), ('3', '<f4'), ('4', '<f4')]),
 'names': rec.array([('GPIHBP1', 'LOC731424', 'ACKR1', 'XCL1')],
           dtype=[('1', '<U50'), ('2', '<U50'), ('3', '<U50'), ('4', '<U50')])}

### Are labels and var_names the same?

In [47]:
list(var_names)==list(adata.var_names)

True

### Is grouping the same?

In [55]:
groups=['1', '2', '3']
labels = cluster_labels
n_genes=1
reference='4'
groups_order = list(groups)
groups_order += [reference]

In [56]:
r_groups_order, r_groups_masks = select_groups(labels, groups_order)
print(r_groups_order, r_groups_masks)

['1' '2' '3' '4'] [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]


In [57]:
sc_groups_order, sc_groups_masks = sc_select_groups(adata, groups_order, 'louvain')
print(sc_groups_order, sc_groups_masks)

['1' '2' '3' '4'] [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]


In [61]:
(r_groups_masks.get()==sc_groups_masks).all()

True

### Are the masks the same?

In [62]:
n_vars = len(var_names)
n_genes_user = n_genes
reference = groups_order[0]
print(n_vars, n_genes_user, reference)

5000 1 1


In [64]:
r_grouping_mask = labels.astype('int').isin(cudf.Series(r_groups_order))
r_grouping = labels.loc[r_grouping_mask]
print(r_grouping)

1493     1
1969     1
1970     1
1971     1
1972     1
        ..
51778    3
51779    3
52332    1
52677    1
53789    4
Length: 12929, dtype: category
Categories (33, object): [0, 1, 10, 11, ..., 6, 7, 8, 9]


In [66]:
sc_grouping_mask = adata.obs['louvain'].isin(sc_groups_order)
sc_grouping = adata.obs.loc[sc_grouping_mask, 'louvain']
print(sc_grouping)

1493     1
1969     1
1970     1
1971     1
1972     1
        ..
51778    3
51779    3
52332    1
52677    1
53789    4
Name: louvain, Length: 12929, dtype: category
Categories (33, object): [0, 1, 2, 3, ..., 29, 30, 31, 32]


In [83]:
r_grouping.to_array()==np.array(sc_grouping)

array([False, False, False, ..., False, False, False])

In [80]:
(r_grouping.to_array()==np.array(sc_grouping).astype(int)).all()

array([1, 1, 1, ..., 1, 1, 4])

## Are the inputs to regression the same?

In [94]:
r_X = sparse_gpu_array[r_grouping_mask.values, :]
print(r_X.shape)

(12929, 5000)


In [95]:
sc_X = adata.X[sc_grouping_mask.values, :]
print(sc_X.shape)

(12929, 5000)


In [96]:
(sc_X == X.get()).all()

True

## Are the labels to regression the same?

In [115]:
list(sc_grouping.cat.codes) == list(r_grouping.to_array().astype('float32'))

True

### Are the regression outputs the same?

In [182]:
y = np.array(sc_grouping.cat.codes).astype('float32')
print(y)

[1. 1. 1. ... 1. 1. 4.]


In [282]:
y_monotonic = np.arange(len(np.unique(y)))

In [285]:
y = y - 1

In [319]:
clf = sklearn.linear_model.LogisticRegression(penalty="none", max_iter=100, multi_class="multinomial")


clf.fit(sc_X, y)
sc_scores_all = clf.coef_
print(sc_scores_all.shape)
print(sc_scores_all[0])

(4, 5000)
[ 0.0003458   0.0082248   0.00487657 ...  0.00427103 -0.00247896
 -0.00078385]


In [288]:
np.unique(y)

array([0., 1., 2., 3.], dtype=float32)

In [321]:
r_clf = cuml.linear_model.LogisticRegression(penalty="none", max_iter=100)
r_clf.fit(sc_X, y.astype('float32'))
r_scores_all = cp.array(r_clf.coef_).T
print(r_scores_all.shape)
print(r_scores_all[0])

(4, 5000)
[1.0000695 1.0014609 1.002202  ... 1.0009569 1.0052625 0.9997621]


In [327]:
np.corrcoef(r_scores_all[0].get(), sc_scores_all[0])[1,0]

0.8926745571161203

In [315]:
sc_y_hat = clf.predict(sc_X)

In [322]:
r_y_hat = r_clf.predict(sc_X)

In [310]:
np.sum(sc_y_hat == r_y_hat) / len(sc_y_hat)

1.0

In [312]:
np.sum(np.asarray([True, True]))

2

In [323]:
np.exp(sc_scores_all)

array([[1.00034586, 1.00825872, 1.00488848, ..., 1.00428016, 0.99752411,
        0.99921646],
       [0.99960331, 0.99406229, 0.9900294 , ..., 0.99682133, 1.02496485,
        1.00016649],
       [1.00020174, 0.98859758, 0.99955046, ..., 1.0006983 , 0.97754031,
        1.00016961],
       [0.99984926, 1.00924097, 1.00560938, ..., 0.99821624, 1.00053654,
        1.00044787]])

In [324]:
r_scores_all.get()

array([[1.0000695 , 1.0014609 , 1.002202  , ..., 1.0009569 , 1.0052625 ,
        0.9997621 ],
       [0.9998564 , 0.9984232 , 0.99478006, ..., 0.9986997 , 0.99912286,
        1.000052  ],
       [1.0000591 , 0.9970673 , 0.997044  , ..., 1.000362  , 0.9950766 ,
        1.0000067 ],
       [1.0000144 , 1.0030487 , 1.0059748 , ..., 0.99998075, 1.0005378 ,
        1.0001786 ]], dtype=float32)

In [294]:
np.mean((np.exp(sc_scores_all) - r_scores_all.get())**2)

0.00035298969475261623

In [295]:
np.mean((sc_scores_all - r_scores_all.get())**2)

1.0003398875011547

In [296]:
np.allclose(np.exp(sc_scores_all), r_scores_all.get(), rtol=1e-1, atol=1e-1)

False

In [297]:
np.exp(sc_scores_all) - r_scores_all.get()

array([[ 2.76359266e-04,  6.79780708e-03,  2.68644152e-03, ...,
         3.32327021e-03, -7.73838266e-03, -5.45658510e-04],
       [-2.53099830e-04, -4.36092586e-03, -4.75065970e-03, ...,
        -1.87839658e-03,  2.58419881e-02,  1.14518335e-04],
       [ 1.42607585e-04, -8.46969271e-03,  2.50643685e-03, ...,
         3.36263977e-04, -1.75363485e-02,  1.62939152e-04],
       [-1.65160375e-04,  6.19230928e-03, -3.65388562e-04, ...,
        -1.76450850e-03, -1.21373042e-06,  2.69292307e-04]])

In [234]:
r_scores_all

array([[1.0001146 , 1.0008537 , 1.0076411 , ..., 1.0007269 , 0.9982038 ,
        1.0001632 ],
       [1.0000595 , 1.0000442 , 1.0029163 , ..., 1.0007036 , 1.0026325 ,
        0.9998369 ],
       [0.99985063, 0.9985515 , 0.99407315, ..., 0.9986741 , 0.999082  ,
        1.0000243 ],
       [1.0000582 , 0.9997989 , 0.9978093 , ..., 1.0004256 , 0.9982918 ,
        0.9999907 ]], dtype=float32)

In [230]:
np.log(r_scores_all.get())==sc_scores_all

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [231]:
np.corrcoef(np.log(r_scores_all[0].get()), sc_scores_all[0])[1,0]

-0.1008142567664284

In [226]:
np.unique(y)

array([1., 2., 3., 4.], dtype=float32)

In [331]:
np.save(open("cuml_logistic_coeffs.npy", "wb"), r_scores_all.get())

In [332]:
np.save(open("sk_logistic_coeffs.npy", "wb"), sc_scores_all)

In [227]:
y.shape

(12929,)

### Top gene found by both functions

In [206]:
partition = cp.argpartition(r_scores_all[0], -1)[-1:]
print(partition)
print(var_names[partition], r_scores_all[0][partition])
r_scores_all[0].max()

[3666]
3666    SEPP1
dtype: object [1.0138866]


array(1.0138866, dtype=float32)

In [207]:
partition = np.argpartition(sc_scores_all[0], -1)[-1:]
print(partition)
print(var_names[partition], sc_scores_all[0][partition])
sc_scores_all[0].max()

[4628]
4628    GPIHBP1
dtype: object [0.19351941]


0.19351940748586458