# cNMF in `cellarium-ml`

Yang Xu

Stephen Fleming

2024.11.20

The `cellarium-ml` project:

https://github.com/cellarium-ai/cellarium-ml

The specific implementation of cNMF we are actively working on:

https://github.com/cellarium-ai/cellarium-ml/pull/196

NOTE: You will need to use the `cnmf-yx-streamline` branch of `cellarium-ml` on github.

In [1]:
import os

import numpy as np

import scipy
import scanpy as sc

## Data

This demo uses a human heart dataset which is hosted in a google bucket. We will first download the dataset to the machine where this notebook is running, and then we will run cNMF.

- The dataset in h5ad format. In this case the entire dataset is a single h5ad file, but `cellarium-ml` can use an arbitrary number of h5ad files.

In [2]:
# change these values to run on a different dataset:

# define file paths

dataset_h5ad = "gs://broad-bican-cellarium-file-system/large_datasets/oligodendrocytes_all_tissues_filtered.h5ad"

working_dir = "./tmp"
data_dir = "./tmp/subset"

In [3]:
!mkdir -p $working_dir
!mkdir -p $data_dir

In [4]:
# localize data

local_h5ad = os.path.join(working_dir, "data.h5ad")

# !gsutil cp $dataset_h5ad $local_h5ad

In [5]:
adata = sc.read_h5ad(local_h5ad, backed='r')
n_total = adata.n_obs
n_genes_total = adata.n_vars

In [6]:
chunk_size = 50000
the_last_chunk = n_total - (n_total//chunk_size) * chunk_size

In [7]:
r = np.random.RandomState(seed=0).permutation(n_total)
for i in range(n_total//chunk_size + 1):
    start_index = i*chunk_size
    end_index = min((i+1)*chunk_size, n_total)
    adata_subset = adata[r[start_index:end_index], :].copy(filename='tmp/subset/subset_data_%i.h5ad' % (i))
    adata_subset = adata_subset.to_memory()
    adata_subset.X = adata_subset.X.astype(np.float32)
    adata_subset.write(filename='tmp/subset/subset_data_%i.h5ad' % (i))