In [1]:
import os
from pathlib import Path

import numpy as np
import scanpy as sc
import faiss

from build_atlas_index_faiss import load_index

You can find the code to build the index at [build_atlas_index_faiss.py](build_atlas_index_faiss.py).
We applied careful tuning to finally very well balance between the accuracy and efficiency. Now the actual building process takes less than 3 minutes and we choose to use only 16 bytes to store the vector per cell, which leads to 936 MB for the whole index of around 40 million cells.

In [2]:
use_gpu = faiss.get_num_gpus() > 0
index, meta_labels = load_index(
    index_dir="/scratch/hdd001/home/haotian/projects/cellxemb/all",
    use_config_file=False,
    use_gpu=use_gpu,
)
print(f"Loaded index with {index.ntotal} cells")
if isinstance(meta_labels[0], bytes):
    meta_labels = meta_labels.astype(str)


Loading index and meta from /scratch/hdd001/home/haotian/projects/cellxemb/all ...
Index loaded, num_embeddings: 40628904
Set nprobe from 128 to 128 for 16384 clusters
Loaded index with 40628904 cells


## ways to reduce the memory footprint of the index

1. use smaller feature dimensions. Create an autoencoder or PCA before indexing? An issue with the projection will need to be updated now and then when a large trunk of new data is added to the atlas.
    
    1.1. Also, if using PCA, be aware that the different dimensions can have different variance scales. So it is important to use a distance metric such as l2 instead of cosine.
    
2. use smaller number of cells, such as meta cells. A naive way to select meta cells is to do an stratified sampling of the cells of different cell types in the atlas. The strategy can be tricky and will need to be updated when new data is added as well.

    2.1 On one hand, using some meta cells may be needed. Since there can be too many cells, we want the found K neighbors to be representative enough, like they can cover enough range in the cell space. To make this more clear, some area in the cell space may be too dense comparing to other areas, this will make it an unfair higher chance of selecting cells in the region. So basically too approaches to relieve this (1) selecting meta cells more evenly spread, and then just retrieve in some meta cell space. (2) Post processing like normalized voting across cell types. A summarizing point is as follows, even if you want to use the approach (2), one will still need approach (1) because you will at least need cell states that should be covered actually get covered, so that then they can be re weighted during the voting process.

    2.2 Using stratified sampling by annotated cell types is not appropriate, since the annotation schema can definitely change over time. So really need to make sure the meta cell selection is simply dependent on the raw data themselves.

    2.3 Can think about using some layers of the hnswlib index as meta cells.

In [3]:
# QUERY
query_data_dir = "/scratch/ssd004/datasets/cellxgene/embed_dataset/ms-dataset/"
# get all file under this dir
query_data_path_list = os.listdir(query_data_dir)
# query_data_path_list = [os.path.join(query_data_dir, i) for i in query_data_path_list]
query_data_path_list = [
    os.path.join(query_data_dir, i) for i in query_data_path_list if "embed_0" in i
]
query_data_path_list

['/scratch/ssd004/datasets/cellxgene/embed_dataset/ms-dataset/embed_0.h5ad']

Another point is that when computing the retrieving process. Balancing the voting process by considering the different numbers of cells for each cell is essential. Maybe the voting weights can be normalized by the ratio of cells in the atlas. Also please have a look at existing literature about how they applied that.

In [4]:
query_embed_list = []
query_meta_list = []
for query_data_path in query_data_path_list:
    data = sc.read_h5ad(query_data_path)
    query_embed_list.append(data.X)
    query_meta_list.append(data.obs["cell_type"].values)
    
query_embed_array = np.concatenate(query_embed_list, axis=0)
query_meta_array = np.concatenate(query_meta_list, axis=0)


The search runs remarkably fast. Using GPU, the following search for 10,000 queries from the whole 40-million cell index typically takes less than 1 second.

In [19]:
k = 1000
# test with the first 100 cells
%time distances, idx = index.search(query_embed_array[:10000], k)
gt = query_meta_array[:10000]

CPU times: user 846 ms, sys: 581 ms, total: 1.43 s
Wall time: 761 ms


In [20]:
matched_array = meta_labels[idx]
from scipy.stats import mode

voting = mode(matched_array, axis=1)[0]

### TODO: need a weighted voting to balance the different ratio of celltypes in the reference atlas

In [21]:
voting[:10]

array([['oligodendrocyte'],
       ['double-positive, alpha-beta thymocyte'],
       ['OFF-bipolar cell'],
       ['oligodendrocyte'],
       ['ON-bipolar cell'],
       ['ON-bipolar cell'],
       ['oligodendrocyte'],
       ['double-positive, alpha-beta thymocyte'],
       ['astrocyte'],
       ['oligodendrocyte']], dtype='<U84')

In [22]:
gt[:10]

array(['oligodendrocyte A', 'PVALB-expressing interneuron',
       'oligodendrocyte A', 'oligodendrocyte precursor cell',
       'oligodendrocyte A', 'mixed glial cell?', 'mixed glial cell?',
       'VIP-expressing interneuron', 'astrocyte',
       'oligodendrocyte precursor cell'], dtype=object)

In [23]:
matched_array[0]

array(['cardiac muscle cell', 'double-positive, alpha-beta thymocyte',
       'cardiac muscle cell', 'epithelial cell',
       'epithelial cell of proximal tubule',
       'epithelial cell of proximal tubule',
       'double-positive, alpha-beta thymocyte', 'mural cell',
       'CD4-positive, alpha-beta T cell',
       'CD4-positive, alpha-beta T cell',
       'cortical cell of adrenal gland', 'cardiac muscle cell',
       'mural cell', 'stromal cell',
       'double-positive, alpha-beta thymocyte',
       'double-positive, alpha-beta thymocyte',
       'epithelial cell of proximal tubule',
       'cortical cell of adrenal gland', 'alpha-beta T cell',
       'cortical cell of adrenal gland', 'stromal cell',
       'double-positive, alpha-beta thymocyte',
       'cortical cell of adrenal gland',
       'double-positive, alpha-beta thymocyte', 'cardiac muscle cell',
       'enterocyte', 'double negative thymocyte', 'oligodendrocyte',
       'cardiac muscle cell', 'endocardial cell', 'nat