# Broad NEU Challenge Helper Notebook

## Installing required libraries

In [1]:
!pip install anndata



## Data overview

Usually we use anndata to operate with our data which is stored in `h5ad` format. You can find more about anndata and the format it stores the data [here](https://anndata.readthedocs.io/en/latest/)

In [2]:
!wget https://storage.googleapis.com/dsp-cellarium-cas-public/neu-broad-challenge/pbmc_10k_neu_challenge_example.h5ad

--2024-05-14 20:45:15--  https://storage.googleapis.com/dsp-cellarium-cas-public/neu-broad-challenge/pbmc_10k_neu_challenge_example.h5ad
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.180.219, 142.250.201.219, 142.251.208.123, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.180.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 189524066 (181M) [application/octet-stream]
Saving to: ‘pbmc_10k_neu_challenge_example.h5ad’


2024-05-14 20:45:44 (6.49 MB/s) - ‘pbmc_10k_neu_challenge_example.h5ad’ saved [189524066/189524066]



In [3]:
import anndata

In [4]:
adata = anndata.read_h5ad("pbmc_10k_neu_challenge_example.h5ad")

General information about the data, here you can see all the metadata variables and data dimensionality. You don't need those metadata variables, this is just an example of the dataset that usually goes through our pipeline tool

In [5]:
adata

AnnData object with n_obs × n_vars = 10246 × 36601
    obs: 'total_mrna_umis', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'feature_name'
    uns: 'hvg', 'log1p', 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_umap'
    obsp: 'connectivities', 'distances'

Representation of the sparse count matrix. (Output of how the slice of raw count matrix looks like)

In [9]:
adata.X[:5, 1365:1375].todense()

matrix([[0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 2., 5., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 2., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

Feel free to explore the data more if needed...

## Embeddings

This step is totally unnecessary for the challenge, however for a better understanding, this is what kind of the output, the embedding model (in our case PCA) returns (this value is just a dummy value based on random numbers, but the output looks exactly like this).

In [10]:
import numpy as np


EMBEDDING_DIMENSION = 512

embeddings = np.random.random((adata.shape[0], EMBEDDING_DIMENSION))

In [11]:
embeddings[:5]

array([[0.61804098, 0.4374286 , 0.96260307, ..., 0.1848124 , 0.05337794,
        0.82654222],
       [0.75054247, 0.32108817, 0.81191402, ..., 0.34008085, 0.00360174,
        0.26584503],
       [0.70063089, 0.15133945, 0.88460563, ..., 0.22607995, 0.68806827,
        0.24766543],
       [0.40241017, 0.75182927, 0.39930617, ..., 0.67262924, 0.86085721,
        0.95091779],
       [0.67152385, 0.93624569, 0.34892989, ..., 0.30340524, 0.93085506,
        0.27393163]])

## Diving in details with the challenge problem

In [12]:
!wget https://storage.googleapis.com/dsp-cellarium-cas-public/neu-broad-challenge/neu_broad_challenge_inputs.pkl

--2024-05-14 20:47:19--  https://storage.googleapis.com/dsp-cellarium-cas-public/neu-broad-challenge/neu_broad_challenge_inputs.pkl
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.20.27, 142.250.180.251, 142.251.39.91, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.20.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50054196 (48M) [application/octet-stream]
Saving to: ‘neu_broad_challenge_inputs.pkl’


2024-05-14 20:47:28 (5.81 MB/s) - ‘neu_broad_challenge_inputs.pkl’ saved [50054196/50054196]



In [13]:
import pickle


with open("./neu_broad_challenge_inputs.pkl", "rb") as f:
    data = pickle.loads(f.read())

In [14]:
data.keys()

dict_keys(['cas_search_output', 'cas_search_all_neighbors_info'])

## Nearest Neighbor Search Engine
Here's how the output of nearest neighbor search engine looks like. It represents the query_cell_id (the id that user had for each of the input cell) and all neighbors with the distances from the Nearest Neighbor Search Engine.

In [16]:
nearest_neighbors = data["cas_search_output"]
nearest_neighbors[1]

{'query_cell_id': 'AAACCCACAGAGTTGG-1',
 'neighbors': [{'cas_cell_index': 1524013579, 'distance': 0.9772824645042419},
  {'cas_cell_index': 1524055387, 'distance': 0.9762778282165527},
  {'cas_cell_index': 330041453, 'distance': 0.975730299949646},
  {'cas_cell_index': 330045566, 'distance': 0.9749655723571777},
  {'cas_cell_index': 330044658, 'distance': 0.9747146368026733},
  {'cas_cell_index': 330040314, 'distance': 0.9743883013725281},
  {'cas_cell_index': 1524037623, 'distance': 0.9731814861297607},
  {'cas_cell_index': 1524064517, 'distance': 0.9731571078300476},
  {'cas_cell_index': 1524064385, 'distance': 0.9731186628341675},
  {'cas_cell_index': 330047686, 'distance': 0.9730817079544067},
  {'cas_cell_index': 330039427, 'distance': 0.9729874730110168},
  {'cas_cell_index': 1524024579, 'distance': 0.9728894233703613},
  {'cas_cell_index': 1524002970, 'distance': 0.9724669456481934},
  {'cas_cell_index': 1524001588, 'distance': 0.9724181890487671},
  {'cas_cell_index': 152401193

In [18]:
len(nearest_neighbors)

10246

## Cell Metadata

Here is what metdata we store in our database. You can match the cell metadata by `cas_cell_index` and use it for neighborhood context composition. You would need `cell_type` and `cell_type_ontology_term_id` for this task and can ignore other features variables.

In [19]:
data["cas_search_all_neighbors_info"][0]

{'cas_cell_index': 1524109823,
 'cell_type': 'central memory CD8-positive, alpha-beta T cell',
 'assay': "10x 3' v3",
 'disease': 'normal',
 'suspension_type': 'cell',
 'tissue': 'blood',
 'cell_type_ontology_term_id': 'CL:0000907',
 'assay_ontology_term_id': 'EFO:0009922',
 'disease_ontology_term_id': 'PATO:0000461',
 'tissue_ontology_term_id': 'UBERON:0000178'}

You might potentially need owlready2 to explore the cell ontology graph

In [20]:
# Uncomment if needed
#!pip install owlready2

Here is the Cell Ontology OWL file

In [21]:
cl_owl_path = 'https://github.com/obophenotype/cell-ontology/raw/v2022-09-15/cl.owl'