# Census datasets presence

*Goal:* demonstrate basic use of the `datasets_presence_matrix` array.

The presence matrix is a sparse array, indicating which features (var) were present in each dataset.  The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement `varp` collection. The first dimension is indexed by the `soma_joinid` in the `census_datasets` dataframe. The second is indexed by the `soma_joinid` in the `var` dataframe of the measurement.

In [1]:
import numpy as np
from scipy import sparse
import cell_census

census = cell_census.open_soma()

# Grab the experiment containing human data, and the measurement therein with RNA
human = census["census_data"]["homo_sapiens"]
human_rna = human.ms["RNA"]

# The cell census-wide datasets
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()
datasets_df

Unnamed: 0,soma_joinid,collection_id,collection_name,collection_doi,dataset_id,dataset_title,dataset_h5ad_path,dataset_total_cell_count
0,0,,,,d2fc9880-e6d3-4922-af5c-61f4f517adfa,,d2fc9880-e6d3-4922-af5c-61f4f517adfa.h5ad,99369
1,1,,,,f75f2ff4-2884-4c2d-b375-70de37a34507,,f75f2ff4-2884-4c2d-b375-70de37a34507.h5ad,3799
2,2,,,,703f00e6-b996-48e5-bc34-00c41b9876f4,,703f00e6-b996-48e5-bc34-00c41b9876f4.h5ad,649
3,3,,,,cdefb878-7f00-4b9d-9eda-b3652cfac0c8,,cdefb878-7f00-4b9d-9eda-b3652cfac0c8.h5ad,1641
4,4,,,,fd072bc3-2dfb-46f8-b4e3-467cb3223182,,fd072bc3-2dfb-46f8-b4e3-467cb3223182.h5ad,908046
...,...,...,...,...,...,...,...,...
469,469,,,,2adb1f8a-a6b1-4909-8ee8-484814e2d4bf,,2adb1f8a-a6b1-4909-8ee8-484814e2d4bf.h5ad,598266
470,470,,,,2190bd4d-3be0-4bf7-8ca8-8d6f71228936,,2190bd4d-3be0-4bf7-8ca8-8d6f71228936.h5ad,126782
471,471,,,,1a018108-b4b6-457b-ba15-046d5e98c169,,1a018108-b4b6-457b-ba15-046d5e98c169.h5ad,21534
472,472,,,,76544818-bc5b-4a0d-87d4-40dde89545cb,,76544818-bc5b-4a0d-87d4-40dde89545cb.h5ad,6777


For convenience, read the entire presence matrix (for Homo sapiens) into a SciPy array. There is a convience API providing this capability, returning the matrix in a scipy.sparse.array:

In [2]:
presence_matrix = cell_census.get_presence_matrix(census, organism="Homo sapiens", measurement_name="RNA")
presence_matrix

<474x60664 sparse matrix of type '<class 'numpy.uint8'>'
	with 12450655 stored elements in Compressed Sparse Row format>

We also need the `var` dataframe, which is read into a Pandas DataFrame for convenient manipulation:

In [3]:
var_df = human_rna.var.read().concat().to_pandas()

## Is a feature present in a dataset?

*Goal:* test if a given feature is present in a given dataset.

**Important:** the presence matrix is indexed by soma_joinid, and is *NOT* positionally indexed.  In other words:
* the first dimension of the presence matrix is the dataset's `soma_joinid`, as stored in the `census_datasets` dataframe.
* the second dimension of the presence matrix is the feature's `soma_joinid`, as stored in the `var` dataframe.

In [4]:
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
dataset_joinid = datasets_df.loc[datasets_df.dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"].soma_joinid
is_present = presence_matrix[dataset_joinid, var_joinid][0, 0]
print(f'Feature is {"present" if is_present else "not present"}.')

Feature is present.


## What datasets contain a feature?

*Goal:* look up all datasets that have a feature_id present.

In [5]:
# Grab the feature's soma_joinid from the var dataframe
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid

# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
dataset_joinids = presence_matrix[:, var_joinid].tocoo().row

# From the datasets dataframe, slice out the datasets which have a joinid in the list
datasets_df.loc[datasets_df.soma_joinid.isin(dataset_joinids)]

Unnamed: 0,soma_joinid,collection_id,collection_name,collection_doi,dataset_id,dataset_title,dataset_h5ad_path,dataset_total_cell_count
38,38,,,,97a17473-e2b1-4f31-a544-44a60773e2dd,,97a17473-e2b1-4f31-a544-44a60773e2dd.h5ad,104148
51,51,,,,5a11f879-d1ef-458a-910c-9b0bdfca5ebf,,5a11f879-d1ef-458a-910c-9b0bdfca5ebf.h5ad,31691
66,66,,,,9372df2d-13d6-4fac-980b-919a5b7eb483,,9372df2d-13d6-4fac-980b-919a5b7eb483.h5ad,33794
70,70,,,,acae7679-d077-461c-b857-ee6ccfeb267f,,acae7679-d077-461c-b857-ee6ccfeb267f.h5ad,39147
98,98,,,,5e5ab909-f73f-4b57-98a0-6d2c5662f6a4,,5e5ab909-f73f-4b57-98a0-6d2c5662f6a4.h5ad,32306
120,120,,,,53d208b0-2cfd-4366-9866-c3c6114081bc,,53d208b0-2cfd-4366-9866-c3c6114081bc.h5ad,483152
123,123,,,,c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771,,c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771.h5ad,37642
141,141,,,,bdb26abd-f4ba-4ea3-8862-c2340e7a4f55,,bdb26abd-f4ba-4ea3-8862-c2340e7a4f55.h5ad,227671
162,162,,,,d2b5efc1-14c6-4b5f-bd98-40f9084872d7,,d2b5efc1-14c6-4b5f-bd98-40f9084872d7.h5ad,36886
189,189,,,,7a0a8891-9a22-4549-a55b-c2aca23c3a2a,,7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad,74979


## What features are in a dataset?

*Goal:* lookup the features present in a given dataset.

This example also demonstrates the ability to do the query on multiple datasets.

In [6]:
# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids = datasets_df.loc[datasets_df.collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"].soma_joinid

# Slice the presence matrix by the first dimension, i.e., by dataset
var_joinids = presence_matrix[dataset_joinids, :].tocoo().col

# From the feature (var) dataframe, slice out features which have a joinid in the list.
var_df.loc[var_df.soma_joinid.isin(var_joinids)]

Unnamed: 0,soma_joinid,feature_id,feature_name,feature_length
