# Genes measured in each cell (dataset presence matrix)

The Census is a compilation of cells from multiple datasets that may differ by the sets of genes they measure. This notebook describes the way to identify the genes measured per dataset.

The presence matrix is a sparse boolean array, indicating which features (var) were present in each dataset.  The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement `varp` collection. The first dimension is indexed by the `soma_joinid` in the `census_datasets` dataframe. The second is indexed by the `soma_joinid` in the `var` dataframe of the measurement.

As a reminder the `obs` data frame has a column `dataset_id` that can be used to link any cell in the Census to the presence matrix.

**Contents** 

1. Opening the Census.
2. Fetching the IDs of the Census datasets.
3. Fetching the dataset presence matrix.
4. Identifying genes measured in a specific dataset.
5. Identifying datasets that measured specific genes.
6. Identifying all genes measured in a dataset.


## Opening the Census

The `cellxgene_census` python package contains a convenient API to open the latest version of the Census.

In [1]:
import numpy as np
from scipy import sparse
import cellxgene_census

census = cellxgene_census.open_soma()

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.


## Fetching the IDs of the Census datasets

Let's grab a table of all the datasets included in the Census and use this table in combination with the presence matrix below.

In [2]:
# Grab the experiment containing human data, and the measurement therein with RNA
human = census["census_data"]["homo_sapiens"]
human_rna = human.ms["RNA"]

# The census-wide datasets
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()

datasets_df

Unnamed: 0,soma_joinid,collection_id,collection_name,collection_doi,dataset_id,dataset_title,dataset_h5ad_path,dataset_total_cell_count
0,0,6b701826-37bb-4356-9792-ff41fc4c3161,Abdominal White Adipose Tissue,,9d8e5dca-03a3-457d-b7fb-844c75735c83,22 integrated samples,9d8e5dca-03a3-457d-b7fb-844c75735c83.h5ad,72335
1,1,4195ab4c-20bd-4cd3-8b3d-65601277e731,A spatially resolved single cell genomic atlas...,,a6388a6f-6076-401b-9b30-7d4306a20035,scRNA-seq data - myeloid cells,a6388a6f-6076-401b-9b30-7d4306a20035.h5ad,30789
2,2,4195ab4c-20bd-4cd3-8b3d-65601277e731,A spatially resolved single cell genomic atlas...,,842c6f5d-4a94-4eef-8510-8c792d1124bc,scRNA-seq data - all cells,842c6f5d-4a94-4eef-8510-8c792d1124bc.h5ad,714331
3,3,4195ab4c-20bd-4cd3-8b3d-65601277e731,A spatially resolved single cell genomic atlas...,,74520626-b0ba-4ee9-86b5-714649554def,scRNA-seq data - T cells,74520626-b0ba-4ee9-86b5-714649554def.h5ad,76567
4,4,4195ab4c-20bd-4cd3-8b3d-65601277e731,A spatially resolved single cell genomic atlas...,,396a9124-fb20-4822-bf9c-e93fdf7c999a,scRNA-seq data - B cells,396a9124-fb20-4822-bf9c-e93fdf7c999a.h5ad,12510
...,...,...,...,...,...,...,...,...
557,557,180bff9c-c8a5-4539-b13b-ddbc00d643e6,Molecular characterization of selectively vuln...,10.1038/s41593-020-00764-7,f9ad5649-f372-43e1-a3a8-423383e5a8a2,Molecular characterization of selectively vuln...,f9ad5649-f372-43e1-a3a8-423383e5a8a2.h5ad,8168
558,558,a72afd53-ab92-4511-88da-252fb0e26b9a,Single-cell atlas of peripheral immune respons...,10.1038/s41591-020-0944-y,456e8b9b-f872-488b-871d-94534090a865,Single-cell atlas of peripheral immune respons...,456e8b9b-f872-488b-871d-94534090a865.h5ad,44721
559,559,38833785-fac5-48fd-944a-0f62a4c23ed1,Construction of a human cell landscape at sing...,10.1038/s41586-020-2157-4,2adb1f8a-a6b1-4909-8ee8-484814e2d4bf,Construction of a human cell landscape at sing...,2adb1f8a-a6b1-4909-8ee8-484814e2d4bf.h5ad,598266
560,560,5d445965-6f1a-4b68-ba3a-b8f765155d3a,A molecular cell atlas of the human lung from ...,10.1038/s41586-020-2922-4,e04daea4-4412-45b5-989e-76a9be070a89,"Krasnow Lab Human Lung Cell Atlas, Smart-seq2",e04daea4-4412-45b5-989e-76a9be070a89.h5ad,9409


## Fetching the dataset presence matrix

Now let's fetch the dataset presence matrix. 

For convenience, read the entire presence matrix (for Homo sapiens) into a SciPy array. There is a convenience API providing this capability, returning the matrix in a `scipy.sparse.array`.

In [3]:
presence_matrix = cellxgene_census.get_presence_matrix(census, organism="Homo sapiens", measurement_name="RNA")

presence_matrix

<562x60664 sparse matrix of type '<class 'numpy.uint8'>'
	with 14829450 stored elements in Compressed Sparse Row format>

We also need the `var` dataframe, which is read into a Pandas DataFrame for convenient manipulation:

In [4]:
var_df = human_rna.var.read().concat().to_pandas()

var_df

Unnamed: 0,soma_joinid,feature_id,feature_name,feature_length
0,0,ENSG00000243485,MIR1302-2HG,1021
1,1,ENSG00000237613,FAM138A,1219
2,2,ENSG00000186092,OR4F5,2618
3,3,ENSG00000238009,RP11-34P13.7,3726
4,4,ENSG00000239945,RP11-34P13.8,1319
...,...,...,...,...
60659,60659,ENSG00000288719,RP4-669P10.21,4252
60660,60660,ENSG00000288720,RP11-852E15.3,7007
60661,60661,ENSG00000288721,RP5-973N23.5,7765
60662,60662,ENSG00000288723,RP11-553N16.6,1015


## Identifying genes measured in a specific dataset.

Now that we have the dataset table, the genes metadata table, and the dataset presence matrix, we can check if a gene or set of genes were measured in a specific dataset.

**Important:** the presence matrix is indexed by soma_joinid, and is *NOT* positionally indexed.  In other words:

* the first dimension of the presence matrix is the dataset's `soma_joinid`, as stored in the `census_datasets` dataframe.
* the second dimension of the presence matrix is the feature's `soma_joinid`, as stored in the `var` dataframe.

Let's find out if the the gene `"ENSG00000286096"` was measured in the dataset with id `"97a17473-e2b1-4f31-a544-44a60773e2dd"`.


In [5]:
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
dataset_joinid = datasets_df.loc[datasets_df.dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"].soma_joinid
is_present = presence_matrix[dataset_joinid, var_joinid][0, 0]
print(f'Feature is {"present" if is_present else "not present"}.')

Feature is present.


## Identifying datasets that measured specific genes

Similarly, we can determine the datasets that measured a specific gene or set of genes.

In [6]:
# Grab the feature's soma_joinid from the var dataframe
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid

# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
dataset_joinids = presence_matrix[:, var_joinid].tocoo().row

# From the datasets dataframe, slice out the datasets which have a joinid in the list
datasets_df.loc[datasets_df.soma_joinid.isin(dataset_joinids)]

Unnamed: 0,soma_joinid,collection_id,collection_name,collection_doi,dataset_id,dataset_title,dataset_h5ad_path,dataset_total_cell_count
105,105,283d65eb-dd53-496d-adb7-7570c7caa443,Transcriptomic diversity of cell types across ...,10.1101/2022.10.12.511898,fe1a73ab-a203-45fd-84e9-0f7fd19efcbd,Dissection: Amygdaloid complex (AMY) - basolat...,fe1a73ab-a203-45fd-84e9-0f7fd19efcbd.h5ad,35285
109,109,283d65eb-dd53-496d-adb7-7570c7caa443,Transcriptomic diversity of cell types across ...,10.1101/2022.10.12.511898,f8dda921-5fb4-4c94-a654-c6fc346bfd6d,Dissection: Cerebral cortex (Cx) - Occipitotem...,f8dda921-5fb4-4c94-a654-c6fc346bfd6d.h5ad,31899
126,126,283d65eb-dd53-496d-adb7-7570c7caa443,Transcriptomic diversity of cell types across ...,10.1101/2022.10.12.511898,dd03ce70-3243-4c96-9561-330cc461e4d7,Dissection: Cerebral cortex (Cx) - Perirhinal ...,dd03ce70-3243-4c96-9561-330cc461e4d7.h5ad,23732
131,131,283d65eb-dd53-496d-adb7-7570c7caa443,Transcriptomic diversity of cell types across ...,10.1101/2022.10.12.511898,d2b5efc1-14c6-4b5f-bd98-40f9084872d7,Dissection: Tail of Hippocampus (HiT) - Caudal...,d2b5efc1-14c6-4b5f-bd98-40f9084872d7.h5ad,36886
141,141,283d65eb-dd53-496d-adb7-7570c7caa443,Transcriptomic diversity of cell types across ...,10.1101/2022.10.12.511898,c4b03352-af8d-492a-8d6b-40f304e0a122,Supercluster: Medium spiny neuron,c4b03352-af8d-492a-8d6b-40f304e0a122.h5ad,152189
142,142,283d65eb-dd53-496d-adb7-7570c7caa443,Transcriptomic diversity of cell types across ...,10.1101/2022.10.12.511898,c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771,Dissection: Cerebral cortex (Cx) - Temporal po...,c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771.h5ad,37642
143,143,283d65eb-dd53-496d-adb7-7570c7caa443,Transcriptomic diversity of cell types across ...,10.1101/2022.10.12.511898,c202b243-1aa1-4b16-bc9a-b36241f3b1e3,Supercluster: Amygdala excitatory,c202b243-1aa1-4b16-bc9a-b36241f3b1e3.h5ad,109452
144,144,283d65eb-dd53-496d-adb7-7570c7caa443,Transcriptomic diversity of cell types across ...,10.1101/2022.10.12.511898,bdb26abd-f4ba-4ea3-8862-c2340e7a4f55,Supercluster: CGE interneuron,bdb26abd-f4ba-4ea3-8862-c2340e7a4f55.h5ad,227671
149,149,283d65eb-dd53-496d-adb7-7570c7caa443,Transcriptomic diversity of cell types across ...,10.1101/2022.10.12.511898,acae7679-d077-461c-b857-ee6ccfeb267f,Dissection: Head of hippocampus (HiH) - CA1,acae7679-d077-461c-b857-ee6ccfeb267f.h5ad,39147
162,162,283d65eb-dd53-496d-adb7-7570c7caa443,Transcriptomic diversity of cell types across ...,10.1101/2022.10.12.511898,9372df2d-13d6-4fac-980b-919a5b7eb483,Dissection: Midbrain (M) - Periaqueductal gray...,9372df2d-13d6-4fac-980b-919a5b7eb483.h5ad,33794


## Identifying all genes measured in a dataset 

Finally, we can find the set of genes that were measured in the cells of a given dataset.

In [7]:
# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids = datasets_df.loc[datasets_df.collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"].soma_joinid

# Slice the presence matrix by the first dimension, i.e., by dataset
var_joinids = presence_matrix[dataset_joinids, :].tocoo().col

# From the feature (var) dataframe, slice out features which have a joinid in the list.
var_df.loc[var_df.soma_joinid.isin(var_joinids)]

Unnamed: 0,soma_joinid,feature_id,feature_name,feature_length
3,3,ENSG00000238009,RP11-34P13.7,3726
4,4,ENSG00000239945,RP11-34P13.8,1319
13,13,ENSG00000229905,RP11-206L10.4,456
14,14,ENSG00000237491,LINC01409,8413
15,15,ENSG00000177757,FAM87B,1947
...,...,...,...,...
51459,51459,ENSG00000277778,PGM5P2,1980
51983,51983,ENSG00000254893,RAP1BL,555
58043,58043,ENSG00000261408,TEN1-CDK3,3898
58688,58688,ENSG00000279457,WASH9P,1397
