# Querying and fetching data from the Cell Census

The Cell Census is a versioned container for the single-cell data hosted at [CELLxGENE Discover](https://cellxgene.cziscience.com/). The Cell Census utilizes [SOMA](https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md) powered by [TileDB](https://tiledb.com/products/tiledb-embedded) for storing, accessing, and efficiently filtering data.

This notebook showcases the easiest ways to query the expression data and cell/gene metadata from the Cell Census.


# Contents
- Opening the census
- Querying cell metadata (obs)
- Querying gene metadata (var)
- Querying expression data

## Opening the census

The `cell_census` python package contains a convenient API to open the latest version of the Cell Census.

In [1]:
import cell_census

census = cell_census.open_soma()

You can learn more about the `cell_census` methods by accessing their corresponding documentation via `help()`. For example `help(cell_census.open_soma)`.

## Querying cell metadata (obs)

The human gene metadata of the Cell Census, for RNA assays, is located at `census["census_data"]["homo_sapiens"].obs`. This is a `SOMADataFrame` and as such it can be materialized as a `pandas.DataFrame` via the methods `read().concat().to_pandas()`. 

The mouse cell metadata is at `census["census_data"]["mus_musculus"].obs`.

For slicing the cell metadata there are two relevant arguments that can be passed through `read()`:

- `column_names` — list of strings indicating what metadata columns to fetch. 
- `value_filter` — python expression with selection conditions to fetch rows, it is similar to [`pandas.DataFrame.query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html), for full details see [`tiledb.QueryCondition`](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition) shortly:
   - Expressions are one or more comparisons
   - Comparisons are one of column op value or column op column
   - Expressions can combine comparisons using and, or, & or |
   - op is one of < | > | <= | >= | == | != or in

To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.

In [2]:
list(census["census_data"]["homo_sapiens"].obs.keys())

['soma_joinid',
 'dataset_id',
 'assay',
 'assay_ontology_term_id',
 'cell_type',
 'cell_type_ontology_term_id',
 'development_stage',
 'development_stage_ontology_term_id',
 'disease',
 'disease_ontology_term_id',
 'donor_id',
 'is_primary_data',
 'self_reported_ethnicity',
 'self_reported_ethnicity_ontology_term_id',
 'sex',
 'sex_ontology_term_id',
 'suspension_type',
 'tissue',
 'tissue_ontology_term_id',
 'tissue_general',
 'tissue_general_ontology_term_id']

`soma_joinid` is a special `SOMADataFrame` column that is used for join operations. The definition for all other columns can be found at the [Cell Census schema](https://github.com/chanzuckerberg/cell-census/blob/main/docs/cell_census_schema_0.0.1.md#cell-metadata--census_objcensus_dataorganismobs--somadataframe).

All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for _a priori_.

For example let's see what are the possible values available for `sex`. To this we can load all cell metadata but fetching only for the column `sex`. 

In [3]:
sex_cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
    column_names = ["sex"]
).concat().to_pandas()

sex_cell_metadata.drop_duplicates()

Unnamed: 0,sex
0,male
3788,female
727233,unknown


As you can see there are only three different values for `sex`, that is `"male"`, `"female"` and `"unknown"`. 

With this information we can fetch all cell metatadata for a specific `sex` value, for example `"unknown"`.

In [4]:
cell_metadata_all_unknown_sex = census["census_data"]["homo_sapiens"].obs.read(
    value_filter = "sex == 'unknown'"
).concat().to_pandas()

cell_metadata_all_unknown_sex

Unnamed: 0,soma_joinid,dataset_id,assay,assay_ontology_term_id,cell_type,cell_type_ontology_term_id,development_stage,development_stage_ontology_term_id,disease,disease_ontology_term_id,...,is_primary_data,self_reported_ethnicity,self_reported_ethnicity_ontology_term_id,sex,sex_ontology_term_id,suspension_type,tissue,tissue_ontology_term_id,tissue_general,tissue_general_ontology_term_id
0,727233,fa8605cf-f27e-44af-ac2a-476bee4410d3,10x 5' v1,EFO:0011025,"CD4-positive, alpha-beta T cell",CL:0000624,human middle aged stage,HsapDv:0000092,COVID-19,MONDO:0100096,...,True,unknown,unknown,unknown,unknown,cell,blood,UBERON:0000178,blood,UBERON:0000178
1,727234,fa8605cf-f27e-44af-ac2a-476bee4410d3,10x 5' v1,EFO:0011025,monocyte,CL:0000576,80 year-old and over human stage,HsapDv:0000095,COVID-19,MONDO:0100096,...,True,unknown,unknown,unknown,unknown,cell,blood,UBERON:0000178,blood,UBERON:0000178
2,727235,fa8605cf-f27e-44af-ac2a-476bee4410d3,10x 5' v1,EFO:0011025,monocyte,CL:0000576,human early adulthood stage,HsapDv:0000088,COVID-19,MONDO:0100096,...,True,unknown,unknown,unknown,unknown,cell,blood,UBERON:0000178,blood,UBERON:0000178
3,727236,fa8605cf-f27e-44af-ac2a-476bee4410d3,10x 5' v1,EFO:0011025,monocyte,CL:0000576,human early adulthood stage,HsapDv:0000088,COVID-19,MONDO:0100096,...,True,unknown,unknown,unknown,unknown,cell,blood,UBERON:0000178,blood,UBERON:0000178
4,727237,fa8605cf-f27e-44af-ac2a-476bee4410d3,10x 5' v1,EFO:0011025,monocyte,CL:0000576,human early adulthood stage,HsapDv:0000088,COVID-19,MONDO:0100096,...,True,unknown,unknown,unknown,unknown,cell,blood,UBERON:0000178,blood,UBERON:0000178
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2086592,43207791,b46237d1-19c6-4af2-9335-9854634bad16,10x 3' v2,EFO:0009899,mesodermal cell,CL:0000222,Carnegie stage 23,HsapDv:0000030,normal,PATO:0000461,...,True,unknown,unknown,unknown,unknown,cell,colon,UBERON:0001155,colon,UBERON:0001155
2086593,43207792,b46237d1-19c6-4af2-9335-9854634bad16,10x 3' v2,EFO:0009899,colon epithelial cell,CL:0011108,Carnegie stage 23,HsapDv:0000030,normal,PATO:0000461,...,True,unknown,unknown,unknown,unknown,cell,colon,UBERON:0001155,colon,UBERON:0001155
2086594,43207793,b46237d1-19c6-4af2-9335-9854634bad16,10x 3' v2,EFO:0009899,enteroendocrine cell,CL:0000164,Carnegie stage 23,HsapDv:0000030,normal,PATO:0000461,...,True,unknown,unknown,unknown,unknown,cell,colon,UBERON:0001155,colon,UBERON:0001155
2086595,43207794,b46237d1-19c6-4af2-9335-9854634bad16,10x 3' v2,EFO:0009899,colon epithelial cell,CL:0011108,Carnegie stage 23,HsapDv:0000030,normal,PATO:0000461,...,True,unknown,unknown,unknown,unknown,cell,colon,UBERON:0001155,colon,UBERON:0001155


You can use both `column_names` and `value_filter` to perform specific queries. For example let's fetch the `disease` columns for the `cell_type` `"B cell"` in the `tissue_general` `"lung"`. 

In [5]:
cell_metadata_b_cell = census["census_data"]["homo_sapiens"].obs.read(
    value_filter = "cell_type == 'B cell' and tissue_general == 'lung'",
    column_names = ["disease"],
).concat().to_pandas()

cell_metadata_b_cell.value_counts()

disease                                cell_type  tissue_general
lung adenocarcinoma                    B cell     lung              50228
non-small cell lung carcinoma          B cell     lung              17484
normal                                 B cell     lung              15081
squamous cell lung carcinoma           B cell     lung              11584
chronic obstructive pulmonary disease  B cell     lung               7147
interstitial lung disease 2            B cell     lung               5141
interstitial lung disease              B cell     lung               1655
COVID-19                               B cell     lung                704
small cell lung carcinoma              B cell     lung                583
non-specific interstitial pneumonia    B cell     lung                284
hypersensitivity pneumonitis           B cell     lung                 13
sarcoidosis                            B cell     lung                  6
dtype: int64

## Querying gene metadata (var)

The human gene metadata of the Cell Census is located at `census["census_data"]["homo_sapiens"].ms["RNA"].var`. Similarly to the cell metadata, it is a `SOMADataFrame` and thus we can also use its method `read()`.

The mouse gene metadata is at `census["census_data"]["homo_sapiens"].ms["RNA"].var`.

Let's take a look at the metadata available for column selection and row filtering.

In [6]:
list(census["census_data"]["homo_sapiens"].ms["RNA"].var.keys())

['soma_joinid', 'feature_id', 'feature_name', 'feature_length']

With the exception of `soma_joinid` these columns are defined in the [Cell Census schema](https://github.com/chanzuckerberg/cell-census/blob/main/docs/cell_census_schema_0.0.1.md#feature-metadata--census_objcensus_dataorganismmsrnavar--somadataframe). Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.

For example, to get the `feature_name` and `feature_length` of the genes `"ENSG00000161798"` and `"ENSG00000188229"` we can do the following.

In [7]:
gene_metadata = census["census_data"]["homo_sapiens"].ms["RNA"].var.read(
    value_filter = "feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    column_names = ["feature_name", "feature_length"],
).concat().to_pandas()

gene_metadata

Unnamed: 0,feature_name,feature_length,feature_id
0,AQP5,1884,ENSG00000161798
1,TUBB4B,2037,ENSG00000188229


## Querying expression data

A convenient way to query and fetch expression data is to use the `get_anndata` method of the `cell_census` API. This is a powerful method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.

The method will return an `anndata.AnnData` object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:

- `column_names` — a dictionary with two keys `obs` and `var` whose values are lists of strings indicating the columns to select for cell and gene metadata respectively.
- `obs_value_filter` —  python expression with selection conditions to fetch **cells** meeting a criteria. For full details see [`tiledb.QueryCondition`](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition).
- `var_value_filter` —  python expression with selection conditions to fetch **genes** meeting a criteria. Details as above.  For full details see [`tiledb.QueryCondition`](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition).


For example if we want to fetch the expression data for:

- Genes `"ENSG00000161798"` and `"ENSG00000188229"`. 
- All `"B cells"` of `"lung"` with `"COVID-19"`.
- With all gene metadata and adding `sex` cell metadata.

In [8]:
adata = cell_census.get_anndata(
    census = census,
    organism = "Homo sapiens",
    var_value_filter = "feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    obs_value_filter = "cell_type == 'B cell' and tissue_general == 'lung' and disease == 'COVID-19'",
    column_names = {"obs": ["sex"]}
    
)

And now we can take a look at the results.

In [9]:
adata

AnnData object with n_obs × n_vars = 704 × 2
    obs: 'sex'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

In [10]:
adata.obs

Unnamed: 0,sex
0,female
1,female
2,female
3,female
4,female
...,...
699,male
700,male
701,male
702,male


In [11]:
adata.var

Unnamed: 0,soma_joinid,feature_id,feature_name,feature_length
0,10625,ENSG00000161798,AQP5,1884
1,31366,ENSG00000188229,TUBB4B,2037


For a full description of `get_anndata()` refer to `help(cell_census.get_anndata)`