# Cell Census query & extract subsets

_Goal:_ demonstrate the ability to query subsets of the Cell Census based upon user-defined obs/var metadata, and extract those slices into in-memory data structures for further analysis.

**NOTE:** all examples in this notebook assume that sufficient memory exists on the host machine to store query results. There are other notebooks which provide examples for out-of-core processing.

In [1]:
import cell_census

census = cell_census.open_soma(census_version="latest")

The Cell Census includes SOMA Experiments for both human and mouse. These experiments can be queried based upon metadata values (eg, tissue type), and the query result can be extracted into a variety of formats.

> ⚠️ **NOTE:** The following is experimental query code. It is is built upon SOMA, but not (yet) part of SOMA. If it becomes sufficiently useful, we plan to propose it as a SOMA extension.

Basic idea:

- define per-axis (i.e., obs, var) query criteria
- specify the experiment and measurement name to be queried
- specify the column names you want as part of the results
- and read the query result _into an in-memory format_.

This utilizes the SOMA `value_filter` query language. Keep in mind that the results must fit into memory, so it is best to define a selective query _and_ only fetch those axis metadata columns which are necessary.

The `cell_census` package includes a convenience function to extract a slice of the Census and read into an [AnnData](https://anndata.readthedocs.io/en/latest/), for use with [ScanPy](https://scanpy.readthedocs.io/en/stable/). This function accepts a variety of arguments, including:
* the organism to slice
* the per-axis slice criteria
* the columns to fetch and include in the AnnData

For more complex query scenarios, there is an advanced query API demonstrated in other notebooks.

In [2]:
# Define a simple obs-axis query for all cells where tissue is UBERON:0001264 and sex is PATO:0000383.
adata = cell_census.get_anndata(
    census,
    "Homo sapiens",
    obs_value_filter="tissue_ontology_term_id=='UBERON:0002048' and sex_ontology_term_id=='PATO:0000383' and cell_type_ontology_term_id in ['CL:0002063', 'CL:0000499']",
)

display(adata)

AnnData object with n_obs × n_vars = 119269 × 60664
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

In [3]:
# You can also query on both axis. This example adds a var-axis query for a handful of genes, and queries the mouse experiment.
adata = cell_census.get_anndata(
    census,
    "Mus musculus",
    obs_value_filter="tissue == 'brain'",
    var_value_filter="feature_name in ['Gm16259', 'Dcaf5', 'Gm53058']",
    column_names={"obs": ["tissue", "cell_type", "sex"]},
)

display(adata)

AnnData object with n_obs × n_vars = 133674 × 3
    obs: 'tissue', 'cell_type', 'sex'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

Close the census

In [4]:
census.close()