# Tutorial Cellxgene: Data fetching

This notebook wants to be a tutorial on how to handle the package cellxgene developed by CZ fundation. 
In particular this notebook helps to understand how to retrieve data from the Census.
It will be organized in this way: 
1. Retrieve anndata objects 
2. Retrieve obs metadata objects 
3. Retrieve var metadata objects 

## Anndata objects

In [1]:
# let's start by importing census 
import cellxgene_census

In [2]:
# here you can find the reference documentation
# the reference documentation can be accessed using the help command: 
help(cellxgene_census)
# you can also access the information about specific functions of the API 
help(cellxgene_census.get_anndata)

Help on package cellxgene_census:

NAME
    cellxgene_census - An API to facilitate use of the CZI Science CELLxGENE Census. The Census is a versioned container of single-cell data hosted at `CELLxGENE Discover`_.

DESCRIPTION
    The API is built on the `tiledbsoma` SOMA API, and provides a number of helper functions including:
    
        * Open a named version of the Census, for use with the SOMA API
        * Get a list of available Census versions, and for each version, a description
        * Get a slice of the Census as an AnnData, for use with ScanPy
        * Get the URI for, or directly download, underlying data in H5AD format
    
    For more information on the API, visit the `cellxgene_census repo`_. For more information on SOMA, see the `tiledbsoma repo`_.
    
    .. _CELLxGENE Discover:
        https://cellxgene.cziscience.com/
    
    .. _cellxgene_census repo:
        https://github.com/chanzuckerberg/cellxgene-census/
    
    .. _tiledbsoma repo:
        https://g

Remember the data are downloaded from the census into local RAM, please follow the specific requirement guidelines. 
The system will take minutes to connect with the server and download the dataset. 

In [3]:
# open census 
census = cellxgene_census.open_soma()

The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.


A convenient way to query and fetch expression data is to use the `get_anndata` method of the `cellxgene_census` API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.

The method will return an `anndata.AnnData` object, takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:

- `obs_column_names` and `var_column_names` — a pair of arguments whose values are lists of strings indicating the columns to select for cell (`obs`) and gene (`var`) metadata respectively.
- `obs_value_filter` — python expression with selection conditions to fetch **cells** meeting criteria.
- `var_value_filter` — python expression with selection conditions to fetch **genes** meeting a criteria. Details as above.
- `X_name` allows to access raw or normalized data and, by default is set to `"raw"`

For example, if we want to fetch the expression data for:

- Genes `"ENSG00000161798"` and `"ENSG00000188229"`.
- All `"B cells"` of `"lung"` with `"COVID-19"` from non-duplicated cells.
- All gene metadata and `sex` cell metadata were added.

In [4]:
# In this case, we want to access all the Homo Sapience Covid 19 diseased cells that are B cells contained in the Lungs. 
#To avoid any duplicates cell entry we explicitly consider only the To avoid any duplicates cell entry we explicitly consider only the primary_data == True
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    var_value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and disease == 'COVID-19' and is_primary_data == True",
    obs_column_names=["sex"],
)

In [5]:
adata

AnnData object with n_obs × n_vars = 2729 × 2
    obs: 'sex', 'cell_type', 'tissue_general', 'disease', 'is_primary_data'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'

In [6]:
# remember to aleays close the census once finishing the download 
census.close()


## Fetching metadata observation

The human gene metadata of the Census, for RNA assays, is located at `census["census_data"]["homo_sapiens"].obs`. This is a `SOMADataFrame` and as such it can be materialized as a `pandas.DataFrame` via the methods `read().concat().to_pandas()`. 

The mouse cell metadata is at `census["census_data"]["mus_musculus"].obs`.

For slicing the cell metadata two relevant arguments can be passed through `read()`:

- `column_names` — list of strings indicating what metadata columns to fetch.
- `value_filter` — Python expression with selection conditions to fetch rows, it is similar to [pandas.DataFrame.query()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html), for full details see [tiledb.QueryCondition](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition) shortly:
    - Expressions are one or more comparisons
    - Comparisons are one of `<column> <op> <value>` or `<column> <op> <column>`
    - Expressions can combine comparisons using and, or, & or |
    - op is one of < | > | <= | >= | == | != or in

In [7]:
# open census 
census = cellxgene_census.open_soma()

# To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.
# it is then necessary to learn what are the columns available with: 
keys = list(census["census_data"]["homo_sapiens"].obs.keys())

keys

The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.


['soma_joinid',
 'dataset_id',
 'assay',
 'assay_ontology_term_id',
 'cell_type',
 'cell_type_ontology_term_id',
 'development_stage',
 'development_stage_ontology_term_id',
 'disease',
 'disease_ontology_term_id',
 'donor_id',
 'is_primary_data',
 'observation_joinid',
 'self_reported_ethnicity',
 'self_reported_ethnicity_ontology_term_id',
 'sex',
 'sex_ontology_term_id',
 'suspension_type',
 'tissue',
 'tissue_ontology_term_id',
 'tissue_type',
 'tissue_general',
 'tissue_general_ontology_term_id',
 'raw_sum',
 'nnz',
 'raw_mean_nnz',
 'raw_variance_nnz',
 'n_measured_vars']

In [8]:
# you need also to know what are the possible values in the columns, for example in the column sex: 
sex_cell_metadata = cellxgene_census.get_obs(census, "homo_sapiens", column_names=["sex"])

sex_cell_metadata.drop_duplicates()

Unnamed: 0,sex
0,female
85,male
63809,unknown


In [9]:
#Let’s now query only the cells having unknown sex
cell_metadata_all_unknown_sex = cellxgene_census.get_obs(census, "homo_sapiens", value_filter="sex == 'unknown'")

cell_metadata_all_unknown_sex


Unnamed: 0,soma_joinid,dataset_id,assay,assay_ontology_term_id,cell_type,cell_type_ontology_term_id,development_stage,development_stage_ontology_term_id,disease,disease_ontology_term_id,...,tissue,tissue_ontology_term_id,tissue_type,tissue_general,tissue_general_ontology_term_id,raw_sum,nnz,raw_mean_nnz,raw_variance_nnz,n_measured_vars
0,63809,94423ec1-21f8-40e8-b5c9-c3ea82350ca4,10x 3' v2,EFO:0009899,dendritic cell,CL:0000451,unknown,unknown,normal,PATO:0000461,...,body of stomach,UBERON:0001161,tissue,stomach,UBERON:0000945,695.0,368,1.888587,12.142867,19550
1,63825,94423ec1-21f8-40e8-b5c9-c3ea82350ca4,10x 3' v2,EFO:0009899,monocyte,CL:0000576,unknown,unknown,normal,PATO:0000461,...,body of stomach,UBERON:0001161,tissue,stomach,UBERON:0000945,6095.0,1427,4.271198,124.798069,19550
2,63829,94423ec1-21f8-40e8-b5c9-c3ea82350ca4,10x 3' v2,EFO:0009899,monocyte,CL:0000576,unknown,unknown,normal,PATO:0000461,...,body of stomach,UBERON:0001161,tissue,stomach,UBERON:0000945,1045.0,492,2.123984,23.318609,19550
3,63842,94423ec1-21f8-40e8-b5c9-c3ea82350ca4,10x 3' v2,EFO:0009899,mast cell,CL:0000097,unknown,unknown,normal,PATO:0000461,...,body of stomach,UBERON:0001161,tissue,stomach,UBERON:0000945,1546.0,640,2.415625,27.823856,19550
4,63845,94423ec1-21f8-40e8-b5c9-c3ea82350ca4,10x 3' v2,EFO:0009899,monocyte,CL:0000576,unknown,unknown,normal,PATO:0000461,...,body of stomach,UBERON:0001161,tissue,stomach,UBERON:0000945,1308.0,530,2.467925,59.814659,19550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3756275,69305276,9f222629-9e39-47d0-b83f-e08d610c7479,Drop-seq,EFO:0008722,ciliated columnar cell of tracheobronchial tree,CL:0002145,unknown,unknown,cystic fibrosis,MONDO:0009061,...,lung,UBERON:0002048,tissue,lung,UBERON:0002048,2748.0,1592,1.726131,15.334125,50205
3756276,69305278,9f222629-9e39-47d0-b83f-e08d610c7479,10x 3' v2,EFO:0009899,alveolar macrophage,CL:0000583,unknown,unknown,interstitial lung disease,MONDO:0015925,...,lung,UBERON:0002048,tissue,lung,UBERON:0002048,6945.0,2010,3.455224,200.698094,50205
3756277,69305280,9f222629-9e39-47d0-b83f-e08d610c7479,10x 3' v3,EFO:0009922,alveolar macrophage,CL:0000583,unknown,unknown,normal,PATO:0000461,...,lung,UBERON:0002048,tissue,lung,UBERON:0002048,37883.0,5559,6.814715,2129.944792,50205
3756278,69305283,9f222629-9e39-47d0-b83f-e08d610c7479,10x 3' v2,EFO:0009899,unknown,unknown,unknown,unknown,normal,PATO:0000461,...,lung,UBERON:0002048,tissue,lung,UBERON:0002048,10531.0,3077,3.422489,227.936529,50205


In [10]:
# We can do the same also using other columns, like "development_stage”
sex_cell_metadata = cellxgene_census.get_obs(census, "homo_sapiens", column_names=["development_stage"])

sex_cell_metadata.drop_duplicates()

Unnamed: 0,development_stage
0,human adult stage
129,mature stage
146,6-year-old human stage
148,3-year-old human stage
399,5-year-old human stage
...,...
42256022,eighth LMP month human stage
42271546,sixth LMP month human stage
42282126,seventh LMP month human stage
45115591,25th week post-fertilization human stage


In [11]:
# for example
cell_metadata_all_mature_stage = cellxgene_census.get_obs(census, "homo_sapiens", value_filter="development_stage == 'mature stage'")

Or you can do more complicated queries 

In [12]:
# or you can generally make more queries at the same time 
cell_metadata_b_cell = cellxgene_census.get_obs(
    census,
    "homo_sapiens",
    value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data==True",
    column_names=["disease"],
)

cell_metadata_b_cell.value_counts()

disease                                cell_type  tissue_general  is_primary_data
lung adenocarcinoma                    B cell     lung            True               42720
squamous cell lung carcinoma           B cell     lung            True               10631
normal                                 B cell     lung            True                9082
non-small cell lung carcinoma          B cell     lung            True                8742
pulmonary fibrosis                     B cell     lung            True                6798
COVID-19                               B cell     lung            True                2729
chronic obstructive pulmonary disease  B cell     lung            True                2203
lung large cell carcinoma              B cell     lung            True                1534
pulmonary emphysema                    B cell     lung            True                1512
pleomorphic carcinoma                  B cell     lung            True                1210
intersti

In [13]:
census.close()

## Fetching metadata variables

The human gene metadata of the Census is located at `census["census_data"]["homo_sapiens"].ms["RNA"].var`. Similarly to the cell metadata, it is a `SOMADataFrame` and thus we can also use its method `read()`.

The mouse gene metadata is at `census["census_data"]["mus_musculus"].ms["RNA"].var`.

Let’s take a look at the metadata available for column selection and row filtering.

In [14]:
# open census 
census = cellxgene_census.open_soma()
keys = list(census["census_data"]["homo_sapiens"].ms["RNA"].var.keys())

keys

The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.


['soma_joinid',
 'feature_id',
 'feature_name',
 'feature_length',
 'nnz',
 'n_measured_obs']

In [15]:
# filter genes based on id and keeping only "feature_name", "feature_length" columns
gene_metadata = cellxgene_census.get_var(
    census,
    "homo_sapiens",
    value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    column_names=["feature_name", "feature_length"],
)

gene_metadata

Unnamed: 0,feature_name,feature_length,feature_id
0,AQP5,1884,ENSG00000161798
1,TUBB4B,2037,ENSG00000188229


In [16]:
census.close()

While we've successfully accessed and filtered the census dataset based on our queries, it's crucial to remember that these data are curated and have consistent column keys. However, normalization, technical, and biological biases may persist.

**When conducting analyses with data from multiple datasets, carefully consider these factors.** Pay particular attention to the **assay** column, which indicates the specific technique used to collect the data.