# Census Datasets example

*Goal:* demonstrate basic use of the `census_datasets` dataframe.

Each Cell Census contains a top-level dataframe itemizing the datasets contained therein. You can read this into a Pandas DataFrame:

In [1]:
import cell_census
import tiledbsoma as soma

census = cell_census.open_soma()
census_datasets = census["census_info"]["datasets"].read().concat().to_pandas()

# for convenience, indexing on the soma_joinid which links this to other census data.
census_datasets = census_datasets.set_index("soma_joinid")
census_datasets

Unnamed: 0_level_0,collection_id,collection_name,collection_doi,dataset_id,dataset_title,dataset_h5ad_path,dataset_total_cell_count
soma_joinid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,03f821b4-87be-4ff4-b65a-b5fc00061da7,Local and systemic responses to SARS-CoV-2 inf...,10.1038/s41586-021-04345-x,edc8d3fe-153c-4e3d-8be0-2108d30f8d70,Airway,edc8d3fe-153c-4e3d-8be0-2108d30f8d70.h5ad,236977
1,03f821b4-87be-4ff4-b65a-b5fc00061da7,Local and systemic responses to SARS-CoV-2 inf...,10.1038/s41586-021-04345-x,2a498ace-872a-4935-984b-1afa70fd9886,PBMC,2a498ace-872a-4935-984b-1afa70fd9886.h5ad,422220
2,43d4bb39-21af-4d05-b973-4c1fed7b916c,Transcriptional Programming of Normal and Infl...,10.1016/j.celrep.2018.09.006,f512b8b6-369d-4a85-a695-116e0806857f,Skin,f512b8b6-369d-4a85-a695-116e0806857f.h5ad,68036
3,0434a9d4-85fd-4554-b8e3-cf6c582bb2fa,Acute COVID-19 cohort across a range of WHO ca...,10.1101/2020.11.20.20227355,fa8605cf-f27e-44af-ac2a-476bee4410d3,PBMCs,fa8605cf-f27e-44af-ac2a-476bee4410d3.h5ad,59506
4,a9254216-6cd8-4186-b32c-349363777584,Single-cell reconstruction of the early matern...,10.1038/s41586-018-0698-6,5bc42b88-bb76-4954-927b-8bb7369adc64,Pregnant Uterus (All),5bc42b88-bb76-4954-927b-8bb7369adc64.h5ad,70325
...,...,...,...,...,...,...,...
466,5d445965-6f1a-4b68-ba3a-b8f765155d3a,A molecular cell atlas of the human lung from ...,10.1038/s41586-020-2922-4,e04daea4-4412-45b5-989e-76a9be070a89,"Krasnow Lab Human Lung Cell Atlas, Smart-seq2",e04daea4-4412-45b5-989e-76a9be070a89.h5ad,9409
467,5d445965-6f1a-4b68-ba3a-b8f765155d3a,A molecular cell atlas of the human lung from ...,10.1038/s41586-020-2922-4,8c42cfd0-0b0a-46d5-910c-fc833d83c45e,"Krasnow Lab Human Lung Cell Atlas, 10X",8c42cfd0-0b0a-46d5-910c-fc833d83c45e.h5ad,65662
468,33d19f34-87f5-455b-8ca5-9023a2e5453d,Intra- and Inter-cellular Rewiring of the Huma...,10.1016/j.cell.2019.06.029,4dd00779-7f73-4f50-89bb-e2d3c6b71b18,Intra- and Inter-cellular Rewiring of the Huma...,4dd00779-7f73-4f50-89bb-e2d3c6b71b18.h5ad,34772
469,17481d16-ee44-49e5-bcf0-28c0780d8c4a,Single-Cell Sequencing of Developing Human Gut...,10.1016/j.devcel.2020.11.010,8e47ed12-c658-4252-b126-381df8d52a3d,Paediatric Human Gut (4-14y),8e47ed12-c658-4252-b126-381df8d52a3d.h5ad,22502


The sum cells across all datasets should match the number of cells across all SOMA experiments (human, mouse).

In [2]:
# Count cells across all experiments
all_experiments = (
    (organism_name, organism_experiment) for organism_name, organism_experiment in census["census_data"].items()
)
experiments_total_cells = 0
print("Count by experiment:")
for organism_name, organism_experiment in all_experiments:
    num_cells = len(organism_experiment.obs.read(column_names=["soma_joinid"]).concat().to_pandas())
    print(f"\t{num_cells} cells in {organism_name}")
    experiments_total_cells += num_cells

print(f"\nFound {experiments_total_cells} cells in all experiments.")

# Count cells across all datasets
print(f"Found {census_datasets.dataset_total_cell_count.sum()} cells in all datasets.")

Count by experiment:
	43207796 cells in homo_sapiens
	3922090 cells in mus_musculus

Found 47129886 cells in all experiments.
Found 47129886 cells in all datasets.


Lets pick one dataset to slice out of the census, and turn into an [AnnData](https://anndata.readthedocs.io/en/latest/) in-memory object. This can be used with the [ScanPy](https://scanpy.readthedocs.io/en/stable/) toolchain. You can also save this AnnData locally using the AnnData [`write`](https://anndata.readthedocs.io/en/latest/api.html#writing) API.

In [3]:
census_datasets[census_datasets.dataset_id == "0bd1a1de-3aee-40e0-b2ec-86c7a30c7149"]

Unnamed: 0_level_0,collection_id,collection_name,collection_doi,dataset_id,dataset_title,dataset_h5ad_path,dataset_total_cell_count
soma_joinid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
139,0b9d8a04-bb9d-44da-aa27-705bb65b54eb,Tabula Muris Senis,10.1038/s41586-020-2496-1,0bd1a1de-3aee-40e0-b2ec-86c7a30c7149,Bone marrow - A single-cell transcriptomic atl...,0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad,40220


Create a query on the mouse experiment, "RNA" measurement, for the dataset_id.

In [4]:
mouse = census["census_data"]["mus_musculus"]
with mouse.axis_query(
    "RNA",
    obs_query=soma.AxisQuery(value_filter="dataset_id == '0bd1a1de-3aee-40e0-b2ec-86c7a30c7149'"),
) as query:
    adata = query.to_anndata("raw")

adata

AnnData object with n_obs × n_vars = 40220 × 52392
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

You can also use the `cell_census.get_h5ad_uri()` API to fetch a URI pointing to the H5AD associated with this `dataset_id`. This is the same H5AD you can download from the CELLxGENE Portal, and may contain additional data-submittor provided information which was not included in the Cell Census.

The "locator" returned by this API will include a `uri` and additional information that may be necessary to use the URI (eg, the S3 region).

You will need to use a download API to fetch this H5AD, such as [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/).

In [5]:
cell_census.get_source_h5ad_uri("0bd1a1de-3aee-40e0-b2ec-86c7a30c7149")

{'uri': 's3://cellxgene-data-public/cell-census/2023-01-13/h5ads/0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad',
 's3_region': 'us-west-2'}