# CAS-CAP roundtrip

The aim of this demo is to illustrate roundtripping between CAP h5ad format and h5ad + merged CAS.

## How to Run the Notebooks

For detailed instructions on setting up and running the notebooks, please refer to the [README.md](https://github.com/cellannotation/cas-tools/blob/main/notebooks/README.md) in the notebooks directory.


In [2]:
import json
import pandas as pd
import anndata as ad

#### Retrieve AnnData File for `CB Glut`  

The Cellular Semantics group at the Sanger hosts a number of pre-rolled CAS 'taxonomies' for Brain-related datasets.  These can be browsed at the [Cellular Semantics Taxonomy Catalog](https://cellular-semantics.sanger.ac.uk/tdt/catalog).

For demo perposes we will focus on an h5ad file of Cerebellar glutamatergic neurons (Class: 29 CB Glut) from:

Yao, Zizhen, Cindy T. J. van Velthoven, Michael Kunst, Meng Zhang, Delissa McMillen, Changkyu Lee, Won Jung, et al. 2023. “A High-Resolution Transcriptomic and Spatial Atlas of Cell Types in the Whole Mouse Brain.” Nature 624 (7991): 317–32. https://doi.org/10.1038/s41586-023-06812-z

This file already has CAS merged into `uns`.

This data can be viewed on the [Allen Brain Cell Atlas](https://knowledge.brain-map.org/abcatlas) Class: 29 CB Glut.

In [3]:
# Download the h5ad file.

!wget -N http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad

--2025-03-27 15:05:08--  http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad
Resolving cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)... 172.27.51.1, 172.27.51.3, 172.27.51.2, ...
Connecting to cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)|172.27.51.1|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad [following]
--2025-03-27 15:05:08--  https://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad
Connecting to cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)|172.27.51.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2575377493 (2.4G) [application/x-hdf5]
Saving to: ‘merged_CS20230722_CLAS_29.h5ad’


2025-03-27 15:06:44 (25.7 MB/s) - ‘merged_CS20230722_CLAS_29.h5ad’ saved [2575377493/2575377493]



In [None]:
## Inspecting file contents

In [4]:
merged_anndata = ad.read_h5ad("merged_CS20230722_CLAS_29.h5ad", backed="r")
merged_anndata.obs[:5]

Unnamed: 0_level_0,cell_barcode,library_label,tissue,tissue_ontology_term_id,neurotransmitter,class,subclass,supertype,cluster,organism,disease,assay
cell_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AAACCCAAGAACAAGG-472_A05,AAACCCAAGAACAAGG,L8TX_201217_01_G07,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v2
AAACCCAAGAATCCCT-473_A06,AAACCCAAGAATCCCT,L8TX_201217_01_A08,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v3
AAACCCAAGACTACCT-225_A01,AAACCCAAGACTACCT,L8TX_200227_01_F10,Medulla,UBERON:0001896,Glut,29 CB Glut,314 CB Granule Glut,1154 CB Granule Glut_1,5197 CB Granule Glut_1,Mus musculus,normal,10x 3' v2
AAACCCAAGAGCTGAC-231.2_B01,AAACCCAAGAGCTGAC,L8TX_200306_01_H12,Medulla,UBERON:0001896,Glut,29 CB Glut,314 CB Granule Glut,1154 CB Granule Glut_1,5197 CB Granule Glut_1,Mus musculus,normal,10x 3' v3
AAACCCAAGAGGACTC-478_A02,AAACCCAAGAGGACTC,L8TX_210107_02_H11,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v2


### Extract cas from header and inspect

In [6]:
merged_anndata.file.close()
from cas.reports import get_all_annotations
from cas.file_utils import read_cas_from_anndata
cas = read_cas_from_anndata('./merged_CS20230722_CLAS_29.h5ad')
cas.get_all_annotations(labels = [('subclass', '314 CB Granule Glut')])

Unnamed: 0,labelset,cell_label,cell_set_accession,cell_fullname,cell_ontology_term_id,cell_ontology_term,rationale,rationale_dois,marker_gene_evidence,synonyms,...,author_annotation_fields.merfish.markers.combo,author_annotation_fields.CTX.size,author_annotation_fields.subclass.tf.markers.combo,author_annotation_fields.nt_type_label,author_annotation_fields.nt_type_combo_label,author_annotation_fields.CTX.cluster_id,author_annotation_fields.CTX.neighborhood_id,author_annotation_fields.F,author_annotation_fields.CTX.neighborhood_label,author_annotation_fields.M
2,subclass,314 CB Granule Glut,CS20230722_SUBC_314,,CL:0001031,cerebellar granule cell,,,,,...,,,"Pax6,Neurod2,Etv1",Glut,,,,,,


In [5]:
merged_anndata.file.close()

## Export CAS content to CAP AnnData format

We use the command line interface to do this.  This generates a [CAP Anndata format](https://github.com/cellannotation/cell-annotation-schema/blob/main/docs/cap_anndata_schema.md) h5ad file, using information from CAS JSON stored in uns. While doing this it runs a series of tests to ensure that CAS representation is in-sync with obs.  It also retains CAS JSON in the header and stores a hash of sorted Cell IDs for each cell set.  This allows future detection of any changes in cell set membership.

In [7]:
# Note that all command line tools come with built-in help.  

!cas export2cap --help

usage: cas export2cap [-h] [--json JSON] [--anndata ANNDATA] [--output OUTPUT]
                      [--fill-na]

Flattens all content of CAS annotations to an AnnData file.

options:
  -h, --help         show this help message and exit
  --json JSON        Optional input JSON file path. If not provided, the CAS
                     JSON will be extracted from the AnnData file's 'uns'
                     section.
  --anndata ANNDATA  Optional input AnnData file path. If not provided, the
                     AnnData file will be downloaded using the matrix file id
                     from the CAS JSON.
  --output OUTPUT    Output AnnData file name.
  --fill-na          Boolean flag indicating whether to fill missing values in
                     the 'obs' field with pd.NA. If provided, missing values
                     will be replaced with pd.NA; if not provided, they will
                     remain as empty strings.


In [9]:
# export2cap with cas json file from header
# Note the results of checks in STDERR
!cas export2cap --anndata merged_CS20230722_CLAS_29.h5ad --output flatten_cas_CS20230722_CLAS_29.h5ad

INFO:root:All labelsets exist in obs.
INFO:root:All labelset members exist in the corresponding obs columns.
INFO:root:Parent-child relationships are consistent between CAS and OBS.
INFO:root:All labelsets exist in obs.
INFO:root:All labelset members exist in the corresponding obs columns.
INFO:root:Parent-child relationships are consistent between CAS and OBS.


In [10]:
# INspecting the flattened dataframe, we can see CAS content 
flatten_df = ad.read_h5ad("./flatten_cas_CS20230722_CLAS_29.h5ad", backed="r")
flatten_df.obs.iloc[0]

cell_barcode                                                                       AAACCCAAGAACAAGG
library_label                                                                    L8TX_201217_01_G07
tissue                                                                                   Cerebellum
tissue_ontology_term_id                                                              UBERON:0002037
neurotransmitter                                                                               Glut
organism                                                                               Mus musculus
disease                                                                                      normal
assay                                                                                     10x 3' v2
class                                                                                    29 CB Glut
class--cell_set_accession                                                        CS20230722_CLAS_29


In [9]:
# CAS JSON is retained in the header - note the addition of a cellhash, 
# used to test for any changes to the membership of the annotated cell set.
json.loads(flatten_df.uns['cas'])['annotations'][1]

{'labelset': 'subclass',
 'cell_label': '314 CB Granule Glut',
 'cell_set_accession': 'CS20230722_SUBC_314',
 'cell_ontology_term_id': 'CL:0001031',
 'cell_ontology_term': 'cerebellar granule cell',
 'parent_cell_set_accession': 'CS20230722_CLAS_29',
 'author_annotation_fields': {'neighborhood': 'NN-IMN-GC',
  'subclass.tf.markers.combo': 'Pax6,Neurod2,Etv1',
  'subclass.markers.combo': 'Gabra6,Ror1',
  'supertype.markers.combo _within subclass_': 'None',
  'supertype.markers.combo': 'None',
  'anatomical_annotation': 'None',
  'merfish.markers.combo': 'None',
  'cluster.TF.markers.combo': 'None',
  'cluster.markers.combo _within subclass_': 'None',
  'cluster.markers.combo': 'None',
  'cellhash': 'subclass:1961bf7b20'}}

## Edit annotations CAP h5ad file

This is meant to mimic edits on [CAP](https://celltype.info/) including changes to namnes and to any annotation metadata.  If cell set membership changes, it prompts an error.

In [21]:
flatten_df.obs["subclass"] = flatten_df.obs["subclass"].replace("315 DCO UBC Glut", "Upgraded DCO UBC Glut")
flatten_df.obs["subclass"] = flatten_df.obs["subclass"].replace("314 CB Granule Glut", "Downgraded CB Granule Glut")
flatten_df.obs[["subclass","class"]].drop_duplicates()

Unnamed: 0_level_0,subclass,class
cell_label,Unnamed: 1_level_1,Unnamed: 2_level_1
AAACCCAAGAACAAGG-472_A05,Downgraded CB Granule Glut,29 CB Glut
AAACCCAAGTCGCCAC-231.2_A01,Upgraded DCO UBC Glut,29 CB Glut


In [12]:
flatten_df.write("edited_flatten_cas_CS20230722_CLAS_29.h5ad", compression="gzip")

In [13]:
flatten_df.file.close()

## Unflatten

In [16]:
!cas unflatten --help

usage: cas unflatten [-h] --anndata ANNDATA [--json JSON]
                     [--output_anndata OUTPUT_ANNDATA]
                     [--output_json OUTPUT_JSON]

Unflattens all content of a flattened AnnData file to a CAS JSON file. Also
creates an unflattened AnnData file.

options:
  -h, --help            show this help message and exit
  --anndata ANNDATA     Path to the input AnnData file that contains flattened
                        data.
  --json JSON           Optional path to the CAS JSON file. If provided, the
                        'annotations' within the file will be updated. If not
                        provided, a new CAS JSON file will be created.
  --output_anndata OUTPUT_ANNDATA
                        Optional output AnnData file name. If not provided,
                        'unflattened.h5ad' will be used as default name.
  --output_json OUTPUT_JSON
                        Optional output CAS JSON file name. If not provided,
                        'cas.json' 

In [17]:
!cas unflatten --anndata edited_flatten_cas_CS20230722_CLAS_29.h5ad --output_anndata edited_unflatten_cas_CS20230722_CLAS_29.h5ad



In [None]:
# load and inspect unflattened file.

In [19]:
from cas.reports import get_all_annotations
from cas.file_utils import read_cas_from_anndata
cas = read_cas_from_anndata('./edited_unflatten_cas_CS20230722_CLAS_29.h5ad')
cas.get_all_annotations()

Unnamed: 0,labelset,cell_label,cell_set_accession,cell_fullname,cell_ontology_term_id,cell_ontology_term,rationale,rationale_dois,marker_gene_evidence,synonyms,...,author_annotation_fields.CTX.size,author_annotation_fields.subclass.tf.markers.combo,author_annotation_fields.nt_type_label,author_annotation_fields.nt_type_combo_label,author_annotation_fields.CTX.cluster_id,author_annotation_fields.CTX.neighborhood_id,author_annotation_fields.F,author_annotation_fields.CTX.neighborhood_label,author_annotation_fields.M,author_annotation_fields.cellhash
0,neurotransmitter,Glut,CS20230722_NEUR_Glut,,CL:0000679,glutamatergic neuron,,,,,...,,,,,,,,,,
1,class,29 CB Glut,CS20230722_CLAS_29,,CL:0000540,neuron,,,,,...,,,,,,,,,,class:33ff68cfc4
2,subclass,314 CB Granule Glut,CS20230722_SUBC_314,,CL:0001031,cerebellar granule cell,,,,,...,,"Pax6,Neurod2,Etv1",Glut,,,,,,,subclass:1961bf7b20
3,subclass,315 DCO UBC Glut,CS20230722_SUBC_315,,CL:4023161,unipolar brush cell,,,,,...,,"Eomes,Lmx1a,Klf3",Glut,,,,,,,subclass:1d857515df
4,supertype,1154 CB Granule Glut_1,CS20230722_SUPT_1154,,,,,,,,...,,,,,,,,,,supertype:100fb2c5ce
5,supertype,1155 CB Granule Glut_2,CS20230722_SUPT_1155,,,,,,,,...,,,,,,,,,,supertype:e8d4a48a95
6,supertype,1156 DCO UBC Glut_1,CS20230722_SUPT_1156,,,,,,,,...,,,,,,,,,,supertype:1d857515df
7,cluster,5197 CB Granule Glut_1,CS20230722_CLUS_5197,,,,,,,,...,,,Glut,Glut,,,0.5,,0.5,cluster:34ef02d6a9
8,cluster,5198 CB Granule Glut_1,CS20230722_CLUS_5198,,,,,,,,...,,,Glut,Glut,,,0.45,,0.55,cluster:0e563742cf
9,cluster,5199 CB Granule Glut_1,CS20230722_CLUS_5199,,,,,,,,...,,,Glut,Glut,,,0.44,,0.56,cluster:b98115a4ba
