### Converting an Allen-Style Taxonomy Spreadsheet to CAS Format  

This notebook demonstrates how to convert an Allen-style taxonomy spreadsheet into the CAS (Cell Annotation Schema) format, how to merge it to an h5a file and how to export to CAP anndata formt. The process includes the following steps:  

1. **Generate CAS**: Convert the spreadsheet data into the CAS format.
2. **Validate CAS against h5ad**: Compare CAS and AnnData to check if the annotations and the hierarchy are the same.
3. **Populate IDs**: Add corresponding Cell IDs to the CAS from a related Anndata (h5ad) file.  
4. **Merge Data**: Integrate the updated CAS into the `var` field of the Anndata object.
5. **Export to CAP**: Merge CAS in 

This workflow uses a subset of the WMBO dataset. Example Allen-style spreadsheet is derived from the `CB Glut (CS20230722_CLAS_29)` subset of the Mouse Cell-type annotations (https://www.nature.com/articles/s41586-023-06812-z#Sec49, Table 7)

This notebook demonstrates how to carry out these steps in Python, but most can also be done via the [cas-tools command-line interface](https://cellannotation.github.io/cas-tools/cli.html).

#### Installing Required Packages

In [1]:
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install anndata
!{sys.executable} -m pip install --upgrade cas-tools


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


#### Example of an Allen-Style Spreadsheet  

To keep things simple, we use the `CB Glut (CS20230722_CLAS_29)` subset from the Mouse Cell-Type Annotations dataset (refer to [Nature article, Supplementary Table 7](https://www.nature.com/articles/s41586-023-06812-z#Sec49)).

In [1]:
import pandas as pd

pd.read_csv("./data/wmb_class_29_annotation.tsv", delimiter='\t')[:4]

Unnamed: 0,cluster_id,cluster,supertype,subclass,class,neighborhood,anatomical_annotation,notes,CCF_broad.freq,CCF_acronym.freq,...,CTX.subclass_id,CTX.subclass_id.1,CTX.neighborhood_id,CTX.neighborhood_label,CTX.size,taxonomy_id,cell_set_accession.cluster,cell_set_accession.supertype,cell_set_accession.subclass,cell_set_accession.class
0,5197,5197 CB Granule Glut_1,1154 CB Granule Glut_1,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,DCO VCO,,"MY:0.35,CB:0.33,NA:0.26","DCO:0.19,FL:0.11,arb:0.09,VCO:0.09,DN:0.08,PFL...",...,,,,,,CCN202307220,CS20230722_CLUS_5197,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29
1,5198,5198 CB Granule Glut_1,1154 CB Granule Glut_1,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,DCO VCO,,"CB:0.56,MY:0.24,NA:0.19","FL:0.25,PFL:0.17,DCO:0.13,VCO:0.09,arb:0.08,NA...",...,,,,,,CCN202307220,CS20230722_CLUS_5198,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29
2,5199,5199 CB Granule Glut_1,1154 CB Granule Glut_1,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,DCO VCO,,"MY:0.4,CB:0.38,NA:0.18","DCO:0.29,DN:0.12,arb:0.09,FL:0.09,PFL:0.06,VCO...",...,,,,,,CCN202307220,CS20230722_CLUS_5199,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29
3,5200,5200 CB Granule Glut_2,1155 CB Granule Glut_2,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,NOD PFL,,"CB:0.77,NA:0.22","PFL:0.17,NOD:0.17,FL:0.13,NA:0.11,arb:0.1,UVU:...",...,,,,,,CCN202307220,CS20230722_CLUS_5200,CS20230722_SUPT_1155,CS20230722_SUBC_314,CS20230722_CLAS_29


#### Import Spreadsheet into CAS  

Import assumes that the source spreadsheet will have:

* One row per cluster (where cluster is the most granular cell set)
* A column containing cluster_ids
* A spreadsheet representation of hierarchy with columns for each leve in the hierarchy.
* A set of columns specific to the sheet (author categories)

To import an Allen-style spreadsheet into the CAS, we need a simple mapping file. This includes some general metadata and specification of heirarchy fields.  Rank refers to the level of a heirarchy field - with 0 being the most granular. The mapping file must specify cluster_id and cluster_name require specific field types.  All other hierarchy fields are typed as cell set.  Here is a configuration for our example sheet:

In [None]:
!less ./data/wmb_ingestion_config.yaml

taxonomy_id: CS20230722
brain_region_names:
  - Whole Brain
species_names:
  - Mouse
author_name: Hongkui Zeng
title: Whole Mouse Brain taxonomy

fields:
  - column_name: cluster_id
    column_type: cluster_id
    rank: 0

  - column_name: cluster
    column_type: cluster_name
    rank: 0

  - column_name: supertype
    column_type: cell_set
    rank: 1

  - column_name: subclass
    column_type: cell_set
[7m./data/wmb_ingestion_config.yaml[m[K

In [2]:
# Import to CAS

import json
from cas.ingest.ingest_user_table import ingest_data, ingest_user_data
from cas.file_utils import write_json_file


cas_data = ingest_user_data("./data/wmb_class_29_annotation.tsv", "./data/wmb_ingestion_config.yaml", True)
wmb_class_29_cas = cas_data.as_dictionary()
print(json.dumps(wmb_class_29_cas, indent=2)[:2000])
write_json_file(cas_data, 'wmb_class_29.json')  # saving CAS JSON file to disk
 

{
  "author_name": "Hongkui Zeng",
  "annotations": [
    {
      "labelset": "class",
      "cell_label": "29 CB Glut",
      "cell_set_accession": "5206"
    },
    {
      "labelset": "subclass",
      "cell_label": "314 CB Granule Glut",
      "cell_set_accession": "5207",
      "parent_cell_set_accession": "5206"
    },
    {
      "labelset": "supertype",
      "cell_label": "1154 CB Granule Glut_1",
      "cell_set_accession": "5208",
      "parent_cell_set_accession": "5207"
    },
    {
      "labelset": "cluster",
      "cell_label": "5197 CB Granule Glut_1",
      "cell_set_accession": "5197",
      "parent_cell_set_accession": "5208",
      "author_annotation_fields": {
        "neighborhood": "NN-IMN-GC",
        "anatomical_annotation": "DCO VCO",
        "CCF_broad.freq": "MY:0.35,CB:0.33,NA:0.26",
        "CCF_acronym.freq": "DCO:0.19,FL:0.11,arb:0.09,VCO:0.09,DN:0.08,PFL:0.07,MY:0.07,mcp:0.06,CUL4, 5:0.04",
        "v3.size": "14544",
        "v2.size": "0",
        "m

In [None]:
!less wmb_class_29.json

{7[?47h[?1h=
  "author_name": "Hongkui Zeng",
  "annotations": [
    {
      "labelset": "class",
      "cell_label": "29 CB Glut",
      "cell_set_accession": "5206"
    },
    {
      "labelset": "subclass",
      "cell_label": "314 CB Granule Glut",
      "cell_set_accession": "5207",
      "parent_cell_set_accession": "5206"
    },
    {
      "labelset": "supertype",
      "cell_label": "1154 CB Granule Glut_1",
      "cell_set_accession": "5208",
      "parent_cell_set_accession": "5207"
    },
    {
      "labelset": "cluster",
      "cell_label": "5197 CB Granule Glut_1",
[7mwmb_class_29.json[m[K

In [3]:
# Or we can use reporting tools from cas-tools to get view dataframes:

from cas import reports
reports.get_all_annotations(wmb_class_29_cas)

Unnamed: 0,labelset,cell_label,cell_set_accession,parent_cell_set_accession,author_annotation_fields.neighborhood,author_annotation_fields.anatomical_annotation,author_annotation_fields.CCF_broad.freq,author_annotation_fields.CCF_acronym.freq,author_annotation_fields.v3.size,author_annotation_fields.v2.size,...,author_annotation_fields.cluster.markers.combo,author_annotation_fields.merfish.markers.combo,author_annotation_fields.cluster.TF.markers.combo,author_annotation_fields.cluster.markers.combo (within subclass),author_annotation_fields.taxonomy_id,author_annotation_fields.cell_set_accession.cluster,author_annotation_fields.cell_set_accession.supertype,author_annotation_fields.cell_set_accession.subclass,author_annotation_fields.cell_set_accession.class,author_annotation_fields.np.markers
0,class,29 CB Glut,5206,,,,,,,,...,,,,,,,,,,
1,subclass,314 CB Granule Glut,5207,5206.0,,,,,,,...,,,,,,,,,,
2,supertype,1154 CB Granule Glut_1,5208,5207.0,,,,,,,...,,,,,,,,,,
3,cluster,5197 CB Granule Glut_1,5197,5208.0,NN-IMN-GC,DCO VCO,"MY:0.35,CB:0.33,NA:0.26","DCO:0.19,FL:0.11,arb:0.09,VCO:0.09,DN:0.08,PFL...",14544.0,0.0,...,"Gabra6,Lmx1a,Rnf182","Col27a1,Barhl1,St18,Trhde,Spon1,Syt6","Lmx1a,Zic1,St18","Lmx1a,Rnf182",CCN202307220,CS20230722_CLUS_5197,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29,
4,cluster,5198 CB Granule Glut_1,5198,5208.0,NN-IMN-GC,DCO VCO,"CB:0.56,MY:0.24,NA:0.19","FL:0.25,PFL:0.17,DCO:0.13,VCO:0.09,arb:0.08,NA...",2196.0,0.0,...,"Gabra6,Cntn5","Svep1,Slc17a7,Chrm2","Pax6,Neurod2,Etv1,Bcl11b",Cntn5,CCN202307220,CS20230722_CLUS_5198,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29,
5,cluster,5199 CB Granule Glut_1,5199,5208.0,NN-IMN-GC,DCO VCO,"MY:0.4,CB:0.38,NA:0.18","DCO:0.29,DN:0.12,arb:0.09,FL:0.09,PFL:0.06,VCO...",909.0,0.0,...,"Cbln3,Tmem132d","Eomes,Col27a1,Calb2","Eomes,Lmx1a,Nr2f2,Lin28b",Rgs6,CCN202307220,CS20230722_CLUS_5199,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29,
6,supertype,1155 CB Granule Glut_2,5209,5207.0,,,,,,,...,,,,,,,,,,
7,cluster,5200 CB Granule Glut_2,5200,5209.0,NN-IMN-GC,NOD PFL,"CB:0.77,NA:0.22","PFL:0.17,NOD:0.17,FL:0.13,NA:0.11,arb:0.1,UVU:...",220.0,0.0,...,"Gabra6,Gap43,Il1rap","Eomes,Svep1,Ntng1,Medag","Eomes,St18,En2,Nr4a2","Gap43,Kcnq5",CCN202307220,CS20230722_CLUS_5200,CS20230722_SUPT_1155,CS20230722_SUBC_314,CS20230722_CLAS_29,
8,cluster,5201 CB Granule Glut_2,5201,5209.0,NN-IMN-GC,CBX,"CB:0.74,NA:0.26","NA:0.17,CUL4, 5:0.14,SIM:0.11,ANcr1:0.09,arb:0...",115909.0,0.0,...,"Gabra6,Daam2,Calb2","Svep1,Slc17a7,Pappa,Barhl2","Pax6,En2,Etv1,Barhl2,Uncx,Maf",Clstn2,CCN202307220,CS20230722_CLUS_5201,CS20230722_SUPT_1155,CS20230722_SUBC_314,CS20230722_CLAS_29,
9,subclass,315 DCO UBC Glut,5210,5206.0,,,,,,,...,,,,,,,,,,


#### Retrieve AnnData File for `CB Glut`

Assumed starting point - an AnnData file with annotations corresponding to the hierarchy in the spreadsheet.

Download the AnnData file corresponding to `CB Glut (CS20230722_CLAS_29)`. The original [WMB-10Xv2](https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/WMB-10Xv2/20230630/) and [WMB-10Xv3](https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/WMB-10Xv3/20230630/) AnnData files were generated based on dissection.  

These files were merged and then split into 34 top-level classes. The resulting AnnData files for each class can be accessed at the following links:  
- [Class 01](http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_01.h5ad)  
- [Class 02](http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_02.h5ad)  
- ...  
- [Class 34](http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_34.h5ad)  

The file specific to `CB Glut` is included in this collection.

In [10]:
#!rm merged_CS20230722_CLAS_29.h5ad
!wget -N http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad

--2025-02-11 15:55:53--  http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad
Resolving cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)... 193.62.203.62, 193.62.203.61, 193.62.203.63
Connecting to cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)|193.62.203.62|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad [following]
--2025-02-11 15:55:53--  https://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad
Connecting to cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)|193.62.203.62|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2585489225 (2.4G) [binary/octet-stream]
Saving to: ‘merged_CS20230722_CLAS_29.h5ad’


2025-02-11 16:00:32 (8.83 MB/s) - ‘merged_CS20230722_CLAS_29.h5ad’ saved [2585489225/2585489225]



In [11]:
from cas.file_utils import read_anndata_file
CLAS_29_ad = read_anndata_file("./merged_CS20230722_CLAS_29.h5ad")
CLAS_29_ad.obs[:3]

Unnamed: 0_level_0,cell_barcode,library_label,tissue,tissue_ontology_term_id,neurotransmitter,class,subclass,supertype,cluster,organism,disease,assay
cell_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AAACCCAAGAACAAGG-472_A05,AAACCCAAGAACAAGG,L8TX_201217_01_G07,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v2
AAACCCAAGAATCCCT-473_A06,AAACCCAAGAATCCCT,L8TX_201217_01_A08,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v3
AAACCCAAGACTACCT-225_A01,AAACCCAAGACTACCT,L8TX_200227_01_F10,Medulla,UBERON:0001896,Glut,29 CB Glut,314 CB Granule Glut,1154 CB Granule Glut_1,5197 CB Granule Glut_1,Mus musculus,normal,10x 3' v2


#### Analyse CAS and AnnData


#### Populate CAS with Cell IDs  

**Check hieraarchies match between h5ad and CAS, and if they do, add Cell IDs to CAS**

* *Checks*: Does the heirarchy of nested cell sets in the AnnData file matches the heirarchy in the CAS file (as derived from the Allen spreadsheet)? If two cell set are identical but one has a higher rank then the other, this is also treated as valid heirarchy.
* *Warnings*: If heirarchies do not match, report differences
* *Update IDs*: If heirarchies match, add cell IDs to the most granular level of the hierarchy in CAS.  (Cell IDs for other levels can be derived from the hierarchy).


In [23]:
from cas.populate_cell_ids import add_cell_ids

# get list of labelset names sorted by rank
labelset_names = [labelset.name for labelset in sorted(cas_data.labelsets, key=lambda x: x.rank)]
cas = add_cell_ids(wmb_class_29_cas, CLAS_29_ad.obs, labelset_names)
print(json.dumps(cas["annotations"][5], indent=2)[:1500])
with open('CLAS_29_ad.json', 'w') as f:
    f.write(json.dumps(cas))
CLAS_29_ad.file.close()

{
  "labelset": "cluster",
  "cell_label": "5199 CB Granule Glut_1",
  "cell_set_accession": "5199",
  "parent_cell_set_accession": "5208",
  "author_annotation_fields": {
    "neighborhood": "NN-IMN-GC",
    "anatomical_annotation": "DCO VCO",
    "CCF_broad.freq": "MY:0.4,CB:0.38,NA:0.18",
    "CCF_acronym.freq": "DCO:0.29,DN:0.12,arb:0.09,FL:0.09,PFL:0.06,VCO:0.05,MY:0.05,CUL4, 5:0.05,P:0.04",
    "v3.size": "909",
    "v2.size": "0",
    "multiome.size": "0",
    "F": "0.44",
    "M": "0.56",
    "Dark": "0.03",
    "Light": "0.97",
    "nt_type_label": "Glut",
    "nt.markers": "Slc17a7:8.33,Slc17a6:3.37",
    "nt_type_combo_label": "Glut",
    "cluster.markers.combo": "Cbln3,Tmem132d",
    "merfish.markers.combo": "Eomes,Col27a1,Calb2",
    "cluster.TF.markers.combo": "Eomes,Lmx1a,Nr2f2,Lin28b",
    "cluster.markers.combo (within subclass)": "Rgs6",
    "taxonomy_id": "CCN202307220",
    "cell_set_accession.cluster": "CS20230722_CLUS_5199",
    "cell_set_accession.supertype": "CS

#### Merge CAS to Anndata

Adds CAS to uns of the AnnData file. 

Note that further checks are carried out here to assess whether cell set membership matches between the CAS file and the AnnData file.  In this case they do match, but this also protects against files being out of sync.

In [24]:
from cas.anndata_conversion import merge_cas_object
import anndata as ad

merge_cas_object(input_json=wmb_class_29_cas, 
                 anndata_file_path="./merged_CS20230722_CLAS_29.h5ad", 
                 validate=True, 
                 output_file_path ="./merged_CS20230722_CLAS_29_v2.h5ad")

Error during read anndata file at path: ./merged_CS20230722_CLAS_29_v2.h5ad, error = Unable to open file (file is already open for read-only)!


OSError: Unable to open file (file is already open for read-only)

In [25]:
from cas.file_utils import read_cas_from_anndata

cas_read = read_cas_from_anndata("./merged_CS20230722_CLAS_29_v2.h5ad")
cas_read.get_all_annotations().head(5)

Unnamed: 0,labelset,cell_label,cell_set_accession,cell_fullname,cell_ontology_term_id,cell_ontology_term,rationale,rationale_dois,marker_gene_evidence,synonyms,...,author_annotation_fields.neighborhood,author_annotation_fields.subclass.tf.markers.combo,author_annotation_fields.subclass.markers.combo,author_annotation_fields.supertype.markers.combo _within subclass_,author_annotation_fields.supertype.markers.combo,author_annotation_fields.anatomical_annotation,author_annotation_fields.merfish.markers.combo,author_annotation_fields.cluster.TF.markers.combo,author_annotation_fields.cluster.markers.combo _within subclass_,author_annotation_fields.cluster.markers.combo
0,class,29 CB Glut,CS20230722_CLAS_29,,,,,,,,...,NN-IMN-GC,,,,,,,,,
1,subclass,314 CB Granule Glut,CS20230722_SUBC_314,,CL:0001031,cerebellar granule cell,,,,,...,NN-IMN-GC,"Pax6,Neurod2,Etv1","Gabra6,Ror1",,,,,,,
2,subclass,315 DCO UBC Glut,CS20230722_SUBC_315,,CL:4023161,unipolar brush cell,,,,,...,NN-IMN-GC,"Eomes,Lmx1a,Klf3","Sln,Lmx1a",,,,,,,
3,supertype,1154 CB Granule Glut_1,CS20230722_SUPT_1154,,,,,,,,...,,,,Cntn3,"Gabra6,Lmx1a",,,,,
4,supertype,1155 CB Granule Glut_2,CS20230722_SUPT_1155,,,,,,,,...,,,,Gap43,"Gabra6,Gap43,Rab37",,,,,


## Export to CAP

This flattens the annotations to obs following the spec in the [cap_anndata_schema](https://github.com/cellannotation/cell-annotation-schema/blob/main/docs/cap_anndata_schema.md). stores cas in the header and save checksums of cell IDs to use for validating 


In [22]:
!cas export2CAP --json wmb_class_29.json --anndata ./merged_CS20230722_CLAS_29_v2.h5ad --output ./flattened_CS20230722_CLAS_29.h5ad

Traceback (most recent call last):
  File "/Users/do12/Documents/GitHub/cas-tools/venv/bin/cas", line 8, in <module>
    sys.exit(main())
  File "/Users/do12/Documents/GitHub/cas-tools/venv/lib/python3.9/site-packages/cas/__main__.py", line 64, in main
    export2cap(json_file_path, anndata_file_path, output_file_path, fill_na)
  File "/Users/do12/Documents/GitHub/cas-tools/venv/lib/python3.9/site-packages/cas/flatten_data_to_anndata.py", line 76, in export2cap
    export_cas_object2cap(input_json, anndata_file_path, output_file_path, fill_na)
  File "/Users/do12/Documents/GitHub/cas-tools/venv/lib/python3.9/site-packages/cas/flatten_data_to_anndata.py", line 114, in export_cas_object2cap
    flatten_data = process_annotations(
  File "/Users/do12/Documents/GitHub/cas-tools/venv/lib/python3.9/site-packages/cas/flatten_data_to_anndata.py", line 153, in process_annotations
    else accession_manager.generate_accession_id(
  File "/Users/do12/Documents/GitHub/cas-tools/venv/lib/python3.9/

In [None]:
flattened_adata = anndata.read_h5ad("flattened_CS20230722_CLAS_29.h5ad", backed='r')
flattened_adata.obs[:5]

In [2]:
!which cas

/Users/do12/Documents/GitHub/cas-tools/venv/bin/cas
