### Converting an Allen-Style Taxonomy Spreadsheet to CAS Format  

This notebook demonstrates how to convert an Allen-style taxonomy spreadsheet into the CAS (Cell Annotation Schema) format. The process includes the following steps:  

1. **Generate CAS**: Convert the spreadsheet data into the CAS format.
2. **Validate CAS against h5ad**: Compare CAS and AnnData to check if the annotations and the hierarchy are the same.
3. **Populate IDs**: Add corresponding Cell IDs to the CAS from a related Anndata (h5ad) file.  
4. **Merge Data**: Integrate the updated CAS into the `var` field of the Anndata object.  

This workflow uses a subset of the WMBO dataset. Example Allen-style spreadsheet is derived from the `CB Glut (CS20230722_CLAS_29)` subset of the Mouse Cell-type annotations (https://www.nature.com/articles/s41586-023-06812-z#Sec49, Table 7) 

#### Installing Required Packages

In [1]:
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install anndata
!{sys.executable} -m pip install --upgrade cas-tools

You should consider upgrading via the '/Users/hk9/workspaces/workspace3/cas-tools/notebooks/venv/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/hk9/workspaces/workspace3/cas-tools/notebooks/venv/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/hk9/workspaces/workspace3/cas-tools/notebooks/venv/bin/python3 -m pip install --upgrade pip' command.[0m


#### Example of an Allen-Style Spreadsheet  

To keep things simple, we use the `CB Glut (CS20230722_CLAS_29)` subset from the Mouse Cell-Type Annotations dataset (refer to [Nature article, Supplementary Table 7](https://www.nature.com/articles/s41586-023-06812-z#Sec49)).

In [2]:
import pandas as pd

pd.read_csv("./data/wmb_class_29_annotation.tsv", delimiter='\t')[:4]

Unnamed: 0,cluster_id,cluster,supertype,subclass,class,neighborhood,anatomical_annotation,notes,CCF_broad.freq,CCF_acronym.freq,...,CTX.subclass_id,CTX.subclass_id.1,CTX.neighborhood_id,CTX.neighborhood_label,CTX.size,taxonomy_id,cell_set_accession.cluster,cell_set_accession.supertype,cell_set_accession.subclass,cell_set_accession.class
0,5197,5197 CB Granule Glut_1,1154 CB Granule Glut_1,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,DCO VCO,,"MY:0.35,CB:0.33,NA:0.26","DCO:0.19,FL:0.11,arb:0.09,VCO:0.09,DN:0.08,PFL...",...,,,,,,CCN202307220,CS20230722_CLUS_5197,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29
1,5198,5198 CB Granule Glut_1,1154 CB Granule Glut_1,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,DCO VCO,,"CB:0.56,MY:0.24,NA:0.19","FL:0.25,PFL:0.17,DCO:0.13,VCO:0.09,arb:0.08,NA...",...,,,,,,CCN202307220,CS20230722_CLUS_5198,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29
2,5199,5199 CB Granule Glut_1,1154 CB Granule Glut_1,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,DCO VCO,,"MY:0.4,CB:0.38,NA:0.18","DCO:0.29,DN:0.12,arb:0.09,FL:0.09,PFL:0.06,VCO...",...,,,,,,CCN202307220,CS20230722_CLUS_5199,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29
3,5200,5200 CB Granule Glut_2,1155 CB Granule Glut_2,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,NOD PFL,,"CB:0.77,NA:0.22","PFL:0.17,NOD:0.17,FL:0.13,NA:0.11,arb:0.1,UVU:...",...,,,,,,CCN202307220,CS20230722_CLUS_5200,CS20230722_SUPT_1155,CS20230722_SUBC_314,CS20230722_CLAS_29


#### Import Spreadsheet into CAS  

Import an Allen-style spreadsheet into the CAS data-classes using a simple mapping file.

In [3]:
import json
from cas.ingest.ingest_user_table import ingest_data, ingest_user_data

cas_data = ingest_user_data("./data/wmb_class_29_annotation.tsv", "./data/wmb_ingestion_config.yaml", True)
cas = cas_data.as_dictionary()
print(json.dumps(cas, indent=2)[:2000])

{
  "author_name": "Hongkui Zeng",
  "annotations": [
    {
      "labelset": "class",
      "cell_label": "29 CB Glut",
      "cell_set_accession": "5206"
    },
    {
      "labelset": "subclass",
      "cell_label": "314 CB Granule Glut",
      "cell_set_accession": "5207",
      "parent_cell_set_accession": "5206"
    },
    {
      "labelset": "supertype",
      "cell_label": "1154 CB Granule Glut_1",
      "cell_set_accession": "5208",
      "parent_cell_set_accession": "5207"
    },
    {
      "labelset": "cluster",
      "cell_label": "5197 CB Granule Glut_1",
      "cell_set_accession": "5197",
      "parent_cell_set_accession": "5208",
      "author_annotation_fields": {
        "neighborhood": "NN-IMN-GC",
        "anatomical_annotation": "DCO VCO",
        "CCF_broad.freq": "MY:0.35,CB:0.33,NA:0.26",
        "CCF_acronym.freq": "DCO:0.19,FL:0.11,arb:0.09,VCO:0.09,DN:0.08,PFL:0.07,MY:0.07,mcp:0.06,CUL4, 5:0.04",
        "v3.size": "14544",
        "v2.size": "0",
        "m

#### Retrieve AnnData File for `CB Glut`  

Download the AnnData file corresponding to `CB Glut (CS20230722_CLAS_29)`. The original [WMB-10Xv2](https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/WMB-10Xv2/20230630/) and [WMB-10Xv3](https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/WMB-10Xv3/20230630/) AnnData files were generated based on dissection.  

These files were merged and then split into 34 top-level classes. The resulting AnnData files for each class can be accessed at the following links:  
- [Class 01](http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_01.h5ad)  
- [Class 02](http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_02.h5ad)  
- ...  
- [Class 34](http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_34.h5ad)  

The file specific to `CB Glut` is included in this collection.

In [4]:
!wget -N http://cellular-semantics.cog.sanger.ac.uk/public/CS20230722_CLAS_29.h5ad

--2025-02-17 16:12:23--  http://cellular-semantics.cog.sanger.ac.uk/public/CS20230722_CLAS_29.h5ad
Resolving cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)... 172.27.51.1, 172.27.51.3, 172.27.51.2, ...
Connecting to cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)|172.27.51.1|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cellular-semantics.cog.sanger.ac.uk/public/CS20230722_CLAS_29.h5ad [following]
--2025-02-17 16:12:23--  https://cellular-semantics.cog.sanger.ac.uk/public/CS20230722_CLAS_29.h5ad
Connecting to cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)|172.27.51.1|:443... connected.
304 Not Modifiedt, awaiting response... 
File ‘CS20230722_CLAS_29.h5ad’ not modified on server. Omitting download.



In [5]:
from cas.file_utils import read_anndata_file

anndata = read_anndata_file("./CS20230722_CLAS_29.h5ad")
anndata.obs[:3]

Unnamed: 0_level_0,cell_barcode,library_label,tissue,tissue_ontology_term_id,class,subclass,supertype,cluster,organism,disease,organism_ontology_term_id,disease_ontology_term_id
cell_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AAACCCAAGAACAAGG-472_A05,AAACCCAAGAACAAGG,L8TX_201217_01_G07,Cerebellum,UBERON:0002037,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,NCBITaxon:10090,PATO:0000461
AAACCCAAGAATCCCT-473_A06,AAACCCAAGAATCCCT,L8TX_201217_01_A08,Cerebellum,UBERON:0002037,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,NCBITaxon:10090,PATO:0000461
AAACCCAAGACTACCT-225_A01,AAACCCAAGACTACCT,L8TX_200227_01_F10,Medulla,UBERON:0001896,29 CB Glut,314 CB Granule Glut,1154 CB Granule Glut_1,5197 CB Granule Glut_1,Mus musculus,normal,NCBITaxon:10090,PATO:0000461


#### Analyse CAS and AnnData

Compare CAS and AnnData to verify that the annotations and hierarchy are consistent.

In [6]:
obs = anndata.obs 

labelsets = [labelset.name for labelset in sorted(cas_data.labelsets, key=lambda x: x.rank)]
print(labelsets)

# assert cluster counts are the same
cas_clusters = { annotation.cell_label for annotation in cas_data.annotations if annotation.labelset == "cluster"}
anndata_clusters = set(obs["cluster"].unique())
assert cas_clusters == anndata_clusters

# assert hierarchies are the same
for annotation in cas_data.annotations:
    if labelsets.index(annotation.labelset) < len(labelsets) -1:
        # list all parent names for the given annotation in the AnnData
        parent_labelset = labelsets[labelsets.index(annotation.labelset) + 1]
        parent_names = obs[obs[annotation.labelset] == annotation.cell_label][parent_labelset].unique()
        # assert has only one parent
        assert len(parent_names) == 1
        # assert AnnData parent is same with CAS parent 
        assert annotation.parent_cell_set_name == parent_names[0]

['cluster', 'supertype', 'subclass', 'class']


#### Populate CAS with Cell IDs  

Add Cell IDs to the CAS using data from the AnnData file.

In [8]:
from cas.populate_cell_ids import add_cell_ids

cas = add_cell_ids(cas, anndata.obs)

print(json.dumps(cas["annotations"][5], indent=2)[:1500])

{
  "labelset": "cluster",
  "cell_label": "5199 CB Granule Glut_1",
  "cell_set_accession": "5199",
  "parent_cell_set_accession": "5208",
  "author_annotation_fields": {
    "neighborhood": "NN-IMN-GC",
    "anatomical_annotation": "DCO VCO",
    "CCF_broad.freq": "MY:0.4,CB:0.38,NA:0.18",
    "CCF_acronym.freq": "DCO:0.29,DN:0.12,arb:0.09,FL:0.09,PFL:0.06,VCO:0.05,MY:0.05,CUL4, 5:0.05,P:0.04",
    "v3.size": "909",
    "v2.size": "0",
    "multiome.size": "0",
    "F": "0.44",
    "M": "0.56",
    "Dark": "0.03",
    "Light": "0.97",
    "nt_type_label": "Glut",
    "nt.markers": "Slc17a7:8.33,Slc17a6:3.37",
    "nt_type_combo_label": "Glut",
    "cluster.markers.combo": "Cbln3,Tmem132d",
    "merfish.markers.combo": "Eomes,Col27a1,Calb2",
    "cluster.TF.markers.combo": "Eomes,Lmx1a,Nr2f2,Lin28b",
    "cluster.markers.combo (within subclass)": "Rgs6",
    "taxonomy_id": "CCN202307220",
    "cell_set_accession.cluster": "CS20230722_CLUS_5199",
    "cell_set_accession.supertype": "CS

#### Merge CAS to Anndata

Adds CAS to uns of the AnnData file.

In [9]:
from cas.anndata_conversion import merge_cas_object
import anndata as ad

anndata.file.close()
merge_cas_object(cas, "./CS20230722_CLAS_29.h5ad", True, "./merged_CS20230722_CLAS_29_v2.h5ad")

In [10]:
from cas.file_utils import read_cas_from_anndata

cas_read = read_cas_from_anndata("./merged_CS20230722_CLAS_29_v2.h5ad")
cas_read.get_all_annotations().head(5)

Unnamed: 0,labelset,cell_label,cell_set_accession,cell_fullname,cell_ontology_term_id,cell_ontology_term,rationale,rationale_dois,marker_gene_evidence,synonyms,...,author_annotation_fields.cluster.markers.combo,author_annotation_fields.merfish.markers.combo,author_annotation_fields.cluster.TF.markers.combo,author_annotation_fields.cluster.markers.combo (within subclass),author_annotation_fields.taxonomy_id,author_annotation_fields.cell_set_accession.cluster,author_annotation_fields.cell_set_accession.supertype,author_annotation_fields.cell_set_accession.subclass,author_annotation_fields.cell_set_accession.class,author_annotation_fields.np.markers
0,class,29 CB Glut,5206,,,,,,,,...,,,,,,,,,,
1,subclass,314 CB Granule Glut,5207,,,,,,,,...,,,,,,,,,,
2,supertype,1154 CB Granule Glut_1,5208,,,,,,,,...,,,,,,,,,,
3,cluster,5197 CB Granule Glut_1,5197,,,,,,,,...,"Gabra6,Lmx1a,Rnf182","Col27a1,Barhl1,St18,Trhde,Spon1,Syt6","Lmx1a,Zic1,St18","Lmx1a,Rnf182",CCN202307220,CS20230722_CLUS_5197,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29,
4,cluster,5198 CB Granule Glut_1,5198,,,,,,,,...,"Gabra6,Cntn5","Svep1,Slc17a7,Chrm2","Pax6,Neurod2,Etv1,Bcl11b",Cntn5,CCN202307220,CS20230722_CLUS_5198,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29,
