### Converting an Allen-Style Taxonomy Spreadsheet to CAS Format  

This notebook demonstrates how to convert an Allen-style taxonomy spreadsheet into the CAS (Cell Annotation Schema) format. The process includes the following steps:  

1. **Generate CAS**: Convert the spreadsheet data into the CAS format.  
2. **Populate IDs**: Add corresponding Cell IDs to the CAS from a related Anndata (h5ad) file.  
3. **Merge Data**: Integrate the updated CAS into the `var` field of the Anndata object.  

This workflow uses a subset of the WMBO dataset. Example Allen-style spreadsheet is derived from the `CB Glut (CS20230722_CLAS_29)` subset of the Mouse Cell-type annotations (https://www.nature.com/articles/s41586-023-06812-z#Sec49, Table 7) 

#### Installing Required Packages

In [1]:
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install anndata
!{sys.executable} -m pip install --upgrade cas-tools

Collecting pandas
  Using cached pandas-2.2.3-cp39-cp39-macosx_11_0_arm64.whl (11.3 MB)
Collecting numpy>=1.22.4
  Using cached numpy-2.0.2-cp39-cp39-macosx_14_0_arm64.whl (5.3 MB)
Collecting tzdata>=2022.7
  Using cached tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Collecting pytz>=2020.1
  Using cached pytz-2024.2-py2.py3-none-any.whl (508 kB)
Installing collected packages: tzdata, pytz, numpy, pandas
Successfully installed numpy-2.0.2 pandas-2.2.3 pytz-2024.2 tzdata-2024.2
You should consider upgrading via the '/Users/hk9/workspaces/workspace3/cas-tools/notebooks/venv/bin/python3 -m pip install --upgrade pip' command.[0m
Collecting anndata
  Downloading anndata-0.10.9-py3-none-any.whl (128 kB)
[K     |████████████████████████████████| 128 kB 552 kB/s eta 0:00:01
[?25hCollecting array-api-compat!=1.5,>1.4
  Downloading array_api_compat-1.10.0-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 6.3 MB/s  eta 0:00:01
Collecting natsort
  Using cached natsort-8.4

#### Example of an Allen-Style Spreadsheet  

To keep things simple, we use the `CB Glut (CS20230722_CLAS_29)` subset from the Mouse Cell-Type Annotations dataset (refer to [Nature article, Supplementary Table 7](https://www.nature.com/articles/s41586-023-06812-z#Sec49)).

In [23]:
import pandas as pd

pd.read_csv("./data/wmb_class_29_annotation.tsv", delimiter='\t')[:4]

Unnamed: 0,cluster_id,cluster,supertype,subclass,class,neighborhood,anatomical_annotation,notes,CCF_broad.freq,CCF_acronym.freq,...,CTX.subclass_id,CTX.subclass_id.1,CTX.neighborhood_id,CTX.neighborhood_label,CTX.size,taxonomy_id,cell_set_accession.cluster,cell_set_accession.supertype,cell_set_accession.subclass,cell_set_accession.class
0,5197,5197 CB Granule Glut_1,1154 CB Granule Glut_1,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,DCO VCO,,"MY:0.35,CB:0.33,NA:0.26","DCO:0.19,FL:0.11,arb:0.09,VCO:0.09,DN:0.08,PFL...",...,,,,,,CCN202307220,CS20230722_CLUS_5197,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29
1,5198,5198 CB Granule Glut_1,1154 CB Granule Glut_1,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,DCO VCO,,"CB:0.56,MY:0.24,NA:0.19","FL:0.25,PFL:0.17,DCO:0.13,VCO:0.09,arb:0.08,NA...",...,,,,,,CCN202307220,CS20230722_CLUS_5198,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29
2,5199,5199 CB Granule Glut_1,1154 CB Granule Glut_1,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,DCO VCO,,"MY:0.4,CB:0.38,NA:0.18","DCO:0.29,DN:0.12,arb:0.09,FL:0.09,PFL:0.06,VCO...",...,,,,,,CCN202307220,CS20230722_CLUS_5199,CS20230722_SUPT_1154,CS20230722_SUBC_314,CS20230722_CLAS_29
3,5200,5200 CB Granule Glut_2,1155 CB Granule Glut_2,314 CB Granule Glut,29 CB Glut,NN-IMN-GC,NOD PFL,,"CB:0.77,NA:0.22","PFL:0.17,NOD:0.17,FL:0.13,NA:0.11,arb:0.1,UVU:...",...,,,,,,CCN202307220,CS20230722_CLUS_5200,CS20230722_SUPT_1155,CS20230722_SUBC_314,CS20230722_CLAS_29


#### Import Spreadsheet into CAS  

Import an Allen-style spreadsheet into the CAS data-classes using a simple mapping file.

In [4]:
from cas.ingest.ingest_user_table import ingest_data, ingest_user_data

cas_data = ingest_user_data("./data/wmb_class_29_annotation.tsv", "./data/wmb_ingestion_config.yaml")
print(cas_data.to_json(indent=2))

{
  "author_name": "Hongkui Zeng",
  "annotations": [
    {
      "labelset": "class",
      "cell_label": "29 CB Glut"
    },
    {
      "labelset": "subclass",
      "cell_label": "314 CB Granule Glut"
    },
    {
      "labelset": "supertype",
      "cell_label": "1154 CB Granule Glut_1"
    },
    {
      "labelset": "cluster",
      "cell_label": "5197 CB Granule Glut_1",
      "cell_set_accession": "5197",
      "author_annotation_fields": {
        "neighborhood": "NN-IMN-GC",
        "anatomical_annotation": "DCO VCO",
        "CCF_broad.freq": "MY:0.35,CB:0.33,NA:0.26",
        "CCF_acronym.freq": "DCO:0.19,FL:0.11,arb:0.09,VCO:0.09,DN:0.08,PFL:0.07,MY:0.07,mcp:0.06,CUL4, 5:0.04",
        "v3.size": "14544",
        "v2.size": "0",
        "multiome.size": "0",
        "F": "0.5",
        "M": "0.5",
        "Dark": "0.04",
        "Light": "0.96",
        "nt_type_label": "Glut",
        "nt.markers": "Slc17a7:8.41",
        "nt_type_combo_label": "Glut",
        "cluster.m

#### Retrieve AnnData File for `CB Glut`  

Download the AnnData file corresponding to `CB Glut (CS20230722_CLAS_29)`. The original [WMB-10Xv2](https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/WMB-10Xv2/20230630/) and [WMB-10Xv3](https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#expression_matrices/WMB-10Xv3/20230630/) AnnData files were generated based on dissection.  

These files were merged and then split into 34 top-level classes. The resulting AnnData files for each class can be accessed at the following links:  
- [Class 01](http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_01.h5ad)  
- [Class 02](http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_02.h5ad)  
- ...  
- [Class 34](http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_34.h5ad)  

The file specific to `CB Glut` is included in this collection.

In [2]:
!wget -N http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad

--2025-01-21 10:35:08--  http://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad
Resolving cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)... 172.27.51.3, 172.27.51.1, 172.27.51.130, ...
Connecting to cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)|172.27.51.3|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad [following]
--2025-01-21 10:35:08--  https://cellular-semantics.cog.sanger.ac.uk/public/merged_CS20230722_CLAS_29.h5ad
Connecting to cellular-semantics.cog.sanger.ac.uk (cellular-semantics.cog.sanger.ac.uk)|172.27.51.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2585489225 (2.4G) [binary/octet-stream]
Saving to: ‘merged_CS20230722_CLAS_29.h5ad’


2025-01-21 10:45:36 (3.93 MB/s) - ‘merged_CS20230722_CLAS_29.h5ad’ saved [2585489225/2585489225]



#### Populate CAS with Cell IDs  

Add Cell IDs to the CAS using data from the AnnData file.

In [13]:
from cas.file_utils import read_json_file, read_anndata_file

cas = read_json_file("./data/wmb_cas.json")
anndata = read_anndata_file("./merged_CS20230722_CLAS_29.h5ad")

Unnamed: 0_level_0,cell_barcode,library_label,tissue,tissue_ontology_term_id,neurotransmitter,class,subclass,supertype,cluster,organism,disease,assay
cell_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AAACCCAAGAACAAGG-472_A05,AAACCCAAGAACAAGG,L8TX_201217_01_G07,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v2
AAACCCAAGAATCCCT-473_A06,AAACCCAAGAATCCCT,L8TX_201217_01_A08,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v3
AAACCCAAGACTACCT-225_A01,AAACCCAAGACTACCT,L8TX_200227_01_F10,Medulla,UBERON:0001896,Glut,29 CB Glut,314 CB Granule Glut,1154 CB Granule Glut_1,5197 CB Granule Glut_1,Mus musculus,normal,10x 3' v2
AAACCCAAGAGCTGAC-231.2_B01,AAACCCAAGAGCTGAC,L8TX_200306_01_H12,Medulla,UBERON:0001896,Glut,29 CB Glut,314 CB Granule Glut,1154 CB Granule Glut_1,5197 CB Granule Glut_1,Mus musculus,normal,10x 3' v3
AAACCCAAGAGGACTC-478_A02,AAACCCAAGAGGACTC,L8TX_210107_02_H11,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v2


In [14]:
from cas.populate_cell_ids import add_cell_ids

labelset_names = [labelset.name for labelset in sorted(cas_data.labelsets, key=lambda x: x.rank)]
cas = add_cell_ids(cas, anndata, labelset_names)

cas["annotations"][2]

{'labelset': 'supertype',
 'cell_label': '1154 CB Granule Glut_1',
 'cell_set_accession': '5208',
 'parent_cell_set_name': '314 CB Granule Glut',
 'parent_cell_set_accession': '5207',
 'cell_ids': ['TACCGAACAGGTCCGT-202.1_A01',
  'AACAACCTCTGCGGGT-175_A01',
  'CTCAAGATCAAAGGAT-231.2_B01',
  'CACACAATCTGTCTCG-175_B01',
  'GTCAAGTCATACCATG-231.2_A01',
  'GAGACTTGTTCCGGTG-179_B01',
  'GATCCCTCATGGCACC-1062_A03',
  'GTTGTGAGTCCCTGAG-231.2_A01',
  'GCTACCTCAGCAGTCC-504_A01',
  'TCATTCAGTGTGTTTG-225_B01',
  'CAAGGGATCACATTGG-179_C01',
  'GGAGGTAGTGATCATC-175_A01',
  'GCCTGTTAGTTTGCTG-179_A01',
  'CTCCGATGTGAATTAG-1032_A09',
  'ACCCTTGCACGGCTAC-179_C01',
  'TGTCAGAAGATACTGA-179_A01',
  'ACGTAGTTCAACACGT-225_B01',
  'GCGGATCTCGACACCG-175_A01',
  'CTATAGGAGCAATTAG-231.2_B01',
  'AGTCTCCTCATAGGCT-231.2_B01',
  'GGTTGTAAGAAGCTCG-231.2_B01',
  'TTACTGTCACTCATAG-225_A01',
  'TCGCAGGAGGCAGTCA-231.2_B01',
  'GTCATGATCACCCTCA-175_A01',
  'CCTAAGAAGAGGTCGT-231.2_B01',
  'GGTGATTAGACTTCAC-231.2_B01',
  

#### Merge CAS to Anndata

Adds CAS to uns of the AnnData file.

In [18]:
from cas.anndata_conversion import merge_cas_object
import anndata as ad

anndata.file.close()
merge_cas_object(cas, "./merged_CS20230722_CLAS_29.h5ad", True, "./merged_CS20230722_CLAS_29_v2.h5ad")

Can't read raw.var since raw layer doesn't exist!


In [19]:
anndata_merged = ad.read_h5ad("./merged_CS20230722_CLAS_29_v2.h5ad", backed="r")
anndata_merged.obs[:5]

Unnamed: 0_level_0,cell_barcode,library_label,tissue,tissue_ontology_term_id,neurotransmitter,class,subclass,supertype,cluster,organism,disease,assay
cell_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AAACCCAAGAACAAGG-472_A05,AAACCCAAGAACAAGG,L8TX_201217_01_G07,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v2
AAACCCAAGAATCCCT-473_A06,AAACCCAAGAATCCCT,L8TX_201217_01_A08,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v3
AAACCCAAGACTACCT-225_A01,AAACCCAAGACTACCT,L8TX_200227_01_F10,Medulla,UBERON:0001896,Glut,29 CB Glut,314 CB Granule Glut,1154 CB Granule Glut_1,5197 CB Granule Glut_1,Mus musculus,normal,10x 3' v2
AAACCCAAGAGCTGAC-231.2_B01,AAACCCAAGAGCTGAC,L8TX_200306_01_H12,Medulla,UBERON:0001896,Glut,29 CB Glut,314 CB Granule Glut,1154 CB Granule Glut_1,5197 CB Granule Glut_1,Mus musculus,normal,10x 3' v3
AAACCCAAGAGGACTC-478_A02,AAACCCAAGAGGACTC,L8TX_210107_02_H11,Cerebellum,UBERON:0002037,Glut,29 CB Glut,314 CB Granule Glut,1155 CB Granule Glut_2,5201 CB Granule Glut_2,Mus musculus,normal,10x 3' v2


In [20]:
anndata_merged.uns["cas"]

'{"author_name": "Hongkui Zeng", "title": "Whole Mouse Brain taxonomy", "description": "", "labelsets": [{"name": "cluster", "rank": 0}, {"name": "supertype", "rank": 1}, {"name": "subclass", "rank": 2}, {"name": "class", "rank": 3}], "cellannotation_schema_version": "1.0.0", "annotations": [{"labelset": "class", "cell_label": "29 CB Glut", "cell_set_accession": "5206"}, {"labelset": "subclass", "cell_label": "314 CB Granule Glut", "cell_set_accession": "5207", "parent_cell_set_name": "29 CB Glut", "parent_cell_set_accession": "5206"}, {"labelset": "supertype", "cell_label": "1154 CB Granule Glut_1", "cell_set_accession": "5208", "parent_cell_set_name": "314 CB Granule Glut", "parent_cell_set_accession": "5207"}, {"labelset": "cluster", "cell_label": "5197 CB Granule Glut_1", "cell_set_accession": "5197", "parent_cell_set_name": "1154 CB Granule Glut_1", "parent_cell_set_accession": "5208", "author_annotation_fields": {"neighborhood": "NN-IMN-GC", "anatomical_annotation": "DCO VCO", "C