## Introduction

In this notebook, we demonstrate how to manage and split large Whole Mouse Brain CAS files into smaller, more manageable files. The workflow includes:

- Installing the required packages.
- Downloading the latest Whole Mouse Brain CAS file.
- Inspecting the original file to check the cell set count in the annotations.
- Splitting the large CAS file using the `29 CB Glut` cell set as a reference point.  This splits out the chosen reference cell set and all its subsets.
- Comparing the annotations and cell set counts before and after the split to ensure the operation was successful.


### Installing Required Packages

For detailed instructions on setting up and running the notebooks, please refer to the [README.md](https://github.com/cellannotation/cas-tools/blob/main/notebooks/README.md) in the notebooks directory.

### Downloading latest version of Whole Mouse Brain CAS file

In [2]:
!wget https://raw.githubusercontent.com/brain-bican/whole_mouse_brain_taxonomy/refs/heads/main/CCN20230722.json -O CCN20230722.json

--2025-03-18 11:55:16--  https://raw.githubusercontent.com/brain-bican/whole_mouse_brain_taxonomy/refs/heads/main/CCN20230722.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8000::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12364592 (12M) [text/plain]
Saving to: ‘CCN20230722.json’


2025-03-18 11:55:20 (3.28 MB/s) - ‘CCN20230722.json’ saved [12364592/12364592]



In [16]:
# Load and inspect the JSON file.

import json
import pandas as pd
with open("data/CCN20230722.json", "r") as f:
    cas = json.load(f)
print(f"There are {len(cas['annotations'])} cell sets in CAS annotations")

There are 6905 cell sets in CAS annotations


labelsets in the heirarchy have a rank, from 0 (leaf node) up.  labelsets outside of the heirarchy have no rank

In [4]:
pd.DataFrame.from_records(cas['labelsets'])

Unnamed: 0,name,description,rank
0,neurotransmitter,Clusters are assigned based on the average exp...,
1,class,The top level of cell type definition in the m...,3.0
2,subclass,The coarse level of cell type definition in th...,2.0
3,supertype,The second finest level of cell type definitio...,1.0
4,cluster,The finest level of cell type definition in th...,0.0


### Splitting Large CAS Files

Large CAS files can be split into smaller ones based on cell accession IDs. By specifying an accession_id, you determine which groups are included in the split CAS. The `split_cas` command will create a separate CAS file for every cell set nested under the provided cell set.

In [4]:
!cas split_cas --help

usage: cas split_cas [-h] --cas_json CAS_JSON
                     [--split_on SPLIT_ON [SPLIT_ON ...]] [--multiple_outputs]

Split CAS JSON file based on specified cell label/s.

options:
  -h, --help            show this help message and exit
  --cas_json CAS_JSON   Path to the CAS JSON file that will be split
  --split_on SPLIT_ON [SPLIT_ON ...]
                        Cell accession_id(s) to split the CAS file.
  --multiple_outputs    If set, create multiple output files for each split_on
                        term; if not set, create a single output file named
                        `split_cas.json`.


Split on the `29 CB Glut` cell set using accession_id `CS20230722_CLAS_29`.

In [14]:
!cas split_cas --cas_json data/CCN20230722.json --split_on CS20230722_CLAS_29

In [6]:
with open("split_cas.json", "r") as f:
    split_cas = json.load(f)
print(f"There are {len(split_cas['annotations'])} cell sets in splited CAS annotations")

There are 16 cell sets in splited CAS annotations


In [15]:
from cas.file_utils import read_cas_json_file
caz = read_cas_json_file('./split_cas.json')
caz.get_all_annotations()

Unnamed: 0,labelset,cell_label,cell_set_accession,cell_fullname,cell_ontology_term_id,cell_ontology_term,rationale,rationale_dois,marker_gene_evidence,synonyms,...,author_annotation_fields.merfish.markers.combo,author_annotation_fields.CTX.size,author_annotation_fields.subclass.tf.markers.combo,author_annotation_fields.nt_type_label,author_annotation_fields.nt_type_combo_label,author_annotation_fields.CTX.cluster_id,author_annotation_fields.CTX.neighborhood_id,author_annotation_fields.F,author_annotation_fields.CTX.neighborhood_label,author_annotation_fields.M
0,neurotransmitter,Glut,CS20230722_NEUR_Glut,,CL:0000679,glutamatergic neuron,,,,,...,,,,,,,,,,
1,class,29 CB Glut,CS20230722_CLAS_29,,CL:0000540,neuron,,,,,...,,,,,,,,,,
2,subclass,314 CB Granule Glut,CS20230722_SUBC_314,,CL:0001031,cerebellar granule cell,,,,,...,,,"Pax6,Neurod2,Etv1",Glut,,,,,,
3,subclass,315 DCO UBC Glut,CS20230722_SUBC_315,,CL:4023161,unipolar brush cell,,,,,...,,,"Eomes,Lmx1a,Klf3",Glut,,,,,,
4,supertype,1154 CB Granule Glut_1,CS20230722_SUPT_1154,,,,,,,,...,,,,,,,,,,
5,supertype,1155 CB Granule Glut_2,CS20230722_SUPT_1155,,,,,,,,...,,,,,,,,,,
6,supertype,1156 DCO UBC Glut_1,CS20230722_SUPT_1156,,,,,,,,...,,,,,,,,,,
7,cluster,5197 CB Granule Glut_1,CS20230722_CLUS_5197,,,,,,,,...,"Col27a1,Barhl1,St18,Trhde,Spon1,Syt6",,,Glut,Glut,,,0.5,,0.5
8,cluster,5198 CB Granule Glut_1,CS20230722_CLUS_5198,,,,,,,,...,"Svep1,Slc17a7,Chrm2",,,Glut,Glut,,,0.45,,0.55
9,cluster,5199 CB Granule Glut_1,CS20230722_CLUS_5199,,,,,,,,...,"Eomes,Col27a1,Calb2",,,Glut,Glut,,,0.44,,0.56
