## Introduction

In this notebook, we demonstrate how to manage and split large Whole Mouse Brain CAS files into smaller, more manageable files. The workflow includes:

- How to Run the Notebooks.
- Downloading the latest Whole Mouse Brain CAS file.
- Inspecting the original file to check the cell set count in the annotations.
- Splitting the large CAS file using the `29 CB Glut` cell set as a reference point.
- Comparing the annotations and cell set counts before and after the split to ensure the operation was successful.


### How to Run the Notebooks

For detailed instructions on setting up and running the notebooks, please refer to the [README.md](https://github.com/cellannotation/cas-tools/blob/main/notebooks/README.md) in the notebooks directory.


### Downloading latest version of Whole Mouse Brain CAS file

In [2]:
!wget https://raw.githubusercontent.com/brain-bican/whole_mouse_brain_taxonomy/refs/heads/main/CCN20230722.json -O CCN20230722.json

--2025-03-18 11:55:16--  https://raw.githubusercontent.com/brain-bican/whole_mouse_brain_taxonomy/refs/heads/main/CCN20230722.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8000::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12364592 (12M) [text/plain]
Saving to: ‘CCN20230722.json’


2025-03-18 11:55:20 (3.28 MB/s) - ‘CCN20230722.json’ saved [12364592/12364592]



In [3]:
import json
with open("data/CCN20230722.json", "r") as f:
    cas = json.load(f)
print(f"There are {len(cas['annotations'])} cell sets in CAS annotations")

There are 6905 cell sets in CAS annotations


### Splitting Large CAS Files

Large CAS files can be split into smaller ones based on cell accession IDs. By specifying an accession_id, you determine which groups are included in the split CAS. The `split_cas` command will create a separate CAS file for every cell set nested under the provided cell set.

In [4]:
!cas split_cas --help

usage: cas split_cas [-h] --cas_json CAS_JSON
                     [--split_on SPLIT_ON [SPLIT_ON ...]] [--multiple_outputs]

Split CAS JSON file based on specified cell label/s.

options:
  -h, --help            show this help message and exit
  --cas_json CAS_JSON   Path to the CAS JSON file that will be split
  --split_on SPLIT_ON [SPLIT_ON ...]
                        Cell accession_id(s) to split the CAS file.
  --multiple_outputs    If set, create multiple output files for each split_on
                        term; if not set, create a single output file named
                        `split_cas.json`.


Split on the `29 CB Glut` cell set using accession_id `CS20230722_CLAS_29`.

In [5]:
!cas split_cas --cas_json CCN20230722.json --split_on CS20230722_CLAS_29

In [6]:
with open("split_cas.json", "r") as f:
    split_cas = json.load(f)
print(f"There are {len(split_cas['annotations'])} cell sets in splited CAS annotations")

There are 16 cell sets in splited CAS annotations


In [7]:
split_cas["annotations"]

[{'labelset': 'neurotransmitter',
  'cell_label': 'Glut',
  'cell_set_accession': 'CS20230722_NEUR_Glut',
  'cell_ontology_term': 'glutamatergic neuron',
  'cell_ontology_term_id': 'CL:0000679'},
 {'labelset': 'class',
  'cell_label': '29 CB Glut',
  'cell_set_accession': 'CS20230722_CLAS_29',
  'author_annotation_fields': {'neighborhood': 'NN-IMN-GC',
   'Neuronal': 'Y',
   'Glial': 'None',
   'CTX.cluster_label': 'None',
   'supertype.markers.combo _within subclass_': 'None',
   'supertype.markers.combo': 'None',
   'CCF_acronym.freq': 'None',
   'sex.bias': 'None',
   'cluster.markers.combo _within subclass_': 'None',
   'nt.markers': 'None',
   'CTX.supertype_label': 'None',
   'v2.size': 'None',
   'v3.size': 'None',
   'cluster.TF.markers.combo': 'None',
   'multiome.size': 'None',
   'Dark': 'None',
   'subclass.markers.combo': 'None',
   'Light': 'None',
   'anatomical_annotation': 'None',
   'CTX.supertype_id': 'None',
   'np.markers': 'None',
   'CTX.subclass_id': 'None',
   