## Introduction

This notebook covers the essential setup for our analysis workflow. We begin by installing the required packages and downloading the latest version of the Human Brain Cell Atlas CAS file. Subsequent sections will guide you through processing and annotating the data.


### How to Run the Notebooks

For detailed instructions on setting up and running the notebooks, please refer to the [README.md](https://github.com/cellannotation/cas-tools/blob/main/notebooks/README.md) in the notebooks directory.


### Downloading latest version of Human Brain Cell Atlas v1.0 (Non-neuronal) CAS file

In [2]:
!wget https://raw.githubusercontent.com/brain-bican/human-brain-cell-atlas_v1_non-neuronal/main/CS202210140.json -O CS202210140.json

--2025-03-18 11:54:47--  https://raw.githubusercontent.com/brain-bican/human-brain-cell-atlas_v1_non-neuronal/main/CS202210140.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8000::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33310500 (32M) [text/plain]
Saving to: ‘CS202210140.json’


2025-03-18 11:54:57 (3.08 MB/s) - ‘CS202210140.json’ saved [33310500/33310500]



In [3]:
import json
with open("CS202210140.json", "r") as f:
    cas = json.load(f)
cas["annotations"][:5]

[{'labelset': 'Cluster',
  'cell_label': 'Mgl_4',
  'cell_set_accession': 'CS202210140_5',
  'parent_cell_set_accession': 'CS202210140_464',
  'author_annotation_fields': {'Cluster ID': '4',
   'Class auto_annotation': 'MGL',
   'Neurotransmitter auto_annotation': 'None',
   'Neuropeptide auto_annotation': 'NAMPT',
   'Subtype auto_annotation': 'None',
   'Transferred MTG Label': 'Micro-PVM',
   'Top three regions': 'Spinal cord: 31.8%, Pons: 26.0%, Medulla: 13.2%',
   'Top three dissections': 'Human SpC: 31.8%, Human MN: 11.5%, Human PnEN: 11.4%',
   'Top Enriched Genes': 'SRGN, RGS1, GPR183, CD69, HLA-DRA, OLR1, TNFRSF1B, IFI30, CXCR4, CD74',
   'Number of cells': '745.0',
   'DoubletFinder score': '0.013652992',
   'Total UMI': '2439.644295',
   'Fraction unspliced': '0.672263033',
   'Fraction mitochondrial': '0.005220894',
   'H19.30.002': '176.0',
   'H19.30.001': '135.0',
   'H18.30.002': '434.0',
   'H18.30.001': '0.0',
   'Fraction cells from top donor': '0.582550336',
   'Num

### Generating Random Annotation Data

In this section, we generate synthetic annotation data for testing purposes. This random data simulates real annotation inputs, allowing us to validate our processing pipeline and ensure that the CAS file can be updated correctly with new annotations.

In [4]:
import random
import pandas as pd

accession_ids = [annotation["cell_set_accession"] for annotation in cas["annotations"]]
data = {
    "cell_set_accession": accession_ids,
    "annotation_1": [random.randint(1,100) for _ in range(len(accession_ids))],
    "annotation_2": [round(random.uniform(0,1), 3) for _ in range(len(accession_ids))]
}
new_data = pd.DataFrame(data)
new_data.head()

Unnamed: 0,cell_set_accession,annotation_1,annotation_2
0,CS202210140_5,73,0.237
1,CS202210140_6,48,0.543
2,CS202210140_7,14,0.156
3,CS202210140_8,26,0.566
4,CS202210140_9,60,0.109


In [5]:
new_data.to_csv("annotation.csv", index=False)

### Integrating New Annotation Data into the CAS File

In this section, we incorporate the newly created annotation data into the existing CAS file and save the result as a new file. This approach allows you to preserve the original file while verifying the updated annotations.

In [6]:
!cas add_author_annotations --cas_json CS202210140.json --csv annotation.csv --join_on_cellset_ids --output updated_CS202210140.json

In [7]:
with open("updated_CS202210140.json", "r") as f:
    updated_cas = json.load(f)
updated_cas["annotations"][:5]

[{'labelset': 'Cluster',
  'cell_label': 'Mgl_4',
  'cell_set_accession': 'CS202210140_5',
  'parent_cell_set_accession': 'CS202210140_464',
  'author_annotation_fields': {'Cluster ID': '4',
   'Class auto_annotation': 'MGL',
   'Neurotransmitter auto_annotation': 'None',
   'Neuropeptide auto_annotation': 'NAMPT',
   'Subtype auto_annotation': 'None',
   'Transferred MTG Label': 'Micro-PVM',
   'Top three regions': 'Spinal cord: 31.8%, Pons: 26.0%, Medulla: 13.2%',
   'Top three dissections': 'Human SpC: 31.8%, Human MN: 11.5%, Human PnEN: 11.4%',
   'Top Enriched Genes': 'SRGN, RGS1, GPR183, CD69, HLA-DRA, OLR1, TNFRSF1B, IFI30, CXCR4, CD74',
   'Number of cells': '745.0',
   'DoubletFinder score': '0.013652992',
   'Total UMI': '2439.644295',
   'Fraction unspliced': '0.672263033',
   'Fraction mitochondrial': '0.005220894',
   'H19.30.002': '176.0',
   'H19.30.001': '135.0',
   'H18.30.002': '434.0',
   'H18.30.001': '0.0',
   'Fraction cells from top donor': '0.582550336',
   'Num