# Taxonomy model

This notebook will explain how cell type taxonomies are organized in the `abc_atlas_access` package. The goal is to give users the knowledge and skills they need to explore an arbitrary cell type taxonomy.

In [1]:
import abc_atlas_access.abc_atlas_cache.abc_project_cache as abc_project_cache

In [2]:
import numpy as np
import os
import pandas as pd

cache_dir = "/Users/scott.daniel/KnowledgeEngineering/cell_type_mapper/examples/data/abc_cache"

if not os.path.isdir(cache_dir):
    raise RuntimeError(
        "Set cache_dir above to the path to the directory where you want to download ABC Atlas data"
    )

abc_cache = abc_project_cache.AbcProjectCache.from_cache_dir(cache_dir)

type.compare_manifests('releases/20250531/manifest.json', 'releases/20251031/manifest.json')
To load another version of the dataset, run
type.load_manifest('releases/20251031/manifest.json')


As discussed elsewhere, data releases in `abc_atlas_access` are divided into "directories." Each directory contains a discrete set of data or metadata. There is a simple command to list all of the available directories.

In [3]:
abc_cache.list_directories

['ASAP-PMDBS-10X',
 'ASAP-PMDBS-taxonomy',
 'Allen-CCF-2020',
 'HMBA-10xMultiome-BG',
 'HMBA-10xMultiome-BG-Aligned',
 'HMBA-BG-taxonomy-CCN20250428',
 'MERFISH-C57BL6J-638850',
 'MERFISH-C57BL6J-638850-CCF',
 'MERFISH-C57BL6J-638850-imputed',
 'MERFISH-C57BL6J-638850-sections',
 'SEAAD',
 'SEAAD-taxonomy',
 'WHB-10Xv3',
 'WHB-taxonomy',
 'WMB-10X',
 'WMB-10XMulti',
 'WMB-10Xv2',
 'WMB-10Xv3',
 'WMB-neighborhoods',
 'WMB-taxonomy',
 'Zeng-Aging-Mouse-10Xv3',
 'Zeng-Aging-Mouse-WMB-taxonomy',
 'Zhuang-ABCA-1',
 'Zhuang-ABCA-1-CCF',
 'Zhuang-ABCA-2',
 'Zhuang-ABCA-2-CCF',
 'Zhuang-ABCA-3',
 'Zhuang-ABCA-3-CCF',
 'Zhuang-ABCA-4',
 'Zhuang-ABCA-4-CCF']

# HMBA Basal Ganglia taxonomy

We will start by exploring the HMBA Basal Ganglia taxnomy, which is located in this directory.

In [4]:
hmba_taxonomy_dir = 'HMBA-BG-taxonomy-CCN20250428'

Taxonomy definitions are treated as metadata in `abc_atlas_access` (as opposed to "data", which we generally use to refer to information pertaining directly to a biological sample, such as a cell-by-gene matrix). We can list all of the metadata files associated with this taxonomy thusly

In [5]:
abc_cache.list_metadata_files(directory=hmba_taxonomy_dir)

['abbreviation_term',
 'cell_2d_embedding_coordinates',
 'cell_to_cluster_membership',
 'cluster',
 'cluster_annotation_term',
 'cluster_annotation_term_set',
 'cluster_annotation_to_abbreviation_map',
 'cluster_to_cluster_annotation_membership']

## Aside: how taxonomies are derived

Before we discuss which of these files are most useful and what they contain, it is worth brieflly discussing how a cell type taxonomy is derived. This discussion is a gross generalization, but it qualitatively corresponds to the important features of all of the taxonomies served through `abc_atlas_access`.

A transcriptomic cell type taxonomy general starts with a large collection of cell-by-gene data. The Yao et al. 2023 Whole Mouse Brain taxonomy was derived from a set of ~ 4 million cells each sequenced in ~ 32 thousand genes. This data is run through an almost purely data-driven algorithm to subdivide the dataset into "clusters" (the Whole Mouse Brain taxonomy is made up of 5322 clusters at its root). These clusters are groups of cells that have something in common according to the data-driven algorithm. They are more appropriately thought of as artifacts of the computation than anything that is biologically real. These "purely computational" (this is a very simplifying statement, but elucidates an important detail in how `abc_atlas_access` data is organized) are identified by the column `cluster_alias` in our data mode.

Once the data has been clustered by the algorithm, expert biologists then begin looking at the clusters to determine the biological reality behind what the algorithm detected. They will group the taxons into taxons that do (according to the experts) have biological meaning. These taxons are referred to in the `abc_atlas_access` files as "annotation terms."

Sometimes, these annotation terms are grouped into other, larger annotation terms. These different levels of grouping are referred to in `abc_atlas_access` as "annotation term sets."

For instance, the HMBA Basal Ganglia taxonomy we are currently working with is made up of
- 1435 Clusters, which are collected into
- 61 Groups, which are collected into
- 36 Subclasses, which are collected into
- 12 Classes, which are collected into
- 4 Neighborhoods

In this taxonomy, "Neighborhood", "Class", "Subclass", "Group", and "Cluster" are all annotation term sets associated with the taxonomy. The 1548 individual clusters, groups, subclasses, classes, and neighborhoods are the annotation terms associated with the taxonomy. To reiterate, the taxonomy is made up fo 1548 annotation terms and 5 annotation term sets.

## What files define a taxonomy?

The minimum set of files from the `hmba_taxonomy_dir` needed to construct the taxonomy and associate it with the cells in ABC Atlas are
- `cluster_annotation_term_set`: a file defining the annotation term sets making up the taxonomy
- `cluster_annotation_term`: a file defining the individual annotation terms making up the taxonomy and any parent-child relationships between them
- `cluster_to_cluster_annotation_membership`: a file that associates the annotation terms with the data-driven clusters in the taxonomy
- `cell_to_cluster_membership`: a file that links individual cells with the data driven clusters in the taxonomy

## cluster_annotation_term_set

The `cluster_annotation_term_set` file lists the annotation term sets in the taxonomy (i.e. the structured groups into which taxons have been collected).

In [6]:
hmba_term_set_df = abc_cache.get_metadata_dataframe(
    directory=hmba_taxonomy_dir,
    file_name='cluster_annotation_term_set'
)

cluster_annotation_term_set.csv: 100%|████████| 223/223 [00:00<00:00, 3.21kMB/s]


In [7]:
hmba_term_set_df

Unnamed: 0,label,name,description,order
0,CCN20250428_LEVEL_0,Neighborhood,Neighborhood,0
1,CCN20250428_LEVEL_1,Class,Class,1
2,CCN20250428_LEVEL_2,Subclass,Subclass,2
3,CCN20250428_LEVEL_3,Group,Group,3
4,CCN20250428_LEVEL_4,Cluster,Cluster,4


Each term set has a unique `label` and a human readable `name`. The `label` is generally how the term set will be referred to throughout the other files defining taxonomy. The `name` is how the term set is referred to "in conversation." The `order` column is a courtesy ordinal defining which levels in the taxonomy are grosser or finer than which (i.e. in this taxonomy, `Cluster` is finer than `Group`, which is finer than `Subclass`, etc.)

To see the parent-child relationships in the taxonomy more explicitly, we must look at the `cluster_annotation_term.csv` file.

## cluster_annotation_term

The `cluster_annotation_term` file defines the individual annotation terms in the taxonomy. It tells you which annotation terms belong to which annotation term sets (as defined in `cluster_annotation_term_set`) as well as the parent-child relationships between taxons.

The annotation terms in this file are the entities in the taxonomy generally assigned biological meaning.

In [8]:
hmba_term_df = abc_cache.get_metadata_dataframe(
    directory=hmba_taxonomy_dir,
    file_name='cluster_annotation_term'
)

cluster_annotation_term.csv: 100%|██████████| 206k/206k [00:00<00:00, 1.47MMB/s]


In [9]:
# display a random sample of terms
hmba_term_df.sample(n=15, random_state=np.random.default_rng(4321))

Unnamed: 0,label,name,cluster_annotation_term_set_label,cluster_annotation_term_set_name,color_hex_triplet,term_order,term_set_order,parent_term_label,parent_term_name,parent_term_set_label
141,CS20250428_CLUST_1470,Marmoset-507,CCN20250428_LEVEL_4,Cluster,#eea2dd,28,4,CS20250428_GROUP_0039,Astrocyte,CCN20250428_LEVEL_3
4,CS20250428_CLASS_0000,Astro-Epen,CCN20250428_LEVEL_1,Class,#6ec0da,1,1,CS20250428_NEIGH_0001,Nonneuron,CCN20250428_LEVEL_0
1499,CS20250428_CLUST_1257,Marmoset-211,CCN20250428_LEVEL_4,Cluster,#adc82b,1386,4,CS20250428_GROUP_0051,STR D1D2 Hybrid MSN,CCN20250428_LEVEL_3
1351,CS20250428_CLUST_1028,Macaque-404,CCN20250428_LEVEL_4,Cluster,#4167c3,1238,4,CS20250428_GROUP_0058,STRv D2 MSN,CCN20250428_LEVEL_3
1242,CS20250428_CLUST_1205,Marmoset-120,CCN20250428_LEVEL_4,Cluster,#a52397,1129,4,CS20250428_GROUP_0056,STRv D1 MSN,CCN20250428_LEVEL_3
1384,CS20250428_CLUST_1892,Marmoset-1025,CCN20250428_LEVEL_4,Cluster,#982e45,1271,4,CS20250428_GROUP_0058,STRv D2 MSN,CCN20250428_LEVEL_3
190,CS20250428_CLUST_0648,Macaque-461,CCN20250428_LEVEL_4,Cluster,#9ad367,77,4,CS20250428_GROUP_0012,Ependymal,CCN20250428_LEVEL_3
61,CS20250428_GROUP_0026,Oligo PLEKHG1,CCN20250428_LEVEL_3,Group,#401e66,10,3,CS20250428_SUBCL_0022,Oligodendrocyte,CCN20250428_LEVEL_2
1443,CS20250428_CLUST_0516,Human-484,CCN20250428_LEVEL_4,Cluster,#73f9d4,1330,4,CS20250428_GROUP_0057,STRv D1 NUDAP MSN,CCN20250428_LEVEL_3
1132,CS20250428_CLUST_1946,Marmoset-1239,CCN20250428_LEVEL_4,Cluster,#555912,1019,4,CS20250428_GROUP_0066,AMY-SLEA-BNST GABA,CCN20250428_LEVEL_3


Each row in this dataframe corresponds to a taxon in the taxonomy. The most valuable columns are
- `label`: this is the stable identifier for the annotation term and is generally used to identify the taxon everywhere in our data model.
- `name`: the human-readable name for the annotation term. This usually carries some biologically useful information.
- `cluster_annotation_term_set_label`: the label of the annotation term set to which this annotation term belongs. This links back to the `cluster_annotation_term_set` file.
- `parent_term_label`: the label of the annotation that is the parent to this taxon (if any).
- `parent_term_set_label`: the label of the annotation term set to which the parent belongs (though this is somewhat redundant since `parent_term_label` ought to uniquely identify the parent term).

**Note:** annotation terms for all of the term sets are listed in this single file. To consider only one level of the taxonomy (for instance, `Subclass`), you must filter on `term_set_label`.

### Example use: create a denormalized taxonomy data frame

The data in `cluster_annotation_term` is highly normalized. Every parent-child relationship at every level of the taxonomy is in that one file. Suppose we wanted to create a single dataframe in which each row was a cluster (the finest level of this taxonomy) and the columns represented the parents of that row at each grosser level of the taxonomy.

In [10]:
# create a dict mapping term_set_label to term_set_name
level_name_lookup = {
    label: name for label, name in zip(hmba_term_set_df.label, hmba_term_set_df.name)
}

In [11]:
level_name_lookup

{'CCN20250428_LEVEL_0': 'Neighborhood',
 'CCN20250428_LEVEL_1': 'Class',
 'CCN20250428_LEVEL_2': 'Subclass',
 'CCN20250428_LEVEL_3': 'Group',
 'CCN20250428_LEVEL_4': 'Cluster'}

#### Proof of concept

Before collating the entire taxonomy into a single dataframe, let's show how we would create a dataframe that links subclasses and classes.

Using the `cluster_annotation_term_set_label` column, we can select the subset of rows in `cluster_annotation_term` corresponding to just one level in the taxonomy.

In [12]:
just_subclasses = hmba_term_df[
    hmba_term_df.cluster_annotation_term_set_label=='CCN20250428_LEVEL_2'
]

In [13]:
just_subclasses.sample(n=5, random_state=np.random.default_rng(6000))

Unnamed: 0,label,name,cluster_annotation_term_set_label,cluster_annotation_term_set_name,color_hex_triplet,term_order,term_set_order,parent_term_label,parent_term_name,parent_term_set_label
50,CS20250428_SUBCL_0036,ACx MEIS2 GABA,CCN20250428_LEVEL_2,Subclass,#3636ed,35,2,CS20250428_CLASS_0012,Cx GABA,CCN20250428_LEVEL_1
31,CS20250428_SUBCL_0025,M Dopa,CCN20250428_LEVEL_2,Subclass,#77f0ca,16,2,CS20250428_CLASS_0009,M Dopa,CCN20250428_LEVEL_1
40,CS20250428_SUBCL_0005,CN LHX8 GABA,CCN20250428_LEVEL_2,Subclass,#caf28b,25,2,CS20250428_CLASS_0006,F M GABA,CCN20250428_LEVEL_1
36,CS20250428_SUBCL_0031,STR RSPO2 GABA,CCN20250428_LEVEL_2,Subclass,#5b0e63,21,2,CS20250428_CLASS_0005,CN MGE GABA,CCN20250428_LEVEL_1
20,CS20250428_SUBCL_0015,Lymphocyte,CCN20250428_LEVEL_2,Subclass,#aa66d4,5,2,CS20250428_CLASS_0008,Immune,CCN20250428_LEVEL_1


In [14]:
just_classes = hmba_term_df[
    hmba_term_df.cluster_annotation_term_set_label=='CCN20250428_LEVEL_1'
]

In [15]:
just_classes.sample(n=5, random_state=np.random.default_rng(81912))

Unnamed: 0,label,name,cluster_annotation_term_set_label,cluster_annotation_term_set_name,color_hex_triplet,term_order,term_set_order,parent_term_label,parent_term_name,parent_term_set_label
12,CS20250428_CLASS_0006,F M GABA,CCN20250428_LEVEL_1,Class,#cd0f13,9,1,CS20250428_NEIGH_0002,Subpallium GABA,CCN20250428_LEVEL_0
15,CS20250428_CLASS_0002,CN GABA-Glut,CCN20250428_LEVEL_1,Class,#1c8d83,12,1,CS20250428_NEIGH_0003,Subpallium GABA-Glut,CCN20250428_LEVEL_0
6,CS20250428_CLASS_0010,OPC-Oligo,CCN20250428_LEVEL_1,Class,#8605d4,3,1,CS20250428_NEIGH_0001,Nonneuron,CCN20250428_LEVEL_0
8,CS20250428_CLASS_0007,F M Glut,CCN20250428_LEVEL_1,Class,#7d0f09,5,1,CS20250428_NEIGH_0000,Glut Sero Dopa,CCN20250428_LEVEL_0
7,CS20250428_CLASS_0011,Vascular,CCN20250428_LEVEL_1,Class,#cba5ec,4,1,CS20250428_NEIGH_0001,Nonneuron,CCN20250428_LEVEL_0


For the sake of demonstration, let's strip those two dataframes down to contain just `label`, `name`, and `parent_term_label` columns

In [16]:
just_subclasses = just_subclasses[['label', 'name', 'parent_term_label']]
just_classes = just_classes[['label', 'name', 'parent_term_label']]

In [17]:
just_subclasses.sample(n=5, random_state=np.random.default_rng(67112))

Unnamed: 0,label,name,parent_term_label
48,CS20250428_SUBCL_0030,STR Hybrid MSN,CS20250428_CLASS_0003
30,CS20250428_SUBCL_0017,F M Glut,CS20250428_CLASS_0007
38,CS20250428_SUBCL_0002,CN LAMP5-CXCL14 GABA,CS20250428_CLASS_0001
46,CS20250428_SUBCL_0028,STR D1 MSN,CS20250428_CLASS_0003
17,CS20250428_SUBCL_0012,Ependymal,CS20250428_CLASS_0000


Now, let's rename the columns in the dataframes to reflect the fact that they are different levels in the taxonomy

In [18]:
just_classes = just_classes.rename(
    {'label': 'class_label',
     'name': 'class_name',
     'parent_term_label': 'neighborhood_label'},
    axis=1
)
just_subclasses = just_subclasses.rename(
    {'label': 'subclass_label',
     'name': 'subclass_name',
     'parent_term_label': 'class_label'},
    axis=1
)
just_classes.sample(n=5, random_state=np.random.default_rng(11312))

Unnamed: 0,class_label,class_name,neighborhood_label
12,CS20250428_CLASS_0006,F M GABA,CS20250428_NEIGH_0002
9,CS20250428_CLASS_0009,M Dopa,CS20250428_NEIGH_0000
14,CS20250428_CLASS_0012,Cx GABA,CS20250428_NEIGH_0002
8,CS20250428_CLASS_0007,F M Glut,CS20250428_NEIGH_0000
4,CS20250428_CLASS_0000,Astro-Epen,CS20250428_NEIGH_0001


In [19]:
just_subclasses.sample(n=5, random_state=np.random.default_rng(11312))

Unnamed: 0,subclass_label,subclass_name,class_label
34,CS20250428_SUBCL_0033,STR SST-CHODL GABA,CS20250428_CLASS_0005
39,CS20250428_SUBCL_0009,CN VIP GABA,CS20250428_CLASS_0001
48,CS20250428_SUBCL_0030,STR Hybrid MSN,CS20250428_CLASS_0003
29,CS20250428_SUBCL_0013,F Glut,CS20250428_CLASS_0007
16,CS20250428_SUBCL_0000,Astrocyte,CS20250428_CLASS_0000


We can now add class annotations to the subclass dataframe by joining the `class_label` column of `just_subclasses` to the `class_label` column of `just_classes`

In [20]:
subclasses_and_classes = just_subclasses.join(
    just_classes.set_index('class_label'),
    on='class_label'
)

In [21]:
subclasses_and_classes.sample(n=10, random_state=np.random.default_rng(88111))

Unnamed: 0,subclass_label,subclass_name,class_label,class_name,neighborhood_label
33,CS20250428_SUBCL_0008,CN ST18 GABA,CS20250428_CLASS_0005,CN MGE GABA,CS20250428_NEIGH_0002
17,CS20250428_SUBCL_0012,Ependymal,CS20250428_CLASS_0000,Astro-Epen,CS20250428_NEIGH_0001
42,CS20250428_SUBCL_0016,F M GATA3 GABA,CS20250428_CLASS_0006,F M GABA,CS20250428_NEIGH_0002
31,CS20250428_SUBCL_0025,M Dopa,CS20250428_CLASS_0009,M Dopa,CS20250428_NEIGH_0000
27,CS20250428_SUBCL_0024,SMC,CS20250428_CLASS_0011,Vascular,CS20250428_NEIGH_0001
35,CS20250428_SUBCL_0032,STR SST GABA,CS20250428_CLASS_0005,CN MGE GABA,CS20250428_NEIGH_0002
16,CS20250428_SUBCL_0000,Astrocyte,CS20250428_CLASS_0000,Astro-Epen,CS20250428_NEIGH_0001
36,CS20250428_SUBCL_0031,STR RSPO2 GABA,CS20250428_CLASS_0005,CN MGE GABA,CS20250428_NEIGH_0002
19,CS20250428_SUBCL_0018,Macrophage,CS20250428_CLASS_0008,Immune,CS20250428_NEIGH_0001
38,CS20250428_SUBCL_0002,CN LAMP5-CXCL14 GABA,CS20250428_CLASS_0001,CN CGE GABA,CS20250428_NEIGH_0002


#### Put it all together

Let's assemble the whole taxonomy into a single dataframe using the principles demonstrated above

In [22]:
hmba_taxonomy_df = None

# loop over the levels in the taxonomy from most fine to most gross
for level in ('CCN20250428_LEVEL_4',
              'CCN20250428_LEVEL_3',
              'CCN20250428_LEVEL_2',
              'CCN20250428_LEVEL_1',
              'CCN20250428_LEVEL_0'):

    # only grab rows corresponding to taxons on this level
    level_data = hmba_term_df[
        hmba_term_df.cluster_annotation_term_set_label==level
    ]

    # fill 'nan' with the string 'NULL' so that the call to np.unique
    # below will handle those values properly
    level_data = level_data.fillna('NULL', axis=1)

    # check that, indeed, the rows only have parents at a single
    # level
    parent_level = np.unique(level_data.parent_term_set_label)
    assert len(parent_level) == 1
    parent_level = parent_level[0]

    # rename the columns in level_data so that they reflect the term_set
    # they belong to (i.e. 'label' becomes 'Group_label' or 'Subclass_label', etc.)
    column_name_mapping = {
        'label': f'{level_name_lookup[level]}_label',
        'name': level_name_lookup[level]
    }
    if parent_level != 'NULL':
         column_name_mapping['parent_term_label'] = f'{level_name_lookup[parent_level]}_label'

    level_data = level_data[column_name_mapping.keys()].rename(
        column_name_mapping,
        axis=1
    )

    if hmba_taxonomy_df is None:
        hmba_taxonomy_df = level_data
    else:
        # join the this data onto the taxonomy using the new label column
        level_data = level_data.set_index(f'{level_name_lookup[level]}_label')
        hmba_taxonomy_df = hmba_taxonomy_df.join(
            level_data,
            on=f'{level_name_lookup[level]}_label'
        )


In [23]:
# display a random set of taxons from the combined dataframe

hmba_taxonomy_df.sample(n=10, random_state=np.random.default_rng(6713))

Unnamed: 0,Cluster_label,Cluster,Group_label,Group,Subclass_label,Subclass,Class_label,Class,Neighborhood_label,Neighborhood
486,CS20250428_CLUST_2108,Marmoset-1473,CS20250428_GROUP_0064,VTR-HTH Glut,CS20250428_SUBCL_0017,F M Glut,CS20250428_CLASS_0007,F M Glut,CS20250428_NEIGH_0000,Glut Sero Dopa
1015,CS20250428_CLUST_2043,Marmoset-1408,CS20250428_GROUP_0037,SN-VTR-HTH GATA3-TCF7L2 GABA,CS20250428_SUBCL_0016,F M GATA3 GABA,CS20250428_CLASS_0006,F M GABA,CS20250428_NEIGH_0002,Subpallium GABA
1510,CS20250428_CLUST_0341,Human-460,CS20250428_GROUP_0024,OT D1 ICj,CS20250428_SUBCL_0021,OT Granular GABA,CS20250428_CLASS_0003,CN LGE GABA,CS20250428_NEIGH_0002,Subpallium GABA
916,CS20250428_CLUST_0749,Macaque-171,CS20250428_GROUP_0020,GPe-NDB-SI LHX6-LHX8-GBX1 GABA,CS20250428_SUBCL_0005,CN LHX8 GABA,CS20250428_CLASS_0006,F M GABA,CS20250428_NEIGH_0002,Subpallium GABA
1353,CS20250428_CLUST_1030,Macaque-406,CS20250428_GROUP_0058,STRv D2 MSN,CS20250428_SUBCL_0029,STR D2 MSN,CS20250428_CLASS_0003,CN LGE GABA,CS20250428_NEIGH_0002,Subpallium GABA
1231,CS20250428_CLUST_1009,Macaque-371,CS20250428_GROUP_0056,STRv D1 MSN,CS20250428_SUBCL_0028,STR D1 MSN,CS20250428_CLASS_0003,CN LGE GABA,CS20250428_NEIGH_0002,Subpallium GABA
995,CS20250428_CLUST_0882,Macaque-324,CS20250428_GROUP_0037,SN-VTR-HTH GATA3-TCF7L2 GABA,CS20250428_SUBCL_0016,F M GATA3 GABA,CS20250428_CLASS_0006,F M GABA,CS20250428_NEIGH_0002,Subpallium GABA
1060,CS20250428_CLUST_1992,Marmoset-1349,CS20250428_GROUP_0063,ZI-HTH GABA,CS20250428_SUBCL_0037,F GABA,CS20250428_CLASS_0006,F M GABA,CS20250428_NEIGH_0002,Subpallium GABA
1118,CS20250428_CLUST_0604,Macaque-22,CS20250428_GROUP_0066,AMY-SLEA-BNST GABA,CS20250428_SUBCL_0037,F GABA,CS20250428_CLASS_0006,F M GABA,CS20250428_NEIGH_0002,Subpallium GABA
127,CS20250428_CLUST_0251,Human-31,CS20250428_GROUP_0039,Astrocyte,CS20250428_SUBCL_0000,Astrocyte,CS20250428_CLASS_0000,Astro-Epen,CS20250428_NEIGH_0001,Nonneuron


Now, if we want to list all the unique classes in the taxonomy

In [24]:
np.unique(hmba_taxonomy_df.Class)

array(['Astro-Epen', 'CN CGE GABA', 'CN GABA-Glut', 'CN LGE GABA',
       'CN MGE GABA', 'Cx GABA', 'F M GABA', 'F M Glut', 'Immune',
       'M Dopa', 'OPC-Oligo', 'Vascular'], dtype=object)

versus, using the normalized table

In [25]:
np.unique(
    hmba_term_df[
        hmba_term_df.cluster_annotation_term_set_label=='CCN20250428_LEVEL_1'
    ]['name']
)

array(['Astro-Epen', 'CN CGE GABA', 'CN GABA-Glut', 'CN LGE GABA',
       'CN MGE GABA', 'Cx GABA', 'F M GABA', 'F M Glut', 'Immune',
       'M Dopa', 'OPC-Oligo', 'Vascular'], dtype=object)

Now, suppose I wanted to know all of the Groups which were children of the Class 'CN MGE GABA'

In [26]:
np.unique(
    hmba_taxonomy_df[
        hmba_taxonomy_df.Class=='CN MGE GABA'
    ]['Group']
)

array(['GPin-BF Cholinergic GABA', 'LAMP5-LHX6 GABA',
       'STR Cholinergic GABA', 'STR FS PTHLH-PVALB GABA',
       'STR LYPD6-RSPO2 GABA', 'STR SST-ADARB2 GABA',
       'STR SST-CHODL GABA', 'STR SST-RSPO2 GABA', 'STR TAC3-PLPP4 GABA',
       'STR-BF TAC3-PLPP4-LHX8 GABA', 'STRd Cholinergic GABA'],
      dtype=object)

## Joining taxonomy annotations to cells

Individual cells are linked to taxons via `cluster_alias`. `cluster_alias` is an identifier that refers to the purely computational (as opposed to "biologically meaningful") clusters that result from teh *de novo* clustering algorithm used to derive the taxonomy. Cells are assigned a `cluster_alias`. Biological experts assemble the `cluster_aliases` into biologically meaningful annotation terms. This link between an annotation term and its consitutent `cluster_alias` is found in the `cluster_to_cluster_annotation_membership` file.

In [27]:
hmba_membership_df = abc_cache.get_metadata_dataframe(
    directory=hmba_taxonomy_dir,
    file_name='cluster_to_cluster_annotation_membership'
)

cluster_to_cluster_annotation_membership.csv: 100%|█| 539k/539k [00:00<00:00, 3.


In [28]:
hmba_membership_df.sample(n=10, random_state=np.random.default_rng(8711222))

Unnamed: 0,cluster_annotation_term_label,cluster_annotation_term_set_label,cluster_alias,cluster_annotation_term_set_name,cluster_annotation_term_name
1441,CS20250428_GROUP_0039,CCN20250428_LEVEL_3,Human-159,Group,Astrocyte
6881,CS20250428_NEIGH_0002,CCN20250428_LEVEL_0,Marmoset-134,Neighborhood,Subpallium GABA
7111,CS20250428_NEIGH_0002,CCN20250428_LEVEL_0,Macaque-260,Neighborhood,Subpallium GABA
5776,CS20250428_NEIGH_0001,CCN20250428_LEVEL_0,Marmoset-515,Neighborhood,Nonneuron
1446,CS20250428_GROUP_0039,CCN20250428_LEVEL_3,Human-7,Group,Astrocyte
5211,CS20250428_CLASS_0006,CCN20250428_LEVEL_1,Marmoset-1412,Class,F M GABA
6097,CS20250428_NEIGH_0000,CCN20250428_LEVEL_0,Marmoset-1345,Neighborhood,Glut Sero Dopa
3981,CS20250428_SUBCL_0028,CCN20250428_LEVEL_2,Human-492,Subclass,STR D1 MSN
74,CS20250428_CLUST_0192,CCN20250428_LEVEL_4,Human-125,Cluster,Human-125
5653,CS20250428_CLASS_0003,CCN20250428_LEVEL_1,Macaque-32,Class,CN LGE GABA


Each row in this file represents a single `cluster_alias`-to-`cluster_annotation_term` relationship.

The key columns in this dataframe are
- `cluster_alias`: the identifier of the purely computational cluster. Again: this is what will be associated to individual cells later on
- `cluster_annotation_term_label`: the label for an annotation term to which the `cluster_alias` has been assigned. Each distinct (`cluster_alias`, `cluster_annotation_term_label`) will have its own row in this file, meaning individual `cluster_aliases` and `cluster_annotation_term_labels` will be repeated throughout.
- `cluster_annotation_term_set_label`: the label of the term set to which the term referred to by `cluster_annotation_term_label`

To get a dataframe linking `cluster_alias` to the labels of clusters (the finest level of the taxonomy as we have been using it so far), we can just filter this dataframe on `cluster_annotation_term_set_label`

In [29]:
hmba_term_to_alias_df = hmba_membership_df[
    hmba_membership_df.cluster_annotation_term_set_label=='CCN20250428_LEVEL_4'
]
hmba_term_to_alias_df.head(10)

Unnamed: 0,cluster_annotation_term_label,cluster_annotation_term_set_label,cluster_alias,cluster_annotation_term_set_name,cluster_annotation_term_name
0,CS20250428_CLUST_0161,CCN20250428_LEVEL_4,Human-143,Cluster,Human-143
1,CS20250428_CLUST_0162,CCN20250428_LEVEL_4,Human-145,Cluster,Human-145
2,CS20250428_CLUST_0163,CCN20250428_LEVEL_4,Human-146,Cluster,Human-146
3,CS20250428_CLUST_0164,CCN20250428_LEVEL_4,Human-149,Cluster,Human-149
4,CS20250428_CLUST_0165,CCN20250428_LEVEL_4,Human-150,Cluster,Human-150
5,CS20250428_CLUST_0166,CCN20250428_LEVEL_4,Human-152,Cluster,Human-152
6,CS20250428_CLUST_0167,CCN20250428_LEVEL_4,Human-159,Cluster,Human-159
7,CS20250428_CLUST_0168,CCN20250428_LEVEL_4,Human-160,Cluster,Human-160
8,CS20250428_CLUST_0169,CCN20250428_LEVEL_4,Human-450,Cluster,Human-450
9,CS20250428_CLUST_0170,CCN20250428_LEVEL_4,Human-452,Cluster,Human-452


Let's filter down this dataframe to the two columns we need and, for ease of use, rename `cluster_annoation_term_label` to `Cluster_label`

In [30]:
hmba_term_to_alias_df = hmba_term_to_alias_df[
    ['cluster_alias', 'cluster_annotation_term_label']
].rename({'cluster_annotation_term_label': 'Cluster_label'}, axis=1)

hmba_term_to_alias_df.head(10)

Unnamed: 0,cluster_alias,Cluster_label
0,Human-143,CS20250428_CLUST_0161
1,Human-145,CS20250428_CLUST_0162
2,Human-146,CS20250428_CLUST_0163
3,Human-149,CS20250428_CLUST_0164
4,Human-150,CS20250428_CLUST_0165
5,Human-152,CS20250428_CLUST_0166
6,Human-159,CS20250428_CLUST_0167
7,Human-160,CS20250428_CLUST_0168
8,Human-450,CS20250428_CLUST_0169
9,Human-452,CS20250428_CLUST_0170


The link between cells and `cluster_aliases` can be found in `cell_to_cluster_membership`

In [31]:
hmba_cell_to_cluster = abc_cache.get_metadata_dataframe(
    directory=hmba_taxonomy_dir,
    file_name='cell_to_cluster_membership'
)

cell_to_cluster_membership.csv: 100%|███████| 119M/119M [00:15<00:00, 7.62MMB/s]


In [32]:
hmba_cell_to_cluster.head(5)

Unnamed: 0,cell_label,cluster_alias,cluster_label
0,AAACAGCCAAATGCCC-2362_A05,Human-451,CS20250428_CLUST_0268
1,AAACAGCCAATTGAGA-2362_A05,Human-1,CS20250428_CLUST_0227
2,AAACAGCCAGCATGTC-2362_A05,Human-153,CS20250428_CLUST_0215
3,AAACAGCCATTGACAT-2362_A05,Human-1,CS20250428_CLUST_0227
4,AAACAGCCATTGTGGC-2362_A05,Human-14,CS20250428_CLUST_0249


We see here that `cluster_label` is also present in `cell_to_cluster_membership`, which means we could also use that column to join taxonomic annotations to cells. This is not a guarantee of the data model. Generally, it will be safer to assume that the link to individual cells is mediated by `cluster_alias`.

Let's join our denormalized taxonomy dataframe to cells using these tables.

In [33]:
# for the sake of illustration
hmba_cell_to_cluster = hmba_cell_to_cluster[["cell_label", "cluster_alias"]]

hmba_cell_to_cluster.head(5)

Unnamed: 0,cell_label,cluster_alias
0,AAACAGCCAAATGCCC-2362_A05,Human-451
1,AAACAGCCAATTGAGA-2362_A05,Human-1
2,AAACAGCCAGCATGTC-2362_A05,Human-153
3,AAACAGCCATTGACAT-2362_A05,Human-1
4,AAACAGCCATTGTGGC-2362_A05,Human-14


In [34]:
hmba_cell_to_cluster = hmba_cell_to_cluster.join(
    hmba_term_to_alias_df.set_index('cluster_alias'),
    on='cluster_alias'
)

In [35]:
hmba_cell_to_cluster.head(5)

Unnamed: 0,cell_label,cluster_alias,Cluster_label
0,AAACAGCCAAATGCCC-2362_A05,Human-451,CS20250428_CLUST_0268
1,AAACAGCCAATTGAGA-2362_A05,Human-1,CS20250428_CLUST_0227
2,AAACAGCCAGCATGTC-2362_A05,Human-153,CS20250428_CLUST_0215
3,AAACAGCCATTGACAT-2362_A05,Human-1,CS20250428_CLUST_0227
4,AAACAGCCATTGTGGC-2362_A05,Human-14,CS20250428_CLUST_0249


Now let's join the taxonomy to `hmba_cell_to_cluster`

In [36]:
hmba_cell_to_taxonomy = hmba_cell_to_cluster.join(
    hmba_taxonomy_df.set_index('Cluster_label'),
    on='Cluster_label'
)

In [37]:
hmba_cell_to_taxonomy.head(15)

Unnamed: 0,cell_label,cluster_alias,Cluster_label,Cluster,Group_label,Group,Subclass_label,Subclass,Class_label,Class,Neighborhood_label,Neighborhood
0,AAACAGCCAAATGCCC-2362_A05,Human-451,CS20250428_CLUST_0268,Human-451,CS20250428_GROUP_0062,VLMC,CS20250428_SUBCL_0035,VLMC,CS20250428_CLASS_0011,Vascular,CS20250428_NEIGH_0001,Nonneuron
1,AAACAGCCAATTGAGA-2362_A05,Human-1,CS20250428_CLUST_0227,Human-1,CS20250428_GROUP_0025,Oligo OPALIN,CS20250428_SUBCL_0022,Oligodendrocyte,CS20250428_CLASS_0010,OPC-Oligo,CS20250428_NEIGH_0001,Nonneuron
2,AAACAGCCAGCATGTC-2362_A05,Human-153,CS20250428_CLUST_0215,Human-153,CS20250428_GROUP_0019,Microglia,CS20250428_SUBCL_0019,Microglia,CS20250428_CLASS_0008,Immune,CS20250428_NEIGH_0001,Nonneuron
3,AAACAGCCATTGACAT-2362_A05,Human-1,CS20250428_CLUST_0227,Human-1,CS20250428_GROUP_0025,Oligo OPALIN,CS20250428_SUBCL_0022,Oligodendrocyte,CS20250428_CLASS_0010,OPC-Oligo,CS20250428_NEIGH_0001,Nonneuron
4,AAACAGCCATTGTGGC-2362_A05,Human-14,CS20250428_CLUST_0249,Human-14,CS20250428_GROUP_0039,Astrocyte,CS20250428_SUBCL_0000,Astrocyte,CS20250428_CLASS_0000,Astro-Epen,CS20250428_NEIGH_0001,Nonneuron
5,AAACATGCAAATATCC-2362_A05,Human-1,CS20250428_CLUST_0227,Human-1,CS20250428_GROUP_0025,Oligo OPALIN,CS20250428_SUBCL_0022,Oligodendrocyte,CS20250428_CLASS_0010,OPC-Oligo,CS20250428_NEIGH_0001,Nonneuron
6,AAACATGCAGTAGGAT-2362_A05,Human-159,CS20250428_CLUST_0167,Human-159,CS20250428_GROUP_0039,Astrocyte,CS20250428_SUBCL_0000,Astrocyte,CS20250428_CLASS_0000,Astro-Epen,CS20250428_NEIGH_0001,Nonneuron
7,AAACCAACAAGGTACG-2362_A05,Human-154,CS20250428_CLUST_0524,Human-154,CS20250428_GROUP_0056,STRv D1 MSN,CS20250428_SUBCL_0028,STR D1 MSN,CS20250428_CLASS_0003,CN LGE GABA,CS20250428_NEIGH_0002,Subpallium GABA
8,AAACCAACACAGACTC-2362_A05,Human-15,CS20250428_CLUST_0231,Human-15,CS20250428_GROUP_0026,Oligo PLEKHG1,CS20250428_SUBCL_0022,Oligodendrocyte,CS20250428_CLASS_0010,OPC-Oligo,CS20250428_NEIGH_0001,Nonneuron
9,AAACCAACATGAAGTA-2362_A05,Human-563,CS20250428_CLUST_0476,Human-563,CS20250428_GROUP_0052,STRd D2 Matrix MSN,CS20250428_SUBCL_0029,STR D2 MSN,CS20250428_CLASS_0003,CN LGE GABA,CS20250428_NEIGH_0002,Subpallium GABA


# Subtleties: Whole Mouse Brain taxonomy

The `abc_atlas_access` data model has evolved somewhat over the two years that we have been using `abc_atlas_access` to serve data to the community. Accessing older datasets may involve some exploration and inspection to find the right tables to join under the right circumstances. To illustrate this, let's consider the first taxonomy released with `abc_atlas_access`: the Yao et al. 2023 Whole Mouse Brain taxonomy.

In [38]:
whole_mouse_taxonomy_dir = 'WMB-taxonomy'

In [39]:
abc_cache.list_metadata_files(directory=whole_mouse_taxonomy_dir)

['cluster',
 'cluster_annotation_term',
 'cluster_annotation_term_set',
 'cluster_annotation_term_with_counts',
 'cluster_to_cluster_annotation_membership',
 'cluster_to_cluster_annotation_membership_color',
 'cluster_to_cluster_annotation_membership_pivoted']

The most obvious difference between this taxonomy at the HMBA Basal Ganglia taxonomy is the lack of a `cell_to_cluster_membership` file. In the Whole Mouse Brain taxonomy, the link between individual cells and `cluster_alias` occurs in the `cell_metadata` file.

In [40]:
whole_mouse_metadata_dir = 'WMB-10X'

In [41]:
whole_mouse_cell_metadata = abc_cache.get_metadata_dataframe(
    directory=whole_mouse_metadata_dir,
    file_name='cell_metadata'
)[['cell_label', 'feature_matrix_label', 'cluster_alias']]

whole_mouse_cell_metadata.sample(n=10, random_state=np.random.default_rng(88123))

cell_metadata.csv: 100%|██████████████████| 1.01G/1.01G [02:10<00:00, 7.74MMB/s]


Unnamed: 0,cell_label,feature_matrix_label,cluster_alias
3242684,TACTCATTCTGAGTGT-1.04_A01,WMB-10Xv2-Isocortex-2,233
3262844,ACTATCTGTTCCGTCT-005_B01,WMB-10Xv2-Isocortex-2,202
1501160,TCGGATATCTGCTTAT-216_C01,WMB-10Xv3-CB,5158
2517333,TGCGTGGGTTGATTGC-120_B01,WMB-10Xv2-CTXsp,24
3484208,CCTCTGATCTGCCCTA-098_A01,WMB-10Xv2-Isocortex-3,184
3718225,CACAGGCCATCGGTTA-097_C01,WMB-10Xv2-OLF,262
2608964,ACGGAGAGTGCATCTA-059_D01,WMB-10Xv2-MB,3572
825909,CTATCCGGTCCAAAGG-189_D01,WMB-10Xv3-CB,5158
1410845,CCTCTAGCACTGTTCC-455_A04,WMB-10Xv3-OLF,1384
1537242,AACCTGAAGAAACCCG-982_B03,WMB-10Xv3-MB,5231


Now let's inspect the Whole Mouse Brain taxonomy itself

In [42]:
whole_mouse_term_set_df = abc_cache.get_metadata_dataframe(
    directory=whole_mouse_taxonomy_dir,
    file_name='cluster_annotation_term_set'
)
whole_mouse_term_set_df

cluster_annotation_term_set.csv: 100%|████| 1.11k/1.11k [00:00<00:00, 21.8kMB/s]


Unnamed: 0,label,name,description,order
0,CCN20230722_NEUR,neurotransmitter,Clusters are assigned based on the average exp...,0
1,CCN20230722_CLAS,class,The top level of cell type definition in the m...,1
2,CCN20230722_SUBC,subclass,The coarse level of cell type definition in th...,2
3,CCN20230722_SUPT,supertype,The second finest level of cell type definitio...,3
4,CCN20230722_CLUS,cluster,The finest level of cell type definition in th...,4


In [43]:
whole_mouse_term_df = abc_cache.get_metadata_dataframe(
    directory=whole_mouse_taxonomy_dir,
    file_name='cluster_annotation_term'
)
whole_mouse_term_df.sample(n=10, random_state=np.random.default_rng(999))

Unnamed: 0,label,name,cluster_annotation_term_set_label,parent_term_label,parent_term_set_label,term_set_order,term_order,cluster_annotation_term_set_name,color_hex_triplet
707,CS20230722_SUPT_0326,0326 LSX Prdm12 Zeb2 Gaba_1,CCN20230722_SUPT,CS20230722_SUBC_071,CCN20230722_SUBC,3,325,supertype,#500F66
1181,CS20230722_SUPT_0800,0800 SNr-VTA Pax5 Npas1 Gaba_1,CCN20230722_SUPT,CS20230722_SUBC_195,CCN20230722_SUBC,3,799,supertype,#979945
1248,CS20230722_SUPT_0867,0867 SC Tnnt1 Gli3 Gaba_1,CCN20230722_SUPT,CS20230722_SUBC_211,CCN20230722_SUBC,3,866,supertype,#994588
1201,CS20230722_SUPT_0820,0820 PAG-ND-PCG Onecut1 Gaba_3,CCN20230722_SUPT,CS20230722_SUBC_200,CCN20230722_SUBC,3,819,supertype,#1F65CC
1188,CS20230722_SUPT_0807,0807 IC Six3 En2 Gaba_1,CCN20230722_SUPT,CS20230722_SUBC_198,CCN20230722_SUBC,3,806,supertype,#66501F
4924,CS20230722_CLUS_3342,3342 CUN-PPN Evx2 Meis2 Glut_4,CCN20230722_CLUS,CS20230722_SUPT_0779,CCN20230722_SUPT,4,3341,cluster,#FFBE99
5179,CS20230722_CLUS_3597,3597 PRT Tcf7l2 Gaba_1,CCN20230722_CLUS,CS20230722_SUPT_0829,CCN20230722_SUPT,4,3596,cluster,#4DFF94
5612,CS20230722_CLUS_4030,4030 PB Evx2 Glut_2,CCN20230722_CLUS,CS20230722_SUPT_0915,CCN20230722_SUPT,4,4029,cluster,#CC7B7A
1062,CS20230722_SUPT_0681,0681 MB-ant-ve Dmrta2 Glut_1,CCN20230722_SUPT,CS20230722_SUBC_156,CCN20230722_SUBC,3,680,supertype,#176599
5371,CS20230722_CLUS_3789,3789 SCs Lef1 Gli3 Gaba_1,CCN20230722_CLUS,CS20230722_SUPT_0869,CCN20230722_SUPT,4,3788,cluster,#2B9917


Another big difference between the Whole Mouse Brain taxonomy and the HMBA Basal Ganglia taxonomy is that, while every term set in the HMBA Basal Ganglia taxonomy is a part of a hierarchical taxonomy (every Cluster belongs to only one Group, every Group belongs to only one Subclass, every Subclass belongs to only one Class, and every Class belongs to only one Neighborhood), the `neurotransmitter` term set in Whole Mouse Brain exists outside of the hierarchical taxonomy.

In [44]:
for label, name in zip(whole_mouse_term_set_df.label, whole_mouse_term_set_df.name):

    # grab all terms claiming parents from this level
    subset = whole_mouse_term_df[
        whole_mouse_term_df.parent_term_set_label==label
    ]
    child_levels = np.unique(subset.cluster_annotation_term_set_label)

    print(
        f'level: ({label}, {name}) '
        f'-- {len(subset)} children from level {child_levels} claim a parent from this level'
    )

level: (CCN20230722_NEUR, neurotransmitter) -- 0 children from level [] claim a parent from this level
level: (CCN20230722_CLAS, class) -- 338 children from level ['CCN20230722_SUBC'] claim a parent from this level
level: (CCN20230722_SUBC, subclass) -- 1201 children from level ['CCN20230722_SUPT'] claim a parent from this level
level: (CCN20230722_SUPT, supertype) -- 5322 children from level ['CCN20230722_CLUS'] claim a parent from this level
level: (CCN20230722_CLUS, cluster) -- 0 children from level [] claim a parent from this level


In the Whole Mouse Brain taxonomy
- every Cluster belongs to only one Supertype
- every Supertype belongs to only one Subclass
- every Subclass belongs to only one Class

However, while every Cluster belongs to only one Neurotransmitter, Neurotransmitters can cut across supertypes, as can be seen from [this ABC Atlas visualization](https://knowledge.brain-map.org/abcatlas#AQQBQVA4Sk5ONUxZQUJHVk1HS1kxQgACUTFOQ1dXUEc2RlowRE5JWEpCUQADAQE0TVY3SEE1REcyWEpaM1VEOEc5AAIBRG9wYQAABAEBAn%2B7B4%2BClGNmA4TOQ%2BmEojpNAAUABgEBAjRNVjdIQTVERzJYSlozVUQ4RzkAA34AAAAEAAAIRzRJNEdGSlhKQjlBVFozUFRYMQAJTFZEQkpBVzhCSTVZU1MxUVVCRwAKAAsBbm9uZQACbm9uZQADAQQBAAIjMDAwMDAwAAPIAQAFAQECIzAwMDAwMAADyAEAAAABQVA4Sk5ONUxZQUJHVk1HS1kxQgACUTFOQ1dXUEc2RlowRE5JWEpCUQADAQE0TVY3SEE1REcyWEpaM1VEOEc5AAIBRG9wYQAABAEBAn%2B7B4%2BClGNmA4TOQ%2BmEojpNAAUABgEBAjE1Qks0N0RDSU9GMVNMTFVXOVAAA34AAAAEAAAIRzRJNEdGSlhKQjlBVFozUFRYMQAJTFZEQkpBVzhCSTVZU1MxUVVCRwAKAAsBbm9uZQACbm9uZQADAQQBAAIjMDAwMDAwAAPIAQAFAQECIzAwMDAwMAADyAEAAAABQVA4Sk5ONUxZQUJHVk1HS1kxQgACUTFOQ1dXUEc2RlowRE5JWEpCUQADBAFGUzAwRFhWMFQ5UjFYOUZKNFFFAAIAAAFRWTVTOEtNTzVITEpVRjBQMDBLAAIAAAExNUJLNDdEQ0lPRjFTTExVVzlQAAIBMDE2MiBPQiBEb3BhLUdhYmFfMQAAAUNCR0MwVTMwVlY5SlBSNjBUSlUAAgAABAEBAn%2B7B4%2BClGNmA4TOQ%2BmEojpNAAUABgEBAjRNVjdIQTVERzJYSlozVUQ4RzkAA34AAAAEAAAIRzRJNEdGSlhKQjlBVFozUFRYMQAJTFZEQkpBVzhCSTVZU1MxUVVCRwAKAAsBbm9uZQACbm9uZQADAQQBAAIjMDAwMDAwAAPIAQAFAQECIzAwMDAwMAADyAEAAAABQVA4Sk5ONUxZQUJHVk1HS1kxQgACUTFOQ1dXUEc2RlowRE5JWEpCUQADBAFGUzAwRFhWMFQ5UjFYOUZKNFFFAAIAAAFRWTVTOEtNTzVITEpVRjBQMDBLAAIAAAExNUJLNDdEQ0lPRjFTTExVVzlQAAIBMDE2MiBPQiBEb3BhLUdhYmFfMQAAAUNCR0MwVTMwVlY5SlBSNjBUSlUAAgAABAEBAn%2B7B4%2BClGNmA4TOQ%2BmEojpNAAUABgEBAjE1Qks0N0RDSU9GMVNMTFVXOVAAA34AAAAEAAAIRzRJNEdGSlhKQjlBVFozUFRYMQAJTFZEQkpBVzhCSTVZU1MxUVVCRwAKAAsBbm9uZQACbm9uZQADAQQBAAIjMDAwMDAwAAPIAQAFAQECIzAwMDAwMAADyAEAAAACCAA%3D). This visualization shows four panels. The two panels on the left are colored according to the "neurotransmitter" annotation. The two panels on the right are colored according to the supertype annotation. The two top panesl are filtered to include all cells annotated with `neurotransmitter` == "Dopa". The two panels on the bottom are filtered to show all cells annotated with `supertype` == "0162 Dopa GABA". As we can see, `neurotransmitter` == "Dopa" includes a subset of supertype "0162 Dopa GABA" alongside the entirety of other supertypes. This places `neurotransmitter` outside of the strictly hierarchical taxonomy.

This is just a feature of the taxonomy you have to discover by inspecting the contents of the taxonomy's tables.

### Joining Whole Mouse cell data to taxonomy annotations

Let's recycle our code to create a denormalized taxonomy dataframe, this time using the Whole Mouse Brain taxonomy

(we are intentionally copy-and-pasting code rather than writing utility functions to force users to inspect what is going on "under the hood")

In [45]:
whole_mouse_level_name_lookup = {
    label: name for label, name in zip(whole_mouse_term_set_df.label, whole_mouse_term_set_df.name)
}

In [46]:
whole_mouse_taxonomy_df = None

# loop over the levels in the taxonomy from most fine to most gross
for level in ('CCN20230722_CLUS',
              'CCN20230722_SUPT',
              'CCN20230722_SUBC',
              'CCN20230722_CLAS'):

    # only grab rows corresponding to taxons on this level
    level_data = whole_mouse_term_df[
        whole_mouse_term_df.cluster_annotation_term_set_label==level
    ]

    # fill 'nan' with the string 'NULL' so that the call to np.unique
    # below will handle those values properly
    level_data = level_data.fillna('NULL', axis=1)

    # check that, indeed, the rows only have parents at a single
    # level
    parent_level = np.unique(level_data.parent_term_set_label)
    assert len(parent_level) == 1
    parent_level = parent_level[0]

    # rename the columns in level_data so that they reflect the term_set
    # they belong to (i.e. 'label' becomes 'Group_label' or 'Subclass_label', etc.)
    column_name_mapping = {
        'label': f'{whole_mouse_level_name_lookup[level]}_label',
        'name': whole_mouse_level_name_lookup[level]
    }
    if parent_level != 'NULL':
         column_name_mapping['parent_term_label'] = f'{whole_mouse_level_name_lookup[parent_level]}_label'

    level_data = level_data[column_name_mapping.keys()].rename(
        column_name_mapping,
        axis=1
    )

    if whole_mouse_taxonomy_df is None:
        whole_mouse_taxonomy_df = level_data
    else:
        # join the this data onto the taxonomy using the new label column
        level_data = level_data.set_index(f'{whole_mouse_level_name_lookup[level]}_label')
        whole_mouse_taxonomy_df = whole_mouse_taxonomy_df.join(
            level_data,
            on=f'{whole_mouse_level_name_lookup[level]}_label'
        )

In [47]:
whole_mouse_taxonomy_df.sample(n=10, random_state=np.random.default_rng(76112))

Unnamed: 0,cluster_label,cluster,supertype_label,supertype,subclass_label,subclass,class_label,class
4070,CS20230722_CLUS_2488,2488 PH Pitx2 Glut_2,CS20230722_SUPT_0611,0611 PH Pitx2 Glut_2,CS20230722_SUBC_138,138 PH Pitx2 Glut,CS20230722_CLAS_14,14 HY Glut
6339,CS20230722_CLUS_4757,4757 PDTg-PCG Pax6 Gaba_4,CS20230722_SUPT_1064,1064 PDTg-PCG Pax6 Gaba_4,CS20230722_SUBC_273,273 PDTg-PCG Pax6 Gaba,CS20230722_CLAS_26,26 P GABA
3177,CS20230722_CLUS_1595,1595 RT-ZI Gnb3 Gaba_4,CS20230722_SUPT_0434,0434 RT-ZI Gnb3 Gaba_4,CS20230722_SUBC_093,093 RT-ZI Gnb3 Gaba,CS20230722_CLAS_12,12 HY GABA
1722,CS20230722_CLUS_0140,0140 L2/3 IT PIR-ENTl Glut_1,CS20230722_SUPT_0039,0039 L2/3 IT PIR-ENTl Glut_1,CS20230722_SUBC_009,009 L2/3 IT PIR-ENTl Glut,CS20230722_CLAS_01,01 IT-ET Glut
5543,CS20230722_CLUS_3961,3961 PSV Lmx1a Trpv6 Glut_1,CS20230722_SUPT_0902,0902 PSV Lmx1a Trpv6 Glut_1,CS20230722_SUBC_218,218 PSV Lmx1a Trpv6 Glut,CS20230722_CLAS_23,23 P Glut
3707,CS20230722_CLUS_2125,2125 ADP-MPO Trp73 Glut_2,CS20230722_SUPT_0528,0528 ADP-MPO Trp73 Glut_2,CS20230722_SUBC_118,118 ADP-MPO Trp73 Glut,CS20230722_CLAS_13,13 CNU-HYa Glut
6054,CS20230722_CLUS_4472,4472 CU-ECU-SPVI Foxb1 Glut_3,CS20230722_SUPT_0990,0990 CU-ECU-SPVI Foxb1 Glut_3,CS20230722_SUBC_246,246 CU-ECU-SPVI Foxb1 Glut,CS20230722_CLAS_24,24 MY Glut
2435,CS20230722_CLUS_0853,0853 Sst Chodl Gaba_2,CS20230722_SUPT_0239,0239 Sst Chodl Gaba_2,CS20230722_SUBC_056,056 Sst Chodl Gaba,CS20230722_CLAS_08,08 CNU-MGE GABA
5534,CS20230722_CLUS_3952,3952 PB Lmx1a Glut_6,CS20230722_SUPT_0900,0900 PB Lmx1a Glut_6,CS20230722_SUBC_217,217 PB Lmx1a Glut,CS20230722_CLAS_23,23 P Glut
3650,CS20230722_CLUS_2068,2068 MS-SF Bsx Glut_1,CS20230722_SUPT_0518,0518 MS-SF Bsx Glut_1,CS20230722_SUBC_115,115 MS-SF Bsx Glut,CS20230722_CLAS_13,13 CNU-HYa Glut


Get the `cluster_to_cluster_annotation_membership` dataframe

In [48]:
whole_mouse_membership_df = abc_cache.get_metadata_dataframe(
    directory=whole_mouse_taxonomy_dir,
    file_name='cluster_to_cluster_annotation_membership'
)
whole_mouse_membership_df.sample(n=10, random_state=np.random.default_rng(56332))

cluster_to_cluster_annotation_membership.csv: 100%|█| 2.21M/2.21M [00:00<00:00, 


Unnamed: 0,cluster_annotation_term_label,cluster_annotation_term_set_label,cluster_alias,cluster_annotation_term_name,cluster_annotation_term_set_name,number_of_cells,color_hex_triplet
10880,CS20230722_SUBC_014,CCN20230722_SUBC,474,014 LA-BLA-BMA-PA Glut,subclass,1607,#BBFF73
12660,CS20230722_SUBC_114,CCN20230722_SUBC,2557,114 COAa-PAA-MEA Barhl2 Glut,subclass,950,#EDFF99
7237,CS20230722_SUPT_0486,CCN20230722_SUPT,1890,0486 PVpo-VMPO-MPN Hmx2 Gaba_5,supertype,194,#00CC8D
20499,CS20230722_CLAS_24,CCN20230722_CLAS,3934,24 MY Glut,class,53,#F0A0FF
13369,CS20230722_SUBC_155,CCN20230722_SUBC,4701,155 PRC-PAG Pax6 Glut,subclass,40,#3D6566
21622,CS20230722_NEUR_Glut,CCN20230722_NEUR,312,Glut,neurotransmitter,10619,#2B93DF
6720,CS20230722_SUPT_0388,CCN20230722_SUPT,1828,0388 CEA-BST Ebf1 Pdyn Gaba_5,supertype,100,#1FCC61
1960,CS20230722_CLUS_1961,CCN20230722_CLUS,1976,1961 ARH-PVp Tbx3 Gaba_1,cluster,307,#A67ACC
7203,CS20230722_SUPT_0482,CCN20230722_SUPT,1448,0482 PVpo-VMPO-MPN Hmx2 Gaba_1,supertype,143,#DB4DFF
11433,CS20230722_SUBC_053,CCN20230722_SUBC,539,053 Sst Gaba,subclass,57,#99F2FF


Sub-select so that we have a mapping specificallly from `cluster_alias` to `cluster_label`

In [49]:
whole_mouse_alias_to_term_df = whole_mouse_membership_df[
    whole_mouse_membership_df.cluster_annotation_term_set_label=='CCN20230722_CLUS'
][['cluster_annotation_term_label', 'cluster_alias']].rename({'cluster_annotation_term_label': 'cluster_label'}, axis=1)

whole_mouse_alias_to_term_df.head(10)

Unnamed: 0,cluster_label,cluster_alias
0,CS20230722_CLUS_0001,128
1,CS20230722_CLUS_0002,129
2,CS20230722_CLUS_0003,130
3,CS20230722_CLUS_0004,143
4,CS20230722_CLUS_0005,131
5,CS20230722_CLUS_0006,116
6,CS20230722_CLUS_0007,120
7,CS20230722_CLUS_0008,121
8,CS20230722_CLUS_0009,122
9,CS20230722_CLUS_0010,125


Join with `cell_metadata` on `cluster_alias` so that we can link `cell_label` to `cluster_label`.

In [50]:
whole_mouse_cell_metadata = whole_mouse_cell_metadata.join(
    whole_mouse_alias_to_term_df.set_index('cluster_alias'),
    on='cluster_alias'
)
whole_mouse_cell_metadata.head(10)

Unnamed: 0,cell_label,feature_matrix_label,cluster_alias,cluster_label
0,GCGAGAAGTTAAGGGC-410_B05,WMB-10Xv3-HPF,1,CS20230722_CLUS_0326
1,AATGGCTCAGCTCCTT-411_B06,WMB-10Xv3-HPF,1,CS20230722_CLUS_0326
2,AACACACGTTGCTTGA-410_B05,WMB-10Xv3-HPF,1,CS20230722_CLUS_0326
3,CACAGATAGAGGCGGA-410_A05,WMB-10Xv3-HPF,1,CS20230722_CLUS_0326
4,AAAGTGAAGCATTTCG-410_B05,WMB-10Xv3-HPF,1,CS20230722_CLUS_0326
5,GATCGTATCGAATCCA-411_B06,WMB-10Xv3-HPF,1,CS20230722_CLUS_0326
6,AGATGAAAGGACCCAA-410_A05,WMB-10Xv3-HPF,1,CS20230722_CLUS_0326
7,TCTCACGGTCAGGAGT-411_A06,WMB-10Xv3-HPF,1,CS20230722_CLUS_0326
8,GATTCTTGTTCGCGTG-410_B05,WMB-10Xv3-HPF,1,CS20230722_CLUS_0326
9,TTTCGATAGTAAAGCT-410_B05,WMB-10Xv3-HPF,1,CS20230722_CLUS_0326


Use `cluster_label` to join the rest of the taxonomy.

In [51]:
whole_mouse_cell_metadata = whole_mouse_cell_metadata.join(
    whole_mouse_taxonomy_df.set_index('cluster_label'),
    on='cluster_label'
)
whole_mouse_cell_metadata.sample(n=10, random_state=np.random.default_rng(876661))

Unnamed: 0,cell_label,feature_matrix_label,cluster_alias,cluster_label,cluster,supertype_label,supertype,subclass_label,subclass,class_label,class
2478151,TTCTCAAAGTAACCCT-063_A01,WMB-10Xv2-HY,2043,CS20230722_CLUS_1943,1943 DMH Hmx2 Gaba_3,CS20230722_SUPT_0489,0489 DMH Hmx2 Gaba_3,CS20230722_SUBC_107,107 DMH Hmx2 Gaba,CS20230722_CLAS_12,12 HY GABA
2703697,ATAGACCTCCAATGGT-042_A01,WMB-10Xv2-TH,5073,CS20230722_CLUS_2651,2651 TH Prkcd Grin2c Glut_3,CS20230722_SUPT_0656,0656 TH Prkcd Grin2c Glut_3,CS20230722_SUBC_151,151 TH Prkcd Grin2c Glut,CS20230722_CLAS_18,18 TH Glut
2050784,GAATCGTAGAGTCGAC-433_D01,WMB-10Xv3-MB,14956,CS20230722_CLUS_5214,5214 Astro-NT NN_2,CS20230722_SUPT_1160,1160 Astro-NT NN_2,CS20230722_SUBC_318,318 Astro-NT NN,CS20230722_CLAS_30,30 Astro-Epen
439177,CATTCCGTCTCTAGGA-219_A01,WMB-10Xv3-MY,4123,CS20230722_CLUS_4989,4989 PAS-MV Ebf2 Gly-Gaba_3,CS20230722_SUPT_1111,1111 PAS-MV Ebf2 Gly-Gaba_3,CS20230722_SUBC_293,293 PAS-MV Ebf2 Gly-Gaba,CS20230722_CLAS_27,27 MY GABA
353142,AAGCATCCATCGCCTT-149_B01,WMB-10Xv3-MB,3349,CS20230722_CLUS_3512,3512 PAG-MRN-RN Foxa2 Gaba_1,CS20230722_SUPT_0814,0814 PAG-MRN-RN Foxa2 Gaba_1,CS20230722_SUBC_199,199 PAG-MRN-RN Foxa2 Gaba,CS20230722_CLAS_20,20 MB GABA
379806,GACCTTCAGTGGACGT-140_A01,WMB-10Xv3-MB,3532,CS20230722_CLUS_3703,3703 SCm-PAG Cdh23 Gaba_1,CS20230722_SUPT_0848,0848 SCm-PAG Cdh23 Gaba_1,CS20230722_SUBC_206,206 SCm-PAG Cdh23 Gaba,CS20230722_CLAS_20,20 MB GABA
2594489,TTGGAACGTAGAGTGC-020_C01,WMB-10Xv2-HPF,338,CS20230722_CLUS_0384,0384 SUB-ProS Glut_1,CS20230722_SUPT_0096,0096 SUB-ProS Glut_1,CS20230722_SUBC_023,023 SUB-ProS Glut,CS20230722_CLAS_01,01 IT-ET Glut
1982734,AGGGTGAAGACCGTTT-449_B08,WMB-10Xv3-Isocortex-2,14939,CS20230722_CLUS_5225,5225 Astro-TE NN_3,CS20230722_SUPT_1163,1163 Astro-TE NN_3,CS20230722_SUBC_319,319 Astro-TE NN,CS20230722_CLAS_30,30 Astro-Epen
3963084,GTTCTCGCAGATAATG-121_B01,WMB-10Xv2-OLF,1331,CS20230722_CLUS_0545,0545 OB-in Frmd7 Gaba_2,CS20230722_SUPT_0151,0151 OB-in Frmd7 Gaba_2,CS20230722_SUBC_041,041 OB-in Frmd7 Gaba,CS20230722_CLAS_05,05 OB-IMN GABA
1911746,TCCCACAAGACGTCCC-159_A01,WMB-10Xv3-P,5230,CS20230722_CLUS_5284,5284 MOL NN_4,CS20230722_SUPT_1184,1184 MOL NN_4,CS20230722_SUBC_327,327 Oligo NN,CS20230722_CLAS_31,31 OPC-Oligo
