# Programmatic access to the MICrONS dataset using CAVE

This tutorial walks through the key functions needed to access the MICrONS dataset programmatically and highlights key resources within it. This tutorial is written for the MICrONS dataset specifically, but note that the underlying technology (CAVE) is being used for multiple connectomics dataset.

This tutorial is designed to be run in Google Colab. Some adjustments should be made if run locally, mostly around installation and authentication — see [CAVEclient documentation](https://caveconnectome.github.io/CAVEclient/) for more information.

## CAVEclient and setup

The CAVEclient is a python library that facilitates communication with a CAVE system. It can be installed with `pip`.

In [None]:
# Run to install caveclient in your colab instance
!pip install caveclient

and imported as usual:

In [None]:
import caveclient

## CAVE account setup

Each and every user needs to create a CAVE account and download a user token to access CAVE's services programmatically fully in order to manage server traffic.
The CAVE infrastructure can be read about in more detail on our [preprint](https://www.biorxiv.org/content/10.1101/2023.07.26.550598v1).
The MICrONS data is publicly available which means that no extra permissions need to be given to a new user account to access the data.
Bulk downloads of some static data are also available without an account on [MICrONs Explorer](https://microns-explorer.org/).

A Google account (or Google-enabled account) is required to create a CAVE account.

### Start here if you do not have a CAVE account or are not sure

Login to CAVE to setup a new account. To do this go to this [website](https://minnie.microns-daf.com/materialize/views/datastack/minnie65_public).

### Once you have an account: Setup your token

Create a new token by running the next cell. Then, copy the token and insert it into the argument of the following cell. These two cells should be redone together to make sure that the correct token is stored on your machine. You can copy your token and store on as many machines as you like. If you think your token has been compromised just reset it but rerunning the following cell.

In [None]:
client = caveclient.CAVEclient()
client.auth.setup_token(make_new=True)

### Set or save your token

From the website that just opened up, paste your token here:

In [None]:
my_token = "PASTE_TOKEN_HERE"

Were you to run this on your local machine, you should save the token so it will be loaded automatically in the future.

```python
client.auth.save_token(token=my_token, overwrite=True)
```

## Libraries for this notebook

This notebook makes use of a few python libraries for analysis and visualization. These are installed by default in Colab, but if you are running this notebook locally, you may need to install them via pip.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt

## Initialize CAVEclient with a datastack

Datasets in CAVE are organized as datastacks. These are a combination of an EM dataset, a segmentation and a set of annotations. The datastack for MICrONS public release is `minnie65_public`. When you instantiate your client with this datastack, it loads all relevant information to access it.

In [None]:
datastack_name = "minnie65_public"
client = caveclient.CAVEclient(datastack_name, auth_token=my_token)

## Materialization versions


Queries to the database for properties like synapses, cell types, and proofreading status are stored in a CAVE database.
All data in CAVE is timestamped for complete reproducibility. For convenience, data is periodically "materialized", where a copy of the database is stored at a specific timestamp and assigned a "version" number. Roughly quarterly, a permanent materialization version is released as the latest public version.

The various CAVEclient functions related to querying these materialized databases are available under `client.materialize`. 

For example, to see which versions are available:

In [None]:
client.materialize.get_versions()

And these are their associated timestamps (all timestamps are in UTC time zone):

In [None]:
for version in client.materialize.get_versions():
    print(f"Version {version}: {client.materialize.get_timestamp(version)}")

The client will automatically query the latest materialization version, but note that you can specify a `materialization_version` for every query if you want to access a specific version.

## Tables and generally useful information

A datastack has a large number of tables that can be intimidating to traverse at first. CAVE provides several ways to find the tables you may want use. To print all tables that are available run:

In [None]:
client.materialize.get_tables()

For each datastack, CAVE stores information about key data sources and parameters. These can be accessed through:

In [None]:
client.info.get_datastack_info()

For instance, the table with synapses is named `synapses_pni_2` and table with one entry per cell body (i.e. soma) is named `nucleus_detection_v0`. 

## Query 1: Querying cells and their types

### Querying cell bodies

The basic querying function for CAVE is `client.materialize.query_table`. This accepts a table name as the only required parameter and optionally some filters. Let's query the table of all automatically segmented nuclei:

In [None]:
nucleus_table_name = client.info.get_datastack_info()["soma_table"] # We just saw that this should be `nucleus_detection_v0`
nucleus_df = client.materialize.query_table(nucleus_table_name)
nucleus_df.head(5)

Every annotation table has at least one position column (here: `pt_position`) which serves as anchor to the segmentation. These positions are automatically associated to the segmentation using `pt_root_id`s which can be thought of segment or cell IDs. Beyond positions and their associated IDs, every table stores metadata. For instance, the nucleus table contains the `volume` of each cell body. Note that `pt_root_id=0` is a special value indicating that the point position does not have a segmentation. This can happen if it is on the edge of the imagery or on top of a masked-out image artifact.

Every table has a description and metadata attached to it that describes how the data was generated, limitations of it, and papers to cite when using it:

In [None]:
client.materialize.get_table_metadata(nucleus_table_name)

### Location vs depth

As a first analysis, we will plot the depth location vs the size of each cell nucleus. `query_table` has additional parameters to modify the results and standardize returns that make such an analysis easier. Using `desired_resolution` the resolution of all position columns can be defined in nanometers (thus 1000 = 1 µm). Using `split_positions`, The x, y, and z position columns are separated.

In [None]:
nucleus_df = client.materialize.query_table(nucleus_table_name, desired_resolution=[1000, 1000, 1000], split_positions=True)
nucleus_df.head(5)

Note that `y` is approximately along the depth axis here, increasing with deeper locations. There is a small tilt (roughly 5 degrees), and information about how to adjust for this tilt and align y=0 with the pial surface can be found [in the microns tutorial](https://alleninstitute.github.io/microns_tutorial/programmatic_access/em_py_07_coordinates.html)

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
ax.tick_params(labelsize=14)
sns.scatterplot(data=nucleus_df, x="volume", y="pt_position_y", size=1, edgecolor=None, alpha=.01, color="k", ax=ax, legend=False)
ax.invert_yaxis()
ax.set_xlabel("Volume ($\mu m^3$)", fontsize=16)
ax.set_ylabel("Depth ($\mu m$)", fontsize=16)
ax.set_xlim(0, 500)
plt.show()

### Querying cell type information 

There are two distinct ways cell types were classified in the MICrONS dataset: manual and automated. Manual annotations are available for ~1,300 neurons (`allen_v1_column_types_slanted_ref`), automated classifications are available for all cell bodies based on these manual annotations (`aibs_metamodel_celltypes_v661`). Because they are annotating an existing annotations, these annotations are introduced as a "reference" table:

In [None]:
ct_df = client.materialize.query_table("aibs_metamodel_celltypes_v661", desired_resolution=[1000, 1000, 1000], split_positions=True,
                                       merge_reference=False)

ct_df.head(5)

Reference annotations contain `target_id` to merge them onto the table they target (here: the nucleus table). But do not worry, CAVE automatically merges them onto their target table by default (`merge_reference=True`):

In [None]:
ct_df = client.materialize.query_table("aibs_metamodel_celltypes_v661", desired_resolution=[1000, 1000, 1000], split_positions=True)
# remove segments with merged cell bodies — these are generally rare
ct_df = ct_df.drop_duplicates("pt_root_id", keep=False)
ct_df.head(5)

The reference table added two additional data columns: `classification_system` and `cell_type`. The `classification_system` divides the cells into excitatitory and inhibitory neurons as well as non-neuronal cells. `cell_type` provides lower level cell annotations.

In [None]:
ct_df["classification_system"].value_counts()

In [None]:
ct_df["cell_type"].value_counts()

### Location vs depth + Cell type


Because the cell type table contains the information about the nuclei, we can use it to plot the locations of all cell bodies as well and label them by type.

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
ax.tick_params(labelsize=14)
sns.scatterplot(
    data=ct_df,
    x="volume",
    y="pt_position_y",
    s=5,
    edgecolor=None,
    alpha=.15,
    color="k",
    ax=ax, 
    legend=True,
    hue="classification_system",
)

ax.invert_yaxis()
ax.set_xlabel("Volume ($\mu m^3$)", fontsize=16)
ax.set_ylabel("Depth ($\mu m$)", fontsize=16)
ax.set_xlim(0, 500)

leg = plt.legend()
for lh in leg.legend_handles: 
    lh.set_alpha(1)

plt.show()

## Query 2: Querying synapses and proofread neurons

### Proofread neurons

Proofreading is necessary to obtain accurate reconstructions of a cell. In the MICrONS dataset, the general rule is that dendrites onto cells with a cell body are sufficiently proofread to trust synaptic connections onto a cell. Axons on the other hand require so much proofread that only ~1,000 cells have proofread axons for which outputs can be used for analysis.

The table `proofreading_status_and_strategy` contains proofreading information about ~1,200 neurons. Axon annotations `status_axon = "t"` and `clean` can be used for analysis. We can obtain such cells by adding a filter to our query:

In [None]:
proof_all_df = client.materialize.query_table("proofreading_status_and_strategy")

In [None]:
proof_all_df["strategy_axon"].value_counts()

In [None]:
proof_df = client.materialize.query_table("proofreading_status_and_strategy", filter_in_dict={"strategy_axon": ["axon_partially_extended", "axon_fully_extended", "axon_interareal"]})
proof_df

In [None]:
root_id = 864691135808684573
client.chunkedgraph.get_tabular_change_log(root_id)[root_id]

### Synapse query

The MICrONS dataset relies on automatically detected synapses for connectivity information. The consortium automatically detected and associated a total of 337 million synaptic clefts. The detections were evaluated by manually identifying synapses in 70 small subvolumes (n=8,611 synapses) distributed across the dataset, giving the automated detection an estimated precision of 96% and recall of 89% with a partner assignment accuracy of 98%.

We can query the synapse table directly. However, it is too large to query all at once. CAVE limits to queries to 500,000 rows at once and will display a warning when that happens. Here, we demonstrate this with the limit set to 10:

In [None]:
synapse_table_name = client.info.get_datastack_info()["synapse_table"]
syn_df = client.materialize.query_table(synapse_table_name, limit=10)
syn_df

Instead we need to limit our query to a few neurons. We can query the graph spanned by the neurons with "clean" axons using the `filter_in_dict` parameter:

In [None]:
%%time 

synapse_table_name = client.info.get_datastack_info()["synapse_table"]
syn_df = client.materialize.query_table(synapse_table_name, 
                                        filter_in_dict={"pre_pt_root_id": proof_df["pt_root_id"], 
                                                        "post_pt_root_id": proof_df["pt_root_id"]})

# remove internal synapses — almost entirely false detections
syn_df = syn_df[syn_df["pre_pt_root_id"] != syn_df["post_pt_root_id"]]
syn_df

In [None]:
syn_df = client.materialize.synapse_query(pre_ids= proof_df["pt_root_id"], post_ids=proof_df["pt_root_id"])
syn_df

Compared to the nucleus table, the synapse table has two points which were associated with segments (`pre_pt_position` and `post_pt_position`). The associated root ID columns are `pre_pt_root_id` and `post_pt_root_id`. 

Using pandas pivot function, we can transform this table into a matrix and plot it:

In [None]:
syn_mat = syn_df.pivot_table(index="pre_pt_root_id", columns="post_pt_root_id", values="size", aggfunc="sum")

# Squaring the matrix
syn_mat = syn_mat.reindex(columns=syn_mat.index)

In [None]:
sns.heatmap(np.log2(syn_mat), cmap="magma", xticklabels=[], yticklabels=[])
plt.show()

## Query 3 - Functional properties

Before acquiring the EM dataset, the activity of the cells in the same region was recorded using calcium imaging (excitatory neurons only). Joint analysis of the connectivity and functional data is possible for neurons that have been coregistered between the two datasets. Currently there are 10,630 such neurons, some of them were imaged in multiple sessions. 

The raw fluoresence traces of all neurons (also available as spiketrain) can be access via DANDI: https://dandiarchive.org/dandiset/000402. The fluoresence files are large but DANDI supports streaming via the `dandi` and `pynwb` python packages. The coregistration is stored in the table `coregistration_manual_v3` but currently extra information from [this csv](https://github.com/sdorkenw/MICrONS_workshop/blob/main/data/functional_coreg_unit_lookup_all_sessions.csv) is required to complete the matching to the DANDI archive.

Here, we will skip the access to the functional traces and instead work with extracted functional properties of these neurons. The table `functional_properties_v3_bcm` contains functional properties as outlined in the description:

In [None]:
print(client.materialize.get_table_metadata("functional_properties_v3_bcm")["description"])

In [None]:
func_df = client.materialize.query_table("functional_properties_v3_bcm")
func_df = func_df.drop_duplicates("pt_root_id", keep="first")
func_df

In [None]:
sns.histplot(data=func_df, x="gDSI")

In [None]:
sns.histplot(data=func_df[func_df["gDSI"] > .05], x="pref_dir")

Now we can do something fun and link synaptic connectivity information with functional properties. Let's query the targets of a neuron with a lot of synapses and plot their functional properties.

syn_df.value_counts('pre_pt_root_id').head(20)

In [None]:
pre_root_id = 864691135594657067
cell_syn_df = client.materialize.synapse_query(pre_ids=pre_root_id)

cell_syn_func_df = cell_syn_df.merge(
    func_df.drop_duplicates('pt_root_id')[['pt_root_id', 'gOSI', 'gDSI', 'pref_ori', 'pref_dir']],
    left_on='post_pt_root_id',
    right_on='pt_root_id',
)

In [None]:
sns.histplot(
    data=cell_target_func_df.query('gDSI>0.05'),
    x="pref_dir",
    bins=50,
    stat='percent',
    color=(0.5,0.7,0.7)a
)
sns.histplot(
    data=func_df[func_df["gDSI"] > .05],
    x="pref_dir",
    bins=50,
    stat='percent',
    element='step',
    fill=False,
    color='r',
    linewidth=3
)