# Get to Know a Dataset: EMBER

This notebook serves as a guided tour of the [EMBER](https://emberarchive.org) Open Data bucket. EMBER is the Ecosystem for Multi-modal Brain-behavior Experimentation and Research.

More usage examples, tutorials, and documentation for this dataset and others can be found at the [Registry of Open Data on AWS](https://registry.opendata.aws/ember).

### Q: How is the EMBER Open Data Bucket organized?

To understand the organization of the EMBER Open Data bucket, it's important to understand the organization of an EMBER project.

An EMBER project is the top-level organizational unit for EMBER. Within an EMBER project, public data and additional metadata is mostly commonly be stored as dandisets in [EMBER-DANDI](https://dandi.emberarchive.org/). In special cases, EMBER also supports storing other datasets forms.

The EMBER Open Data bucket is organized into three sections, as follows:

1. [EMBER-dandisets](https://dandi.emberarchive.org/)
    - EMBER-dandisets are stored using the prefixes blobs/ and dandisets/
2. Other EMBER Data
    - Other forms of datasets are stored using the prefix other/
3. Tools
    - EMBER tools are stored using the prefix tools/


For this tutorial, we will demonstrate using 2 EMBER projects:  
- [Kumar2025](https://emberarchive.org/project/kumar2025)
- [Shepherd2025 - EMBER-DANDI:000463](https://dandi.emberarchive.org/dandiset/000463)


First we will import the Python libraries required throughout this notebook.

In [None]:
# This notebook requires the following additional libraries
# (please install using the preferred method for your environment, e.g. uv, pip, conda):
#
# "boto3>=1.42.29"
# "h5py>=3.15.1"
# "matplotlib>=3.10.8"
# "pynwb>=3.1.3"
# "requests>=2.32.5"

import boto3
import h5py
import matplotlib.pyplot as plt
import numpy as np
import requests

from botocore import UNSIGNED
from botocore.config import Config
from pynwb import NWBHDF5IO
from urllib.parse import urlparse

Next, we will define the location of our EMBER Open Data Bucket and create our boto3 S3 client.

In [None]:
# EMBER S3 bucket
bucket = "ember-open-data"

# List the top level of the bucket using boto3. Because this is a public bucket, we don't need to sign requests.
# Here we set the signature version to unsigned, which is required for public buckets.
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Print the items in the top-level prefixes to see all of the different BossDB project datasets
for item in s3.list_objects_v2(Bucket=bucket, Delimiter='/')['CommonPrefixes']:
    print(item['Prefix'])

At the top level of the EMBER Open Data bucket, we see that prefixes correspond to EMBER-dandisets (blobs/ and dandisets/), other datasets (other/), and tools (tools/) as described in the previous section.

In the code blocks below, we will dive into each EMBER project.

**Kumar 2025**
- EMBER Open Data bucket S3 Prefix: `other/kumar2025/`
- Data can also be accessed through our [EMBER Project File Browser: Kumar2025](https://ember-open-data.s3.us-east-1.amazonaws.com/other/kumar2025/index.html)


We will see that data is organized into classifers, pose_files, and videos.


In [None]:
# Kumar2025
project = "kumar2025"

print(f"Prefixes within the project: /other/{project}")
# List the key prefixes within the top level of the Kumar2025 dataset
for item in s3.list_objects_v2(Bucket=bucket, Prefix=f'other/{project}/', Delimiter='/', MaxKeys=10)['CommonPrefixes']:
    print(item['Prefix'])

**Shepherd 2025 - EMBER-DANDI:000463**
- Data can also be accessed through [EMBER-DANDI File Browser: EMBER-DANDI:000463](https://dandi.emberarchive.org/dandiset/000463/draft/files)

We will see that data within the dandiset is organized by subject.

Please note that the organization of data within a dandiset does not have a direct correspondence to the organization of data within the S3 bucket. In the steps below, we will see how to get the full S3 bucket path.

In [None]:
# Shepherd 2025 - EMBER-DANDI:000463
dandiset = "000463"
dandiset_version = "draft"
dandi_api_base = "https://api-dandi.emberarchive.org/api"

response = requests.get(f"{dandi_api_base}/dandisets/{dandiset}/versions/{dandiset_version}/assets/paths")
resp_json = response.json()

paths = set()
for asset_file in resp_json["results"]:
    paths.add(asset_file["path"])

paths = sorted(paths)

for path in paths[:10]:
    print(path)
print("... (only listing the first 10 paths)\n")


### Q: What data formats are present in your dataset? What kinds of data are stored using these formats? Can you give any advice for how you work with these data formats?

EMBER is the data archive for multimodal neurophysiological and behavioral datasets.

Data in our datasets are stored as a set of assets, organized into EMBER-Dandisets. You can find documenation at https://emberarchive.org/documentation.

The landing page of each EMBER-Dandiset contains important information including metadata provided by the owners such as contact information, description, license, access information and keywords, simple statistics such as size and number of files, or a summary of the Dandiset including information about species, techniques, and standards.

Data within a dataset are organized in one of two ways. First, assets may be organized via the BIDS standard (https://bids.neuroimaging.io/index.html) and can be accessed via the pybids library (https://github.com/bids-standard/pybids). They can also be organized as Neurodata Without Borders files (https://nwb.org/) which is an hdf5-based format for hierarchical physiological data. This can be accessed via (https://pynwb.readthedocs.io/en/stable/). Note that BIDS datasets can sometimes contain NWB files in subdirectories, so both libraries may need to be used in tandem.

### Q: Can you show us an example of downloading and loading data from your dataset?

As an example, we will load a data file from each project.

**Kumar 2025**

We will look at a single HDF5 file stored in this project.

- Data can also be browsed through our [EMBER Project File Browser: Kumar2025](https://ember-open-data.s3.us-east-1.amazonaws.com/other/kumar2025/index.html#pose_files%2F)

In [None]:
# Kumar 2025

print(f"\nList of files in: /other/{project}/pose_files")
# List files within other/kumar2025/pose_files/
objects = s3.list_objects_v2(Bucket=bucket, Prefix=f'other/{project}/pose_files/', Delimiter='/', MaxKeys=10)['Contents']
for item in objects:
    print(item['Key'])
print("... (only listing the first 10 files)\n")

# Select 1 HDF5 file to explore further
h5_path = objects[0]['Key']
local_h5_file = f"kumar2025-{h5_path.split("/")[-1]}"

# Download the file
s3.download_file(bucket, h5_path, local_h5_file)
print(f"Downloaded local file:\n\t{local_h5_file}")

**Shepherd 2025 - EMBER-DANDI:000463**

We will look at a single NWB file stored in this dandiset.

We will see that data files can be accessed directly via S3 or using the EMBER-DANDI API:
- https://ember-open-data.s3.amazonaws.com/blobs/a67/f98/a67f98c3-ffb8-4c40-8b2f-45706e6bf8a9
- https://api-dandi.emberarchive.org/api/assets/2d0c4695-a091-48c3-b70c-61b5567ef515/download/
- Data can also be browsed through the [EMBER-DANDI File Browser](https://dandi.emberarchive.org/dandiset/000463/draft/files?location=sub-ADPTM01).

In [None]:
# Shepherd 2025 - EMBER-DANDI:000463

# Use the first file path from above
file_path = paths[0]

# Query Dandiset assets for files witht the above file path
response = requests.get(f"{dandi_api_base}/dandisets/{dandiset}/versions/{dandiset_version}/assets/?path={file_path}&metadata=false&zarr=false")

# Filter out non-NWB files (for this demo)
assets = response.json()["results"]
nwb_assets = [asset for asset in assets if ".nwb" in  asset["path"]]

# Select 1 NWB file to explore further
file = nwb_assets[0]
asset_id = file["asset_id"]
print(f"Asset Path:\n\t{file["path"]}")

# Get asset metadata
asset_response = requests.get(f"{dandi_api_base}/dandisets/{dandiset}/versions/{dandiset_version}/assets/{asset_id}")
asset_access_urls = asset_response.json()["contentUrl"]

print("Asset Access Methods:")
for access_method in asset_access_urls:
    print(f"\t{access_method}")

# Get S3 Access URL
s3_access_url = asset_access_urls[1]
# Strip the S3 URL to get the prefix path
s3_path = urlparse(s3_access_url).path.lstrip("/")
local_nwb_file = f"dandiset000463-{file["path"].split("/")[-1]}"

# Download the file
s3.download_file(bucket, s3_path, local_nwb_file)
print(f"Downloaded local file:\n\t{local_nwb_file}")

Now, we will read some basic information from the files.

**Kumar 2025**

In [None]:
# Kumar 2025

# Open the HDF5 file
with h5py.File(local_h5_file, "r") as f:
    print("Top-level keys (groups):", list(f.keys()))

    def show_h5(name, obj):
        if isinstance(obj, h5py.Dataset):
            print(f"  DATASET {name} shape={obj.shape}")
        else:
            print(f"GROUP   {name}")

    # Print each Group and Dataset
    f.visititems(show_h5)

**Shepherd 2025 - EMBER-DANDI:000463**

NWB Files can also be accessed and explored through [EMBER-DANDI File Browser: EMBER-DANDI:000463](https://dandi.emberarchive.org/dandiset/000463/draft/files?location=sub-ADPTM01)

- Select "Open With" and then select one of:
    - "Neurosift"
        - [Neurosift](https://neurosift.app/) is a browser-based tool designed for visualizing neuroscience data with a focus on NWB files.
        - Example: https://neurosift.app/nwb?url=https://api-dandi.emberarchive.org/api/assets/2d0c4695-a091-48c3-b70c-61b5567ef515/download/&dandisetId=000463&dandisetVersion=draft
    - "MetaCell/NWBExplorer"
        - [NWB Explorer](https://nwbexplorer.v2.opensourcebrain.org/) is a browser-based tool for visualizing and understanding neurophysiology data formatted as an NWB file.

In [None]:
# Shepherd 2025 - EMBER-DANDI:000463

# Open the NWB file
with NWBHDF5IO(local_nwb_file, "r") as io:
    nwbfile = io.read()

    print("Session description:", nwbfile.session_description)
    print("Start time:", nwbfile.session_start_time)

    print("\nWhatâ€™s in this file?")
    print("  Acquisition:\t", list(nwbfile.acquisition.keys()))
    print("  Processing:\t", list(nwbfile.processing.keys()))
    print("  Intervals:\t", list(nwbfile.intervals.keys()))
    print("  Stimulus:\t", list(nwbfile.stimulus.keys()))
    print("  Analysis:\t", list(nwbfile.analysis.keys()))

To learn more about NWB, please see the following links:

- NWB overview: https://nwb.org/
- PyNWB documentation: https://pynwb.readthedocs.io/
- PyNWB tutorials: https://pynwb.readthedocs.io/en/stable/tutorials/index.html
- Common NWB structures: https://pynwb.readthedocs.io/en/stable/tutorials/domain.html

### Q: A picture is worth a thousand words. Show us a visual (or several!) from your dataset that either illustrates something informative about your dataset, or that you think might excite someone to dig in further.


**Shepherd 2025 - EMBER-DANDI:000463**

This figure visualizes pose-estimation data from the NWB file, showing how a tracked body point moves over time. The time axis is reconstructed from the sampling rate stored in the file.

In [None]:
# Shepherd 2025 - EMBER-DANDI:000463

# Open the NWB file
with NWBHDF5IO(local_nwb_file, "r") as io:
    nwbfile = io.read()

    pose = nwbfile.processing["behavior"].data_interfaces["PoseEstimationDeepLabCut"]

    print("Series:", list(pose.pose_estimation_series))

    # Pick the first pose-estimation series
    series_name = list(pose.pose_estimation_series.keys())[0]
    series = pose.pose_estimation_series[series_name]

    print("Using series:", series_name)
    print("Data shape:", series.data.shape)

    # How many samples to plot
    n = min(2000, series.data.shape[0])
    xy = np.array(series.data[:n])

    # Build time axis from sampling rate
    rate = series.rate
    starting_time = series.starting_time if series.starting_time is not None else 0.0
    t = starting_time + np.arange(n) / rate

    plt.figure()
    plt.plot(t, xy[:, 0], label="x")
    plt.plot(t, xy[:, 1], label="y")
    plt.xlabel("Time (s)")
    plt.ylabel("Position")
    plt.title(series.name)
    plt.legend()
    plt.tight_layout()
    plt.show()


### Q: What is one question that you have answered using these data? Can you show us how you came to that answer?

The EMBER Archive is a new effort to support the NIH [Brian Behavior Quantification and Synchronization](https://braininitiative.nih.gov/research/systems-neuroscience/brain-behavior-quantification-and-synchronization-program) program, which is still ongoing. Currently, teams are working on collecting, sharing, and publishing their findings.

In the example Shepherd 2025 EMBER-dandiset, the research team investigated dexterous single sniffs for ethological active olfaction. This is to say, they investigated the links between olfaction and movement during feeding in mice, giving novel insight into the link between sensing and behavior.

### Q: What is one unanswered question that you think could be answered using these data? Do you have any recommendations or advice for someone wanting to answer this question?

In the NIH [Brian Behavior Quantification and Synchronization](https://braininitiative.nih.gov/research/systems-neuroscience/brain-behavior-quantification-and-synchronization-program) program, the core question is what are the common substrates which underly the neural representation of behavior, and how can these be quantified and represented. Key questions include how everyday activities (navigation, exploration, interaction with the world and others) generate repeatable patterns of neural activity. In addition, there is interest in how neurological disease alters or perturbs this function. Integrating multi-lab and multi-species datasets to give insight into this is a key issue moving forward.