nexa

A Python library for working with first-person video recordings. It reads MCAP files coming off an egocentric camera rig (RGB, depth, camera pose, IMU), runs hand tracking, visualizes the result in Rerun, and exports a clean episode directory you can hand to downstream training code.

What's in the box

MCAP reading for Nexa recordings (RGB, depth, camera pose, IMU, TF, tracking state)
Synchronized frame iteration across those streams
Hand tracking with two backends: WiLoR and MediaPipe
3D hand keypoints anchored against the depth image
Rerun based 3D visualization with sensible defaults
Episode export to a self-contained directory (rgb.mp4, mesh.ply, HDF5 annotations, calibrations)
Coordinate transform helpers (quaternion to matrix, optical to world, depth to point cloud)

Installation

Python 3.10 or newer. Start with a fresh env:

conda create -n nexa python=3.10
conda activate nexa

Clone this repo and install in editable mode:

git clone <this repo>
cd nexa
pip install -e ".[mcap,mediapipe,wilor]"

The extras are optional. Pick whichever you need:

mcap to read MCAP files
mediapipe for the MediaPipe hand backend
wilor for the WiLoR hand backend

You also need ffmpeg on your PATH if you want rgb.mp4 to come out of the episode export:

sudo apt install ffmpeg

And h5py for the annotation HDF5 (install separately if you skipped the extras):

pip install h5py

Hand backend setup

The SDK supports two hand tracking backends. One of them needs some upfront setup, the other doesn't.

MediaPipe

Nothing to do. Once you've installed the SDK with the mediapipe extra, it just works. The model file (a few MB) is downloaded to ~/.cache/mediapipe/ on your first detect_hands call and cached after that.

WiLoR

WiLoR is not shipped with this SDK. It's a separate research repo from rolpotamias. You clone it yourself and point the SDK at it. The SDK just wraps it for inference.

Clone and prep the checkpoints:

git clone https://github.com/rolpotamias/WiLoR.git /path/to/WiLoR
cd /path/to/WiLoR

Follow WiLoR's own instructions to download the pretrained model. Once you're done the directory should look like this:

/path/to/WiLoR/
    wilor/
    pretrained_models/
        wilor_final.ckpt
        detector.pt
        model_config.yaml

WiLoR pulls in its own stack (pyrender, chumpy, scikit-image, timm, pytorch-lightning and friends). Install its requirements inside your nexa env:

cd /path/to/WiLoR
pip install --no-build-isolation -r requirements.txt

One gotcha worth knowing: recent numpy releases dropped np.bool, np.int, and np.float, which older chumpy imports on load. If you hit ImportError: cannot import name 'bool' from 'numpy', open chumpy/__init__.py in your env and replace the bad import with Python builtins, or pin numpy<1.24. This is a WiLoR/chumpy compatibility thing, not something the SDK can fix from outside.

Once all that's in place, hand the path to HandTracker:

HandTracker(model="wilor", model_path="/path/to/WiLoR")

If the path is wrong or the checkpoints are missing, the loader will tell you exactly which file it expected to find.

Logging

The SDK is silent by default. To see progress messages (MCAP opened, WiLoR loading, episode written, and so on), call:

import nexa
nexa.setup_logging()

Use level="DEBUG" if you want the noisier stuff.

Reading an MCAP

MCAPReader gives you random access to each stream in a recording, plus a synchronized frame iterator:

from nexa.data import MCAPReader

session = MCAPReader("my_recording.mcap")

print(session.duration)              # seconds
print(session.num_rgb_frames)
print(session.rgb_intrinsics.K)      # 3x3
print(session.depth_intrinsics.K)

for frame in session.frames():
    # frame.index, frame.timestamp
    # frame.rgb          (H, W, 3) uint8
    # frame.depth        (H, W) uint16 mm, or None
    # frame.camera_pose  Pose6D in world frame, or None
    # frame.imu          dict with linear_acceleration, angular_velocity, orientation
    # frame.depth_K      3x3 depth intrinsics
    # frame.rgb_K        3x3 rgb intrinsics
    pass

The reader is permissive by default. If a topic is missing in your MCAP (no depth, no poses, no tracking state), the corresponding fields just come back as None or empty iterators and nothing blows up.

If you want a strict match against the reference topic fingerprint, pass check_format=True:

session = MCAPReader("my_recording.mcap", check_format=True)

Other useful accessors:

session.all_camera_poses()    # list of (timestamp, Pose6D)
session.all_imu_samples()     # list of (timestamp, dict)
session.tf_transforms()       # list of (ts, parent, child, Pose6D)
session.mesh()                # (verts, faces, colors) or None
session.point_cloud()         # (xyz, rgb or None)

Hand tracking

HandTracker is the thing you actually call. You pick a backend with model=, and both backends return the same HandPose type so the rest of your code doesn't care which one you chose.

WiLoR

Higher accuracy, GPU required, needs the WiLoR setup above.

from nexa.data import MCAPReader
from nexa.models import HandTracker

session = MCAPReader("my_recording.mcap")
tracker = HandTracker(model="wilor", model_path="/path/to/WiLoR")

for frame in session.frames():
    hands = tracker.detect_hands(frame)
    for hp in hands:
        print(hp.hand_side, hp.wrist.x, hp.wrist.y, hp.wrist.z)
        print(hp.fingers["index"][3])   # index fingertip

The tracker uses depth (when present) to anchor the 3D joints into the camera optical frame. If depth isn't available or the patch around the wrist is bad, you still get 2D keypoints (z = 0).

MediaPipe

Faster, runs on CPU, no external repo to set up. The model file gets downloaded on first use into ~/.cache/mediapipe/.

tracker = HandTracker(model="mediapipe")

for frame in session.frames():
    hands = tracker.detect_hands(frame)
    for hp in hands:
        print(hp.hand_side, hp.all_keypoints)

What a HandPose looks like

@dataclass
class HandPose:
    wrist: Keypoint                         # always present
    fingers: dict[str, list[Keypoint]]      # thumb, index, middle, ring, pinky
    hand_side: str                          # "left" or "right"
    frame_idx: Optional[int]
    confidence: float

Each finger value is a list of four Keypoint objects in order MCP, PIP, DIP, tip. A Keypoint holds x, y, z, confidence, and an optional name.

Common access patterns:

hp.wrist.x                  # wrist x (metres in camera optical frame if 3D, pixels if 2D)
hp.fingers["thumb"][3]      # thumb tip
hp.all_keypoints            # flat 21 element list; or [wrist] when only the wrist was detected
hp.has_fingers              # False for wrist only detections

Attaching detections to the session

If you want hand tracks to show up in the exported episode, push them into the session as you go:

for frame in session.frames():
    hands = tracker.detect_hands(frame)
    session.add_hand_pose(frame.index, hands)

When you later call session.export(...), the accumulated detections land in annotation.hdf5:/hand-pose as (F, 21, 3) arrays per hand side, with valid and confidence siblings.

Visualizing with Rerun

RerunVisualizer wires up the whole scene: blueprint layout, static mesh (or point cloud), camera frustum, per frame logging. Panels for RGB and depth only appear if those streams exist in the MCAP, so it degrades gracefully on partial recordings.

Minimal usage:

from nexa.data import MCAPReader
from nexa.models import HandTracker
from nexa.viz import RerunVisualizer

session = MCAPReader("my_recording.mcap")
tracker = HandTracker(model="mediapipe")
viz = RerunVisualizer(session)

for frame in session.frames():
    hands = tracker.detect_hands(frame)
    viz.log_frame(frame, hands=hands)

viz.export("my_recording.rrd")

To open the result:

rerun my_recording.rrd

If you're on a headless machine or inside an SSH session with no display, use the web viewer:

rerun --web-viewer my_recording.rrd

It prints a local URL. Open that in any browser (forward the port with ssh -L 9090:localhost:9090 ... if you're remote).

You can also enable a live depth based point cloud and a dense reconstruction by passing flags at construction time:

viz = RerunVisualizer(
    session,
    max_viz=True,          # live RGB-D point cloud per frame
    dense_map=True,        # build a dense coloured point cloud from all depth frames
)

Those are off by default because they're slower.

Exporting episodes

session.export(out_dir, visualizer=viz) writes a clean directory with everything the recording has:

episodes/my_session/
    rgb.mp4
    mesh.ply
    thumbnail.jpg
    visualization.rrd
    calibrations/
        rgb_K.npy
        rgb_D.npy
        depth_K.npy
        depth_D.npy
        R_optical_to_link.npy
        meta.json
    annotation.hdf5
        /depth            (F, H, W) uint16 mm
        /cam-pose         translations + rotations + timestamps
        /imu              linear_acceleration, angular_velocity, orientation_xyzw
        /hand-pose        left_joints, right_joints, valid, confidence
        /text-annotations from /camera/tracking_state
        /metadata

Anything that can't be produced (no mesh topic, no ffmpeg, no hand detections attached) is skipped cleanly. The log output tells you which artifacts were saved and which were skipped.

Full pipeline example

from tqdm import tqdm
import nexa
from nexa.data import MCAPReader
from nexa.models import HandTracker
from nexa.viz import RerunVisualizer

nexa.setup_logging()

session = MCAPReader("recording.mcap")
tracker = HandTracker(model="wilor", model_path="/path/to/WiLoR")
viz = RerunVisualizer(session)

for frame in tqdm(session.frames(), total=session.num_rgb_frames):
    hands = tracker.detect_hands(frame)
    session.add_hand_pose(frame.index, hands)
    viz.log_frame(frame, hands=hands)

session.export("episodes/my_session", visualizer=viz)
viz.export("recording.rrd")

That's the whole loop. Open recording.rrd in Rerun to inspect visually, or hand episodes/my_session/ to whatever downstream code you have.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/nexa		src/nexa
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nexa

What's in the box

Installation

Hand backend setup

MediaPipe

WiLoR

Logging

Reading an MCAP

Hand tracking

WiLoR

MediaPipe

What a HandPose looks like

Attaching detections to the session

Visualizing with Rerun

Exporting episodes

Full pipeline example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nexa

What's in the box

Installation

Hand backend setup

MediaPipe

WiLoR

Logging

Reading an MCAP

Hand tracking

WiLoR

MediaPipe

What a HandPose looks like

Attaching detections to the session

Visualizing with Rerun

Exporting episodes

Full pipeline example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages