A Python library for working with first-person video recordings. It reads MCAP files coming off an egocentric camera rig (RGB, depth, camera pose, IMU), runs hand tracking, visualizes the result in Rerun, and exports a clean episode directory you can hand to downstream training code.
- MCAP reading for Nexa recordings (RGB, depth, camera pose, IMU, TF, tracking state)
- Synchronized frame iteration across those streams
- Hand tracking with two backends: WiLoR and MediaPipe
- 3D hand keypoints anchored against the depth image
- Rerun based 3D visualization with sensible defaults
- Episode export to a self-contained directory (rgb.mp4, mesh.ply, HDF5 annotations, calibrations)
- Coordinate transform helpers (quaternion to matrix, optical to world, depth to point cloud)
Python 3.10 or newer. Start with a fresh env:
conda create -n nexa python=3.10
conda activate nexa
Clone this repo and install in editable mode:
git clone <this repo>
cd nexa
pip install -e ".[mcap,mediapipe,wilor]"
The extras are optional. Pick whichever you need:
mcapto read MCAP filesmediapipefor the MediaPipe hand backendwilorfor the WiLoR hand backend
You also need ffmpeg on your PATH if you want rgb.mp4 to come out of the episode export:
sudo apt install ffmpeg
And h5py for the annotation HDF5 (install separately if you skipped the extras):
pip install h5py
The SDK supports two hand tracking backends. One of them needs some upfront setup, the other doesn't.
Nothing to do. Once you've installed the SDK with the mediapipe extra, it just works. The model file (a few MB) is downloaded to ~/.cache/mediapipe/ on your first detect_hands call and cached after that.
WiLoR is not shipped with this SDK. It's a separate research repo from rolpotamias. You clone it yourself and point the SDK at it. The SDK just wraps it for inference.
Clone and prep the checkpoints:
git clone https://github.com/rolpotamias/WiLoR.git /path/to/WiLoR
cd /path/to/WiLoR
Follow WiLoR's own instructions to download the pretrained model. Once you're done the directory should look like this:
/path/to/WiLoR/
wilor/
pretrained_models/
wilor_final.ckpt
detector.pt
model_config.yaml
WiLoR pulls in its own stack (pyrender, chumpy, scikit-image, timm, pytorch-lightning and friends). Install its requirements inside your nexa env:
cd /path/to/WiLoR
pip install --no-build-isolation -r requirements.txt
One gotcha worth knowing: recent numpy releases dropped np.bool, np.int, and np.float, which older chumpy imports on load. If you hit ImportError: cannot import name 'bool' from 'numpy', open chumpy/__init__.py in your env and replace the bad import with Python builtins, or pin numpy<1.24. This is a WiLoR/chumpy compatibility thing, not something the SDK can fix from outside.
Once all that's in place, hand the path to HandTracker:
HandTracker(model="wilor", model_path="/path/to/WiLoR")If the path is wrong or the checkpoints are missing, the loader will tell you exactly which file it expected to find.
The SDK is silent by default. To see progress messages (MCAP opened, WiLoR loading, episode written, and so on), call:
import nexa
nexa.setup_logging()Use level="DEBUG" if you want the noisier stuff.
MCAPReader gives you random access to each stream in a recording, plus a synchronized frame iterator:
from nexa.data import MCAPReader
session = MCAPReader("my_recording.mcap")
print(session.duration) # seconds
print(session.num_rgb_frames)
print(session.rgb_intrinsics.K) # 3x3
print(session.depth_intrinsics.K)
for frame in session.frames():
# frame.index, frame.timestamp
# frame.rgb (H, W, 3) uint8
# frame.depth (H, W) uint16 mm, or None
# frame.camera_pose Pose6D in world frame, or None
# frame.imu dict with linear_acceleration, angular_velocity, orientation
# frame.depth_K 3x3 depth intrinsics
# frame.rgb_K 3x3 rgb intrinsics
passThe reader is permissive by default. If a topic is missing in your MCAP (no depth, no poses, no tracking state), the corresponding fields just come back as None or empty iterators and nothing blows up.
If you want a strict match against the reference topic fingerprint, pass check_format=True:
session = MCAPReader("my_recording.mcap", check_format=True)Other useful accessors:
session.all_camera_poses() # list of (timestamp, Pose6D)
session.all_imu_samples() # list of (timestamp, dict)
session.tf_transforms() # list of (ts, parent, child, Pose6D)
session.mesh() # (verts, faces, colors) or None
session.point_cloud() # (xyz, rgb or None)HandTracker is the thing you actually call. You pick a backend with model=, and both backends return the same HandPose type so the rest of your code doesn't care which one you chose.
Higher accuracy, GPU required, needs the WiLoR setup above.
from nexa.data import MCAPReader
from nexa.models import HandTracker
session = MCAPReader("my_recording.mcap")
tracker = HandTracker(model="wilor", model_path="/path/to/WiLoR")
for frame in session.frames():
hands = tracker.detect_hands(frame)
for hp in hands:
print(hp.hand_side, hp.wrist.x, hp.wrist.y, hp.wrist.z)
print(hp.fingers["index"][3]) # index fingertipThe tracker uses depth (when present) to anchor the 3D joints into the camera optical frame. If depth isn't available or the patch around the wrist is bad, you still get 2D keypoints (z = 0).
Faster, runs on CPU, no external repo to set up. The model file gets downloaded on first use into ~/.cache/mediapipe/.
tracker = HandTracker(model="mediapipe")
for frame in session.frames():
hands = tracker.detect_hands(frame)
for hp in hands:
print(hp.hand_side, hp.all_keypoints)@dataclass
class HandPose:
wrist: Keypoint # always present
fingers: dict[str, list[Keypoint]] # thumb, index, middle, ring, pinky
hand_side: str # "left" or "right"
frame_idx: Optional[int]
confidence: floatEach finger value is a list of four Keypoint objects in order MCP, PIP, DIP, tip. A Keypoint holds x, y, z, confidence, and an optional name.
Common access patterns:
hp.wrist.x # wrist x (metres in camera optical frame if 3D, pixels if 2D)
hp.fingers["thumb"][3] # thumb tip
hp.all_keypoints # flat 21 element list; or [wrist] when only the wrist was detected
hp.has_fingers # False for wrist only detectionsIf you want hand tracks to show up in the exported episode, push them into the session as you go:
for frame in session.frames():
hands = tracker.detect_hands(frame)
session.add_hand_pose(frame.index, hands)When you later call session.export(...), the accumulated detections land in annotation.hdf5:/hand-pose as (F, 21, 3) arrays per hand side, with valid and confidence siblings.
RerunVisualizer wires up the whole scene: blueprint layout, static mesh (or point cloud), camera frustum, per frame logging. Panels for RGB and depth only appear if those streams exist in the MCAP, so it degrades gracefully on partial recordings.
Minimal usage:
from nexa.data import MCAPReader
from nexa.models import HandTracker
from nexa.viz import RerunVisualizer
session = MCAPReader("my_recording.mcap")
tracker = HandTracker(model="mediapipe")
viz = RerunVisualizer(session)
for frame in session.frames():
hands = tracker.detect_hands(frame)
viz.log_frame(frame, hands=hands)
viz.export("my_recording.rrd")To open the result:
rerun my_recording.rrd
If you're on a headless machine or inside an SSH session with no display, use the web viewer:
rerun --web-viewer my_recording.rrd
It prints a local URL. Open that in any browser (forward the port with ssh -L 9090:localhost:9090 ... if you're remote).
You can also enable a live depth based point cloud and a dense reconstruction by passing flags at construction time:
viz = RerunVisualizer(
session,
max_viz=True, # live RGB-D point cloud per frame
dense_map=True, # build a dense coloured point cloud from all depth frames
)Those are off by default because they're slower.
session.export(out_dir, visualizer=viz) writes a clean directory with everything the recording has:
episodes/my_session/
rgb.mp4
mesh.ply
thumbnail.jpg
visualization.rrd
calibrations/
rgb_K.npy
rgb_D.npy
depth_K.npy
depth_D.npy
R_optical_to_link.npy
meta.json
annotation.hdf5
/depth (F, H, W) uint16 mm
/cam-pose translations + rotations + timestamps
/imu linear_acceleration, angular_velocity, orientation_xyzw
/hand-pose left_joints, right_joints, valid, confidence
/text-annotations from /camera/tracking_state
/metadata
Anything that can't be produced (no mesh topic, no ffmpeg, no hand detections attached) is skipped cleanly. The log output tells you which artifacts were saved and which were skipped.
from tqdm import tqdm
import nexa
from nexa.data import MCAPReader
from nexa.models import HandTracker
from nexa.viz import RerunVisualizer
nexa.setup_logging()
session = MCAPReader("recording.mcap")
tracker = HandTracker(model="wilor", model_path="/path/to/WiLoR")
viz = RerunVisualizer(session)
for frame in tqdm(session.frames(), total=session.num_rgb_frames):
hands = tracker.detect_hands(frame)
session.add_hand_pose(frame.index, hands)
viz.log_frame(frame, hands=hands)
session.export("episodes/my_session", visualizer=viz)
viz.export("recording.rrd")That's the whole loop. Open recording.rrd in Rerun to inspect visually, or hand episodes/my_session/ to whatever downstream code you have.