Standalone C++ inference engine for SAM-3D-Body — zero Python dependency at runtime.
Takes a BGR image and produces per-person MHR body pose parameters, camera translation, and optionally full 3D mesh vertices + 70 body/hand keypoints, all via ONNX Runtime + ggml.
Also includes Python frontends that call the compiled shared library via ctypes, and a CSV exporter for the 70 MHR keypoints.
--bvh PATH writes a standard BVH motion-capture file per detected person (p_0.bvh, p_1.bvh, …).
Identities are kept stable across frames by a built-in 2D-bbox IoU tracker, each
file's joint OFFSETs are auto-resized to the actor's measured bone lengths, and the
output drops straight into Blender / BVHTester / any DCC. A bundled
blender/blender_bvh_plugin.py drives a
MakeHuman-rigged character from the result. See BVH export for details.
./fast_sam_3dbody_run --from clip.mp4 --bvh ./p.bvh --headless
# → p_0.bvh, p_1.bvh, …Pre-built ONNX / GGUF / LBS model files are hosted on HuggingFace:
https://huggingface.co/AmmarkoV/SAM3DBody-cpp-onnx-models
Download SAM3DBody-cpp-onnx-models.zip, extract it, and place the resulting onnx/ directory at the root of this repository:
# Download and extract
wget https://huggingface.co/AmmarkoV/SAM3DBody-cpp-onnx-models/resolve/main/SAM3DBody-cpp-onnx-models.zip
unzip SAM3DBody-cpp-onnx-models.zip
# onnx/ is now at the repo root — ready to build| File | Size | Description |
|---|---|---|
onnx/backbone.onnx + .data |
~4.8 GB | DINOv3-ViT-H/14+ encoder |
onnx/decoder.onnx |
~93 MB | 6-layer PromptableDecoder |
onnx/yolo.onnx |
~81 MB | YOLO11m-pose person detector |
onnx/pipeline.gguf |
~5 MB | MHR + camera projection heads |
onnx/body_model.lbs |
~27 MB | Native C LBS data (joints, weights, shape) |
onnx/correctives.bin |
~33 MB | Pose corrective blend shapes |
onnx/keypoint_mapping.bin |
~8 KB | MHR-70 keypoint index map |
CMake will warn at configure time if neither
onnx/nor the zip is found.
BGR image
│
▼ yolo.onnx ONNX Runtime (CUDA EP) person bboxes + 17 COCO keypoints
│
▼ backbone.onnx ONNX Runtime (CUDA EP) feature map [B, 1280, 32, 32]
│ DINOv3-ViT-H/14+
│
▼ decoder.onnx ONNX Runtime (CUDA EP) pose token [B, 1024]
│ 6-layer PromptableDecoder
│
▼ pipeline.gguf CPU matmul (ggml) MHR params [B, 519] + camera [B, 3]
│ MHR head + camera head weights
│
▼ body_model.lbs native C LBS (optional) vertices [18439, 3] in metres
extracted once by tools/extract_lbs_data.py
Note:
body_model.onnxexport is blocked on PyTorch ≥ 2.x (torch.export rejects TorchScript modules). The native C LBS path readsbody_model.lbsdirectly and produces identical output to Pythonmhr_forward(body model stores data in cm;mhr_lbs_computeapplies ×0.01 to match Python's/100conversion).
Per-person output (MHRResult / FsbResult):
| Field | Shape | Description |
|---|---|---|
bbox |
[4] | x1 y1 x2 y2 in original image pixels |
focal_length |
scalar | Estimated focal length (pixels) |
pred_cam_t |
[3] | Raw camera head output: [s, tx, ty] |
global_rot |
[3] | Global orientation – Euler ZYX (radians) |
body_pose |
[133] | Body joint angles – Euler |
shape |
[45] | SMPL-like identity blend shape betas |
scale |
[28] | Scale PCA components |
hand_pose |
[108] | Hand joints: left [54] + right [54] |
face_params |
[72] | Facial expression parameters |
mhr_model_params |
[204] | Assembled LBS parameter vector (passed to mhr_lbs_compute) |
yolo_kps |
[51] | COCO 17 keypoints × [x, y, confidence] |
pred_vertices |
[55317] | 18439 verts × 3, metres (when native C LBS runs) |
kps_3d |
[210] | 70 joints × 3, metres (when native C LBS runs) |
kps_2d |
[140] | 70 joints × 2 projected (when native C LBS runs) |
SAM3DBody-cpp/
├── CMakeLists.txt
├── body_mesh.tri SMPL-like body mesh for the GL renderer
├── fast_sam_3dbody_frontend.py Python lightweight frontend (ctypes, no extra deps)
├── fast_sam_3dbody_frontend-3D.py Python 3D frontend (ctypes + Python body model)
├── fast_sam_3dbody_dump_csv.py Python CSV exporter – 70 MHR keypoints per frame
├── two_pass.py Second-pass temporal smoother
├── ros_demo_webcam.py ROS demo
├── onnx/ Runtime model files – download from HuggingFace (see above)
│ ├── backbone.onnx + .data ~4.8 GB DINOv3-ViT-H/14+ encoder
│ ├── decoder.onnx ~93 MB 6-layer PromptableDecoder
│ ├── pipeline.gguf ~5 MB MHR + camera heads
│ ├── yolo.onnx ~81 MB YOLO11m-pose
│ ├── body_model.lbs ~27 MB native C LBS data
│ ├── correctives.bin ~33 MB pose corrective blend shapes
│ └── keypoint_mapping.bin ~8 KB MHR-70 keypoint index map
├── GraphicsEngine/
│ ├── System/glx3.{h,c} GLX window management
│ └── ModelLoader/ .tri mesh loader + LBS joint transform
├── AmMatrix/ Lightweight C matrix / quaternion library
├── render/
│ ├── fast_sam_3dbody_render.cpp OpenGL mesh overlay renderer
│ └── mhr_pose_driver.h LBS driver (camera matrices, vertex update)
├── scripts/
│ └── build.sh / setup.sh / webcam.sh / video.sh / offline_video.sh
└── src/
├── fast_sam_3dbody.h C++ public API
├── fast_sam_3dbody.cpp Pipeline implementation
├── fast_sam_3dbody_capi.h Plain C API (for ctypes)
├── fast_sam_3dbody_capi.cpp
├── preprocess.hpp Crop, normalise, ray_cond, NMS, pose conversion
├── bvh_writer.h / bvh_writer.cpp BVH motion-capture exporter (multi-person, name-mapped to MHR joints)
├── mhr_joint_table.h Generated by scripts/build_joint_table.py — MHR joint names + parents
└── main.cpp CLI executable (--bvh, --out CSV, live overlay window)
Developer tools (ONNX/GGUF export, LBS extraction, debug scripts, Python training env) live in the parent project: https://github.com/AmmarkoV/Fast-SAM-3D-Body
See the Models section above. After extracting the zip, onnx/ should be at the repo root.
To build models from source (requires the original Python training environment), see Fast-SAM-3D-Body.
Requirements: CMake ≥ 3.18, C++17 compiler, OpenCV (core/imgproc/videoio/highgui/dnn), optional CUDA Toolkit.
cd fast_sam_3dbody_cpp
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)CMake handles dependencies automatically:
- ONNX Runtime 1.20.1 – downloaded from GitHub releases if not found; point to an existing install with
-DONNX_RUNTIME_DIR=/path/to/onnxruntime - ggml – fetched via
FetchContentfrom GitHub - CUDA – auto-detected; set
-DCMAKE_CUDA_ARCHITECTURES=86(or75,89, etc.) for your GPU; falls back to CPU-only if not found
Outputs in build/:
| File | Description |
|---|---|
fast_sam_3dbody_run |
Standalone CLI executable |
libfast_sam_3dbody.so |
Shared library for C++ linking or ctypes |
cd fast_sam_3dbody_cpp/build
# Single image – prints pose params to stdout
./fast_sam_3dbody_run \
--onnx-dir ../onnx \
--gguf ../onnx/pipeline.gguf \
--yolo ../onnx/yolo.onnx \
--from ../../assets/teaser.png
# Webcam (device 0)
./fast_sam_3dbody_run \
--onnx-dir ../onnx --gguf ../onnx/pipeline.gguf --yolo ../onnx/yolo.onnx \
--from 0
# Video file
./fast_sam_3dbody_run \
--onnx-dir ../onnx --gguf ../onnx/pipeline.gguf --yolo ../onnx/yolo.onnx \
--from /path/to/video.mp4
# Fastest mode – skip LBS body model (no vertices, just pose params)
./fast_sam_3dbody_run ... --skip-body
# CPU-only
./fast_sam_3dbody_run ... --cuda -1Full option list:
--onnx-dir PATH Directory with backbone/decoder/body_model ONNX files
--gguf PATH pipeline.gguf (MHR + camera heads)
--yolo PATH YOLO pose model (.onnx)
--from SRC Webcam index (0,1,..) or path to image/video
-o / --out PATH Write 70-joint 3D keypoints to CSV per frame
--bvh PATH Write BVH motion capture file(s) to PATH (see "BVH export" below)
--bvh-template P BVH skeleton template (default: ./body.bvh)
--no-bvh-body-shape-change Keep template body bone lengths (skip per-person rewrite)
--no-bvh-hand-shape-change Keep template hand/finger bone lengths (skip per-person rewrite)
--bvh-raw-fingers Do NOT rescale finger End-Site OFFSETs (keeps body.bvh's authored fingertips)
--cuda DEVICE CUDA device index (default 0; -1 = CPU)
--skip-body Skip body model (no vertices / keypoints)
--headless No display window
--thresh T YOLO person confidence threshold (default 0.50)
--nms T YOLO NMS IoU threshold (default 0.45)
--fx / --fy F Camera focal length x/y in pixels (0 = image width)
--cx / --cy F Principal point (0 = image centre)
--render-size W H Override display window size
--size W H Webcam capture resolution
--fps Z Webcam capture framerate
--butterworth Apply Butterworth low-pass filter to MHR output vectors
--bw-cutoff HZ Butterworth cutoff frequency in Hz (default 6.0)
--butterworth-root-rotation Filter global_rot in quaternion space (1st-order SLERP-EMA, see "Output filtering")
--rot-clamp DEG Geodesic SLERP-step clamp on global_rot in degrees/frame (default 1; 0 = no clamp)
--info Print pipeline info and exit
--help Show this message
Draws COCO 2D skeletons and a pose-bar panel. Requires only opencv-python and numpy — no PyTorch.
# From the repo root:
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py --from assets/teaser.png
# Webcam, cap at 3 persons
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py --from 0 --max-skeletons 3
# Save output image / video
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py \
--from assets/teaser.png --out out.jpg
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend.py \
--from video.mp4 --headless --out out.mp4Key options (same as CLI, plus):
--max-skeletons N Cap persons drawn per frame (0 = unlimited)
--headless No display window
--out PATH Write result to image or video file
Full 3D mesh rendering identical to demo_webcam.py: four-panel output
[original | 2D skeleton | front mesh | side mesh].
Uses the C engine for the fast path (YOLO → backbone → decoder → MHR FFN heads),
then calls the Python MHR body model (mhr_model.pt) for LBS skinning to produce
mesh vertices. Requires the full Python environment (PyTorch, sam_3d_body package, pyrender).
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py --from assets/teaser.png
# Webcam
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py --from 0 --max-skeletons 3
# Custom checkpoint paths
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py \
--from assets/teaser.png \
--checkpoint ./checkpoints/sam-3d-body-dinov3/model.ckpt \
--mhr-model ./checkpoints/sam-3d-body-dinov3/assets/mhr_model.pt
# Save result
python fast_sam_3dbody_cpp/fast_sam_3dbody_frontend-3D.py \
--from assets/teaser.png --out result_3d.jpgKey options (same as lightweight frontend, plus):
--checkpoint PATH Path to model.ckpt (default: checkpoints/sam-3d-body-dinov3/model.ckpt)
--mhr-model PATH Path to mhr_model.pt
--device STR PyTorch device for body model: cuda or cpu (default: auto)
The CLI can apply a second-order Butterworth low-pass filter to the MHR output vectors on every frame, reducing per-frame jitter without introducing ripple in the passband.
# Enable with the default 6 Hz cutoff
./fast_sam_3dbody_run --from video.mp4 --butterworth
# Lower cutoff for smoother (more lag) output
./fast_sam_3dbody_run --from video.mp4 --butterworth --bw-cutoff 3.0
# Higher cutoff to preserve faster motion
./fast_sam_3dbody_run --from video.mp4 --butterworth --bw-cutoff 10.0The filter is applied in-place to each detected person's result immediately after inference, so all downstream consumers (CSV writer, BVH writer, display overlay) receive filtered data.
| Vector | Channels | Method | Description |
|---|---|---|---|
keypoints_3d |
210 (70 joints × 3) | Butterworth | 3-D joint positions in metres |
body_pose |
133 | Butterworth | Body joint Euler angles |
hand_pose |
108 | Butterworth | Hand joint angles (left 54 + right 54) |
global_rot |
3 | Quaternion SLERP-EMA | Global orientation – filtered on SO(3), see below |
pred_cam_t |
3 | Butterworth | Camera / root translation |
global_rot is not filtered with the scalar Butterworth path. Doing that
on either the Euler triple or the four quaternion components fails for two
reasons rooted in the geometry of rotations:
- Euler wrap. A rotation passing 180° jumps from +π to −π on the axis storage, even though the actual orientation moved 0°. A per-axis low-pass interpolates linearly through that discontinuity, briefly flipping the whole body for ~τ seconds (the filter time-constant) every time the model crosses a wrap or a gimbal-lock pole.
- SO(3) is not a vector space. Butterworth's "maximally-flat magnitude" guarantee is a theorem about scalar LTI systems; it does not transfer to rotations no matter which parametrisation you filter. Per-channel smoothing of an Euler triple produces a composed rotation that is neither the input nor a geometrically meaningful interpolant.
So instead, when --butterworth-root-rotation is on, global_rot is filtered
in quaternion space by a 1st-order SLERP-EMA (QuatLPF in
src/outputFiltering.h):
- Convert the input Euler triple to a unit quaternion.
- Hemisphere-correct — if
dot(q_filt_prev, q_input) < 0, negateq_input.qand−qrepresent the same orientation, so this removes the spurious 180° jump. - Compute the geodesic angle
θ = 2·acos(|dot|)and pick a SLERP fractiont = min(α, --rot-clamp / θ). Theαcomes from the same--bw-cutofftime-constant used elsewhere. q_filt ← SLERP(q_filt_prev, q_input_hemi, t), renormalise.- Convert back to Euler-ZYX for storage.
This trades Butterworth's maximally-flat magnitude (a guarantee that does
not apply to SO(3) anyway) for geodesic monotonicity, no-flip
continuity across the wrap, and a single rotation-angle distance metric.
It's 1st-order (−6 dB/octave) where the linear channels are 2nd-order, but
that's the right trade-off for rotation — see the comment block in
outputFiltering.h for the full rationale.
--rot-clamp is now a geodesic outlier clamp on the SLERP step, not a
frame-rejection threshold: rapid input motion gets attenuated to at most
rot-clamp deg/frame in the output, never frozen. A single bad FFN frame
with a 180° flip is therefore absorbed smoothly across a few frames rather
than overwriting the orientation.
| Flag | Default | Notes |
|---|---|---|
--butterworth |
off | Enable the filter |
--bw-cutoff HZ |
6.0 |
Cutoff frequency in Hz. Lower = smoother but more temporal lag. Human motion typically stays below 6 Hz; use 3–4 Hz for very smooth output, 8–10 Hz to preserve fast gestures. |
--butterworth-root-rotation |
off | Run the quaternion SLERP-EMA on global_rot (see above). Off by default because the very first input frame seeds the filter, so a bad initial prediction would otherwise lock the sequence to it. |
--rot-clamp DEG |
1.0 |
Geodesic clamp on the SLERP step in degrees per frame. The filter's output never moves more than this many degrees of rotation per frame. With the new quaternion filter this attenuates fast input rather than rejecting it. Set to 0 to disable the clamp (pure EMA). |
The sampling rate is taken from --fps when specified, otherwise 30 Hz is assumed.
Each person slot maintains its own independent filter bank; new person slots are
initialised with a warm-up pass on their first frame so the filter starts from the
measured value rather than zero.
Implemented in src/outputFiltering.h — a header-only, dependency-free,
C-compatible file containing two primitives:
ButterWorth— scalar 2nd-order IIR low-pass used forkeypoints_3d,body_pose,hand_pose,pred_cam_t. Initialised withinitButterWorth(sensor, fs, fc)and stepped withfilter(sensor, value).QuatLPF— 1st-order SLERP-EMA on unit quaternions, used only forglobal_rot. Initialised withinit_quat_lpf(&qf, fs, fc)and stepped withfilter_quat(&qf, in_quat, max_step_rad, out_quat). The header also provideseuler_zyx_to_quat(rz, ry, rx, out)andquat_to_euler_zyx(in, &rz, &ry, &rx)sinceglobal_rotis stored in MHR's ZYX-intrinsic Euler order.
Exports the per-frame MHR pose as one or more standard BVH motion-capture files.
The hierarchy is taken from a BVH template (default ./body.bvh, a MocapNET/MakeHuman
T-pose skeleton); the motion comes from the MHR pipeline.
# Single image / video / webcam → one or more <name>_<id>.bvh files
./fast_sam_3dbody_run --from boom.mp4 --bvh ./p.bvh --headless
# produces ./p_0.bvh, ./p_1.bvh, … (one per tracked person)For each detected person, every frame:
- Root joint — translation from
pred_cam_t(×100 to convert metres → cm) and orientation from MHR's global rotation, in the BVH root'sZrotation Yrotation Xrotationchannel order. - Body joints matched by name to MHR (~50 joints incl. spine, arms, legs,
fingers — full table in
src/bvh_writer.cppNAME_MAP). The local rotation is computed in MHR's frame asinv(delta[parent]) · delta[self], wheredelta[j] = R_global_mhr[j] · R_global_mhr_rest[j]⁻¹, then decomposed to the joint's BVH channel order (Zrotation Xrotation Yrotation). - Unmapped BVH joints (toes, face details, BVH metacarpals, etc.) stay at zero rotation — MHR doesn't predict angles for them.
- OFFSETs are rewritten at close-time to each person's median observed bone length so the template T-pose proportions match the actual subject. Bone direction is preserved (changing it disrupts the T-pose look the BVH file was authored with).
- Per-category opt-out with
--no-bvh-body-shape-changeand--no-bvh-hand-shape-change. body.bvh's authored hands are shorter than the MHR skeleton's (proximal phalanx 2.3 cm vs 3.8 cm), so rewriting visibly enlarges fingers — pass--no-bvh-hand-shape-changeto keep body.bvh's authored hand proportions while still resizing the rest of the body to match the subject. The body flag is the symmetric escape hatch for the trunk/limbs. - Finger-tip End-Site compensation (default on). The End-Site OFFSETs at
the end of each finger have no channels so the bone-length rewrite never
touches them, and body.bvh's are ~0.3 cm longer than the MHR
*_nulltip extensions. By default we rescale all 10 finger End-Site OFFSETs to the MHR length (direction preserved). Pass--bvh-raw-fingersto leave them exactly as authored.
Multiple skeletons in the same scene are exported as separate BVH files, one per tracked identity. Identity persistence is handled by a built-in bbox-IoU greedy tracker:
- IoU threshold
0.10; tracks are retired after90frames missing (≈ 3 s at 30 fps). - Detections with degenerate bboxes (anchored at
(0, 0)or near-zero area) are dropped before they hit the tracker — these are common YOLO failure modes that otherwise would spawn spurious tracks. - While a track is alive but missing this frame, its previous pose is duplicated so the BVH timeline stays continuous through brief occlusions.
Filenames are derived from --bvh PATH:
--bvh value |
Outputs |
|---|---|
p.bvh |
p_0.bvh, p_1.bvh, … |
out/run |
out/run_0.bvh, out/run_1.bvh, … |
cap.mocap |
cap_0.mocap, cap_1.mocap, … |
Each file is fully independent — drop into Blender / BVHTester / any DCC.
# Render a 3D-keypoints CSV at the same time, then compare hip-relative
# joint positions between the BVH and MHR's own 3D keypoints.
./fast_sam_3dbody_run --from boom.mp4 \
--bvh ./p.bvh --out /tmp/boom_mhr.csv --headless
source venv/bin/activate
python3 scripts/verify_bvh_motion.py ./p_0.bvh /tmp/boom_mhr.csvA clean run prints per-joint median / p90 / max error in cm; numbers around 2–5 cm on trunk joints and 5–15 cm on extremities are the expected range (the larger residual on hands/feet partly reflects MHR's 70-keypoint surface landmarks not coinciding with the LBS rotation-centre joints).
The MHR body model uses 127 named joints (body_world, root, l_uparm,
r_thumb1, …) — names that don't appear in body_model.lbs but are exported
from the JIT model:
# (Re)generate src/mhr_joint_table.h from a checkpoint
source venv/bin/activate
python3 scripts/build_joint_table.py
# pass MHR_MODEL_PT=/path/to/mhr_model.pt to override the default locationbvh_writer.cpp then matches each BVH joint name to an MHR name via the
hand-authored NAME_MAP table. To support a new BVH template just add entries
to that table (and rebuild).
- The BVH I/O comes from a vendored, trimmed-down copy of
MotionCaptureLoaderinGraphicsEngine/MotionCaptureLoader/(plusTrajectoryParser/InputParser_C). We usebvh_loadBVHfor parsing the template hierarchy anddumpBVHToBVHfor serialising the result. - The per-person motion buffer lives in
std::vector<float>and is transplanted intomc->motionValuesat close-time, then detached beforebvh_free()so the library doesn't try to free our std::vector storage. - If you only have one person in the scene you'll get exactly one file
(
<name>_0.bvh) — there is no single-file mode for backwards compatibility.
blender/blender_bvh_plugin.py is a Blender add-on (by AmmarkoV) that
plays one of these BVHs onto a MakeHuman-rigged character via the
mpfb / MakeHuman plugin for Blender. It maps
the BVH joints onto the MakeHuman armature with copy-rotation bone constraints
(body / hands / feet / face are independently selectable), and exposes a
"MocapNET BVH Animation Helper" panel in the 3D viewport's N-panel.
End-to-end workflow:
# 1) Generate one BVH per detected person
./fast_sam_3dbody_run --from clip.mp4 --bvh ./p.bvh --headless
# → p_0.bvh, p_1.bvh, …
# 2) (One-time) install a known-good Blender + open the plugin
cd blender && ./downloadAndInstallBlender.sh
# downloads blender-3.4.1, launches it with the plugin loaded;
# on first run the plugin will offer to fetch + install the mpfb2 MakeHuman addonInside Blender:
- Use the MakeHuman (mpfb) tab to create or load a skinned character.
File → Import → Motion Capture (.bvh)and pick one of thep_<id>.bvhfiles. The BVH skeleton appears as a separate armature.- Open the MocapNET BVH Animation Helper panel, point Source BVH at the imported armature, tick the body parts you want driven (body / hands / feet / face) and click Apply — the plugin adds copy-rotation constraints on every matching MakeHuman bone.
- Scrub or render the timeline as usual; the character now follows the exported motion.
The plugin auto-handles naming variants between the BVH skeleton (body.bvh,
which is what we export against) and the MakeHuman armature, so the
p_<id>.bvh files coming out of --bvh work without any further renaming.
fast_sam_3dbody_run and fast_sam_3dbody_render are online / causal
binaries: they process frame N before frame N+1 has been decoded, so every
filter and every tracker can only look backwards. That works fine for live
webcam input, but for video files on disk it leaves quality on the table.
offline_sam_3dbody_render is a separate executable that reads the whole
clip first and then runs five passes over it. It produces the same
multi-person BVH files as --bvh does in the live binaries, but with
smoother motion and more stable identities.
# Most common invocation
./scripts/offline_video.sh --from matrix.mp4 --bvh ./mtx.bvh \
--interpolate-jitter
# → mtx_0.bvh, mtx_1.bvh, …, one per tracked identity-
Inference + scene detection — decode the full video, run YOLO + backbone + decoder + MHR per frame, throw away
pred_verticesimmediately to keep memory bounded. In parallel we run a scene-change detector that votes on three signals each frame (any two ⇒ cut):- (A) Lucas-Kanade optical-flow tracking of background-only corners
(41×41 window, 5 pyramid levels — sized for action footage) from the
previous frame. Cut signal fires when the LK success rate drops
below
--scene-success-threshold(default 0.50). The "background" is everything outside the dilated YOLO bboxes — same mask the user asked for, just produced from bbox geometry instead of a GL shader. - (B) HSV-histogram correlation of the same background. Cut signal fires when the correlation drops below 0.50.
- (C) Person-set discontinuity. Cut signal fires when either the detection count changes by ≥ 2 (or doubles/halves), or the median nearest-prev-detection 3D distance exceeds 1 m.
The voting fuses the failure modes of each individual heuristic, and a 4-frame debounce throws out tracker-artefact bursts. Scene cuts then gate Pass 3 and Pass 4.
- (A) Lucas-Kanade optical-flow tracking of background-only corners
(41×41 window, 5 pyramid levels — sized for action footage) from the
previous frame. Cut signal fires when the LK success rate drops
below
-
Global identity tracking — greedy per-frame matching by
cost = (1 − IoU) + λ · ‖Δpred_cam_t‖_metres, then a post-hoc merge step that splices track-ending pairs (Aends,Bstarts within--track-merge-framesand--track-merge-cmof the same 3D location) underA's id. This is what catches "person hidden behind a pillar for 30 frames" — the live tracker would have retired and re-acquired them as two separate IDs. Tracks with fewer than--min-track-framesdetections are pruned at the end (default 8 — drops YOLO single-frame false positives). -
Gap interpolation (on by default; opt out with
--no-gap-interpolation) — fill in any frame inside a track's lifespan that's missing a detection by linear / SLERP-interpolating from the bracketing real detections. Without this,BVHWriterpads missing frames with "duplicate last pose" and Blender sees that as a frozen segment for the whole gap (typical cause: YOLO confidence dipped for a partial occlusion, fast head turn, or low-contrast frame). Scene cuts are respected — we never bridge a cut.--gap-max-frames Nlets you cap how long a gap will be filled; longer gaps revert to padding (useful when the two endpoints are too different and the linear morph looks unphysical). -
Jitter interpolation (opt-in via
--interpolate-jitter) — flag frames whose 3D keypoint velocity exceeds--jitter-threshold-cm(default 30 cm/frame), then replace each flagged frame by a linear / SLERP interpolation of its non-flagged neighbours. Scene cuts are respected — a "200 cm/frame velocity" right at a cut is real motion between shots, not noise, and we never bridge across a cut. -
Zero-phase smoothing — same Butterworth (for linear channels) and QuatLPF (for
global_rot) as the live binaries, but run asfiltfilt: forward pass, reverse, forward pass, reverse. The phase shift of the two passes cancels exactly, so the smoothed output has zero temporal lag. Each track is split at scene cuts and each segment is filtered independently — we never blend pose data from shot A into shot B. -
BVH export — feed the smoothed + interpolated detections, with the globally-determined track IDs, into
BVHWriter::write_frame_external. The writer is exactly the one used by the live binaries; only the source of the IDs differs.
fsb::Pipeline, BVHWriter, ButterWorth, QuatLPF, euler_zyx_to_quat,
fsb::apply_hand_pose — all unchanged, called directly. The new code in
render/offline_sam_3dbody_render.cpp is only the per-pass
orchestration (≈ 700 lines, every function with a top-of-block comment
explaining why it does what it does), the scene detector, and one new
public method on BVHWriter (write_frame_external) that takes
caller-supplied track IDs instead of running its internal greedy tracker.
The binary refuses webcam indices, RTSP URLs, and single still images;
those are an online-binary workflow. --from must be a path to a video
file with at least a handful of frames.
| Flag | Default | Notes |
|---|---|---|
--smoothing zero-phase|forward|off |
zero-phase |
Forward+backward filtfilt (no lag) vs the live binaries' forward-only filter vs no smoothing |
--no-gap-interpolation |
— | Disable Pass 3; missing frames inside a track will be padded with the last-known pose (the pre-fix "frozen segment" behaviour) |
--gap-max-frames N |
0 |
Skip gap-fill for gaps longer than N frames (0 = no limit). Useful when long occlusions produce a visibly unphysical morph between two unrelated poses |
--interpolate-jitter |
off | Enable Pass 4 |
--jitter-threshold-cm CM |
30.0 |
Per-frame 3D-keypoint velocity above which a frame is replaced by an interpolation of its neighbours |
--track-merge-frames N |
30 |
Maximum gap (in frames) for the post-hoc merge in Pass 2 |
--track-merge-cm CM |
50.0 |
Maximum 3D root distance (cm) for the post-hoc merge |
--min-track-frames N |
8 |
Drop tracks with fewer than this many detections (YOLO false-positive filter) |
--no-scene-detection |
— | Treat the whole clip as one continuous shot (skips Pass-1 OpenCV work and prevents any false-positive cuts from interrupting the smoothing) |
--static-scene |
— | Intent-named alias of --no-scene-detection; use when you know the source has no cuts (one-take dance clip, lab recording, …) |
--scene-success-threshold T |
0.50 |
LK-flow success-rate threshold (signal A) |
--scene-min-corners N |
30 |
Re-seed bg corners when fewer than this many remain |
All --bvh-* flags and --rot-clamp, --bw-cutoff documented above
work identically here.
scripts/offline_video.sh is the shell entry point. It mirrors the
scripts/video.sh interface (same model paths, same flag forwarding)
and additionally accepts the offline-only flags above. Usage:
./scripts/offline_video.sh --from clip.mp4 --bvh out.bvh [...]Passing --save [vis.mp4] additionally runs scripts/video.sh after
the offline pass to produce a visualisation mp4 of the per-frame
inference. The two outputs are independent — the BVH reflects the
offline-only tracking + smoothing + jitter handling, while the rendered
mp4 is whatever the live renderer would produce on the same clip — so
this is a quick way to get a "what was the input?" video alongside the
"what was the cleaned-up output?" BVH:
./scripts/offline_video.sh --from matrix.mp4 --bvh mtx.bvh --save vis.mp4#include "fast_sam_3dbody.h"
fsb::PipelineConfig cfg;
cfg.onnx_dir = "./onnx";
cfg.gguf_path = "./onnx/pipeline.gguf";
cfg.yolo_path = "./onnx/yolo.onnx";
cfg.cuda_device = 0; // -1 = CPU only
cfg.skip_body_model = true; // faster: no vertices
cfg.max_persons = 4; // 0 = unlimited
fsb::Pipeline pipeline;
pipeline.load(cfg);
// BGR uint8 pointer, width, height
std::vector<fsb::MHRResult> results =
pipeline.process_bgr(bgr_ptr, width, height);
for (const auto& r : results) {
// r.bbox [4] x1 y1 x2 y2 (original image pixels)
// r.global_rot [3] Euler ZYX global orientation
// r.body_pose [133] joint angles
// r.shape [45] identity betas
// r.pred_cam_t [3] raw camera head: [s, tx, ty]
// r.focal_length estimated focal length (pixels)
// r.keypoints_yolo[51] COCO 17 × [x,y,conf]
// r.pred_vertices [18439*3] (empty when skip_body_model=true)
}
pipeline.free();#include "fast_sam_3dbody_capi.h"
FsbHandle h = fsb_create();
FsbConfig cfg = {
.onnx_dir = "./onnx",
.gguf_path = "./onnx/pipeline.gguf",
.yolo_path = "./onnx/yolo.onnx",
.cuda_device = 0,
.skip_body_model = 1,
.person_thresh = 0.5f,
.person_nms_iou = 0.45f,
.max_persons = 0,
};
fsb_load(h, &cfg);
FsbResult results[32];
int n = fsb_process_bgr(h, bgr, width, height, results, 32);
for (int i = 0; i < n; i++) {
// results[i].bbox, .body_pose, .yolo_kps, ...
}
fsb_destroy(h);| Stage | Time (RTX 3090, B=1) |
|---|---|
| YOLO detection | ~5 ms |
| Backbone (DINOv3-ViT-H) | ~150–200 ms |
| Decoder (6-layer) | ~20 ms |
| MHR + camera FFN (CPU) | <1 ms |
| Native C LBS (optional) | <1 ms |
- Backbone is the bottleneck; it dominates end-to-end latency.
- Use
--skip-bodyunless 3D vertices are required. - For higher throughput, batch multiple crops in a single backbone forward pass (already done when multiple persons are detected).
The official PyTorch SAM 3D Body repository from Meta Superintelligence Labs is :
https://github.com/facebookresearch/sam-3d-body
If you use this software repository in your research or work, please cite:
@misc{qammaz2026sam3dbodycpp,
author = {Qammaz, Ammar},
title = {{SAM3DBody-cpp}: Standalone {C++} Inference Engine for {SAM-3D-Body}},
year = {2026},
howpublished = {\url{https://github.com/AmmarkoV/SAM3DBody-cpp}},
note = {Zero-dependency runtime: ONNX Runtime + ggml, with BVH export and Python ctypes frontends}
}as well as the Fast SAM 3D Body work which was the inspiration creating the cpp version
@article{yang2026fastsam3dbody,
title={Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery},
author={Yang, Timing and He, Sicheng and Jing, Hongyi and Yang, Jiawei and Liu, Zhijian and Zou, Chuhang and Wang, Yue},
journal={arXiv preprint arXiv:2603.15603},
year={2026}
}and of course the Meta AI team behind the awesome paper that proposes the SAM 3D Body method.
@article{yang2026sam3dbody,
title={SAM 3D Body: Robust Full-Body Human Mesh Recovery},
author={Yang, Xitong and Kukreja, Devansh and Pinkus, Don and Sagar, Anushka and Fan, Taosha and Park, Jinhyung and Shin, Soyong and Cao, Jinkun and Liu, Jiawei and Ugrinovic, Nicolas and Feiszli, Matt and Malik, Jitendra and Dollar, Piotr and Kitani, Kris},
journal={arXiv preprint arXiv:2602.15989},
year={2026}
}