Fast face embedding library. 3 ms inference, 7 MB binary, zero dependencies.
A 148 KB face embedding library that runs EdgeFace-XS at 3.0 ms/face on a consumer i5 -- faster than ONNX Runtime on the same model and CPU.
No Python. No GPU. No ONNX Runtime. One header, one static library.
#include "facex.h"
FaceX* fx = facex_init("edgeface_xs_fp32.bin", NULL);
float emb[512];
facex_embed(fx, rgb_112x112, emb);Face embedding is the core of every face recognition system -- turning a face photo into a 512-dimensional vector. The standard way to run it on CPU is ONNX Runtime (28 MB DLL + Python). That's fine for servers, but too heavy for edge devices, kiosks, embedded systems, and anyone who values simplicity.
FaceX is a single C99 file with handwritten AVX2/AVX-512 SIMD kernels that beats ONNX Runtime by 23% with zero dependencies. Six months of optimization: per-op profiling, custom GEMM microkernels, INT8 quantization, cache-tuned memory layout, thread pool -- every millisecond fought for.
- Edge AI / embedded -- kiosks, turnstiles, access control where you have a $40 CPU and no GPU.
- Server-side at scale -- 3 ms/face means 300+ faces/sec on a single core.
- Privacy-first deployments -- runs fully offline, no cloud, no telemetry.
- Developers --
#include "facex.h", link, done. No package managers, no version conflicts.
Measured on Intel i5-11500 (6 cores, AVX-512 + VNNI):
| Engine | Median | Min | vs FaceX |
|---|---|---|---|
| FaceX | 3.0 ms | 2.87 ms | -- |
| ONNX Runtime 1.23 | 3.9 ms | 3.18 ms | 1.30x slower |
| InsightFace (R34) | 17 ms | -- | 5.7x slower |
| FaceNet (PyTorch) | 30 ms | -- | 10x slower |
| dlib | 50+ ms | -- | 17x slower |
| Benchmark | Score |
|---|---|
| LFW verification | 99.73% |
| Model parameters | 1.77M |
| Embedding dim | 512 |
| Metric | FaceX | ONNX Runtime |
|---|---|---|
| Library size | 148 KB | 28 MB |
| Total deploy | 7 MB | 157 MB |
| Dependencies | none | Python + onnxruntime |
| Cold start | ~100 ms | ~350 ms |
#include "facex.h"
int main() {
// Load engine (one-time, ~100ms)
FaceX* fx = facex_init("edgeface_xs_fp32.bin", NULL);
// Compute embedding (3ms per call)
float face[112 * 112 * 3]; // RGB, HWC, [-1, 1]
float embedding[512];
facex_embed(fx, face, embedding);
// Compare two faces
float sim = facex_similarity(emb_a, emb_b);
// sim > 0.3 → same person
facex_free(fx);
}gcc -O3 -march=native -Iinclude -o myapp myapp.c -L. -lfacex -lm -lpthreadimport "github.com/facex-engine/facex/go/facex"
ff, _ := facex.New(facex.Config{
Exe: "./facex-cli",
Weights: "./edgeface_xs_fp32.bin",
})
defer ff.Close()
embedding, _ := ff.Embed(rgbImage)
sim := facex.CosSim(embA, embB)# Pipe mode: reads 112x112x3 float32 HWC, writes 512 float32
./facex-cli weights.bin --server < faces.raw > embeddings.rawmake # builds libfacex.a + facex-cli
make example # builds and runs example
make encrypt # builds weight encryption toolRequirements: GCC with AVX2 support. Nothing else.
gcc -O3 -march=x86-64-v3 -mavx2 -mfma -static \
-DFACEX_LIB -o libfacex.a src/*.c -lm -lpthread// Initialize engine. Returns NULL on error.
// license_key: NULL for plain weights, or key string for AES-256 encrypted.
FaceX* facex_init(const char* weights_path, const char* license_key);
// Compute 512-dim face embedding from 112x112 RGB image.
// rgb_hwc: float32 array [112][112][3], values in [-1, 1].
// embedding: output buffer, 512 floats (L2-normalized).
int facex_embed(FaceX* fx, const float* rgb_hwc, float embedding[512]);
// Cosine similarity between two embeddings. Range [-1, 1].
float facex_similarity(const float emb1[512], const float emb2[512]);
// Free engine resources.
void facex_free(FaceX* fx);
// Version string.
const char* facex_version(void);Input: 112x112 RGB float32
↓
Stem: Conv 3→32, stride 4
↓
Stage 0: 3× ConvNeXt blocks (C=32)
↓
Stage 1: 2× ConvNeXt + XCA attention (C=64)
↓
Stage 2: 8× ConvNeXt + XCA attention (C=100)
↓
Stage 3: 2× ConvNeXt + XCA attention (C=192)
↓
Global Average Pool → LayerNorm → FC → L2 Norm
↓
Output: 512-dim embedding
Engine internals:
- Pure C99 + SIMD intrinsics (AVX2, FMA, AVX-512, VNNI)
- INT8 quantized GEMM with
vpmaddubsw(AVX2) /vpdpbusd(VNNI) - FP32 packed column-panel MatMul (NR=8 AVX2, NR=16 AVX-512)
- Custom thread pool with work-stealing (WaitOnAddress / futex)
- Exact GELU via polynomial
erfapproximation (A&S 7.1.26) - Pre-packed weights at load time for cache-optimal access
- Optional AES-256-CTR weight encryption with hardware binding
For commercial deployment with IP protection:
# Encrypt weights (binds to target machine hardware)
./facex-encrypt encrypt weights.bin weights.enc "LICENSE-KEY"
# Load encrypted weights
FaceX* fx = facex_init("weights.enc", "LICENSE-KEY");Wrong key or different machine → load fails. Original weights never touch disk in plaintext on the target machine.
| Language | Method | Latency overhead |
|---|---|---|
| C / C++ | libfacex.a + facex.h |
0 (native) |
| Go | go/facex subprocess |
~1 ms IPC |
| Python | subprocess / ctypes | ~1 ms IPC |
| Any | facex-cli --server stdin/stdout |
~1 ms IPC |
| HTTP | wrap in your server | your choice |
- x86-64 only. AVX2 required, AVX-512 optional. ARM NEON port planned for Q3 2026.
- Embedding only. Face detection and alignment are separate steps.
- Single model. EdgeFace-XS (1.77M params). Other models need weight conversion.
Uses EdgeFace-XS by George et al.:
- 1.77M parameters (smallest in its accuracy class)
- 99.73% LFW, competitive with models 100x larger
- Originally CC BY-NC-SA 4.0 license
include/
facex.h — public API (5 functions)
weight_crypto.h — encryption API
src/
facex.c — API implementation
edgeface_engine.c — forward pass (all stages + ops)
transformer_ops.c — SIMD kernels (LN, GELU, MatMul, Conv)
gemm_int8_4x8c8.c — INT8 GEMM microkernel (AVX2 + VNNI)
threadpool.c/h — lock-free thread pool
weight_crypto.c — AES-256-CTR encryption
go/facex/ — Go binding (subprocess protocol)
examples/
example.c — minimal usage example
docs/ — SVG benchmarks, logo
Q: Is it really faster than ONNX Runtime? A: Yes. Measured on the same CPU, same model, same input. FaceX median 3.0 ms vs ONNX Runtime median 3.9 ms. The gap comes from handwritten SIMD kernels that avoid framework overhead.
Q: What accuracy vs ArcFace-R100? A: EdgeFace-XS gets 99.73% LFW vs ArcFace-R100's 99.80%. The 0.07% gap buys you 10x speed and 60x smaller model.
Q: Can I use this commercially? A: The engine code is Apache 2.0 -- fully commercial. The bundled model weights are CC BY-NC-SA 4.0 (non-commercial). For commercial use, train your own weights or contact for licensing.
Q: Does it do face detection? A: No. FaceX is the embedding step only. Pair it with any face detector (RetinaFace, SCRFD, YuNet, etc.) for a complete pipeline.
@software{facex2026,
author = {Atinov, Baurzhan},
title = {FaceX: Fast CPU Face Embedding Library},
year = {2026},
url = {https://github.com/facex-engine/facex}
}Code: Apache License 2.0 -- free for commercial use. Model weights: CC BY-NC-SA 4.0 (follows upstream EdgeFace license). Train your own weights for unrestricted commercial use.
For commercial licensing: bauratynov@gmail.com
Created by Baurzhan Atinov (Kazakhstan)
GitHub
