Pairwise Accuracy Is Not Security: Masterface Attacks Expose a Structural Vulnerability in Face Verification Ehsan Nazari, 2026 📄 Technical report: technical_report.pdf
Face verification asks whether two face images depict the same person. A model embeds each face as a vector and accepts a match when the two fall within a threshold distance. It powers phone unlock, online ID checks, and border control.
You might assume that high pairwise verification accuracy implies a model is robust against adversaries. After all, a score like 99.65% on a face verification benchmark can create the impression that the system is secure.
It isn’t. Pairwise verification benchmarks were never designed to expose security vulnerabilities, and this report makes that gap concrete by showing how much attack surface can remain hidden behind today’s high-accuracy face verification models.
CosFace achieves 99.65% pairwise accuracy on LFW at FAR≈0.001, yet the same model can be fooled by nine optimization-crafted face embeddings that collectively impersonate 47.2% of LFW’s 5,749 identities; about 2,700 people. In each case, at least one of the nine embeddings is accepted as the same person under the verification threshold. This is not unique to one model: we observe the same pattern across ArcFace, CosFace, AdaFace-IR101, AdaFace-ViT, FaceNet-CASIA, and FaceNet-VGG2.
CosFace under three pipeline configurations. Pairwise accuracy (left): uniformly ≈99.6% — the standard benchmark sees no difference. Masterface coverage (right): 47.2% → 0.23%, a more-than-200× swing. Same model, same dataset.
-
An optimization-based masterface attack. Two phases; LM-MA-ES on the embedding hypersphere (Phase 1) followed by PGD-style image-space realization (Phase 2). Reaches 49.957% Phase-2 coverage on FaceNet-CASIA, surpassing the prior GAN-based result of Shmelkin et al. (2021) at 43.82%, with a simpler, non-generative pipeline.
-
Two pipeline "footguns". The FAR-threshold rounding direction and the face-alignment strategy each modulate masterface coverage by up to two orders of magnitude, while pairwise accuracy moves by ≤0.1 pp. These pipeline choices are security-critical yet structurally invisible to standard benchmarks.
-
The irreducible floor (JT-Attack). Even with both pipeline knobs set to their safest values, every model still admits a single embedding covering any 2–4 randomly chosen identities ~100% of the time. Pipeline mitigations bound the magnitude of large-scale attacks; the small-$N$ targeted floor lies in the models themselves.
Accumulative coverage from nine masterface embeddings (one per spherical-k-means cluster, k=9), evaluated against all 5,749 LFW identities. MTCNN = MTCNN-DavidSandberg, Retina = RetinaFace; under = FAR≈0.001 from below, over = from above.
| Model | Pipeline | Coverage % | Accuracy % |
|---|---|---|---|
| CosFace | MTCNN / under | 47.2 | 99.65 |
| CosFace | Retina / over | 24.6 | 99.58 |
| CosFace | Retina / under | 0.23 | 99.58 |
| ArcFace | MTCNN / under | 45.7 | 95.47 |
| ArcFace | Retina / over | 21.4 | 99.55 |
| ArcFace | Retina / under | 0.45 | 99.55 |
| AdaFace-IR101 | MTCNN / under | 32.3 | 99.62 |
| AdaFace-IR101 | Retina / over | 22.5 | 99.58 |
| AdaFace-IR101 | Retina / under | 0.37 | 99.58 |
| AdaFace-ViT | MTCNN / under | 6.6 | 99.90 |
| AdaFace-ViT | Retina / over | 28.8 | 99.70 |
| AdaFace-ViT | Retina / under | 2.4 | 99.70 |
| FaceNet (CASIA) | MTCNN / under | 37.3 | 98.97 |
| FaceNet (CASIA) | Retina / over | 58.1 | 97.97 |
| FaceNet (CASIA) | Retina / under | 41.4 | 97.97 |
| FaceNet (VGG2) | MTCNN / under | 24.2 | 99.53 |
| FaceNet (VGG2) | Retina / over | 34.8 | 99.40 |
| FaceNet (VGG2) | Retina / under | 25.6 | 99.40 |
The CosFace MTCNN / under row is the load-bearing exhibit: the highest pairwise accuracy in the table coexists with the highest coverage at the strictest threshold direction.
Given a set of target identity embeddings
optimized with LM-MA-ES (Loshchilov et al., 2017) — 1,000 generations, population 100,
Given a masterface embedding $\hat{\mathbf{x}}^$, Phase 2 iteratively perturbs a source face $\mathbf{s}$ in pixel space (Adam, PGD-style perturbation budget $\epsilon$) so that its embedding under the face mapper $FM$ approaches $\hat{\mathbf{x}}^
The Joint-Threshold Attack asks: under the safest pipeline configuration (RetinaFace + FAR≈0.001 from below), can a single embedding cover
📄 Full technical report: technical_report.pdf. All experimental details, threshold definitions, and additional results live there; this section covers only the commands needed to regenerate the artifacts.
# 1) Two conda envs (PyTorch + TensorFlow).
conda env create -f environments/master.yml # ArcFace, CosFace, AdaFace (PyTorch)
conda env create -f environments/master_facenet.yml # FaceNet variants (TensorFlow)
# 2) Run any config.
conda activate master
python run.py --config <name> # e.g. cosface_mtcnn_belowWeights, LFW, LMDBs, and embedding caches are auto-built on first use under data/ (the first run pays a one-time download + alignment cost). ArcFace and CosFace weights have no scriptable URL — on first use the loader prints a 3-step pointer to the InsightFace OneDrive folder; download the files manually and drop them in the path it specifies.
Configs live under configs/<section>/<name>.yaml and are looked up by name across all subdirectories. The YAML's attack_mode field selects between two attacks:
attack_mode: masterface(default) — Phase-1 GA embedding search + optional Phase-2 image realization.attack_mode: jt— JT-Attack: joint-threshold attack across N randomly chosen identities.
| Name in configs | Source | Architecture | Reported LFW acc. |
|---|---|---|---|
arcface |
insightface | IResNet-100 | 99.52% |
cosface |
insightface | IResNet-100 | 99.58% |
adaface_ir101 |
AdaFace | IResNet-101 | 99.58% |
adaface_vit |
CVLface | ViT-Base | 99.70% |
facenet_casia |
davidsandberg/facenet | InceptionResNet | 97.97% |
facenet_vgg2 |
davidsandberg/facenet | InceptionResNet | 99.40% |
Activate master for the first four, master_facenet for the FaceNet variants. reproduce_all.sh switches automatically.
./scripts/reproduce_all.sh runs every YAML under configs/ end-to-end (headline 6×3, JT-Attack, Beatles targeted), switching between the master and master_facenet conda envs automatically based on each config's model_name. Outputs land in results/.
run.py single entry point; dispatches on attack_mode
configs/ experiment YAMLs grouped by section
masterface/ the Python package
attack/ masterface orchestrator + Phase-1 GA + Phase-2 image realization
jt_attack/ JT-Attack runner + logger + plotter
models/ 6 face mappers + base + vendored backbones + on-demand weight fetcher
detectors/ MTCNN-DS + RetinaFace alignment (incl. differentiable)
data/ LFW fetch, LMDB build, NPZ embeddings, threshold cache
optimization/ LM-MA-ES + GA + fitness problems
loss/ euclidean / cosine / arc_cosine (PyTorch and TF variants)
metrics/ threshold, coverage (Gini, per-identity), clustering stats
utils/ config lookup, distance functions, result logger, helpers
data/ LFW + LMDB + embedding caches + source faces + model weights (gitignored)
results/ experiment outputs (gitignored)
scripts/ reproduce-all driver + optional pre-fetch weights script
environments/ two conda env files (master, master_facenet)
assets/ figures embedded in this README
This research was enabled in part by support provided by the Digital Research Alliance of Canada.
Apache 2.0. Model weights are subject to their upstream licenses.


