Skip to content

Benchmark Details

Carlo Sacchi edited this page Mar 12, 2026 · 10 revisions

Benchmark Details

This page describes the workloads used in each benchmark category. All tests are native Swift and run locally.

CPU Single-Core

Single-threaded CPU performance across mixed workloads.

Integer

  • What: 64-bit integer arithmetic
  • How: tight loop with add, multiply, shift, and XOR
  • Metric: Mops/s (millions of operations per second)

Floating Point

  • What: double-precision math
  • How: multiply, sqrt, sin/cos operations
  • Metric: Mops/s

SIMD (Accelerate)

  • What: vectorized operations using vDSP
  • How: vector multiply-add, vector add, dot product
  • Metric: GFLOPS

Cryptography

  • What: AES-256-GCM encryption + SHA-256 hashing
  • How: encrypt a data buffer and hash ciphertext
  • Metric: MB/s throughput

Compression

  • What: LZFSE compression + decompression
  • How: compress and decompress a buffer
  • Metric: MB/s throughput (combined)

CPU Multi-Core

Same tests as single-core, executed in parallel across all available CPU cores.

  • Uses Swift TaskGroup for parallel execution
  • Totals throughput across tasks
  • Score is normalized by core count (see Scoring Methodology)
  • SIMD (v2.1.6+): Uses 8K-element vectors (~96KB per core) that fit in L1 cache, ensuring the test is compute-bound and scales properly with core count. Previous versions used 1M-element vectors (12MB) which caused memory bandwidth contention across cores.

Memory

Unified memory subsystem performance.

Sequential Read

  • Linear read of a 256 MB buffer
  • Page-aligned allocation with optional mlock
  • Metric: GB/s

Sequential Write

  • Linear write of a 256 MB buffer
  • Page-aligned allocation with optional mlock
  • Metric: GB/s

Copy

  • memcpy between two 256 MB buffers
  • Page-aligned allocation with optional mlock
  • Metric: GB/s

Latency

  • Pointer-chase random access
  • 32 MB working set, 10M accesses
  • Metric: ns (lower is better)

Note: Advanced profiling adds stride throughput and block-size sweeps for empirical cache boundary detection. Those sweeps report throughput vs stride/size, not latency.

Disk

Storage performance with cache bypass. Patterns are NovaBench-compatible for direct comparison.

Why NovaBench-Compatible?

We align with NovaBench patterns to enable meaningful cross-tool comparisons:

  • Sequential: 4MB blocks (NovaBench uses up to 8 simultaneous)
  • Random: 4KB blocks, QD1 (one operation at a time)
  • All metrics in MB/s for consistency

Setup

  • Creates a unique temp directory per run
  • Uses O_NOFOLLOW to prevent symlink attacks
  • Uses F_NOCACHE to bypass the filesystem cache
  • Uses F_FULLFSYNC to force sync to physical media

Sequential Write

  • 4 MB chunks into a 256 MB (quick) or 1 GB (normal) file
  • Cache bypass + final sync
  • Metric: MB/s

Sequential Read

  • 4 MB chunks from a 256 MB (quick) or 1 GB (normal) file
  • File is synced before reading to ensure cold read
  • Metric: MB/s

Random Write

  • 4 KB writes at random offsets in a 256 MB sparse file
  • QD1 pattern (one I/O at a time)
  • Metric: MB/s (converted from IOPS × 4KB / 1MB)

Random Read

  • 4 KB reads at random offsets in a 256 MB file
  • QD1 pattern (one I/O at a time)
  • Metric: MB/s (converted from IOPS × 4KB / 1MB)

Disk Parameters

Parameter Quick Mode Normal Mode
Sequential file size 256 MB 1 GB
Sequential chunk size 4 MB 4 MB
Random file size 512 MB 1 GB
Random block size 4 KB 4 KB
Random operations 500 2,000

Disk I/O Volume (per iteration)

Disk tests run repeatedly for the selected duration, so total bytes scale with duration and drive speed. Per iteration of each subtest:

  • Sequential write/read: writes/reads the sequential file size (256 MB quick / 1 GB normal) in 4 MB blocks, then syncs
  • Random write: 500/2,000 operations of 4 KB each (about 2 MB / 8 MB total), then final sync
  • Random read: 500/2,000 operations of 4 KB each (about 2 MB / 8 MB total); each iteration also creates and syncs a 512 MB / 1 GB file to keep reads cold

Cache bypass is best-effort; macOS VM and SSD controller caches can still influence results.

Comparison with NovaBench

Metric Our Tool (M1) NovaBench M1 Why Different
Seq Read ~2180 MB/s 3356 MB/s We use F_NOCACHE
Seq Write ~700 MB/s 3279 MB/s We use F_FULLFSYNC
Rand Read ~43 MB/s 166 MB/s 1GB file exceeds cache
Rand Write ~17 MB/s 761 MB/s F_NOCACHE + final sync

Note: Our tool measures actual disk performance (what you'd see in real workloads), not cache throughput. NovaBench numbers include filesystem cache effects, which makes them higher but less representative of sustained I/O performance.

GPU (Metal)

Integrated GPU compute benchmarks using Metal.

Compute (Matrix Multiply)

  • Dense matrix multiplication
  • Size: 1024x1024 (quick) or 2048x2048 (normal)
  • Metric: GFLOPS

Particles Simulation

  • N-body style simulation
  • Particles: 100,000 (quick) or 1,000,000 (normal)
  • Metric: Mparts/s

Gaussian Blur

  • 5x5 convolution on an image
  • Size: 2048x2048 (quick) or 4096x4096 (normal)
  • Metric: MP/s

Edge Detection (Sobel)

  • Sobel filter on the same image size
  • Metric: MP/s

GPU Parameters

Parameter Quick Mode Normal Mode
Matrix size 1024×1024 2048×2048
Particle count 100,000 1,000,000
Image size (blur/edge) 2048×2048 4096×4096

GPU Notes

  • Metal shaders are compiled at runtime from inline source (SPM compatible)
  • Texture data is heap-allocated to avoid stack overflow on large images (v2.1.1+)
  • Uses deterministic patterns for reproducibility
  • If Metal is unavailable, GPU tests return 0 and are scored as "Failed"

AI/ML (v2.0.0+)

Neural Engine and machine learning inference benchmarks using CoreML and Accelerate.

Why a Separate AI Score?

The AI/ML score is not included in the Total Score. This mirrors Geekbench AI's approach:

  • AI workloads have different characteristics than traditional benchmarks
  • Neural Engine performance varies significantly between chip generations
  • Users may care about AI performance independently from general compute

CoreML Inference Tests

All CoreML tests use the same model (MobileNetV2 image classification) with different compute units:

CPU Inference

  • What: CoreML inference using CPU-only
  • Compute Units: .cpuOnly
  • Metric: IPS (inferences per second)
  • Measures: CPU's ability to run ML models without GPU/ANE

GPU Inference

  • What: CoreML inference using GPU
  • Compute Units: .cpuAndGPU
  • Metric: IPS (inferences per second)
  • Measures: GPU's ability to accelerate ML inference

Neural Engine Inference

  • What: CoreML inference with Neural Engine
  • Compute Units: .all (CoreML schedules to ANE when beneficial)
  • Metric: IPS (inferences per second)
  • Measures: ANE throughput for supported operations

BNNS (Accelerate Framework)

  • What: Matrix multiplication using vDSP
  • Size: 512×512 (quick) or 1024×1024 (normal)
  • Metric: GFLOPS
  • Measures: CPU vector math performance for ML workloads

AI Model Management

The benchmark downloads a CoreML model from Apple's official ML assets on first run:

  • Model: MobileNetV2 image classification (~17MB source)
  • Source: ml-assets.apple.com (Apple's official CoreML model repository)
  • Compilation: Compiled locally using CoreML framework (no Xcode required)
  • Cache: ~/Library/Application Support/osx-bench/models/
  • Offline mode: Use --offline to skip if model not cached
  • Custom model: Use --model-path for local .mlmodelc models

AI Test Parameters

Parameter Quick Mode Normal Mode
Warmup iterations 5 20
Min iterations 5 10
Max iterations 1,000 10,000
BNNS matrix size 512×512 1024×1024

AI Notes

  • Neural Engine availability depends on macOS version and chip
  • CoreML decides layer scheduling - not all operations run on ANE even with .all
  • Input is synthetic (deterministic pattern for reproducibility)
  • Results comparable to Geekbench AI methodology (see Geekbench AI Workloads paper)

Thermal Monitoring

Thermal state is tracked during a run using the macOS public API:

  • Nominal
  • Fair
  • Serious
  • Critical

If throttling is detected, scores may be lower than peak performance.

Duration Modes

Mode Flag Duration Use Case
Quick --quick ~3s per category Fast iteration
Normal default 10s per category Standard runs
Custom -d N N seconds Tuned runs
Stress --stress ~60s per category Sustained performance
Repeat --repeats N N × full run Publishable results (median)

Clone this wiki locally