Skip to content

M4 Mac mini #57

@geerlingguy

Description

@geerlingguy

Basic information

  • Board URL (official): https://www.apple.com/mac-mini/
  • Board purchased from: Apple (direct)
  • Board purchase date: October 29, 2024 (arrived Nov 11, 2024)
  • Board specs (as tested): M4 10/10/16-core, 32GB RAM, 1TB SSD, 10 GbE
  • Board price (as tested): 1499.00

Linux/system information

# output of `screenfetch`
                 -/+:.          jgeerling@jeff-mini
                :++++.          OS: 64bit macOS  
               /+++/.           Kernel: arm64 Darwin 24.1.0
       .:-::- .+/:-``.::-       Uptime: 5h 39m
    .:/++++++/::::/++++++/:`    Packages: 183
  .:///////////////////////:`   Shell: zsh 5.9
  ////////////////////////`     Resolution: 3840x2160 
 -+++++++++++++++++++++++`      DE: Aqua
 /++++++++++++++++++++++/       WM: Quartz Compositor
 /sssssssssssssssssssssss.      WM Theme: Blue (Dark)
 :ssssssssssssssssssssssss-     Font: FMonoMedium
  osssssssssssssssssssssssso/`  Disk: 190G / 995G (20%)
  `syyyyyyyyyyyyyyyyyyyyyyyy+`  CPU: Apple M4
   `ossssssssssssssssssssss/    GPU: Apple M4 
     :ooooooooooooooooooo+.     RAM: 3974MiB / 32768MiB
      `:+oo+/:-..-:/+o+/-      

# output of `uname -a`
Darwin jeff-mini.local 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024; root:xnu-11215.41.3~3/RELEASE_ARM64_T8132 arm64

Benchmark results

CPU

Power

  • Idle power draw (at wall): 4.1 W
  • Maximum simulated power draw (stress-ng --matrix 0): 31.2 W
  • During Geekbench multicore benchmark: 36 W
  • During top500 HPL benchmark: 39.6 W
  • During Cinebench 2024: 38 W

Disk

Internal Apple Storage

Benchmark Result
AmorphousDiskMark 4K random read QD64 1113.00 MB/s
AmorphousDiskMark 4K random write QD64 121.97 MB/s
AmorphousDiskMark 1M sequential read 3017.64 MB/s
AmorphousDiskMark 1M sequential write 3196.68 MB/s

Network

iperf3 results:

  • iperf3 -c $SERVER_IP: 9.40 Gbps
  • iperf3 -c $SERVER_IP --reverse: 9.38 Gbps
  • iperf3 -c $SERVER_IP --bidir: 9.37 Gbps up, 7.73 Gbps down

The 10 GbE connection adds about 2W to total system power draw.

(Be sure to test all interfaces, noting any that are non-functional.)

GPU

  • Cinebench 2024: 3787
  • Geekbench (Metal): 56652
  • Geekbench (OpenCL): 37773

Memory

tinymembench results:

Click to expand memory benchmark result
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  30582.1 MB/s (2.4%)
 C copy backwards (32 byte blocks)                    :  30488.7 MB/s (2.7%)
 C copy backwards (64 byte blocks)                    :  30756.7 MB/s (0.8%)
 C copy                                               :  31050.0 MB/s (0.7%)
 C copy prefetched (32 bytes step)                    :  31217.9 MB/s (0.3%)
 C copy prefetched (64 bytes step)                    :  31255.8 MB/s (1.8%)
 C 2-pass copy                                        :  25266.3 MB/s (1.3%)
 C 2-pass copy prefetched (32 bytes step)             :  25340.9 MB/s (1.5%)
 C 2-pass copy prefetched (64 bytes step)             :  25332.9 MB/s (1.5%)
 C fill                                               :  45000.0 MB/s (10.3%)
 C fill (shuffle within 16 byte blocks)               :  35503.4 MB/s (2.5%)
 C fill (shuffle within 32 byte blocks)               :  37420.2 MB/s (4.0%)
 C fill (shuffle within 64 byte blocks)               :  41411.4 MB/s (6.9%)
 NEON 64x2 COPY                                       :  44108.2 MB/s (1.9%)
 NEON 64x2x4 COPY                                     :  44995.9 MB/s (3.8%)
 NEON 64x1x4_x2 COPY                                  :  43933.6 MB/s (3.5%)
 NEON 64x2 COPY prefetch x2                           :  38081.9 MB/s (4.2%)
 NEON 64x2x4 COPY prefetch x1                         :  37652.8 MB/s (0.8%)
 NEON 64x2 COPY prefetch x1                           :  38499.6 MB/s (1.7%)
 NEON 64x2x4 COPY prefetch x1                         :  36585.0 MB/s (1.6%)
 ---
 standard memcpy                                      :  44986.9 MB/s (2.0%)
 standard memset                                      :  69795.4 MB/s (1.2%)
 ---
 NEON LDP/STP copy                                    :  44254.6 MB/s (4.7%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  45326.7 MB/s (4.9%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  43931.5 MB/s (3.8%)
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  44670.6 MB/s (2.6%)
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  44082.8 MB/s (1.3%)
 NEON LD1/ST1 copy                                    :  42881.2 MB/s (1.6%)
 NEON STP fill                                        :  80754.5 MB/s (5.4%)
 NEON STNP fill                                       :  68623.4 MB/s (0.5%)
 ARM LDP/STP copy                                     :  43418.8 MB/s (0.3%)
 ARM STP fill                                         :  82462.2 MB/s (5.2%)
 ARM STNP fill                                        :  68986.8 MB/s (1.3%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.1 ns 
    131072 :    0.0 ns          /     0.0 ns 
    262144 :    2.0 ns          /     3.0 ns 
    524288 :    2.9 ns          /     3.8 ns 
   1048576 :    3.4 ns          /     4.1 ns 
   2097152 :    3.7 ns          /     4.1 ns 
   4194304 :    5.1 ns          /     5.6 ns 
   8388608 :    6.1 ns          /     6.3 ns 
  16777216 :   12.7 ns          /    17.6 ns 
  33554432 :   49.1 ns          /    71.5 ns 
  67108864 :   71.1 ns          /    91.0 ns 

sbc-bench results

The script doesn't run on macOS.

Phoronix Test Suite

Results from pi-general-benchmark.sh:

  • pts/encode-mp3: DNF (doesn't install on macOS)
  • pts/x264 4K: 12.82 fps
  • pts/x264 1080p: 55.53 fps
  • pts/phpbench: 1125967
  • pts/build-linux-kernel (defconfig): DNF (doesn't run on macOS)

Run inside a Docker container:

  • pts/encode-mp3: 4.250 s
  • pts/x264 4K: 25.27 fps
  • pts/x264 1080p: 108.50 fps
  • pts/phpbench: 932720
  • pts/build-linux-kernel (defconfig): 383.776 s

Additional Benchmarks

Ollama (LLMs)

See: https://github.com/geerlingguy/ollama-benchmark?tab=readme-ov-file#findings and geerlingguy/ai-benchmarks#2

System CPU/GPU Model Eval Rate Power (Peak)
M4 Mac mini (10 core CPU) / 32GB GPU llama3.2:3b 41.31 Tokens/s 30.1 W
M4 Mac mini (10 core CPU) / 32GB GPU llama3.1:8b 20.95 Tokens/s 29.4 W
M4 Mac mini (10 core CPU) / 32GB GPU llama2:13b 13.60 Tokens/s 29.8 W

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions