Skip to content

Raspberry Pi 400 #59

@geerlingguy

Description

@geerlingguy

raspberry-pi-400-hero-back-ports

Basic information

NOTE: I never uploaded my initial test results to this repository, as it was created in 2023. As time goes on, I've tried filling in more of the Pi lineup, as I like to have comparisons between all product families. Thus, this machine was re-tested running the latest Pi OS and firmware as of late 2024.

Linux/system information

# output of `screenfetch`
         _,met$$$$$gg.           pi@pi400
      ,g$$$$$$$$$$$$$$$P.        OS: Debian 12 bookworm
    ,g$$P""       """Y$$.".      Kernel: aarch64 Linux 6.6.62+rpt-rpi-v8
   ,$$P'              `$$$.      Uptime: 0m
  ',$$P       ,ggs.     `$$b:    Packages: 1920
  `d$$'     ,$P"'   .    $$$     Shell: bash 5.2.15
   $$P      d$'     ,    $$P     Disk: 13G / 119G (12%)
   $$:      $$.   -    ,d$$'     CPU: ARM Cortex-A72 @ 4x 1.8GHz
   $$\;      Y$b._   _,d$P'      GPU: 
   Y$$.    `.`"Y$$$$P"'          RAM: 415MiB / 3791MiB
   `$$b      "-.__              
    `Y$$                        
     `Y$$.                      
       `$$b.                    
         `Y$$b.                 
            `"Y$b._             
                `""""   

# output of `uname -a`
Linux pi400 6.6.62+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024-11-25) aarch64 GNU/Linux

Benchmark results

CPU

Power

  • Idle power draw (at wall): 2.7 W
  • Maximum simulated power draw (stress-ng --matrix 0): 5.2 W
  • During Geekbench multicore benchmark: 6.3 W
  • During top500 HPL benchmark: 6.4 W

Disk

SanDisk Extreme 128 GB microSD

Benchmark Result
iozone 4K random read 8.76 MB/s
iozone 4K random write 4.39 MB/s
iozone 1M random read 43.66 MB/s
iozone 1M random write 34.71 MB/s
iozone 1M sequential read 43.66 MB/s
iozone 1M sequential write 35.50 MB/s

Run benchmark on any attached storage device (e.g. eMMC, microSD, NVMe, SATA) and add results under an additional heading.

Also consider running PiBenchmarks.com script.

Network

iperf3 results:

  • iperf3 -c $SERVER_IP: TODO Mbps
  • iperf3 -c $SERVER_IP --reverse: TODO Mbps
  • iperf3 -c $SERVER_IP --bidir: TODO Mbps up, TODO Mbps down

(Be sure to test all interfaces, noting any that are non-functional.)

GPU

glmark2

glmark2-es2 / glmark2-es2-wayland results:

=======================================================
    glmark2 2023.01
=======================================================
    OpenGL Information
    GL_VENDOR:      Broadcom
    GL_RENDERER:    V3D 4.2
    GL_VERSION:     OpenGL ES 3.1 Mesa 23.2.1-1~bpo12+rpt3
    Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
    Surface Size:   800x600 windowed
=======================================================
[build] use-vbo=false: FPS: 1027 FrameTime: 0.974 ms
[build] use-vbo=true: FPS: 1527 FrameTime: 0.655 ms
[texture] texture-filter=nearest: FPS: 1252 FrameTime: 0.799 ms
[texture] texture-filter=linear: FPS: 1218 FrameTime: 0.821 ms
[texture] texture-filter=mipmap: FPS: 1150 FrameTime: 0.870 ms
[shading] shading=gouraud: FPS: 1171 FrameTime: 0.854 ms
[shading] shading=blinn-phong-inf: FPS: 940 FrameTime: 1.065 ms
[shading] shading=phong: FPS: 720 FrameTime: 1.389 ms
[shading] shading=cel: FPS: 687 FrameTime: 1.457 ms
[bump] bump-render=high-poly: FPS: 589 FrameTime: 1.701 ms
[bump] bump-render=normals: FPS: 1222 FrameTime: 0.819 ms
[bump] bump-render=height: FPS: 1123 FrameTime: 0.891 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 447 FrameTime: 2.241 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 222 FrameTime: 4.524 ms
[pulsar] light=false:quads=5:texture=false: FPS: 1339 FrameTime: 0.747 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 108 FrameTime: 9.300 ms
[desktop] effect=shadow:windows=4: FPS: 438 FrameTime: 2.284 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 188 FrameTime: 5.338 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 194 FrameTime: 5.176 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 233 FrameTime: 4.306 ms
[ideas] speed=duration: FPS: 914 FrameTime: 1.095 ms
[jellyfish] <default>: FPS: 421 FrameTime: 2.379 ms
[terrain] <default>: FPS: 26 FrameTime: 39.011 ms
[shadow] <default>: FPS: 110 FrameTime: 9.099 ms
[refract] <default>: FPS: 36 FrameTime: 28.205 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 1432 FrameTime: 0.699 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 700 FrameTime: 1.430 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 1331 FrameTime: 0.752 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 1036 FrameTime: 0.966 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 607 FrameTime: 1.649 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 983 FrameTime: 1.018 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 983 FrameTime: 1.018 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 604 FrameTime: 1.656 ms
=======================================================
                                  glmark2 Score: 755 
=======================================================

GravityMark

GravityMark results:

1. Download the latest version of GravityMark: https://gravitymark.tellusim.com
2. Run `chmod [downloaded_file.run]`
3. Run `sudo ./[downloaded_file.run]` and press `y` to accept the terms.
4. Open the link it prints, and run the Benchmark defaults, changing to 720p resolution and 50,000 asteroids.

Note: These benchmarks require an active display on the device. Not all devices may be able to run glmark2-es2, so in that case, make a note and move on!

Ollama

ollama LLM model inference results:

Pi Model CPU/GPU LLM Rate Power
Raspberry Pi 400 - 4GB CPU llama3.2:3b 1.60 Tokens/s 6 W

Note that Ollama will run on the CPU if no valid GPU / drivers are present. Be sure to note whether Ollama runs on the CPU, GPU, or a dedicated NPU.

TODO: See this issue for discussion about a full suite of standardized GPU benchmarks.

Memory

tinymembench results:

Click to expand memory benchmark result
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   2574.7 MB/s (2.3%)
 C copy backwards (32 byte blocks)                    :   2580.6 MB/s (0.3%)
 C copy backwards (64 byte blocks)                    :   2577.9 MB/s (0.3%)
 C copy                                               :   2442.4 MB/s (0.3%)
 C copy prefetched (32 bytes step)                    :   2557.6 MB/s (0.4%)
 C copy prefetched (64 bytes step)                    :   2554.6 MB/s (0.3%)
 C 2-pass copy                                        :   1855.1 MB/s (0.5%)
 C 2-pass copy prefetched (32 bytes step)             :   2202.1 MB/s (0.2%)
 C 2-pass copy prefetched (64 bytes step)             :   2205.1 MB/s (0.2%)
 C fill                                               :   3076.9 MB/s (1.1%)
 C fill (shuffle within 16 byte blocks)               :   3032.4 MB/s (0.6%)
 C fill (shuffle within 32 byte blocks)               :   3040.0 MB/s (0.7%)
 C fill (shuffle within 64 byte blocks)               :   3034.4 MB/s (0.5%)
 NEON 64x2 COPY                                       :   2555.8 MB/s (0.3%)
 NEON 64x2x4 COPY                                     :   2556.9 MB/s (0.3%)
 NEON 64x1x4_x2 COPY                                  :   2557.2 MB/s (0.3%)
 NEON 64x2 COPY prefetch x2                           :   2553.5 MB/s (0.2%)
 NEON 64x2x4 COPY prefetch x1                         :   2552.4 MB/s (0.2%)
 NEON 64x2 COPY prefetch x1                           :   2557.8 MB/s (0.3%)
 NEON 64x2x4 COPY prefetch x1                         :   2549.9 MB/s (0.3%)
 ---
 standard memcpy                                      :   2564.8 MB/s (0.3%)
 standard memset                                      :   3066.5 MB/s (0.9%)
 ---
 NEON LDP/STP copy                                    :   2552.3 MB/s (0.3%)
 NEON LDP/STP copy pldl2strm (32 bytes step)          :   2548.9 MB/s (0.2%)
 NEON LDP/STP copy pldl2strm (64 bytes step)          :   2547.3 MB/s (0.3%)
 NEON LDP/STP copy pldl1keep (32 bytes step)          :   2558.1 MB/s (0.2%)
 NEON LDP/STP copy pldl1keep (64 bytes step)          :   2561.2 MB/s (0.3%)
 NEON LD1/ST1 copy                                    :   2559.0 MB/s (0.3%)
 NEON STP fill                                        :   3035.2 MB/s (0.8%)
 NEON STNP fill                                       :   2862.7 MB/s (0.5%)
 ARM LDP/STP copy                                     :   2553.1 MB/s (0.3%)
 ARM STP fill                                         :   3054.3 MB/s (1.0%)
 ARM STNP fill                                        :   2870.2 MB/s (0.9%)

==========================================================================
== Framebuffer read tests.                                              ==
==                                                                      ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled.       ==
== Writes to such framebuffers are quite fast, but reads are much       ==
== slower and very sensitive to the alignment and the selection of      ==
== CPU instructions which are used for accessing memory.                ==
==                                                                      ==
== Many x86 systems allocate the framebuffer in the GPU memory,         ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover,    ==
== PCI-E is asymmetric and handles reads a lot worse than writes.       ==
==                                                                      ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer    ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall    ==
== performance improvement. For example, the xf86-video-fbturbo DDX     ==
== uses this trick.                                                     ==
==========================================================================

 NEON LDP/STP copy (from framebuffer)                 :    765.8 MB/s (0.3%)
 NEON LDP/STP 2-pass copy (from framebuffer)          :    654.9 MB/s (0.2%)
 NEON LD1/ST1 copy (from framebuffer)                 :    824.9 MB/s (5.5%)
 NEON LD1/ST1 2-pass copy (from framebuffer)          :    689.0 MB/s (0.2%)
 ARM LDP/STP copy (from framebuffer)                  :    551.3 MB/s
 ARM LDP/STP 2-pass copy (from framebuffer)           :    523.8 MB/s (0.2%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    1.2 ns          /     2.2 ns 
     65536 :    4.7 ns          /     7.4 ns 
    131072 :    7.2 ns          /     9.9 ns 
    262144 :   10.3 ns          /    13.1 ns 
    524288 :   11.9 ns          /    15.2 ns 
   1048576 :   29.9 ns          /    45.8 ns 
   2097152 :   82.8 ns          /   120.4 ns 
   4194304 :  111.4 ns          /   144.7 ns 
   8388608 :  132.2 ns          /   163.0 ns 
  16777216 :  142.5 ns          /   171.7 ns 
  33554432 :  148.0 ns          /   176.5 ns 
  67108864 :  155.7 ns          /   188.2 ns 

sbc-bench results

Run sbc-bench and paste a link to the results here: https://0x0.st/XRdJ.bin

Phoronix Test Suite

Results from pi-general-benchmark.sh:

  • pts/encode-mp3: 24.500 sec
  • pts/x264 4K: 1.70 fps
  • pts/x264 1080p: 7.65 fps
  • pts/phpbench: 202897
  • pts/build-linux-kernel (defconfig): 6849.540 sec

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions