Skip to content

Upgrade to Nixpkgs 25.11#303

Draft
knuton wants to merge 29 commits into
dividat:mainfrom
knuton:nixpkgs-25.11
Draft

Upgrade to Nixpkgs 25.11#303
knuton wants to merge 29 commits into
dividat:mainfrom
knuton:nixpkgs-25.11

Conversation

@knuton
Copy link
Copy Markdown
Member

@knuton knuton commented Jan 5, 2026

Updated definitions to resolve deprecation warnings, eval errors and build errors.

Issues

  • kiosk segfaults when started locally
  • kiosk crashes in run-in-vm because QML imports are not found
  • tests using nixos-test-script-helpers don't run due to mypy errors
  • Integration tests pass, currently failing:
  • E2E tests pass
  • Release validation tests pass: https://github.com/dividat/playos/actions/runs/25483115066
  • Add a message to all TTYs explaining how to return to the graphical session (to ensure the Ctrl+Alt+F7 change does not confuse operators): 1d81d83

Checklist

  • Changelog updated
  • Code documented
  • User manual updated
  • Live ISO boot
  • Manual install
  • Clean up git commits
  • Bloat check: compare bundle sizes (du -BM -sL $(nix-build -A components.unsignedRaucBundle))
    • RAUC bundle increased by 400M (+30%) compare to main (from 1323M to 1724M), need to investigate
    • Main 3 bloat contributors:
      1. Skeleton RAUC + deps: adds around ~70MB compressed (~240 MB uncompressed)
      2. linux-firmware increase: adds around ~150MB compressed (~500 MB uncompressed)
      3. gst-plugins-bad (new dep via pyqt6->qtmultimedia->gst-plugins-bad): adds around 50MB compressed, useless plugins (MIDI, TTS..), no way to disable without causing massive Qt rebuild...
    • The rest seem to be package increases.
    • The biggest potential savings are from trimming the linux-firmware image, made a separate issue to track this: Trim down pkgs.linux-firmware to fit actually used hardware #343
  • Bloat check: compare installer sizes (du -BM -sL $(nix-build -A components.installer.isoImage))
    • on main: 2252M
    • here: 3345M
    • ~49% (~1100MB) increase
    • attributable to the overall +970MB system image increase (see above) + rescueImage is now using different toplevel init due to skeleton split
    • unfortunate situation, but not much we can do here, Trim down pkgs.linux-firmware to fit actually used hardware #343 is still the best remedy
  • Read through NixOS 25.05 and 25.11 release notes: https://nixos.org/manual/nixos/stable/release-notes#sec-release-25.11
    • read through manually and also asked Claude to review, nothing beyond what is already changed/fixed.

@knuton
Copy link
Copy Markdown
Member Author

knuton commented Jan 14, 2026

Closing for now, will revisit in future cycle with GRUB skeleton pinned.

@knuton knuton closed this Jan 14, 2026
@yfyf yfyf reopened this Apr 27, 2026
@yfyf
Copy link
Copy Markdown
Collaborator

yfyf commented Apr 27, 2026

Re-based on current main

@yfyf
Copy link
Copy Markdown
Collaborator

yfyf commented Apr 30, 2026

kiosk segfaults when started locally

@knuton does kiosk still segfault from nix-shell for you at 3e9a11d? It segfaults in run-in-vm and I believe it did segfault for me on first run in nix-shell, but now it doesn't anymore, so I can't reproduce locally at least. The run-in-vm segfault seems to relate to graphics drivers, e.g. I can get it to work by changing to -vga std.

I suggest to test with kioskUrl set to something other than play.dividat.com, because there seem to be other issues that might be making QtWebEngine crash/hang specific to it.

@yfyf
Copy link
Copy Markdown
Collaborator

yfyf commented Apr 30, 2026

Rebased on top of main.

@yfyf
Copy link
Copy Markdown
Collaborator

yfyf commented May 5, 2026

I suggest to test with kioskUrl set to something other than play.dividat.com, because there seem to be other issues that might be making QtWebEngine crash/hang specific to it.

Fixed an issue with kiosk going into an infinite reload in 509788d. Strangely, this seems to also get rid of the segfault for me at least. Nevertheless, I also disabled Vulkan in 8513fc4, because it gets rid of some related errors.

@yfyf
Copy link
Copy Markdown
Collaborator

yfyf commented May 7, 2026

The automated tests now pass, remaining efforts should focus on manual testing / verification.
I am parking this to focus on other things for now.

In the meantime, @knuton @dividat-jgu could you check if the following work for you on this branch:

  • ./build vm
  • ./result/bin/run-in-vm
  • ./result/bin/run-in-vm --opengl
  • cd kiosk; nix-shell; ./bin/kiosk-browser https://play.dividat.com http://localhost:3333
  • same as above, but before starting, do rm -rf ~/.cache/kiosk-browser/ ~/.local/share/kiosk-browser/

@dividat-jgu
Copy link
Copy Markdown
Contributor

In the meantime, @knuton @dividat-jgu could you check if the following work for you on this branch:

All good!

Notes: I had to use nixGL to run the kiosk, and also to run the vm with --opengl. With --opengl, the output was a bit blurry, but I already had that before.

@knuton
Copy link
Copy Markdown
Member Author

knuton commented May 7, 2026

  • ./build vm
  • ./result/bin/run-in-vm
  • ./result/bin/run-in-vm --opengl
  • cd kiosk; nix-shell; ./bin/kiosk-browser https://play.dividat.com http://localhost:3333
  • same as above, but before starting, do rm -rf ~/.cache/kiosk-browser/ ~/.local/share/kiosk-browser/

@yfyf yfyf force-pushed the nixpkgs-25.11 branch 3 times, most recently from 3e43b4f to bfb9dad Compare May 12, 2026 14:41
knuton and others added 14 commits May 13, 2026 13:12
nixpkgs 25.11 deprecates substituteAll, however replaceVars has slightly
different behaviour.  Using `substituteAll`, the substitute values were
implicitly coerced to store paths. `replaceVars` doesn't do that and it
causes local paths to appear (at least) in tests, e.g. in e2e-tests:

  Traceback (most recent call last):
    File "/nix/store/qkjvh813d6kjw4clp5kh2fgp3gkb2l8y-install-playos-2026.3.0-TEST/bin/.install-playos-wrapped", line 497, in
  <module>
      _main(parser.parse_args())
    File "/nix/store/qkjvh813d6kjw4clp5kh2fgp3gkb2l8y-install-playos-2026.3.0-TEST/bin/.install-playos-wrapped", line 438, in
  _main
      install_bootloader(disk, machine_id)
    File "/nix/store/qkjvh813d6kjw4clp5kh2fgp3gkb2l8y-install-playos-2026.3.0-TEST/bin/.install-playos-wrapped", line 215, in
  install_bootloader
      shutil.copy2(GRUB_CFG, '/mnt/boot/grub/grub.cfg')
    File "/nix/store/kjgslpdqchx1sm7a5h9xibi5rrqcqfnl-python3-3.12.8/lib/python3.12/shutil.py", line 475, in copy2
      copyfile(src, dst, follow_symlinks=follow_symlinks)
    File "/nix/store/kjgslpdqchx1sm7a5h9xibi5rrqcqfnl-python3-3.12.8/lib/python3.12/shutil.py", line 260, in copyfile
      with open(src, 'rb') as fsrc:
           ^^^^^^^^^^^^^^^
  FileNotFoundError: [Errno 2] No such file or directory: '/home/yfyf/src/playos2/skeleton/bootloader/grub.cfg'
In earlier nixOS versions, connman would "win" /etc/resolv.conf
management over resolvconf silently. Since openresolv has been updated,
it throws up with an error when it detects /etc/resolv.conf is not
managed by it and makes `network-setup.service` fail.

This does not seem to break anything, but produces noise and confusion
in system logs.

Since we have been using connman to manage /etc/resolv.conf, keeping
things "as is" and simply disabling resolvconf. This has the side-effect
of disabling `network-setup.service` as well, which is desirable.
Since NixOS 25.05, `multi-user.target` no longer depends on
`network-online.target`, see:

        NixOS/nixpkgs@2370696

More over, when using connman, nothing starts the
`network-online.target` (see previous commit's 0ec53e4 message).

Transitively, depending on the setup, this means the `network.target`
might be never activated at all, which is what actually  happens in the
controller-proxy test.
More NixOS networking misconfiguration for connman non-sense.

This fixes the testing/integration/controller-interface-labeling.nix
test, but also potential real mDNS issues down the line.
The `dontWrapQtApps = true` is recommended by Qt packaging docs for PyQt
applications and the makeWrapperArgs approach is what many PyQt packages
seem to use in nixpkgs.
We do not use Nvidia cards at the moment and without this, at least in
the VM we get confusing errors in logs such as:

    radv/amdgpu: failed to initialize device.
    MESA: info: could not get caps: Function not implemented
    MESA: info: could not get caps: Function not implemented
    MESA: error: vdrm_device_connect failed
    radv/amdgpu: failed to initialize device.
    MESA: info: could not get caps: Function not implemented
    MESA: error: vdrm_device_connect failed
    radv/amdgpu: failed to initialize device.
    MESA: info: could not get caps: Function not implemented
    MESA: error: vdrm_device_connect failed
    radv/amdgpu: failed to initialize device.
In earlier Qt / QWebEngine verisons, the loadFinished signal arrived
before the next play:beforeunload event, which would clear the
`_is_full_reload` variable.

After the nixpkgs bump, the events look like:

    load page -> SW triggers play:beforeunload ->
        start full_reload -> SW triggers play:beforeunload -> loop

To prevent this, we do not re-enter the full_reload if one is already in
progress.
yfyf added 15 commits May 13, 2026 13:25
From the NixOS 25.11 release notes:

> NixOS display manager modules now strictly use tty1, where many of
> them previously used tty7. Options to configure display managers’ VT
> have been dropped.

This was breaking release-validation tests, since they expect to find a
terminal on TTY1 for triggering the reboot. Using TTY3 should be
backwards compatible with previous releases, since TTY1-6 were all
terminals. Under some distro configurations, TTY2 is sometimes also
used for "active graphical session", so jumping straight to TTY3.

Due to this, switching to TTY7 should no longer be allowed, since it no
longer holds the graphical session.
…sion

This is mostly to help existing users learn about the new shortcut if
they have been using Ctrl+Alt+{F7,F8} to switch between the status
screen and graphical interface for whatever reason.

Follow-up to 7756649
At first I thought it was a problem local to
nixos-test-script-helpers.py, so tried to fix it by making sure it
typechecks standalone by making it a separate package, but turns out the
issue is elsewhere...

When typechecking, `runNixOSTest` prepends a silly piece of code to the
test script that defines various global types and variables. It added
`t: TestCase` which conflicts with the `TestCase` we define. The only
way to avoid the conflict is to either rename or always import it using
a qualified name.  Since we use it everywhere, it is nicer without the
qualified name, so renaming.

Closes: dividat#298
There is tester.nixosTest, but it is in process of being deprecated:
NixOS/nixpkgs#293891
This seems to have been always useless, since NixOS does not actually
set up a `connman-wait-online.service` and the `network-online.target`
was activated by `multi-user.target` implicitly, without any relation to
an actual network "online" status.

Recent NixOS versions have removed the dependency on
network-online.target from multi-user.target, which means that (with
connman as the network manager) it never gets activated, which causes
the TestPrecondition to fail.
New static-web-server version prevents serving files that are symlinks
which resolve to paths outside of the webroot.
The new log message is:

    PlayOS network watchdog skipped, unmet condition check ConditionPathExists=!/home/play/.config/playos-network-watchdog/disabled
The NixOS Stage 2 thing seems to be gone, but it is not relevant.
This vastly increases the test disk size, but seems to be the only way
that works for making the kiosk / QtWebEngine work without crashing /
errors / no display.

Tried disabling graphics accelaration / GPU rendering with all of the
following:

              export QTWEBENGINE_DISABLE_GPU="1"
              export LIBGL_ALWAYS_SOFTWARE="1"
              export QTWEBENGINE_CHROMIUM_FLAGS="--disable-gpu"
              export QSG_RHI_BACKEND="software"

but kiosk/Qt still either crashes or renders nothing.

The positive aspect is that this makes the e2e setup closer to "real
world".
static-web-server made symlinks outside web root illegal (see previous
commits) in the recent version. For tests using UpdateServer, it is
preferable to avoid copying the bundles, since they are quite large and
tests running /tmp already occasionally run out of memory.

So instead we switch to a different static web server, which cares less
about security :P
In CI, this sometimes fails at the precondition stage,
while running:

        playos.succeed("curl ${primaryCheckUrl}")

which seems to hang forever, indicating incorrect NAT / port forwarding.

Cannot reproduce locally.

watchdog logs also indicate the primaryCheckUrl is not reachable, while
secondaryCheckUrl is reachable:

    2026-05-07T07:53:19.2454857Z playos # [   16.693981] playos-network-watchdog[661]: DEBUG:watchdog:URL check for http://10.0.2.88:13939/check succeeded!
    2026-05-07T07:53:20.4631618Z playos # [   17.911811] playos-network-watchdog[661]: DEBUG:watchdog:URL check for http://10.0.2.88:13838/check failed: HTTPConnectionPool(host='10.0.2.88', port=13838): Read timed out. (read timeout=0.2)
    2026-05-07T07:53:20.4753877Z playos # [   17.923982] playos-network-watchdog[661]: DEBUG:watchdog:URL check for http://10.0.2.88:13939/check succeeded!
    2026-05-07T07:53:21.7012434Z playos # [   19.149858] playos-network-watchdog[661]: DEBUG:watchdog:URL check for http://10.0.2.88:13838/check failed: HTTPConnectionPool(host='10.0.2.88', port=13838): Read timed out. (read timeout=0.2)
    2026-05-07T07:53:21.7125924Z playos # [   19.162291] playos-network-watchdog[661]: DEBUG:watchdog:URL check for http://10.0.2.88:13939/check succeeded!
    2026-05-07T07:53:22.9341384Z playos # [   20.380213] playos-network-watchdog[661]: DEBUG:watchdog:URL check for http://10.0.2.88:13838/check failed: HTTPConnectionPool(host='10.0.2.88', port=13838): Read timed out. (read timeout=0.2)
    ...

This is very confusing, since the two HTTPStubServer`s are started
identically and also have identical port forwarding setup.
If networking setup is broken (e.g. NAT failing or QEMU's slirp
partially-setup), a client connection might be kept open indefinitely
since the client never finishes the read or closes the connection.

This is mostly to avoid "strange" hangs and make the failures more
explicit.

The ThreadingHTTPServer seems to be sufficient to fix
network-watchdog integration test flakiness.
The initial multiple rfkill switches alone takes ~20 seconds on CI,
before anything even "happens".

Also increase the timeout further for good measure.
@yfyf yfyf force-pushed the nixpkgs-25.11 branch from 3fb4bc0 to 6ac7283 Compare May 13, 2026 10:26
@yfyf
Copy link
Copy Markdown
Collaborator

yfyf commented May 13, 2026

@knuton I believe this is now at a stage where we can move into manual testing.

I think it would be a good idea to perform the standard PlayOS pre-release manual tests. Do you maybe want to take over this part, since you have all the peripheral hardware? 😊

Alternatively, we can go through code review and then do the manual tests.

@knuton
Copy link
Copy Markdown
Member Author

knuton commented May 13, 2026

I think it would be a good idea to perform the standard PlayOS pre-release manual tests. Do you maybe want to take over this part, since you have all the peripheral hardware? 😊

Sure

Alternatively, we can go through code review and then do the manual tests.

No, this is the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants