-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
steamcmd: Terminating m_ThreadClient, likely to crash down the line #8083
Comments
A user of my valheim Docker container (which uses steamcmd to download valheim dedicated server from Steam) reported that the container now freezes his entire Synology NAS running Linux Kernel He is getting the following message in his steamcmd stderr log:
His syslog stdout captures a very similar output:
The container is using Debian stable as the base Linux system with the following steamcmd and Valheim specific packages installed in it:
|
So, it seems this is Debian 11 related. When I downgraded the Docker container base from debian:stable to debian:buster (Debian 10) steamcmd behaves again and no longer throws any of the errors above. So maybe steamcmd doesn't play nice with the Debian 11 libraries or it is a combination of Debian 11 libs with older host OS kernels? |
Can you post / send me the minidump files from those logs (or new ones for the problem, doesn't matter)? |
I can ask the reporter of the issue to download the dmp for me. Might be challenging as this happend on their Synology NAS and I don't know how comfortable they are with the CLI. |
@TTimo in lloesche/valheim-server-docker#401 (comment) at the end of their comment the user did copy'paste the dmp file (they |
Yeah I'd need the file or a crash id of a successful upload to our servers - I can't retrieve the binary bits from a copy paste to ascii :) |
Apologies for the delay, I'm a bit out of my element and Synology via SSH is pretty barebones. I couldn't run It doesn't seem like the dmp file suffered from Linux -> Windows extra newline characters but please don't be surprised if it did. crash_20210920204020_5.zip (GitHub didn't let me upload as *.dmp) Rough steps I've followed (if we need to run it again in future or you think I mangled the files)
|
For reference, the attached minidump is a DUMP_REQUESTED in crashhandler.so. |
Yeah, DUMP_REQUESTED .. that's an assertion failure (the crash_*.dmp name is misleading, known Steam bug but unrelated here). According to the log, the failure looks immediate after startup, and the .dmp indicates it is an init failure from a worker thread of the HTTP client. I think after that steamcmd just decides to cleanly exit. I don't think Fedora and docker version have anything to do with this do they? Should be reproductible simply with debian 10 vs debian 11 as the base container image. |
@TTimo so the reason why I thought the host system kernel is playing a role here is because not every user of my container has that issue. The container has 16 Mio. downloads on Dockerhub so is in use by quite a few people. However only a small number of reports came in after the container auto-upgraded from Debian 10 to 11 via the Also the failure behavior was different across users. Some, like the Synology users had their entire Synology NAS freeze when running the updated container. Others like @tomekduda would see steamcmd crash and create a dmp file. I myself on my Fedora 30 host system with Linux 5.6.11 I would be able to execute steamcmd and download Valheim server but then get the On a Fedora 33 host system with Linux 5.11.7 I did not see ANY issues whatsoever. steamcmd worked just fine, downloaded Valheim server and was able to run This was all with the same Debian 11 based container just on different host systems with different host kernels. I agree the host system distribution shouldn't make any difference which is why I suspect it's a combination of Kernel version with Debian 11 libc or maybe SDL, that's causing the issues here. After downgrading to Debian 10 all users reported that the issues went away. So that's a good workaround for us right now. |
Yeah - Debian 11's libc/threading having a backwards compatibility bug with 5.6 kernels is a strong possibility. If that's the case it's probably affecting software other than steamcmd too so maybe the debian bug tracker will have some things. |
What kernel is the Synology using? In principle Debian 11 user-space is meant to work on at least kernels >= Debian 10, meaning 4.19.x (otherwise we wouldn't be able to upgrade Debian 10 systems to Debian 11 in-place), but there might be something more subtle happening here. @TTimo, if you're able to get a backtrace with a failing or hanging glibc call, that would give me better search terms? |
This post helps? https://old.reddit.com/r/synology/comments/cn9qnd/what_distribution_of_linux_is_synology_using/ They call it DSM (Disk Station Manager). My hardware is from 2016 so I'm sitting on v 6.2. Newer devices sit on 7.0. From SSH (I'm an "admin" according to Synology web admin UI but I'm not sitting on root account) I don't see mention of Debian anywhere. "toster" is the NAS's name so "Linux toster" won't mean anything to anybody, don't bother Googling it.
|
Thanks, 3.10.105 is the version I was looking for. |
glibc on Debian appears to be set up to have a minimum kernel version of 3.2, so Debian user-space should work on older kernels all the way down to 3.2, and if it doesn't then that's likely to be considered to be a bug. However, a reproducer consisting of "run Steam" is not exactly minimal, so even if it's a Debian 11 bug, reporting that bug is probably not helpful yet. |
@TTimo, is it possible to find out from the crash-dump what this thread is for and how it was created? One possibility is that something involved in thread-creation might have become more strict in the version of glibc used in Debian 11, compared with the version used in Debian 10. Another possibility is that "the other thread has initialized" is being communicated to the main thread in a non-thread-safe way?
In principle this shouldn't be possible for an unprivileged process to achieve, and it is likely to be considered to be a kernel bug (maybe even a security vulnerability) if it can carry out that denial-of-service - but kernel 3.10 is pretty old and is no longer maintained upstream, not even as a LTS kernel.
Am I correct to think that the proprietary code in
Do we have evidence that the original issue you reported here, which seems to be about terminating a thread, has the same root cause as the issue @tomekduda reported, which seems to be about starting a thread?
Does this mean it's using |
It's an HTTP worker thread, nothing special about it's creation, Steam uses the worker thread pattern a lot. There is not much evidence that the two problems have the same root cause (thread failing to start and locking up DSM installs, vs thread failing to terminate at exit), except that they both appeared when the container base image changes from 10 to 11. The thread startup failure seems like a more tractable problem that has less chances of being caused by a problem in steamcmd. |
This might be a long shot, but curious to know whether you've tried installing This missing dependency has usually been the culprit with newer linux distros, and considering it's an HTTP request, it would make sense. |
Steamcmd in docker on Synology crashes the Docker and also Synology NAS hard-freezes, has to be rebooted by power button. Even "sudo killall -KILL dockerd" sometimes does not help. I tried multiple docker images, result is the same.
|
Any news on this for Synology DSM 7. I also can not download steamcmd.
|
It seems that steamcmd is failing during start up to create threads because clock_gettime64 fails to return:
And when trying to run and switch between threads, as inspected by passing a PID of a running steamcmd at 99% CPU to strace the process is just thrashing on clock_gettime64 calls:
I can trivially reproduce this EINVAL failure with:
when building with |
After having done a fair bit more investigating and playing with this issue, it sure seems like the problem is that Synology added system calls that were then later used for other things (403, for example, is clock_gettime64 in Linux mainline while it is SYNOArchiveBit in Synology's 3.10.77 kernel). It seems that the presence of the similarly numbered syscall causes the issue in glibc >= 2.31 and the random musl I tried, even when the entry for the syscall isn't in the vDSO. I don't expect it should have to be in the vDSO, but this glibc commit made me wonder if it should test both for presence in the vDSO and availability in the running kernel ... I imagine a workaround would require LD_PRELOADing something to rewrite clock_gettime calls to only use the syscall instead of trying the vDSO/clock_gettime64/clock_gettime flow. Moving to an older glibc would also fix the steamcmd side of things, and doing so for only 32-bit applications feels a bit ugly but would probably serve the need here). Or a LD_PRELOADed definition for clock_gettime that forces the syscall path. |
I think we have two separate things going on in this issue report:
The rest of this comment refers only to the problems seen on Synology systems. The short version is that I think these will have to be "won't fix" from Steam's point of view.
Sorry, what you have there is not Linux, but instead Synology's incompatible fork of Linux. System call numbers are part of the Linux ABI, and derivatives that use a previously-unused syscall number for their own purposes are no longer suitable for running arbitrary Linux programs. The (only) feature-discovery mechanism for the Linux system call interface is that user-space invokes the system call that it would prefer to use, to see whether it works. If it fails with
There is no requirement for a syscall to be in the vDSO, and the majority of syscalls are not (in practice the vDSO only contains a few of the most time-sensitive syscalls). If a syscall exists in the vDSO, then user-space can choose whether to invoke it via the vDSO or the ordinary syscall mechanism. If a syscall does not exist in the vDSO, then user-space is expected to invoke it via the ordinary syscall mechanism (potentially with a fallback on
steamcmd does not ship with its own glibc (it can't, even if we wanted it to, because private symbols in glibc are tightly coupled to the corresponding runtime linker This doesn't seem like it is really a Steam issue at all: the issue is that Synology's kernel is not suitable for running containers based on a newer version of glibc (for example Debian >= 11), or a newer version of musl. If Synology advertises the ability to run arbitrary Linux containers as a selling point for their systems, then I would suggest that Synology users should report their kernel's incompatibility with modern glibc versions to whatever technical support contact they provide, because this is going to affect any modern container that you want to run (not just Steam). The oldest Linux kernel with security support on kernel.org is currently 4.9, so a kernel based on 3.10.105 is likely to have multiple unfixed security vulnerabilities. The 3.10.x branch started in 2013, was an LTS branch maintained for several years, but reached end-of-life in 2017. Again, this is something for Synology to address. @lloesche, if you're the maintainer of this Valheim-dedicated-server container, you will have to decide which is more important to you: being able to run on Synology's fork of Linux, or using a modern version of Debian. If the ability to run your container on Synology is important to you, you will have to either stick to Debian 10, or have a Debian-11-based container for standard Linux systems and a separate Debian-10-based container for Synology systems. Conversely, if you want to require Debian 11, then you might want to document your container as being unsuitable for use with Synology's fork of Linux. |
Ack. Makes complete sense. The non-Synology problem was encountered again in lloesche/valheim-server-docker #531 and hopefully more information and a simple steamcmd invocation is forthcoming. As for the rest of the manifesto ...
You are preaching to the choir. I was trying to find a way forward given a sea of constraints and understand the complex interaction of the various pieces for which I am not the originator.
I do understand this. The commit I referenced, as I read it, seems to suggest the only time clock_gettime64 would be available would be in vDSO-supporting kernels. Furthermore, if those kernels support the vDSO, they also include clock_gettime64 in that vDSO (starting in v3.15). Such a test would have also avoided the Synology problem, but may not have been appropriate. I don't have the context to say for sure, and didn't really get much of an answer on the libc-help mailing list.
Of course. This sentence was from a paragraph dedicated to workarounds in the Docker image environment used to reproduce the Synology problem. However, musl supports static linking, and certainly presents an opportunity to ship your own libc.
💯
And you can see the solution here if you like. But it's pretty ugly. I certainly don't feel "good" about it insofar as execution environment cleanliness is concerned. But, given the constraints, I didn't see a much better way. Proof of the pudding being in the eating and all. |
Your system information
Please describe your issue in as much detail as possible:
Whenever
steamcmd.sh
quit is executed the following error is thrown:This only happens on the Fedora 30 system. I tried on a Fedora 33 system with Kernel (5.11.7-200.fc33.x86_64) and there the issue doesn't present itself. Previous versions of steamcmd worked just fine on Fedora 30.
Steps for reproducing this issue:
The text was updated successfully, but these errors were encountered: