Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when using linux+asan+static-swift-stdlib+swift 5.8 #1118

Closed
mannuch opened this issue Apr 24, 2023 · 4 comments
Closed

Crash when using linux+asan+static-swift-stdlib+swift 5.8 #1118

mannuch opened this issue Apr 24, 2023 · 4 comments
Labels
0 - new Not sure yet if we should work on it or not asan
Milestone

Comments

@mannuch
Copy link

mannuch commented Apr 24, 2023

Hello!

Ran into an issue when running the service in a release configuration on Linux via docker.

After some digging, I believe I've isolated the issue to when the cluster is initialized. I have a reproduction of the issue with a simple main.swift:

import DistributedCluster

let clusterSystem = await ClusterSystem("TestRunCluster")
try await Task.sleep(for: .seconds(5))

When running with Backtrace installed, I get the following:

Received signal 11. Backtrace:
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@127.0.0.1:7337: _ActorRef<ClusterShell.Message>(/system/cluster)
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaad5a3b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaad5a3979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaad5a3971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Binding to: [sact://TestRunCluster@127.0.0.1:7337]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster/leadership cluster/node=sact://TestRunCluster@127.0.0.1:7337 leadership/election=DistributedCluster.Leadership.LowestReachableMember [DistributedCluster] Not enough members [1/2] to run election, members: [Member(sact://TestRunCluster:2481186327279040895@127.0.0.1:7337, status: joining, reachability: reachable)]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Bound to [IPv4]127.0.0.1/127.0.0.1:7337

With the backtrace sending a signal 11, I tried using AddressSanitizer to see if I could get more information, which ended up giving me:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==1==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0xffff819de570 sp 0xffff819de560 T3)
==1==Hint: pc points to the zero page.
==1==The signal is caused by a READ memory access.
==1==Hint: address points to the zero page.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@127.0.0.1:7337: _ActorRef<ClusterShell.Message>(/system/cluster)
    #0 0x0  (<unknown module>)
    #1 0xaaaac2c62014  (/CrashingCluster+0x1e82014)
    #2 0xaaaac2c62754  (/CrashingCluster+0x1e82754)
    #3 0xaaaac2c2008c  (/CrashingCluster+0x1e4008c)
    #4 0xaaaac2c1fdf4  (/CrashingCluster+0x1e3fdf4)
    #5 0xaaaac2c2c098  (/CrashingCluster+0x1e4c098)
    #6 0xffff85f7d5c4  (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #7 0xffff85fe5d18  (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)

AddressSanitizer can not provide additional info.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaac374b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
Thread T3 created by T1 here:
    #0 0xaaaac149fb68  (/CrashingCluster+0x6bfb68)
    #1 0xaaaac2c28478  (/CrashingCluster+0x1e48478)
    #2 0xaaaac2c2b694  (/CrashingCluster+0x1e4b694)
    #3 0xaaaac2c24c04  (/CrashingCluster+0x1e44c04)
    #4 0xaaaac2c2c098  (/CrashingCluster+0x1e4c098)
    #5 0xffff85f7d5c4  (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #6 0xffff85fe5d18  (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)

Thread T1 created by T0 here:
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaac374979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
    #0 0xaaaac149fb68  (/CrashingCluster+0x6bfb68)
    #1 0xaaaac2c28478  (/CrashingCluster+0x1e48478)
    #2 0xaaaac2c634cc  (/CrashingCluster+0x1e834cc)
    #3 0xaaaac2c6293c  (/CrashingCluster+0x1e8293c)
    #4 0xaaaac2c62014  (/CrashingCluster+0x1e82014)
    #5 0xaaaac2c62754  (/CrashingCluster+0x1e82754)
    #6 0xaaaac18ce5b4  (/CrashingCluster+0xaee5b4)
    #7 0xffff85f273f8  (/lib/aarch64-linux-gnu/libc.so.6+0x273f8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #8 0xffff85f274c8  (/lib/aarch64-linux-gnu/libc.so.6+0x274c8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #9 0xaaaac143efac  (/CrashingCluster+0x65efac)

2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaac374971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
==1==ABORTING

As far as I can tell, the problem only seems to arise when running on Linux with this Dockerfile:

# ================================
# Build image
# ================================
FROM swift:5.8-jammy as builder

RUN mkdir /workspace
WORKDIR /workspace

COPY . /workspace

RUN swift build --sanitize=address -c release -Xswiftc -g --static-swift-stdlib

# ================================
# Run image
# ================================
FROM ubuntu:jammy

COPY --from=builder /workspace/.build/release/CrashingCluster /

EXPOSE 7337

ENTRYPOINT ["./CrashingCluster"]

This reproduction, along with the Dockerfile, can be found in this repo, if it helps.

Thanks for all the work on this!

@mannuch
Copy link
Author

mannuch commented Apr 24, 2023

However, when running with bind mounts to the local filesystem via

docker run -v "$PWD:/code" -w /code swift:latest swift run -c release

my original application, as well as the reproducer linked above, appear to work.

@ktoso
Copy link
Member

ktoso commented Apr 25, 2023

Thanks for the bug report!

We continued looking into this and strongly suspect that this is a bug with Swift 5.8 and --static-swift-stdlib together with asan (address sanitizer).

I'll quadruple check some more but that's our strong suspicion so far.

It also does not reproduce on Swift 5.9 and we suspect this might be a fix for it: apple/swift#65254

@ktoso ktoso added bug 🐞 Bug which should be fixed as soon as possible 0 - new Not sure yet if we should work on it or not asan labels Apr 25, 2023
@ktoso ktoso modified the milestones: 1.0.0-beta.x, 1.0.0-beta.4 Apr 25, 2023
@mannuch
Copy link
Author

mannuch commented Apr 25, 2023

Ahh okay, got it. Running the Swift 5.8 container without --static-swift-stdlib seems to be avoiding the issue, so I'll go with that for now!

Thanks for the timely help with this!

@ktoso ktoso changed the title Distributed Actors Cluster Crashing Crash when using linux+asan+static-swift-stdlib Apr 26, 2023
@ktoso ktoso changed the title Crash when using linux+asan+static-swift-stdlib Crash when using linux+asan+static-swift-stdlib+swift 5.8 Apr 26, 2023
@ktoso
Copy link
Member

ktoso commented May 6, 2023

Thanks for confirming, I'll close this as I believe this is a static linking issue with concurrency library in general

@ktoso ktoso closed this as completed May 6, 2023
@ktoso ktoso removed the bug 🐞 Bug which should be fixed as soon as possible label May 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - new Not sure yet if we should work on it or not asan
Projects
None yet
Development

No branches or pull requests

2 participants