Crash when using linux+asan+static-swift-stdlib+swift 5.8 #1118

mannuch · 2023-04-24T20:34:58Z

Hello!

Ran into an issue when running the service in a release configuration on Linux via docker.

After some digging, I believe I've isolated the issue to when the cluster is initialized. I have a reproduction of the issue with a simple main.swift:

import DistributedCluster

let clusterSystem = await ClusterSystem("TestRunCluster")
try await Task.sleep(for: .seconds(5))

When running with Backtrace installed, I get the following:

Received signal 11. Backtrace:
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@127.0.0.1:7337: _ActorRef<ClusterShell.Message>(/system/cluster)
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaad5a3b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaad5a3979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaad5a3971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Binding to: [sact://TestRunCluster@127.0.0.1:7337]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster/leadership cluster/node=sact://TestRunCluster@127.0.0.1:7337 leadership/election=DistributedCluster.Leadership.LowestReachableMember [DistributedCluster] Not enough members [1/2] to run election, members: [Member(sact://TestRunCluster:2481186327279040895@127.0.0.1:7337, status: joining, reachability: reachable)]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Bound to [IPv4]127.0.0.1/127.0.0.1:7337

With the backtrace sending a signal 11, I tried using AddressSanitizer to see if I could get more information, which ended up giving me:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==1==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0xffff819de570 sp 0xffff819de560 T3)
==1==Hint: pc points to the zero page.
==1==The signal is caused by a READ memory access.
==1==Hint: address points to the zero page.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@127.0.0.1:7337: _ActorRef<ClusterShell.Message>(/system/cluster)
    #0 0x0  (<unknown module>)
    #1 0xaaaac2c62014  (/CrashingCluster+0x1e82014)
    #2 0xaaaac2c62754  (/CrashingCluster+0x1e82754)
    #3 0xaaaac2c2008c  (/CrashingCluster+0x1e4008c)
    #4 0xaaaac2c1fdf4  (/CrashingCluster+0x1e3fdf4)
    #5 0xaaaac2c2c098  (/CrashingCluster+0x1e4c098)
    #6 0xffff85f7d5c4  (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #7 0xffff85fe5d18  (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)

AddressSanitizer can not provide additional info.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaac374b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
Thread T3 created by T1 here:
    #0 0xaaaac149fb68  (/CrashingCluster+0x6bfb68)
    #1 0xaaaac2c28478  (/CrashingCluster+0x1e48478)
    #2 0xaaaac2c2b694  (/CrashingCluster+0x1e4b694)
    #3 0xaaaac2c24c04  (/CrashingCluster+0x1e44c04)
    #4 0xaaaac2c2c098  (/CrashingCluster+0x1e4c098)
    #5 0xffff85f7d5c4  (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #6 0xffff85fe5d18  (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)

Thread T1 created by T0 here:
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaac374979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
    #0 0xaaaac149fb68  (/CrashingCluster+0x6bfb68)
    #1 0xaaaac2c28478  (/CrashingCluster+0x1e48478)
    #2 0xaaaac2c634cc  (/CrashingCluster+0x1e834cc)
    #3 0xaaaac2c6293c  (/CrashingCluster+0x1e8293c)
    #4 0xaaaac2c62014  (/CrashingCluster+0x1e82014)
    #5 0xaaaac2c62754  (/CrashingCluster+0x1e82754)
    #6 0xaaaac18ce5b4  (/CrashingCluster+0xaee5b4)
    #7 0xffff85f273f8  (/lib/aarch64-linux-gnu/libc.so.6+0x273f8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #8 0xffff85f274c8  (/lib/aarch64-linux-gnu/libc.so.6+0x274c8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #9 0xaaaac143efac  (/CrashingCluster+0x65efac)

2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaac374971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
==1==ABORTING

As far as I can tell, the problem only seems to arise when running on Linux with this Dockerfile:

# ================================
# Build image
# ================================
FROM swift:5.8-jammy as builder

RUN mkdir /workspace
WORKDIR /workspace

COPY . /workspace

RUN swift build --sanitize=address -c release -Xswiftc -g --static-swift-stdlib

# ================================
# Run image
# ================================
FROM ubuntu:jammy

COPY --from=builder /workspace/.build/release/CrashingCluster /

EXPOSE 7337

ENTRYPOINT ["./CrashingCluster"]

This reproduction, along with the Dockerfile, can be found in this repo, if it helps.

Thanks for all the work on this!

mannuch · 2023-04-24T21:02:08Z

However, when running with bind mounts to the local filesystem via

docker run -v "$PWD:/code" -w /code swift:latest swift run -c release

my original application, as well as the reproducer linked above, appear to work.

ktoso · 2023-04-25T11:45:37Z

Thanks for the bug report!

We continued looking into this and strongly suspect that this is a bug with Swift 5.8 and --static-swift-stdlib together with asan (address sanitizer).

I'll quadruple check some more but that's our strong suspicion so far.

It also does not reproduce on Swift 5.9 and we suspect this might be a fix for it: apple/swift#65254

mannuch · 2023-04-25T21:35:03Z

Ahh okay, got it. Running the Swift 5.8 container without --static-swift-stdlib seems to be avoiding the issue, so I'll go with that for now!

Thanks for the timely help with this!

ktoso · 2023-05-06T01:51:23Z

Thanks for confirming, I'll close this as I believe this is a static linking issue with concurrency library in general

ktoso added bug 🐞 Bug which should be fixed as soon as possible 0 - new Not sure yet if we should work on it or not asan labels Apr 25, 2023

ktoso modified the milestones: 1.0.0-beta.x, 1.0.0-beta.4 Apr 25, 2023

ktoso changed the title ~~Distributed Actors Cluster Crashing~~ Crash when using linux+asan+static-swift-stdlib Apr 26, 2023

ktoso changed the title ~~Crash when using linux+asan+static-swift-stdlib~~ Crash when using linux+asan+static-swift-stdlib+swift 5.8 Apr 26, 2023

ktoso closed this as completed May 6, 2023

ktoso removed the bug 🐞 Bug which should be fixed as soon as possible label May 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when using linux+asan+static-swift-stdlib+swift 5.8 #1118

Crash when using linux+asan+static-swift-stdlib+swift 5.8 #1118

mannuch commented Apr 24, 2023

mannuch commented Apr 24, 2023

ktoso commented Apr 25, 2023

mannuch commented Apr 25, 2023

ktoso commented May 6, 2023

Crash when using linux+asan+static-swift-stdlib+swift 5.8 #1118

Crash when using linux+asan+static-swift-stdlib+swift 5.8 #1118

Comments

mannuch commented Apr 24, 2023

mannuch commented Apr 24, 2023

ktoso commented Apr 25, 2023

mannuch commented Apr 25, 2023

ktoso commented May 6, 2023