Skip to content

func.go: Concurrent calls swap return-value pointers (sync.Pool of *syscall15Args, regressed in #282) #451

@nsaini-figma

Description

@nsaini-figma

PureGo Version

v0.9.0 and v0.9.1 (both confirmed). Likely all releases from v0.8.1 onward (the first release including #282).

Operating System

  • Windows
  • macOS
  • Linux
  • FreeBSD
  • NetBSD
  • Android
  • iOS

Go Version (go version)

go 1.24 and go 1.26.1

What steps will reproduce the problem?

Two Go goroutines calling the same RegisterLibFunc-registered function from a shared library, signature func(uint64) *byte. Within seconds the goroutines observe each other's return pointers.

ultra.c — 5 lines of payload, returns a malloc'd buffer with the seed embedded:

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

char *ultra_alloc(uint64_t seed) {
    char *buf = (char *)malloc(256);
    if (!buf) return NULL;
    snprintf(buf, 256, "{\"seed\":%lu,\"_\":\"%064lu\"}",
             (unsigned long)seed, (unsigned long)seed);
    return buf;
}

void ultra_free(char *s) { free(s); }
cc -O2 -shared -fPIC -o libultra.so ultra.c

ultra.go — N goroutines, each with a per-worker disjoint seed range, reads the JSON back and checks whether the embedded seed matches what it sent:

package main

import (
    "flag"
    "fmt"
    "os"
    "strings"
    "sync"
    "sync/atomic"
    "time"
    "unsafe"

    "github.com/ebitengine/purego"
)

func main() {
    workers := flag.Int("workers", 32, "")
    flag.Parse()

    h, err := purego.Dlopen("./libultra.so", purego.RTLD_NOW|purego.RTLD_GLOBAL)
    if err != nil {
        fmt.Println(err)
        os.Exit(2)
    }

    var alloc func(uint64) *byte
    var free func(*byte)
    purego.RegisterLibFunc(&alloc, h, "ultra_alloc")
    purego.RegisterLibFunc(&free, h, "ultra_free")

    var ops, mismatches atomic.Uint64
    var wg sync.WaitGroup
    done := make(chan struct{})

    for i := 0; i < *workers; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            seed := uint64(id) * 1_000_000_000 // disjoint per worker
            for {
                select {
                case <-done:
                    return
                default:
                }
                seed++
                ptr := alloc(seed)
                if ptr == nil {
                    continue
                }
                s := string(unsafe.Slice(ptr, 96))
                want := fmt.Sprintf(`{"seed":%d,`, seed)
                if !strings.HasPrefix(s, want) {
                    if n := mismatches.Add(1); n <= 5 {
                        fmt.Fprintf(os.Stderr,
                            "worker %d sent=%d head=%q\n",
                            id, seed, s[:48])
                    }
                }
                free(ptr)
                ops.Add(1)
            }
        }(i)
    }

    time.Sleep(15 * time.Second)
    close(done)
    wg.Wait()
    fmt.Printf("ops=%d mismatches=%d\n", ops.Load(), mismatches.Load())
}

What is the expected result?

Each goroutine's alloc(seed) call returns a pointer to a buffer whose embedded seed matches the seed that goroutine passed in. Across all workers × duration calls, the program prints mismatches=0 and exits cleanly. This is the observed behavior on v0.8.0 (153M dispatches at 32 workers, zero mismatches) and on plain C with 32 pthreads against the same kind of .so (16M+ ops clean).

What happens instead?

On v0.9.0 / v0.9.1 with workers ≥ 2, within seconds a goroutine reads back JSON whose embedded seed belongs to a different worker's seed range — the return pointer from another goroutine's in-flight alloc call has bled across. Often the run then aborts: two goroutines end up defer free-ing the same C pointer and glibc trips double free detected in tcache. Verbatim output from a -workers 32 -dur 15s run on v0.9.1:

worker 16 sent=16000003040 head="{\"seed\":12000007214,\"_\":\"00000000000000000000000"
worker 12 sent=12000007219 head="{\"seed\":16000003063,\"_\":\"00000000000000000000000"
worker 16 sent=16000003092 head="\xa1Lk%\x11y\x00\x00Ӗ\x8d\xdbo\xa53\x02312,\"_\":\"00000000000000000000000"
free(): double free detected in tcache 2
SIGABRT: abort

Worker 16 and worker 12 swapped return pointers — each reads back the other's just-allocated buffer. The third line shows a further-degraded case: by the time worker 16 reads the buffer it's already been freed and reused, so the prefix is non-JSON garbage. The SIGABRT is glibc reacting to two goroutines free-ing the same pointer.

Anything else you feel useful to add?

What I've established

  • The race is introduced in #282 (all: improve memory usage). That PR replaced the stack-allocated syscall15Args in RegisterFunc's reflect closure with syscall := thePool.Get().(*syscall15Args) + defer thePool.Put(syscall) (func.go:310-311 in v0.9.0). Verified by direct version test: built the repro against v0.8.0 (the last release before all: improve memory usage #282 merged on 2024-10-17, with zero thePool references in func.go) and ran it at workers ∈ {2, 4, 8, 32} for 20s each — 153M total dispatches, zero mismatches. Same repro against v0.9.0 / v0.9.1 mismatches within seconds at any workers ≥ 2.
  • Reverting just those two lines at HEAD (back to the v0.8.0 stack allocation pattern) eliminates the race over 137M+ ops in 60s × 32 goroutines. Same workload that mismatches within seconds at HEAD goes mismatch-free.
  • The bug does not reproduce at workers=1 even at 19M+ ops — single-goroutine FFI is unaffected.
  • The bug does not reproduce when the same .so is called from plain C with 32 pthreads under identical workload (16M+ ops clean). So it's not the C side, not the system malloc, not the kernel.
  • The bug does reproduce with GOMAXPROCS=1 — Go still spawns separate OS threads for cgocalls.
  • A function returning (u64, u64) → u64 (no pointer return) at 32 workers does not race. The pointer-return path is required.
  • A function taking string args but returning u64 at 32 workers does not race. So arg marshaling is not the source — the racy path is the return-value extraction.

What I don't know

The precise interleaving. I haven't been able to construct from first principles how two sync.Pool.Get() calls could return overlapping items. sync.Pool is documented concurrent-safe, and *syscall.a1 is read before the deferred Put runs.

The regression points at the sync.Pool of *syscall15Args introduced in #282 (func.go:310-311) and mirrored on the parallel sysv path in #328 (syscall_sysv.go:18-19). Possibilities I considered but couldn't confirm:

  1. The asm trampoline (syscall15XABI0) writes to the captured *syscall15Args after runtime_cgocall nominally returns to Go, opening a window where another goroutine has the same struct from the pool.
  2. A subtle escape-analysis or stack-growth interaction (PR #328 notes that the nosplit annotation on the parallel syscall_sysv.go path was removed when the pool was reintroduced there).
  3. Goroutine preemption between thePool.Get() and runtime_cgocall lets another goroutine Get the same item if the per-P cache misaligns with the in-flight call.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions