Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [Go SDK]: State API writes may become corrupted with Go 1.23 #32245

Closed
1 of 17 tasks
lostluck opened this issue Aug 19, 2024 · 1 comment
Closed
1 of 17 tasks

[Bug]: [Go SDK]: State API writes may become corrupted with Go 1.23 #32245

lostluck opened this issue Aug 19, 2024 · 1 comment
Assignees

Comments

@lostluck
Copy link
Contributor

What happened?

When using the Beam State API in the Go SDK, it's possible that data may become corrupted when performing a Write.

A change to the Go compiler for Go 1.23+ made the issue more frequent/severe/reproducible.

It is not recommended to use Go 1.23+ for Beam Go pipelines using the State API until this issue is resolved. If your job doesn't use the State API, then it is unaffected.

It's not known if the behavior can cause a similar issue elsewhere in the SDK at this time. Other calls such as writing data to data sinks are first written to a buffer, or properly copied to a buffer, avoiding the same problem, or at least rendering it less likely.


In particular, the data written to the state API, might become overwritten or zeroed out before actually being sent over the GRPC channel, corrupting the write. Consider a length prefix byte that is zeroed, or changed prior to being sent to the runner.


The root cause of the issue is that Beam Go is incorrectly implementing go's https://pkg.go.dev/io#Writer interface for writing to State calls. In particular, it's incorrectly retaining a reference to the input byte buffer, passed into the write call.

https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/harness/statemgr.go#L481

The buffer is simply passed to the FnAPI StateAPI write request, then being sent to another channel for actually sending it on the State channel.

https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/harness/statemgr.go#L498

This send happens asynchronously, sending it to a different goroutine for serialization.
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/harness/statemgr.go#L704

While this is ultimately a blocking call, the other half is due to Beam Go lightly subverting Go's escape analysis in the name of performance, using ioutilx.WriteUnsafe.

eg used here. https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/graph/coder/varint.go#L44

This was deemed required for performance for small writes in order to avoid usually unnecessary
allocations to the heap. Typically these allocations would be stack allocated, and then not re-used.

However in Go 1.23 a change to the compiler "can now overlap stack frame slots of local variables". It's my suspicion that this allowed the stack frame referred to to by the byte buffer to be zeroed out or re-assigned, corrupting the data.


The fix is to copy the bytes properly when sending them over the state API. This avoids the bytes being changed prior to being sent.

A more robust fix would be to re-write coder handling entirely, to avoid using the io.Reader and io.Writer interfaces, which opened up the issue in the first place.

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@lostluck lostluck added this to the 2.59.0 Release milestone Aug 19, 2024
@lostluck lostluck self-assigned this Aug 19, 2024
lostluck added a commit to lostluck/beam that referenced this issue Aug 19, 2024
lostluck added a commit to lostluck/beam that referenced this issue Aug 19, 2024
jrmccluskey pushed a commit that referenced this issue Aug 20, 2024
* [#32245] Copy bytes sent from the State API.

* Mention #32245 in changes.md

* remove unnecessary chagned line

* weird copypasta

---------

Co-authored-by: lostluck <13907733+lostluck@users.noreply.github.com>
@lostluck
Copy link
Contributor Author

Not sure why the change didn't auto close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant