Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7192] Fix partitioning of buffered elements during checkpointing #8441

Merged
merged 1 commit into from May 3, 2019

Conversation

mxm
Copy link
Contributor

@mxm mxm commented Apr 30, 2019

When a Flink checkpoint is taken, the current bundle is finalized. The
finalization happens when the checkpoint barrier has already been sent
downstream; emitting elements at this point would violate the checkpoint barrier
alignment.

When elements are emitted during checkpointing they are buffered until the
checkpoint is complete. We should ensure that they are keyed correctly and
emission of the buffered elements does not interfere with any concurrent state
requests (in case of portability).

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@mxm
Copy link
Contributor Author

mxm commented Apr 30, 2019

Run Java PreCommit

@mxm mxm requested a review from tweise April 30, 2019 21:27
@mxm
Copy link
Contributor Author

mxm commented May 1, 2019

Run Java PreCommit

@mxm
Copy link
Contributor Author

mxm commented May 1, 2019

@tweise
Copy link
Contributor

tweise commented May 1, 2019

I don't think we need a lock for the output buffer since elements are only emitted by the operator thread.

I will look at the changes in more detail later.

@mxm
Copy link
Contributor Author

mxm commented May 1, 2019

Thanks for taking an initial look.

I don't think we need a lock for the output buffer since elements are only emitted by the operator thread.

I think we need a lock because elements are buffered to the state backend while the SDK harness can still process elements and make state requests.

@tweise
Copy link
Contributor

tweise commented May 1, 2019

Output elements are buffered by the same thread (the operator thread) that also performs shapshotState. That is why I think that no lock is necessary.

@mxm
Copy link
Contributor Author

mxm commented May 2, 2019

There are two threads:

  1. Main operator thread which emits elements and buffers them in the state backend
  2. GRPC thread which delegates state requests of pending elements to the state backend

So there can be concurrent writes to the state backend which must be avoided. We don't need the lock in case only the main operator thread can have access, as it is the case for flushBuffer() where the bundle is ensured to be finished. That's why there is only a lock in the buffer method.

@mxm
Copy link
Contributor Author

mxm commented May 2, 2019

Run Java PreCommit

@tweise
Copy link
Contributor

tweise commented May 2, 2019

There is no need to buffer output except during snapshotState. But since the bundle is still processing and may access state until finishBundle (remoteBundle.close()) returns, we need the lock?

@mxm
Copy link
Contributor Author

mxm commented May 2, 2019

That's correct.

When a Flink checkpoint is taken, the current bundle is finalized. The
finalization happens when the checkpoint barrier has already been sent
downstream; emitting elements at this point would violate the checkpoint barrier
alignment.

When elements are emitted during checkpointing they are buffered until the
checkpoint is complete. We should ensure that they are keyed correctly and
emission of the buffered elements does not interfere with any concurrent state
requests (in case of portability).
@mxm
Copy link
Contributor Author

mxm commented May 3, 2019

Run Flink ValidatesRunner

@mxm mxm merged commit f8d6b90 into apache:master May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants