-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-30113] Implement state compression for broadcast and regular operator states #21636
[FLINK-30113] Implement state compression for broadcast and regular operator states #21636
Conversation
d42f351
to
e205374
Compare
e205374
to
371763a
Compare
Thanks for the PR. I'll review it over the next few days. |
@dawidwys Could you review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @echauchot
Thanks for working on the issue.
Correct me if I am wrong, but in the current state you always apply the compression for the operator state.
First of all, imo, this needs to be configurable. I'd say it can be done with the same flag as for the keyed state backend (execution.checkpointing.snapshot-compression
)
Moreover, right now it causes previously written checkpoints to be unreadable. (have you tried restoring from an old checkpoint? If not, could you add such a test?) as it always assumes the stream is compressed. The info if the state is compressed or not needs to be persisted somewhere. I don't have a good idea for that now. In the keyed state case it is done via KeyedBackendSerializationProxy#read/write
.
Now I realise this might require changes to the checkpoint format, unless we figure out a smart place to put it, so we might need to create a small FLIP for that.
Hey, thanks for the review !
No, see
See above, I've found we could reuse
Well, if the user disables compression to read uncompressed snapshots, then it works.
Yes I saw that with key state a boolean is written first when the state is written. I can do similar thing for operator state.
Ok, adding the boolean will require to change the snapshot format indeed. This is why I have not added this boolean to avoid the format change. But, indeed, if a snapshot was compressed and the user then disables compression he could not read the previous compressed snapshot with my code. What about simply adding a compression boolean to operator state meta ? As a backward incompatible change, it requires a FLIP even if simple change, right ? |
Ah, sorry missed that :( That part looks good then.
Yes, I do think it requires a FLIP as it is a substantial change, but I believe this change is unavoidable if we want to add operator state compression. |
No problem.
Ok, I'll write the FLIP and ping you on it for review, then when we agree on the architecture, I'll write the code and commit on this PR. |
@dawidwys I don't have the rights to add a FLIP to Flink confluence space. Can you give me the rights ? |
@echauchot You should have the rights now |
Thanks @MartijnVisser ! |
The vote on the FLIP has passed, but I'm going on vacation tonight for a week. I'll update this PR to reflect the FLIP when I get back |
Absolutely @echauchot ! Thanks for the work so far! |
@dawidwys sorry for the delay, I had to finish this other PR after my vacations. Resuming the work on this PR ... |
@flinkbot run azure |
1 similar comment
@flinkbot run azure |
Hi @dawidwys, my code is failing with versioning. I incremented the proxy version (as the metadata serde code has changed) and the snapshot version (as I added a boolean for compression). Can you explain how the proxy/snapshot versioning works because I see nothing in the docs. |
What you did for versioning looks sane. I think the problem is with line Line 68 in a4a11a3
readVersion < 6 , not less than CURRENT_STATE_META_INFO_SNAPSHOT_VERSION . If we change the condition to readVersion < 6 it should work.
|
a4a11a3
to
be169cf
Compare
@flinkbot run azure |
1 similar comment
@flinkbot run azure |
Hi @dawidwys, thanks for the prompt answer. I suspected something not future proof, thanks for the confirmation. It is now fixed. Tests pass but the CI fails in preparation of the E2E tests (since some days) with a http 404 error. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's wait until e2e tests are fixed though before we merge it.
Thanks for the review Dawid ! |
These E2E tests have been fixed for some time already, you probably need to rebase :) See https://issues.apache.org/jira/browse/FLINK-30972 |
Add a new snapshot format version 6 compatible with the previous versions that adds a boolean for snapshot compression At snapshot time: depending on user configuration, write Snappy compressed regular and broadcast operator states (but not the snapshot meta info) and write the compression boolean accordingly At restoration time: depending on the read compression boolean, read the snapshots as compressed or not.
be169cf
to
83f0a2a
Compare
Thanks @MartijnVisser, I missed this ticket. I just rebased. Let's wait for the green lights. |
@flinkbot run azure |
@dawidwys thanks for reviewing/merging. |
What is the purpose of the change
Implement state compression for broadcast and regular operator states. Only the states themselves are compressed not the metadata.
Brief change log
Verifying this change
This change is already covered by existing tests, such as windowing ITCase tests ... in the flink-runtime module
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation
R: @dawidwys