Skip to content

Conversation

@Myasuka
Copy link
Member

@Myasuka Myasuka commented Jan 25, 2021

What is the purpose of the change

Currently, RocksDB memory control end-to-end tests failed with very small probability due to the limitation of RocksDB itself. In general, we think RocksDB memory control should take effect in most cases and cannot ensure it behaves perfectly. Thus, if the chance of exceeding the limit is really low, we can retry up to 3 times.

Brief change log

Run the tests with at most three times and failed the end-to-end test case if all failed.

Verifying this change

This change added tests and can be verified as follows:

Run the tests with at most three times and failed the end-to-end test case if all failed.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 25, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 0a3a317 (Thu Sep 23 18:00:38 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!
  • This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 25, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@Myasuka
Copy link
Member Author

Myasuka commented Jan 25, 2021

@flinkbot run azure

1 similar comment
@Myasuka
Copy link
Member Author

Myasuka commented Jan 27, 2021

@flinkbot run azure

@Myasuka
Copy link
Member Author

Myasuka commented Jan 27, 2021

@StephanEwen could you please take a look at this PR?

@Myasuka
Copy link
Member Author

Myasuka commented Feb 5, 2021

@flinkbot run azure

@StephanEwen
Copy link
Contributor

This looks like a fair way to stabilize the tests, but it ultimately hides a real problem: The fact that the RocksDB memory footprint is not controllable.

I am fine with merging this, but we need to keep working on the other issue, to make the experience in containerized environments more smooth.

@Myasuka
Copy link
Member Author

Myasuka commented Feb 11, 2021

@flinkbot run azure

@Myasuka
Copy link
Member Author

Myasuka commented Feb 11, 2021

@StephanEwen , yes I agree with your concern. Current memory control mechanism of RocksDB cannot handle any malloced memory not inserted into block cache or write buffer, those just uncompressed data block could have OOM risk.
Actually we are trying several optimizations to improve customer experience:

  1. Partitioned index option [FLINK-20496][state backends] RocksDB partitioned index/filters option. #14341 to help avoid performance problem during memory control.
  2. Try to bump RocksDB after 6.10+ to have more restrict block cache size limit.

@StephanEwen
Copy link
Contributor

Do you think this PR is still relevant after the RocksDB version upgrade and using strict mode?
I understand that even strict mode is not totally strict, but is the improvement enough to stabilize the tests without this change?

If the answer to the above question is "yes", then I would suggest to not merge this PR, because it may hide problems if there is a regression in the memory management code.

@Myasuka
Copy link
Member Author

Myasuka commented Apr 12, 2021

@StephanEwen , sorry for missing this reply during my spring festival holidays. Flink-1.13 did not bump the RocksDB version finally and it seems there is no test failure reported during this time.
Maybe we could mark FLINK-17511 to be resolved in Flink-1.14.

@dawidwys
Copy link
Contributor

dawidwys commented Jul 9, 2021

@StephanEwen @Myasuka Could we close the PR? I closed the JIRA ticket as it did not appear in a while.

@github-actions
Copy link

This PR is being marked as stale since it has not had any activity in the last 180 days.
If you would like to keep this PR alive, please leave a comment asking for a review.
If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out to the [community](https://flink.apache.org/what-is-flink/community/).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 90 days, it will be automatically closed.

@github-actions
Copy link

This PR has been closed since it has not had any activity in 120 days.
If you feel like this was a mistake, or you would like to continue working on it,
please feel free to re-open the PR and ask for a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants