Skip to content

Conversation

@Aitozi
Copy link
Contributor

@Aitozi Aitozi commented May 19, 2019

What is the purpose of the change

As described in the jira, User's customize configuration which is configured by backend.configure() method will be override by the configuration loading from flink-conf.yaml. I think the config in the code should has a higher priority than the default file configuration.

Brief change log

Merge configuration before create the new state backend.

Verifying this change

Adding a test case to verify.

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (100MB)
  • Extended integration test for recovery after master (JobManager) failure
  • Added test that validates that TaskInfo is transferred only once across recoveries
  • Manually verified the change by running a 4 node cluser with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented May 19, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 2e316e2 (Fri Feb 28 21:49:54 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@Aitozi
Copy link
Contributor Author

Aitozi commented May 19, 2019

Hi, @StefanRRichter can you help take a look on this PR when you have time, thx.

@Aitozi
Copy link
Contributor Author

Aitozi commented May 20, 2019

The travis-ci failure seems unrelated

* @param classLoader The class loader.
*/
private RocksDBStateBackend(RocksDBStateBackend original, Configuration config, ClassLoader classLoader) {
this.configuration = config;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we use RocksDBStateBackend, seems we always use the original statebackend's conf first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klion26 for your review. I'm not sure whether i quite get you meaning. But IMO this construct is only for creating a re-configured copy of the original state backend. So the configure passed in will be used first ? And also should keep this configure to a member field to reuse it to config the statebackend loaded from StateBackendLoader#fromApplicationOrConfigOrDefault.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the code below(start from line 302) seems to use the original backend's conf first, so I'm not sure you change now is on the right directory(maybe the change you've made can't solve the problem on the issue)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @klion26 I check the code again, only the priorityQueueStateType is configured from configuration, and the type from configuration will be used first see

final String priorityQueueTypeString = config.getString(TIMER_SERVICE_FACTORY);

this.priorityQueueStateType = priorityQueueTypeString.length() > 0 ?
             PriorityQueueStateType.valueOf(priorityQueueTypeString.toUpperCase()) : original.priorityQueueStateType;

And other parameters are all passed by the constructor, so it can work for the issue i think. but it looks strange indeed(some follow the original backend, some follow the config first), I will think how to make it better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the other configs? Do we need to do the same thing here for the other configs?

Copy link
Contributor Author

@Aitozi Aitozi Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other config are all following this manner: check the original config whether set or undefined, if it is not defined then it will look into the config , at last fallback to the default config which is consistent with the doc of the config()

Creates a copy of this state backend that uses the values defined in the configuration for fields where that were not yet specified in this state backend.

But for the priorityQueueStateType there is no undefined type, it only has two: rocksdb and heap, if we have to follow the manner I think we just have to add an undefined value for priorityQueueStateType, but i am not sure do we need to do this.

What i am do now is just overriding the config loading from file with the original statebackend config to solve this problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need to check the config first, then fallback the original config, and at last fallback to the default config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't have to, I prefer to keep the former style, but only use the config in user code with higher priority than the default configuration loading from the file.

@Aitozi
Copy link
Contributor Author

Aitozi commented Jun 2, 2019

Hi @klion26 I refactor the test case to verify this fix work explicitly. @StefanRRichter could you help review this bug fix, too.

@azagrebin azagrebin self-assigned this Aug 30, 2019
@flinkbot
Copy link
Collaborator

flinkbot commented Aug 30, 2019

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@azagrebin azagrebin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this problem @Aitozi
I think you are right that the user custom configuration should have a precedence.
However, it looks like other parameters already have a certain configuration pattern which was probably accidentally overlooked in the original effort.
E.g. localRocksDbDirectories is firstly checked whether it is already set or not by user. If it is set then it is not overwritten otherwise it is set from the config which will be default value in case of cluster config.
It looks like it would be simpler just to stick to the same pattern for priorityQueueStateType instead of changing the existing approach.
wdyt?

@azagrebin azagrebin assigned Aitozi and unassigned azagrebin Aug 30, 2019
@Aitozi
Copy link
Contributor Author

Aitozi commented Sep 2, 2019

Thanks for your suggestion @azagrebin , I will look into this .

@Aitozi
Copy link
Contributor Author

Aitozi commented Sep 4, 2019

Hi @azagrebin , I have done the fix following your comments, please help review when you are free.

@Aitozi
Copy link
Contributor Author

Aitozi commented Sep 4, 2019

Do you know why this push do not trigger travis ci ? How can i trigger it manually @azagrebin ?

@azagrebin
Copy link
Contributor

@Aitozi it triggered the CI, the setup has been recently changed, you can see CI results in the special build comment in this PR.

@Aitozi
Copy link
Contributor Author

Aitozi commented Sep 5, 2019

@azagrebin I check the CI log just now, I think it failed due to

PackagesNotFoundError: The following packages are not available from current channels:

  • python=3.3

seems unrelated to this PR ?

Copy link
Contributor

@azagrebin azagrebin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @Aitozi
I left some smaller comments

PriorityQueueStateType.valueOf(priorityQueueTypeString.toUpperCase()) : original.priorityQueueStateType;
if (original.priorityQueueStateType == PriorityQueueStateType.UNDEFINED) {
String priorityQueueTypeString = config.getString(TIMER_SERVICE_FACTORY);
this.priorityQueueStateType = PriorityQueueStateType.valueOf(priorityQueueTypeString.toUpperCase());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

short-cut for getting Enum values is: config.getEnum(...)

PriorityQueueStateType.valueOf(priorityQueueTypeString.toUpperCase()) : original.priorityQueueStateType;
if (original.priorityQueueStateType == PriorityQueueStateType.UNDEFINED) {
String priorityQueueTypeString = config.getString(TIMER_SERVICE_FACTORY);
this.priorityQueueStateType = PriorityQueueStateType.valueOf(priorityQueueTypeString.toUpperCase());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think introducing UNDEFINED makes sense in general but I am wondering if making it @nullable and leaving initialized by null is actually simpler in this case. If we use UNDEFINED, I think we also have to check here explicitly and then test it that if user configures UNDEFINED then she gets an error at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree to use null directly.

enableTtlCompactionFilter = TernaryBoolean.TRUE;
}

private PriorityQueueStateType resolvePriorityQueueStateType(PriorityQueueStateType origin) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it is more user-friendly if we add setter/getter for this like we have for some other things.
we can also use this getter with the resolution logic in createKeyedStateBackend.


// Fix the option value string and ensure all are covered
Assert.assertEquals(2, RocksDBStateBackend.PriorityQueueStateType.values().length);
Assert.assertEquals(3, RocksDBStateBackend.PriorityQueueStateType.values().length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also UNDEFINED assertion if we choose to go with it
Assert.assertEquals("UNDEFINED", RocksDBStateBackend.PriorityQueueStateType.UNDEFINED.toString());

*/
@Test
public void testConfigureTimerServiceLoadingFromApplication() throws Exception {
final Environment env = getMockEnvironment(tempFolder.newFolder());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid using mockito mocks in new code. We have e.g. MockEnvironmentBuilder which we could try here to achieve similar behaviour.

final RocksDBStateBackend rocksDBStateBackend = (RocksDBStateBackend) loadedBackend;

RocksDBKeyedStateBackend<Integer> keyedBackend2 = createKeyedStateBackend(rocksDBStateBackend, env);
Assert.assertEquals(RocksDBPriorityQueueSetFactory.class, keyedBackend2.getPriorityQueueFactory().getClass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyedBackend(2).dispose(); at the end


RocksDBKeyedStateBackend<Integer> keyedBackend = createKeyedStateBackend(backend, env);

Assert.assertEquals(RocksDBPriorityQueueSetFactory.class, keyedBackend.getPriorityQueueFactory().getClass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these two lines we duplicate check from testConfigureTimerService.
I would suggest we test only one thing in a test at a time.

@azagrebin
Copy link
Contributor

azagrebin commented Sep 5, 2019

yes, I think it is unrelated
AFAIK it was fixed, you could rebase the PR on the current master and the failure should go away.

@Aitozi
Copy link
Contributor Author

Aitozi commented Sep 9, 2019

Please take a look again when you have time, I have fixed following your comments. @azagrebin

@Aitozi
Copy link
Contributor Author

Aitozi commented Sep 19, 2019

ping @azagrebin

Copy link
Contributor

@azagrebin azagrebin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aitozi
Sorry for the late answer. The changes look good to me, I left couple of really small comments.
I was trying to reach @StefanRRichter, just to check whether there was some other reason to implement this option like it was before. I would prefer to give him some time to comment on that.
If there is no objection from his side after some time, we will merge the PR.

@Aitozi
Copy link
Contributor Author

Aitozi commented Sep 26, 2019

OK, I have addressed your comments @azagrebin .

@Aitozi
Copy link
Contributor Author

Aitozi commented Nov 30, 2019

ping @azagrebin

Copy link
Member

@carp84 carp84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, +1.

@azagrebin Is there any response from Stefan? IMHO it would be good if we could include this in 1.10.0, wdyt? Thanks.

@Aitozi Aitozi force-pushed the FLINK-11193 branch 2 times, most recently from 784e65d to e102192 Compare December 8, 2019 09:36
@Aitozi
Copy link
Contributor Author

Aitozi commented Dec 8, 2019

Thanks @carp84 for help review this PR , and I have rebased the master solved the conflicts.

@azagrebin
Copy link
Contributor

@Aitozi @carp84
I rebased the PR on the current master and introduced a setter for the queue to use in application code instead of configure. This is more aligned with the other options. I also modified the test accordingly.
Could you have a look? If there are no objections, I will merge this into master, 1.10 and maybe 1.9 depending on the effort.

@azagrebin azagrebin closed this in 3cf8fe5 Mar 4, 2020
azagrebin pushed a commit that referenced this pull request Mar 4, 2020
azagrebin pushed a commit that referenced this pull request Mar 4, 2020
azagrebin pushed a commit that referenced this pull request Mar 4, 2020
@carp84
Copy link
Member

carp84 commented Mar 4, 2020

@Aitozi @carp84
I rebased the PR on the current master and introduced a setter for the queue to use in application code instead of configure. This is more aligned with the other options. I also modified the test accordingly.
Could you have a look? If there are no objections, I will merge this into master, 1.10 and maybe 1.9 depending on the effort.

+1 (belated) on introducing a setter instead of configure. Sorry for the late response, just noticed...

I'd like to see this bug fix also go into release-1.10, please let me know if you need a hand on the backport @azagrebin . Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants