Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-25445] No need to create local recovery dirs when disabled loc… #18306

Merged
merged 1 commit into from Feb 1, 2022

Conversation

zuston
Copy link
Member

@zuston zuston commented Jan 9, 2022

…al-recovery in TaskExecutorLocalStateStoresManager

What is the purpose of the change

TaskExecutor will create file, check localRecoveryEnabled and load local state store for each task submission in method localStateStoreForSubtask. For batch jobs, the localRecoveryEnabled is always false, and there's no need to create these local files for task in TaskExecutor

Brief change log

When disable task local recovery, it will not create subtask state dirs.

Verifying this change

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 9, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 9, 2022

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 7d972d4 (Sun Jan 09 05:35:41 UTC 2022)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!
  • This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@zuston
Copy link
Member Author

zuston commented Jan 11, 2022

Could you help review? @dmvk Thanks

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @zuston! This already looks pretty good ;) I went through the usages of the LocalRecoveryDirectoryProvider and the change seems safe.

My only concern is around proper handling of @Nullable providers in LocalRecoveryConfig, PTAL

@zuston
Copy link
Member Author

zuston commented Jan 11, 2022

Thanks for your review @dmvk

By the way, why we make task local-recovery disabled defaultly? Only due to the uncompatility with unaligned checkpoint?

@dmvk
Copy link
Member

dmvk commented Jan 11, 2022

The behavior of the local recovery highly depends on the state backend. AFAIK heap state backend still doesn't support it. For RocksDB you can find some context in https://issues.apache.org/jira/browse/FLINK-15507

@zuston
Copy link
Member Author

zuston commented Jan 11, 2022

Thank very much @dmvk . I’ll take a look

@zuston
Copy link
Member Author

zuston commented Jan 12, 2022

Updated @dmvk Could you help review again? Thanks

@zuston
Copy link
Member Author

zuston commented Jan 14, 2022

@dmvk Gentle ping. If you have time, could you help review it? At weekends, i will have more time to deal with it.
Thanks

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zuston, I've added few more minor comments. I think this is mostly good to go.

My primary concern would be that we should get rid of the SavepointLocalRecoveryProvider from the state processor API, as it's no longer needed after this patch.

Thanks for working on this! 👍

Comment on lines 52 to 54
return localStateDirectories == null
? Optional.empty()
: Optional.of(localStateDirectories);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return localStateDirectories == null
? Optional.empty()
: Optional.of(localStateDirectories);
return Optional.ofNullable(localStateDirectories);

@@ -31,11 +35,11 @@
private final boolean localRecoveryEnabled;

/** Encapsulates the root directories and the subtask-specific path. */
@Nonnull private final LocalRecoveryDirectoryProvider localStateDirectories;
@Nullable private final LocalRecoveryDirectoryProvider localStateDirectories;

public LocalRecoveryConfig(
boolean localRecoveryEnabled,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to get rid of this parameter completely. This two parameters having the same meaning thing, just leads into some weird implementations.

For example we can now easily get rid of the SavepointLocalRecoveryProvider from the state processor API.

@@ -44,9 +48,10 @@ public boolean isLocalRecoveryEnabled() {
return localRecoveryEnabled;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return localRecoveryEnabled;
return localStateDirectories != null;

Comment on lines 66 to 57

/**
* An {@link NullOfDirProviderException} is thrown when the {@link
* LocalRecoveryDirectoryProvider} is null.
*/
@Internal
public static class NullOfDirProviderException extends RuntimeException {
public NullOfDirProviderException(String message, Throwable cause) {
super(message, cause);
}

public NullOfDirProviderException(String message) {
super(message);
}

public NullOfDirProviderException() {
this(
"When task local recovery is enabled, the NullPointerException of LocalRecoveryDirectoryProvider should not happen.");
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* An {@link NullOfDirProviderException} is thrown when the {@link
* LocalRecoveryDirectoryProvider} is null.
*/
@Internal
public static class NullOfDirProviderException extends RuntimeException {
public NullOfDirProviderException(String message, Throwable cause) {
super(message, cause);
}
public NullOfDirProviderException(String message) {
super(message);
}
public NullOfDirProviderException() {
this(
"When task local recovery is enabled, the NullPointerException of LocalRecoveryDirectoryProvider should not happen.");
}
}
public static Supplier<IllegalStateException> localRecoveryNotEnabled() {
return () ->
new IllegalStateException(
"Getting a LocalRecoveryDirectoryProvider is only supported with the local recovery enabled. This is a bug and should be reported.");
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@zuston
Copy link
Member Author

zuston commented Jan 17, 2022

Updated @dmvk. Could you help check again?
Thanks~

@zuston
Copy link
Member Author

zuston commented Jan 19, 2022

Gentle ping @dmvk . Thanks :)

…al-recovery in TaskExecutorLocalStateStoresManager
@zuston
Copy link
Member Author

zuston commented Jan 24, 2022

Rebase latest master and squash all commits. @dmvk If you have time, could you help review again? Thanks :)

@zuston
Copy link
Member Author

zuston commented Jan 25, 2022

@flinkbot run azure

@zuston
Copy link
Member Author

zuston commented Feb 1, 2022

@dmvk Gentle ping. Could u help review it ? Thanks

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks for the contribution

@XComp XComp merged commit 9f61e12 into apache:master Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants