Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-11107] Avoid memory stateBackend to create arbitrary folders under HA path when no checkpoint path configured #7281

Closed

Conversation

Myasuka
Copy link
Member

@Myasuka Myasuka commented Dec 11, 2018

What is the purpose of the change

  • This pull request avoids memory state-backend to create arbitrary persist storage directory under HA folder when no checkpoint path has ever been configured. If not so, each memory state-backend within a job would create its directory with UUID name, which would HA storage directory filled with many empty directories, resulting in new jobs cannot create any directory in the worst cases.

Brief change log

  • Remove the code which would let memory state-backend to create arbitrary persist storage directory under HA directory when no checkpoint path has ever been configured.

Verifying this change

This change added tests and can be verified as follows:

  • Added unit tests that validate no arbitrary directory would create under HA storage directory when no checkpoint path has ever been configured.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

@Myasuka
Copy link
Member Author

Myasuka commented Dec 11, 2018

@StephanEwen , since you have left annotations below:

to keep supporting the old behavior where default (JobManager) Backend + HA mode = checkpoints in HA store
we add the HA persistence dir as the checkpoint directory if none other is set

However, I'm wondering whether this keeps the same behavior as before. For Flink-1.3, (JobManager) Backend + HA mode = only create completedCheckpoint file under HA folder. On the other side, for Flink-1.6, this would create another job-id/chk-x/_metadata except the completedCheckpoint file. Please correct me if I am wrong.

@Myasuka
Copy link
Member Author

Myasuka commented Dec 12, 2018

BTW, If we do not create UUID directories for memory state-backend in this situation, job could still restore from high-availability storage. The only difference is the information of Latest Restore under Checkpoints tab of web UI would show the path is <checkpoint-not-externally-addressable>.

@klion26
Copy link
Member

klion26 commented Dec 13, 2018

@Myasuka thank you for your contribution. the problem you want to solve looks the same as FLINK-10346

@Myasuka
Copy link
Member Author

Myasuka commented Dec 14, 2018

@klion26 Hmm, I think it's not the same thing, even you could clean up checkpoint directories after job finished/failed in time, you cannot avoid running jobs to create so many directories (one operator with one memory state-backend, and one state-backend would create an unique directory). If several really big jobs are running, which created too many sub-directories under HA storage directory, new submitted job might also cannot create those new sub-directories.

@Myasuka
Copy link
Member Author

Myasuka commented Mar 1, 2019

Since Flink-1.8 is about to release, @StephanEwen @StefanRRichter could anyone take a look at this problem?
I submitted the same job with the same configuration (no checkpoint path but HA configured) with released Flink-1.3.2 (still has no such MemoryStateBackend creating random checkpoint path code, which should be treated as old behavior) and Flink-1.7.2 (already contained that part of code.)

As you can see Flink-1.3.2 would have a blob service folder, a completed checkpoint file and a submitted job graph file. I think this is the old behavior.
20190301113237

However, Flink-1.7.2 would have many checkpoint paths created by MemoryStateBackend from task-side, as you could guess, 41a7c8b8e62d81225868d2a5a60846f7 is the actual job-id of this job. These created checkpoint path should actually be useless, and might lead to MaxDirectoryItemsExceededException under high availability folder.
20190301113317
Moreover, as you can see, I don't think this would keep supporting the old behavior due to the great directory structure difference.

@tzulitai tzulitai self-requested a review May 30, 2019 05:13
@Myasuka Myasuka changed the title [FLINK-11107][state] Avoid memory stateBackend to create arbitrary folders under HA path when no checkpoint path configured [FLINK-11107] Avoid memory stateBackend to create arbitrary folders under HA path when no checkpoint path configured May 31, 2019
@Myasuka Myasuka force-pushed the mem-backend-checkpoint-FLINK-11107 branch 2 times, most recently from 0131937 to d4cdf3e Compare May 31, 2019 08:07
tzulitai pushed a commit to tzulitai/flink that referenced this pull request Jun 3, 2019
…lders under HA path when no checkpoint path configured

This closes apache#7281.
tzulitai pushed a commit to tzulitai/flink that referenced this pull request Jun 3, 2019
…lders under HA path when no checkpoint path configured

This closes apache#7281.
@sunjincheng121
Copy link
Member

I look at the error of CI, find the following ERROR:
SqlExecutableStatement.java:[21,39] package org.apache.flink.sql.parser.ddl does not exist
I think it's better to rebase the code and let the CI passed. @Myasuka

@Myasuka Myasuka force-pushed the mem-backend-checkpoint-FLINK-11107 branch from d4cdf3e to 5256e77 Compare June 5, 2019 18:14
@Myasuka
Copy link
Member Author

Myasuka commented Jun 5, 2019

@sunjincheng121 Thanks for your reminder, I have already updated the code base.

@sunjincheng121
Copy link
Member

@Myasuka Thanks for the update! @tzulitai told me he will review this later. :)

@tzulitai
Copy link
Contributor

cc @aljoscha

@aljoscha
Copy link
Contributor

I think it's good to now remove this (supposed) legacy behaviour. Even more so when it causes problems with arbitrary directories being created on TMs. 👍

@tzulitai
Copy link
Contributor

thanks for the review @aljoscha and @Myasuka for the contribution.

Merging this for 1.9.0 and 1.8.1 ..

tzulitai pushed a commit to tzulitai/flink that referenced this pull request Jun 13, 2019
…olders under HA path when no checkpoint path configured

This closes apache#7281.
tzulitai pushed a commit to tzulitai/flink that referenced this pull request Jun 13, 2019
…olders under HA path when no checkpoint path configured

This closes apache#7281.
@asfgit asfgit closed this in c2316f6 Jun 13, 2019
sjwiesman pushed a commit to sjwiesman/flink that referenced this pull request Jun 26, 2019
…olders under HA path when no checkpoint path configured

This closes apache#7281.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants