[FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down #2256

uce · 2016-07-15T13:28:23Z

The BlobServer acts as a local cache for uploaded BLOBs. The life-cycle of each BLOB is bound to the life-cycle of the BlobServer. If the BlobServer shuts down (on JobManager shut down), all local files will be removed.

With HA, BLOBs are persisted to another file system (e.g. HDFS) via the BlobStore in order to have BLOBs available after a JobManager failure (or shut down). These BLOBs are only allowed to be removed when the job that requires them enters a globally terminal state (FINISHED, CANCELLED, FAILED).

This commit removes the BlobStore clean up call from the BlobServer shutdown. The BlobStore files will only be cleaned up via the BlobLibraryCacheManager's' clean up task (periodically or on BlobLibraryCacheManager shutdown). This means that there is a chance that BLOBs will linger around after the job has terminated, if the job manager fails before the clean up.

tillrohrmann · 2016-07-15T14:14:51Z

Just a quick question. Do we want to remove also failed jobs from the BlobStore and ZK? Or only finished or cancelled jobs?

uce · 2016-07-15T16:10:03Z

I don't know if we "want to", but it is the current behaviour. A job should only fail if its restart strategy is exhausted though. Do you think we should change that behaviour?

tillrohrmann · 2016-07-18T15:06:42Z

In general, I think it would be helpful for users to be able to retrieve checkpoints of a failed job. I could imagine a scenario where a job is faulty but one only runs into after some time. Being then able to transform a checkpoint into a savepoint and then restarting the failed job with a corrected jar could be helpful.

Thus, I think we should only remove the persisted job data if the job has reached FINISHED or CANCELED. Admittedly, this is a very conservative approach, but then users are less likely to lose data.

However, this should be out of the scope of this PR.

tillrohrmann · 2016-07-20T11:42:41Z

...e/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java

@@ -77,7 +77,7 @@ public BlobLibraryCacheManager(BlobService blobService, long cleanupInterval) {

 		// Initializing the clean up task
 		this.cleanupTimer = new Timer(true);
-		this.cleanupTimer.schedule(this, cleanupInterval);
+		this.cleanupTimer.schedule(this, cleanupInterval, cleanupInterval);


Good catch 👍

tillrohrmann · 2016-07-20T11:59:13Z

Changes look good to me :-). Really good work @uce. I'm just wondering whether we could remove empty folders upon shutdown of the BlobStore. Apart from that, +1 for merging.

uce · 2016-07-21T14:15:08Z

Thank you for your review. I've addressed your comment and now parent directories are deleted if empty, resulting in an empty storage folder after regular cleanup. If there are no objections, I would like to merge this later today.

The `BlobServer` acts as a local cache for uploaded BLOBs. The life-cycle of each BLOB is bound to the life-cycle of the `BlobServer`. If the BlobServer shuts down (on JobManager shut down), all local files will be removed. With HA, BLOBs are persisted to another file system (e.g. HDFS) via the `BlobStore` in order to have BLOBs available after a JobManager failure (or shut down). These BLOBs are only allowed to be removed when the job that requires them enters a globally terminal state (`FINISHED`, `CANCELLED`, `FAILED`). This commit removes the `BlobStore` clean up call from the `BlobServer` shutdown. The `BlobStore` files will only be cleaned up via the `BlobLibraryCacheManager`'s' clean up task (periodically or on BlobLibraryCacheManager shutdown). This means that there is a chance that BLOBs will linger around after the job has terminated, if the job manager fails before the clean up.

tillrohrmann reviewed Jul 20, 2016
View reviewed changes

uce added 4 commits July 22, 2016 12:21

[pr-comments] Delete parent directories if possible on BlobStore#delete

2fbb095

Make archive actor name unique

dde8d0d

Fix wait for running race

c8d63b7

uce force-pushed the 4150-blobstore branch from 02879e8 to c8d63b7 Compare July 22, 2016 10:22

uce added 3 commits July 22, 2016 14:16

Fix rebase conflict

e723992

Fix test failures

5dcf493

Yet another test issue...

f2d876b

asfgit closed this in 3213016 Jul 25, 2016

uce deleted the 4150-blobstore branch August 2, 2016 09:16

rmetzger added the component=Runtime/Coordination label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down #2256

[FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down #2256

uce commented Jul 15, 2016

tillrohrmann commented Jul 15, 2016

uce commented Jul 15, 2016

tillrohrmann commented Jul 18, 2016

tillrohrmann Jul 20, 2016

tillrohrmann commented Jul 20, 2016

uce commented Jul 21, 2016

[FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down #2256

[FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down #2256

Conversation

uce commented Jul 15, 2016

tillrohrmann commented Jul 15, 2016

uce commented Jul 15, 2016

tillrohrmann commented Jul 18, 2016

tillrohrmann Jul 20, 2016

Choose a reason for hiding this comment

tillrohrmann commented Jul 20, 2016

uce commented Jul 21, 2016