Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-4150] [runtime] Don't clean up BlobStore on BlobServer shut down #2256

Closed
wants to merge 7 commits into from

Conversation

uce
Copy link
Contributor

@uce uce commented Jul 15, 2016

The BlobServer acts as a local cache for uploaded BLOBs. The life-cycle of each BLOB is bound to the life-cycle of the BlobServer. If the BlobServer shuts down (on JobManager shut down), all local files will be removed.

With HA, BLOBs are persisted to another file system (e.g. HDFS) via the BlobStore in order to have BLOBs available after a JobManager failure (or shut down). These BLOBs are only allowed to be removed when the job that requires them enters a globally terminal state (FINISHED, CANCELLED, FAILED).

This commit removes the BlobStore clean up call from the BlobServer shutdown. The BlobStore files will only be cleaned up via the BlobLibraryCacheManager's' clean up task (periodically or on BlobLibraryCacheManager shutdown). This means that there is a chance that BLOBs will linger around after the job has terminated, if the job manager fails before the clean up.

@tillrohrmann
Copy link
Contributor

Just a quick question. Do we want to remove also failed jobs from the BlobStore and ZK? Or only finished or cancelled jobs?

@uce
Copy link
Contributor Author

uce commented Jul 15, 2016

I don't know if we "want to", but it is the current behaviour. A job should only fail if its restart strategy is exhausted though. Do you think we should change that behaviour?

@tillrohrmann
Copy link
Contributor

In general, I think it would be helpful for users to be able to retrieve checkpoints of a failed job. I could imagine a scenario where a job is faulty but one only runs into after some time. Being then able to transform a checkpoint into a savepoint and then restarting the failed job with a corrected jar could be helpful.

Thus, I think we should only remove the persisted job data if the job has reached FINISHED or CANCELED. Admittedly, this is a very conservative approach, but then users are less likely to lose data.

However, this should be out of the scope of this PR.

@@ -77,7 +77,7 @@ public BlobLibraryCacheManager(BlobService blobService, long cleanupInterval) {

// Initializing the clean up task
this.cleanupTimer = new Timer(true);
this.cleanupTimer.schedule(this, cleanupInterval);
this.cleanupTimer.schedule(this, cleanupInterval, cleanupInterval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 👍

@tillrohrmann
Copy link
Contributor

Changes look good to me :-). Really good work @uce. I'm just wondering whether we could remove empty folders upon shutdown of the BlobStore. Apart from that, +1 for merging.

@uce
Copy link
Contributor Author

uce commented Jul 21, 2016

Thank you for your review. I've addressed your comment and now parent directories are deleted if empty, resulting in an empty storage folder after regular cleanup. If there are no objections, I would like to merge this later today.

uce added 4 commits July 22, 2016 12:21
The `BlobServer` acts as a local cache for uploaded BLOBs. The life-cycle of
each BLOB is bound to the life-cycle of the `BlobServer`. If the BlobServer
shuts down (on JobManager shut down), all local files will be removed.

With HA, BLOBs are persisted to another file system (e.g. HDFS) via the
`BlobStore` in order to have BLOBs available after a JobManager failure (or
shut down). These BLOBs are only allowed to be removed when the job that
requires them enters a globally terminal state (`FINISHED`, `CANCELLED`,
`FAILED`).

This commit removes the `BlobStore` clean up call from the `BlobServer`
shutdown. The `BlobStore` files will only be cleaned up via the
`BlobLibraryCacheManager`'s' clean up task (periodically or on
BlobLibraryCacheManager shutdown). This means that there is a chance that
BLOBs will linger around after the job has terminated, if the job manager
fails before the clean up.
@asfgit asfgit closed this in 3213016 Jul 25, 2016
@uce uce deleted the 4150-blobstore branch August 2, 2016 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants