Skip to content

Conversation

NicoK
Copy link
Contributor

@NicoK NicoK commented Jul 21, 2017

What is the purpose of the change

Transient BLOB files are currently only deleted manually and may thus linger around if not cleaned up. We should introduce a default cleanup time, i.e. a time-to-live (TTL), as a backup cleanup path.

Brief change log

  • Any TransientBlobService, i.e. BlobServer and TransientBlobCache, gets an additional member keeping track of all expiry times of transient BLOBs.
  • Whenever a transient BLOB enters the local store of a TransientBlobService, its TTL will be set to currentTime + cleanupInterval.
  • A cleanup task (TransientBlobCleanupTask) is executed every cleanupInterval seconds and deletes all expired transient BLOBs.
  • As a result, transient BLOBs stay at most 2 * cleanupInterval seconds before getting deleted automatically.

Please note that this PR is based on #4359 in a series to implement FLIP-19.

Verifying this change

This change adds tests and can be verified as follows:

  • a new test verifies (for each BlobServer and TransientBlobCache with and without a job ID) that after adding two transient BLOBs with further accesses only to one of them, that the accessed one remains and the other is deleted. For the existence check, however, there is a slight race condition: if the test is slow and delayed by 1s, the cleanup may kick in and the test may fail. This is hard to test otherwise without further intrusion into the blob store classes.
  • FYI: since transient BLOBs are not used yet, this cannot be tested with a complete flink distribution yet.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes):

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (JavaDocs)

@NicoK
Copy link
Contributor Author

NicoK commented Oct 5, 2017

rebased onto the latest #4359

Nico Kruber added 2 commits October 6, 2017 11:34
This should guard us from uploading (and deleting) the same file more than
once and also from hash collisions.
Transient BLOB files should not exist for long and are only deleted manually
from the caches and after access on the server. This uses the BLOB storage's
cleanup interval to set a TTL on all transient BLOB files as a backup cleanup
path. The cleanup task itself runs every cleanupInterval seconds and removes all
transient BLOBs for which the TTL is older than the current time. This way, a
transient BLOB stays at most 2*cleanupInterval seconds before getting deleted
automatically.
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. Thanks for your contribution @NicoK. I'll throw again on Travis and if Travis gives green light, then I'll merge it.

tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 20, 2017
Transient BLOB files should not exist for long and are only deleted manually
from the caches and after access on the server. This uses the BLOB storage's
cleanup interval to set a TTL on all transient BLOB files as a backup cleanup
path. The cleanup task itself runs every cleanupInterval seconds and removes all
transient BLOBs for which the TTL is older than the current time. This way, a
transient BLOB stays at most 2*cleanupInterval seconds before getting deleted
automatically.

This closes apache#4381.
@asfgit asfgit closed this in a1d4831 Oct 21, 2017
tillrohrmann pushed a commit to tillrohrmann/flink that referenced this pull request Oct 23, 2017
Transient BLOB files should not exist for long and are only deleted manually
from the caches and after access on the server. This uses the BLOB storage's
cleanup interval to set a TTL on all transient BLOB files as a backup cleanup
path. The cleanup task itself runs every cleanupInterval seconds and removes all
transient BLOBs for which the TTL is older than the current time. This way, a
transient BLOB stays at most 2*cleanupInterval seconds before getting deleted
automatically.

This closes apache#4381.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants