[STORM-3583] Handle exceptions when AsyncLocalizer tries to get local resources#3226
[STORM-3583] Handle exceptions when AsyncLocalizer tries to get local resources#3226Ethanlm merged 2 commits intoapache:masterfrom
Conversation
|
Referring to #3210 |
|
I already made this same change and had to revert it: Similar initial change: Side effect this caused: |
| // Precondition2: Both these two blob files are fully downloaded and proper permission been set | ||
| localResources = getLocalResources(pna); | ||
| } catch (FileNotFoundException fnfException) { | ||
| LOG.warn("Local base blobs have not been downloaded yet. " |
There was a problem hiding this comment.
can we add the pna to this message and the IOException error? What is the side effect of not releasing this if there's a race condition? Could the localizer then be updating this file continually?
There was a problem hiding this comment.
Added pna info. As for the race condition, I think that could be addressed by separating executor pools. I did not think of a scenario that race condition will happen because of catching the exception.
All downloading tasks are CompletableFuture object. File no found happens when the object is not completed. So here we will forget about it, and dereference any resources we can, and release the slot for new assignments.
Then the background download threads will succeeds at some points. The cleanup thread will take care of these orphan files even if they are not released properly.
| } catch (FileNotFoundException fnfException) { | ||
| LOG.warn("Local base blobs have not been downloaded yet. " | ||
| + "DownloadExecService is too busy", fnfException); | ||
| LOG.info("Port and assignment info: {}", pna); |
There was a problem hiding this comment.
If we're proceeding with this change, I think we should mark a meter so we can add an alert and investigate that this is occurring and evaluate.
There was a problem hiding this comment.
Good suggestion, added.
| this.blobCacheUpdateDuration = metricsRegistry.registerTimer("supervisor:blob-cache-update-duration"); | ||
| this.blobLocalizationDuration = metricsRegistry.registerTimer("supervisor:blob-localization-duration"); | ||
| this.numBlobUpdateVersionChanged = metricsRegistry.registerMeter("supervisor:num-blob-update-version-changed"); | ||
| this.localResourceFileNotFound = metricsRegistry.registerMeter("supervisdor:local-resource-file-not-found"); |
There was a problem hiding this comment.
spelling of supervisor.
Also please update ClusterMetrics.md with this metric and any other missing that are in this file.
There was a problem hiding this comment.
Thanks for catching, updated
… resources address comments add file not found meter fix typo and add documentation
|
@agresch Squashed |
| for (LocalResource lr : getLocalResources(pna)) { | ||
|
|
||
| // ALERT: A possible race condition could be resolved by separating the thread pools into downloadExecService and taskExecService | ||
| // https://git.ouroath.com/storm/storm/commit/ebd52b37c7448d381d31451e46e8f19c6e51352d#diff-74535cb89e9e926ad424a8d1e2fa9586 |
There was a problem hiding this comment.
This is our corp internal commit and should be changed to community JIRA.
| } catch (FileNotFoundException fnfException) { | ||
| localResourceFileNotFound.mark(); | ||
| LOG.warn("Local base blobs have not been downloaded yet. " | ||
| + "DownloadExecService is too busy", fnfException); |
There was a problem hiding this comment.
I would remove this line. "too busy" can be misleading. The files not yet being fully downloaded could because it is just started to download.
There was a problem hiding this comment.
Make sense. Removed
| for (LocalResource lr : localResources) { | ||
| try { | ||
| removeBlobReference(lr.getBlobName(), pna, lr.shouldUncompress()); | ||
| } catch (Exception e) { |
There was a problem hiding this comment.
These exception handlings seems can be removed. But I am fine with leaving it there since it's not related to your change in this PR
| return; | ||
| } catch (IOException ioException) { | ||
| LOG.error("Unable to read local conf file. ", ioException); | ||
| LOG.info("Port and assignment info: {}", pna); |
There was a problem hiding this comment.
LOG.info("Port and assignment info: {}", pna); are repeated twice here. We can write it in the following way:
catch (IOException e) {
//common1
if (e instanceof FileNotFoundException) {
//xx
} else {
//yy
}
// common2
}
| this.blobCacheUpdateDuration = metricsRegistry.registerTimer("supervisor:blob-cache-update-duration"); | ||
| this.blobLocalizationDuration = metricsRegistry.registerTimer("supervisor:blob-localization-duration"); | ||
| this.numBlobUpdateVersionChanged = metricsRegistry.registerMeter("supervisor:num-blob-update-version-changed"); | ||
| this.localResourceFileNotFound = metricsRegistry.registerMeter("supervisor:local-resource-file-not-found"); |
There was a problem hiding this comment.
This is very specific to local resource file not found inside releaseSlotFor. So the description and the variable should be more specific.
|
|
||
| for (LocalResource lr : getLocalResources(pna)) { | ||
|
|
||
| // ALERT: A possible race condition could be resolved by separating the thread pools into downloadExecService and taskExecService |
| // Precondition1: Base blob stormconf.ser and stormcode.ser have been localized | ||
| // Precondition2: Both these two blob files are fully downloaded and proper permission been set | ||
| localResources = getLocalResources(pna); | ||
| } catch (Exception e) { |
Supervisor relies on AsyncLocalizer to download blobs from blob store.
AsyncLocalizer uses downloadService pool to process CompletableFuture objects in parallel.
We have noticed a case that while the downloading task is waiting for a thread to execute, new assignment changes will try to release the slot by dereferences all of the related local resources.
However, reading local resources assumes two base blob downloading task have been completed which is not always true.