Skip to content

[STORM-3583] Handle exceptions when AsyncLocalizer tries to get local resources#3226

Merged
Ethanlm merged 2 commits intoapache:masterfrom
RuiLi8080:STORM-3583
Mar 23, 2020
Merged

[STORM-3583] Handle exceptions when AsyncLocalizer tries to get local resources#3226
Ethanlm merged 2 commits intoapache:masterfrom
RuiLi8080:STORM-3583

Conversation

@RuiLi8080
Copy link
Contributor

@RuiLi8080 RuiLi8080 commented Mar 12, 2020

Supervisor relies on AsyncLocalizer to download blobs from blob store.
AsyncLocalizer uses downloadService pool to process CompletableFuture objects in parallel.

We have noticed a case that while the downloading task is waiting for a thread to execute, new assignment changes will try to release the slot by dereferences all of the related local resources.

However, reading local resources assumes two base blob downloading task have been completed which is not always true.

@RuiLi8080
Copy link
Contributor Author

Referring to #3210

@agresch
Copy link
Contributor

agresch commented Mar 12, 2020

// Precondition2: Both these two blob files are fully downloaded and proper permission been set
localResources = getLocalResources(pna);
} catch (FileNotFoundException fnfException) {
LOG.warn("Local base blobs have not been downloaded yet. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add the pna to this message and the IOException error? What is the side effect of not releasing this if there's a race condition? Could the localizer then be updating this file continually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added pna info. As for the race condition, I think that could be addressed by separating executor pools. I did not think of a scenario that race condition will happen because of catching the exception.

All downloading tasks are CompletableFuture object. File no found happens when the object is not completed. So here we will forget about it, and dereference any resources we can, and release the slot for new assignments.

Then the background download threads will succeeds at some points. The cleanup thread will take care of these orphan files even if they are not released properly.

} catch (FileNotFoundException fnfException) {
LOG.warn("Local base blobs have not been downloaded yet. "
+ "DownloadExecService is too busy", fnfException);
LOG.info("Port and assignment info: {}", pna);
Copy link
Contributor

@agresch agresch Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're proceeding with this change, I think we should mark a meter so we can add an alert and investigate that this is occurring and evaluate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, added.

this.blobCacheUpdateDuration = metricsRegistry.registerTimer("supervisor:blob-cache-update-duration");
this.blobLocalizationDuration = metricsRegistry.registerTimer("supervisor:blob-localization-duration");
this.numBlobUpdateVersionChanged = metricsRegistry.registerMeter("supervisor:num-blob-update-version-changed");
this.localResourceFileNotFound = metricsRegistry.registerMeter("supervisdor:local-resource-file-not-found");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling of supervisor.

Also please update ClusterMetrics.md with this metric and any other missing that are in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching, updated

Copy link
Contributor

@agresch agresch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please squash

… resources

address comments

add file not found meter

fix typo and add documentation
@RuiLi8080
Copy link
Contributor Author

@agresch Squashed

for (LocalResource lr : getLocalResources(pna)) {

// ALERT: A possible race condition could be resolved by separating the thread pools into downloadExecService and taskExecService
// https://git.ouroath.com/storm/storm/commit/ebd52b37c7448d381d31451e46e8f19c6e51352d#diff-74535cb89e9e926ad424a8d1e2fa9586
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is our corp internal commit and should be changed to community JIRA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

} catch (FileNotFoundException fnfException) {
localResourceFileNotFound.mark();
LOG.warn("Local base blobs have not been downloaded yet. "
+ "DownloadExecService is too busy", fnfException);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this line. "too busy" can be misleading. The files not yet being fully downloaded could because it is just started to download.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Removed

for (LocalResource lr : localResources) {
try {
removeBlobReference(lr.getBlobName(), pna, lr.shouldUncompress());
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exception handlings seems can be removed. But I am fine with leaving it there since it's not related to your change in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

return;
} catch (IOException ioException) {
LOG.error("Unable to read local conf file. ", ioException);
LOG.info("Port and assignment info: {}", pna);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOG.info("Port and assignment info: {}", pna); are repeated twice here. We can write it in the following way:

catch (IOException e) {
            //common1
            if (e instanceof FileNotFoundException) {
                //xx
            } else {
                //yy
            }
       // common2
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

this.blobCacheUpdateDuration = metricsRegistry.registerTimer("supervisor:blob-cache-update-duration");
this.blobLocalizationDuration = metricsRegistry.registerTimer("supervisor:blob-localization-duration");
this.numBlobUpdateVersionChanged = metricsRegistry.registerMeter("supervisor:num-blob-update-version-changed");
this.localResourceFileNotFound = metricsRegistry.registerMeter("supervisor:local-resource-file-not-found");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very specific to local resource file not found inside releaseSlotFor. So the description and the variable should be more specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


for (LocalResource lr : getLocalResources(pna)) {

// ALERT: A possible race condition could be resolved by separating the thread pools into downloadExecService and taskExecService
Copy link
Contributor

@Ethanlm Ethanlm Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have been

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// Precondition1: Base blob stormconf.ser and stormcode.ser have been localized
// Precondition2: Both these two blob files are fully downloaded and proper permission been set
localResources = getLocalResources(pna);
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be IOException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@Ethanlm Ethanlm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@Ethanlm Ethanlm merged commit ec217df into apache:master Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants