METRON-1829: Large Error Message Causes Slow Search Performance#1239
METRON-1829: Large Error Message Causes Slow Search Performance#1239merrimanr wants to merge 4 commits intoapache:masterfrom
Conversation
| .withThrowable(e); | ||
| tuples.forEach(t -> error.addRawMessage(messageGetStrategy.get(t))); | ||
| handleError(tuples, error); | ||
| tuples.forEach(t -> { |
There was a problem hiding this comment.
Seems like this will ack the same tuples repetitively. If 500 messages in a batch fail, then we will ack all 500 of them, 500 times.
There is also the nuisance that we report the error to the collector for each and every failed message, instead of just once for the batch. There is only one Throwable error to report, so we should just report it once.
We may need something like this.
| tuples.forEach(t -> { | |
| // emit one error for each failed message | |
| tuples.forEach(t -> { | |
| MetronError error = new MetronError() | |
| .withSensorType(Collections.singleton(sensorType)) | |
| .withErrorType(Constants.ErrorType.INDEXING_ERROR) | |
| .withThrowable(e) | |
| .addRawMessage(messageGetStrategy.get(t)); | |
| collector.emit(Constants.ERROR_STREAM, new Values(error.getJSONObject())); | |
| collector.ack(t); | |
| }); | |
| // there is only one error to report for all of the failed tuples | |
| collector.reportError(e); | |
| } |
There was a problem hiding this comment.
You're right! My mistake. I changed it to pass in just the current tuple to handleError. Let me know if that works.
There was a problem hiding this comment.
It might also be useful to discuss on this PR, if the other similar method error(Throwable, Iterable<Tuple>) needs updated. That method gets called if messages are unparsable. And in that case we send only a single error, instead of one for each error'd message. Is that what we want?
It might be what we want, but at the very least it would be useful to comment in there as to why we treat it differently, if indeed it should be different. Could be something like this.
public void error(Throwable e, Iterable<Tuple> tuples) {
LOG.error(format("Failing %d tuple(s) due to invalid message; sensorType unknown", Iterables.size(tuples)), e);
// emit only a single error message, even though many failed because ...
MetronError error = new MetronError()
.withErrorType(Constants.ErrorType.INDEXING_ERROR)
.withThrowable(e);
collector.emit(Constants.ERROR_STREAM, new Values(error.getJSONObject()));
tuples.forEach(t -> collector.ack(t));
// there is only one error to report for all of the failed tuples
collector.reportError(e);
}
There was a problem hiding this comment.
You're right! My mistake. I changed it to pass in just the current tuple to handleError
Your latest commit was my first thought of a fix too; it is much simpler. But I don't think it solves the second problem that I mentioned.
There is also the nuisance that we report the error to the collector for each and every failed message, instead of just once for the batch. There is only one Throwable error to report, so we should just report it once.
I think reporting the error once is both a usability and performance improvement.
I had to unravel the code base a bit more in my suggested code snippet so that it only reports the error once.
There was a problem hiding this comment.
You're right again. I agree, we shouldn't be reporting the same exception for every message in the batch. I changed it to what you suggested in the initial comment and updated the tests.
For the other issue you bring up with error(Throwable, Iterable<Tuple>), it looks like it gets called from only one place: BulkMessageWriterBolt.handleMissingMessage. This method only deals with a single tuple so it is misleading that the error method accepts multiple tuples. Would changing that to accept a single tuple make it less confusing?
There was a problem hiding this comment.
This method only deals with a single tuple so it is misleading that the error method accepts multiple tuples. Would changing that to accept a single tuple make it less confusing?
Yes, agreed. That would help clarify things.
There was a problem hiding this comment.
Looks great. I'm just going to spin it up and do some manual testing now.
| MetronError expectedError1 = new MetronError() | ||
| .withSensorType(Collections.singleton(sensorType)) | ||
| .withErrorType(Constants.ErrorType.INDEXING_ERROR).withThrowable(e).withRawMessages(Arrays.asList(message1, message2)); | ||
| .withErrorType(Constants.ErrorType.INDEXING_ERROR).withThrowable(e).withRawMessages(Collections.singletonList(message1)); |
There was a problem hiding this comment.
Small nit. While we are in here, do you think it is easier to read with newlines?
| .withErrorType(Constants.ErrorType.INDEXING_ERROR).withThrowable(e).withRawMessages(Collections.singletonList(message1)); | |
| .withErrorType(Constants.ErrorType.INDEXING_ERROR) | |
| .withThrowable(e) | |
| .withRawMessages(Collections.singletonList(message1)); |
| .withSensorType(Collections.singleton(sensorType)) | ||
| .withErrorType(Constants.ErrorType.INDEXING_ERROR) | ||
| .withThrowable(e) | ||
| .addRawMessage(messageGetStrategy.get(t)); |
There was a problem hiding this comment.
While testing this, I am seeing a problem. I am trying to decide whether we should call that a pre-existing condition and fix that problem after this PR goes in or whether the change here makes things any worse. Let me know what you think.
I have described the problem in METRON-1832. If any index destination goes down, error messages are continually recycled and grow larger and larger. In addition, long sequence of escape characters are created.
There was a problem hiding this comment.
Thinking more about this some more... I feel that what I have described in METRON-1832 is a pre-existing condition. I don't think this PR makes that pre-existing condition any worse. This PR is a solid, necessary fix. I am in favor of merging this and I will open a discuss thread on a resolution for METRON-1832.
|
+1 Works as advertised. Thanks for the fix |
Contributor Comments
When a failure happens in the BulkWriterComponent, all pending messages in a batch are combined into a single error message. This results in a single large objects being written to ES and causes performance problems. This PR attempts to solve this issue by instead creating separate error messages for each message in the batch.
Testing
This has been tested and verified in full dev:
raw_messagefield. Subsequent documents in ES should also only contain a singleraw_messagefield. Before there would have been 5raw_message_*fields in a single document.Pull Request Checklist
Thank you for submitting a contribution to Apache Metron.
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.
In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:
For all changes:
For code changes:
Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?
For documentation related changes:
Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via
site-book/target/site/index.html:Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.