METRON-1829: Large Error Message Causes Slow Search Performance by merrimanr · Pull Request #1239 · apache/metron

merrimanr · 2018-10-16T18:38:37Z

Contributor Comments

When a failure happens in the BulkWriterComponent, all pending messages in a batch are combined into a single error message. This results in a single large objects being written to ES and causes performance problems. This PR attempts to solve this issue by instead creating separate error messages for each message in the batch.

Testing

This has been tested and verified in full dev:

Spin up full dev and verify data is being indexed into ES
Stop HDFS in Ambari. This will cause a failure in the BulkWriterComponent. By default the batch size for OOTB sensors is set to 5.
After HDFS has stopped, verify that documents are being written to an ES error index
Retrieve a single document from the ES error index. There should only be a single raw_message field. Subsequent documents in ES should also only contain a single raw_message field. Before there would have been 5 raw_message_* fields in a single document.
Verify error documents are displayed properly in Kibana.

Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.

In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:

For all changes:

Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
```
mvn -q clean integration-test install && dev-utilities/build-utils/verify_licenses.sh 
```
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:
```
cd site-book
mvn site
```

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

nickwallen · 2018-10-18T16:07:56Z

metron-platform/metron-writer/src/main/java/org/apache/metron/writer/BulkWriterComponent.java

-            .withThrowable(e);
-    tuples.forEach(t -> error.addRawMessage(messageGetStrategy.get(t)));
-    handleError(tuples, error);
+    tuples.forEach(t -> {


Seems like this will ack the same tuples repetitively. If 500 messages in a batch fail, then we will ack all 500 of them, 500 times.

There is also the nuisance that we report the error to the collector for each and every failed message, instead of just once for the batch. There is only one Throwable error to report, so we should just report it once.

We may need something like this.

Suggested change

tuples.forEach(t -> {

// emit one error for each failed message

tuples.forEach(t -> {

MetronError error = new MetronError()

.withSensorType(Collections.singleton(sensorType))

.withErrorType(Constants.ErrorType.INDEXING_ERROR)

.withThrowable(e)

.addRawMessage(messageGetStrategy.get(t));

collector.emit(Constants.ERROR_STREAM, new Values(error.getJSONObject()));

collector.ack(t);

});

// there is only one error to report for all of the failed tuples

collector.reportError(e);

}

You're right! My mistake. I changed it to pass in just the current tuple to handleError. Let me know if that works.

It might also be useful to discuss on this PR, if the other similar method error(Throwable, Iterable<Tuple>) needs updated. That method gets called if messages are unparsable. And in that case we send only a single error, instead of one for each error'd message. Is that what we want?

It might be what we want, but at the very least it would be useful to comment in there as to why we treat it differently, if indeed it should be different. Could be something like this.

public void error(Throwable e, Iterable<Tuple> tuples) { LOG.error(format("Failing %d tuple(s) due to invalid message; sensorType unknown", Iterables.size(tuples)), e); // emit only a single error message, even though many failed because ... MetronError error = new MetronError() .withErrorType(Constants.ErrorType.INDEXING_ERROR) .withThrowable(e); collector.emit(Constants.ERROR_STREAM, new Values(error.getJSONObject())); tuples.forEach(t -> collector.ack(t)); // there is only one error to report for all of the failed tuples collector.reportError(e); }

You're right! My mistake. I changed it to pass in just the current tuple to handleError

Your latest commit was my first thought of a fix too; it is much simpler. But I don't think it solves the second problem that I mentioned.

There is also the nuisance that we report the error to the collector for each and every failed message, instead of just once for the batch. There is only one Throwable error to report, so we should just report it once.

I think reporting the error once is both a usability and performance improvement.

I had to unravel the code base a bit more in my suggested code snippet so that it only reports the error once.

You're right again. I agree, we shouldn't be reporting the same exception for every message in the batch. I changed it to what you suggested in the initial comment and updated the tests.

For the other issue you bring up with error(Throwable, Iterable<Tuple>), it looks like it gets called from only one place: BulkMessageWriterBolt.handleMissingMessage. This method only deals with a single tuple so it is misleading that the error method accepts multiple tuples. Would changing that to accept a single tuple make it less confusing?

This method only deals with a single tuple so it is misleading that the error method accepts multiple tuples. Would changing that to accept a single tuple make it less confusing?

Yes, agreed. That would help clarify things.

Looks great. I'm just going to spin it up and do some manual testing now.

nickwallen · 2018-10-18T16:12:26Z

...n-platform/metron-writer/src/test/java/org/apache/metron/writer/BulkWriterComponentTest.java

+    MetronError expectedError1 = new MetronError()
            .withSensorType(Collections.singleton(sensorType))
-            .withErrorType(Constants.ErrorType.INDEXING_ERROR).withThrowable(e).withRawMessages(Arrays.asList(message1, message2));
+            .withErrorType(Constants.ErrorType.INDEXING_ERROR).withThrowable(e).withRawMessages(Collections.singletonList(message1));


Small nit. While we are in here, do you think it is easier to read with newlines?

Suggested change

.withErrorType(Constants.ErrorType.INDEXING_ERROR).withThrowable(e).withRawMessages(Collections.singletonList(message1));

.withErrorType(Constants.ErrorType.INDEXING_ERROR)

.withThrowable(e)

.withRawMessages(Collections.singletonList(message1));

I agree. Done.

nickwallen · 2018-10-19T17:14:48Z

metron-platform/metron-writer/src/main/java/org/apache/metron/writer/BulkWriterComponent.java

+              .withSensorType(Collections.singleton(sensorType))
+              .withErrorType(Constants.ErrorType.INDEXING_ERROR)
+              .withThrowable(e)
+              .addRawMessage(messageGetStrategy.get(t));


While testing this, I am seeing a problem. I am trying to decide whether we should call that a pre-existing condition and fix that problem after this PR goes in or whether the change here makes things any worse. Let me know what you think.

I have described the problem in METRON-1832. If any index destination goes down, error messages are continually recycled and grow larger and larger. In addition, long sequence of escape characters are created.

Thinking more about this some more... I feel that what I have described in METRON-1832 is a pre-existing condition. I don't think this PR makes that pre-existing condition any worse. This PR is a solid, necessary fix. I am in favor of merging this and I will open a discuss thread on a resolution for METRON-1832.

nickwallen · 2018-10-19T17:35:12Z

+1 Works as advertised. Thanks for the fix

…manr) closes apache#1239

initial commit

d674d9a

nickwallen reviewed Oct 18, 2018

View reviewed changes

merrimanr added 3 commits October 18, 2018 11:28

pr feedback

4222d3b

only report error once

d472a33

changed error method to accept a single tuple

fd5d859

nickwallen reviewed Oct 19, 2018

View reviewed changes

asfgit closed this in d44a392 Oct 22, 2018

justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018

METRON-1829 Large Error Message Causes Slow Search Performance (merri…

69ef665

…manr) closes apache#1239

justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018

METRON-1829 Large Error Message Causes Slow Search Performance (merri…

a56b263

…manr) closes apache#1239

justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018

METRON-1829 Large Error Message Causes Slow Search Performance (merri…

87b9a8c

…manr) closes apache#1239

snyk-bot mentioned this pull request Jul 17, 2020

[Snyk] Security upgrade ajv from 6.5.3 to 6.12.3 sardell/metron#105

Open

Conversation

merrimanr commented Oct 16, 2018

Contributor Comments

Testing

Pull Request Checklist

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickwallen commented Oct 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants