Skip to content
This repository was archived by the owner on Aug 20, 2025. It is now read-only.

METRON-1829: Large Error Message Causes Slow Search Performance#1239

Closed
merrimanr wants to merge 4 commits intoapache:masterfrom
merrimanr:METRON-1829
Closed

METRON-1829: Large Error Message Causes Slow Search Performance#1239
merrimanr wants to merge 4 commits intoapache:masterfrom
merrimanr:METRON-1829

Conversation

@merrimanr
Copy link
Copy Markdown
Contributor

Contributor Comments

When a failure happens in the BulkWriterComponent, all pending messages in a batch are combined into a single error message. This results in a single large objects being written to ES and causes performance problems. This PR attempts to solve this issue by instead creating separate error messages for each message in the batch.

Testing

This has been tested and verified in full dev:

  1. Spin up full dev and verify data is being indexed into ES
  2. Stop HDFS in Ambari. This will cause a failure in the BulkWriterComponent. By default the batch size for OOTB sensors is set to 5.
  3. After HDFS has stopped, verify that documents are being written to an ES error index
  4. Retrieve a single document from the ES error index. There should only be a single raw_message field. Subsequent documents in ES should also only contain a single raw_message field. Before there would have been 5 raw_message_* fields in a single document.
  5. Verify error documents are displayed properly in Kibana.

Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.

In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:

For all changes:

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?

  • Have you included steps or a guide to how the change may be verified and tested manually?

  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:

    mvn -q clean integration-test install && dev-utilities/build-utils/verify_licenses.sh 
    
  • Have you written or updated unit tests and or integration tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:

    cd site-book
    mvn site
    

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

.withThrowable(e);
tuples.forEach(t -> error.addRawMessage(messageGetStrategy.get(t)));
handleError(tuples, error);
tuples.forEach(t -> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this will ack the same tuples repetitively. If 500 messages in a batch fail, then we will ack all 500 of them, 500 times.

There is also the nuisance that we report the error to the collector for each and every failed message, instead of just once for the batch. There is only one Throwable error to report, so we should just report it once.

We may need something like this.

Suggested change
tuples.forEach(t -> {
// emit one error for each failed message
tuples.forEach(t -> {
MetronError error = new MetronError()
.withSensorType(Collections.singleton(sensorType))
.withErrorType(Constants.ErrorType.INDEXING_ERROR)
.withThrowable(e)
.addRawMessage(messageGetStrategy.get(t));
collector.emit(Constants.ERROR_STREAM, new Values(error.getJSONObject()));
collector.ack(t);
});
// there is only one error to report for all of the failed tuples
collector.reportError(e);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! My mistake. I changed it to pass in just the current tuple to handleError. Let me know if that works.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be useful to discuss on this PR, if the other similar method error(Throwable, Iterable<Tuple>) needs updated. That method gets called if messages are unparsable. And in that case we send only a single error, instead of one for each error'd message. Is that what we want?

It might be what we want, but at the very least it would be useful to comment in there as to why we treat it differently, if indeed it should be different. Could be something like this.

  public void error(Throwable e, Iterable<Tuple> tuples) {
    LOG.error(format("Failing %d tuple(s) due to invalid message; sensorType unknown", Iterables.size(tuples)), e);

    // emit only a single error message, even though many failed because ...
    MetronError error = new MetronError()
            .withErrorType(Constants.ErrorType.INDEXING_ERROR)
            .withThrowable(e);
    collector.emit(Constants.ERROR_STREAM, new Values(error.getJSONObject()));
    tuples.forEach(t -> collector.ack(t));

    // there is only one error to report for all of the failed tuples
    collector.reportError(e);
  }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! My mistake. I changed it to pass in just the current tuple to handleError

Your latest commit was my first thought of a fix too; it is much simpler. But I don't think it solves the second problem that I mentioned.

There is also the nuisance that we report the error to the collector for each and every failed message, instead of just once for the batch. There is only one Throwable error to report, so we should just report it once.

I think reporting the error once is both a usability and performance improvement.

I had to unravel the code base a bit more in my suggested code snippet so that it only reports the error once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right again. I agree, we shouldn't be reporting the same exception for every message in the batch. I changed it to what you suggested in the initial comment and updated the tests.

For the other issue you bring up with error(Throwable, Iterable<Tuple>), it looks like it gets called from only one place: BulkMessageWriterBolt.handleMissingMessage. This method only deals with a single tuple so it is misleading that the error method accepts multiple tuples. Would changing that to accept a single tuple make it less confusing?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method only deals with a single tuple so it is misleading that the error method accepts multiple tuples. Would changing that to accept a single tuple make it less confusing?

Yes, agreed. That would help clarify things.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I'm just going to spin it up and do some manual testing now.

MetronError expectedError1 = new MetronError()
.withSensorType(Collections.singleton(sensorType))
.withErrorType(Constants.ErrorType.INDEXING_ERROR).withThrowable(e).withRawMessages(Arrays.asList(message1, message2));
.withErrorType(Constants.ErrorType.INDEXING_ERROR).withThrowable(e).withRawMessages(Collections.singletonList(message1));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit. While we are in here, do you think it is easier to read with newlines?

Suggested change
.withErrorType(Constants.ErrorType.INDEXING_ERROR).withThrowable(e).withRawMessages(Collections.singletonList(message1));
.withErrorType(Constants.ErrorType.INDEXING_ERROR)
.withThrowable(e)
.withRawMessages(Collections.singletonList(message1));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Done.

.withSensorType(Collections.singleton(sensorType))
.withErrorType(Constants.ErrorType.INDEXING_ERROR)
.withThrowable(e)
.addRawMessage(messageGetStrategy.get(t));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing this, I am seeing a problem. I am trying to decide whether we should call that a pre-existing condition and fix that problem after this PR goes in or whether the change here makes things any worse. Let me know what you think.

I have described the problem in METRON-1832. If any index destination goes down, error messages are continually recycled and grow larger and larger. In addition, long sequence of escape characters are created.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this some more... I feel that what I have described in METRON-1832 is a pre-existing condition. I don't think this PR makes that pre-existing condition any worse. This PR is a solid, necessary fix. I am in favor of merging this and I will open a discuss thread on a resolution for METRON-1832.

@nickwallen
Copy link
Copy Markdown
Contributor

+1 Works as advertised. Thanks for the fix

@asfgit asfgit closed this in d44a392 Oct 22, 2018
justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018
justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018
justinleet pushed a commit to justinleet/metron that referenced this pull request Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants