Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-1218: Metron REST should return better error messages #779

Closed
wants to merge 3 commits into from

Conversation

merrimanr
Copy link
Contributor

@merrimanr merrimanr commented Sep 28, 2017

Contributor Comments

This PR improves the error handling in the REST app and adds logging to failed searches in the ElasticsearchDao. This can be tested in full dev by executing a search against REST with a value for "from" that is larger than the index. For example:

{
  "from": 1000000000,
  "indices": [
    "bro"
  ],
  "query": "*",
  "size": 0
}

Before it would just return:

{
  "responseCode": 500,
  "message": "Could not execute search",
  "fullMessage": "Could not execute search"
}

but after this change it should instead include the root cause:

{
  "responseCode": 500,
  "message": "Could not execute search",
  "fullMessage": "Result window is too large, from + size must be less than or equal to: [10000] but was [100000005]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter."
}

There should also be the full stacktrace in /var/log/metron/metron-rest.log.

This change should improve all error messages that are returned from REST and are not limited to this case. I also added a unit test for the REST exception handler class.

Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.
Please refer to our Development Guidelines for the complete guide to follow for contributions.
Please refer also to our Build Verification Guidelines for complete smoke testing guides.

In order to streamline the review of the contribution we ask you follow these guidelines and ask you to double check the following:

For all changes:

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?

For code changes:

  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?

  • Have you included steps or a guide to how the change may be verified and tested manually?

  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:

    mvn -q clean integration-test install && build_utils/verify_licenses.sh 
    
  • Have you written or updated unit tests and or integration tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered by building and verifying the site-book? If not then run the following commands and the verify changes via site-book/target/site/index.html:

    cd site-book
    mvn site
    

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.
It is also recommended that travis-ci is set up for your personal repository such that your branches are built there before submitting a pull request.

@cestella
Copy link
Member

+1 this one bit me today and this fixes it; great job!

fullMessage = cause.getMessage();
cause = cause.getCause();
}
return fullMessage;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also get the stacktrace? Maybe a call to ExceptionUtils.getStackTrace(ex) appended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe be a security risk to show stack in ui or rest errors? Could we control it by profile or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a good alternative would be to show the error to the endpoint (the UI) and at least log the exception in the metron-rest.log

Copy link
Contributor Author

@merrimanr merrimanr Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ottobackwards the stack trace is not shown in the UI, only the root cause. If a user is not authenticated, they wouldn't be see this error anyways, they would get a 401.

@simonellistonball are you suggesting we move the logging call to the exception handler instead of the offending class? In this case it's very likely an error would get logged twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cestella I initially tried to include the whole stack trace in the response but I couldn't find a good way to format it. Multi-line statements just don't work well in JSON so I settled on the root cause because it's usually the most important piece of information anyways. The whole stack trace is logged, just not returned in the JSON response. Any ideas on how to do this better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in the situation that I was in last night while debugging, I was using this patch and the exception was not logged in metron-rest. I was hunting the same bug as Simon. I ended up adjusting this to include the stack trace and that's how I eventually figured it out. I'm not entirely certain WHY the error logging wasn't showing up.

Since this has happened to 2 of us now, could you validate that when you try to use the alerts-ui and use a sensor that does not have the alert nested template (any new sensor), you get an exception in the log? It should be something like:

Caused by: org.apache.metron.indexing.dao.search.InvalidSearchException: Could not execute search
        at org.apache.metron.elasticsearch.dao.ElasticsearchDao.search(ElasticsearchDao.java:170)
        at org.apache.metron.elasticsearch.dao.ElasticsearchMetaAlertDao.search(ElasticsearchMetaAlertDao.java:190)
        at org.apache.metron.rest.service.impl.SearchServiceImpl.search(SearchServiceImpl.java:53)
        ... 88 more
Caused by: Failed to execute phase [query_fetch], all shards failed; shardFailures {[S6cJSn5nSmOpd7Lm2_OF9g][cowrie_index_2017.09.29.05][0]: RemoteTransportException[[node1][192.168.66.121:9300][indices:data/read/search[phase/query+fetch]]]; nested: SearchParseException[failed to parse search source [{\"from\":0,\"size\":25,\"query\":{\"constant_score\":{\"filter\":{\"bool\":{\"should\":[{\"query_string\":{\"query\":\"*\"}},{\"bool\":{\"must\":[{\"term\":{\"status\":\"active\"}},{\"nested\":{\"query\":{\"query_string\":{\"query\":\"*\"}},\"path\":\"alert\"}}]}}]}}}},\"_source\":{\"includes\":[],\"excludes\":[]},\"sort\":[{\"timestamp\":{\"order\":\"desc\"}}],\"track_scores\":true}]]; nested: QueryParsingException[[nested] failed to find nested object under path [alert]]; }
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onFirstPhaseResult(AbstractSearchAsyncAction.java:206)
        at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:152)
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:46)
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:855)
        at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:833)
        at org.elasticsearch.transport.TransportService$4.onFailure(TransportService.java:387)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39)

If you can get it to work, then I'm +1. I absolutely couldn't get it to work with this PR, but I might've been doing something weird and I worry there's something weird with the logging level (which I think is at INFO) on metron-rest by default.

I still, personally, would rather see a stack trace in the error in the UI (since I'd prefer to get a stack trace from users for diagnosing issues and that's where they'll be looking). I take @ottobackwards 's point about security, but I still think it's valid and valuable. For the moment, I'd just append the multiline string and we can sort it out when debugging. Mind you, this would be in addition to in the rest logs.

I guess the net-net of it is that I want it logged and validated at LEAST one place and I'd be ok if it were 2 (one to the user and one to the system logs).

Copy link
Member

@cestella cestella Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, yes, I realize that the stack trace shown is right in the path of your LOG.error call, but I absolutely did not see anything in the log. I just want to make certain I'm doing something stupid and this case is covered by having you validate it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log level is indeed INFO and it should not be. That regression was introduced when we added ${HBASE_HOME}/conf to the REST classpath because that directory contains a log4j.properties file with logging set to INFO. I'm planning on fixing that in a separate PR. I'm not sure that's the cause here though because I'm able to produce an error in the logs with the same type of error (InvalidSearchException).

Are you sure you have the code in this PR deployed properly? Your stack trace says the exception was thrown at line 170 but in this PR the exception is thrown at line 176.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with having it the UI, but it must be configurable or governed by the production profile.
You can have the customer switch and reproduce to get the screen shot ( although, when would you NOT want the whole log in practice? ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it’s could be a bad merge. If you saw that in the log, then I’m ok with this. My +1 stands. Great job.

@cestella
Copy link
Member

sorry, one small nit.


private String getFullMessage(Throwable ex) {
String fullMessage = ex.getMessage();
Throwable cause = ex.getCause();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a possible recursion point, where ex.getCause() == ex .
Can we use the commons ExceptionUtils.getRootCause()? If not we need to guard against this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this method is just ExceptionUtils.getRootCause.getMessage(), isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. The get cause recursion is an oldie but a goodie.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a blocker imo.

@merrimanr
Copy link
Contributor Author

The latest commit replaces the getFullMessage call with ExceptionUtils.getStackTrace. Hopefully that addresses the recursion issue. As for security @ottobackwards, I'm not sure what the requirements are.

@@ -35,7 +36,7 @@
@ResponseBody
ResponseEntity<?> handleControllerException(HttpServletRequest request, Throwable ex) {
HttpStatus status = getStatus(request);
return new ResponseEntity<>(new RestError(status.value(), ex.getMessage(), getFullMessage(ex)), status);
return new ResponseEntity<>(new RestError(status.value(), ex.getMessage(), ExceptionUtils.getStackTrace(ex)), status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change to stack trace from message?
ExceptionUtils.getRootCauseMessage() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because @cestella requested it earlier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but I thought you wanted to show the root cause to the user, and have the stack trace that casey wanted. Not just the stack trace. What is the user going to see now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer just root cause but I had the impression the full stack trace was wanted. Doesn't make that much difference to me so we just need to agree on what it should be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the stack trace is generally more helpful in debugging than just the root cause, so yeah, I kinda wanted the stack trace along with the root cause. I'm willing to be argued with, but as a dev, I'd like to know line numbers of failures along with why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not in logs that is the problem, it is if it is in the UI. I think that it would apply to rest too... but I am not sure on that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine; I'm with it :) Thanks for the clarity, @ottobackwards

@merrimanr
Copy link
Contributor Author

The most recent commit changes fullMessage to use ExceptionUtils.getRootCauseMessage.

@ottobackwards
Copy link
Contributor

+1

@cestella
Copy link
Member

cestella commented Oct 4, 2017

+1, great job

@asfgit asfgit closed this in a8fd8ea Oct 4, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants