Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change grok watch dog to be Matcher based instead of thread based. #48346

Merged
merged 5 commits into from
Oct 24, 2019

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Oct 22, 2019

There is a watchdog in order to avoid long running (and expensive)
grok expressions. Currently the watchdog is thread based, threads
that run grok expressions are registered and after completion unregister.
If these threads stay registered for too long then the watch dog interrupts
these threads. Joni (the library that powers grok expressions) has a
mechanism that checks whether the current thread is interrupted and
if so abort the pattern matching.

Newer versions have an additional method to abort long running pattern
matching inside joni. Instead of checking the thread's interrupted flag,
joni now also checks a volatile field that can be set via a Matcher
instance. This is more efficient method for aborting long running matches.
(joni checks each 30k iterations whether interrupted flag is set vs.
just checking a volatile field)

Recently we upgraded to a recent joni version (#47374), and this PR
is a followup of that PR.

This change should also fix #43673, since it appears when unit tests
are ran, the a test runner thread's interrupted flag may already have
been set, due to some thread reuse.

There is a watchdog in order to avoid long running (and expensive)
grok expressions. Currently the watchdog is thread based, threads
that run grok expressions are registered and after completion unregister.
If these threads stay registered for too long then the watch dog interrupts
these threads. Joni (the library that powers grok expressions) has a
mechanism that checks whether the current thread is interrupted and
if so abort the pattern matching.

Newer versions have an additional method to abort long running pattern
matching inside joni. Instead of checking the thread's interrupted flag,
joni now also checks a volatile field that can be set via a `Matcher`
instance. This is more efficient method for aborting long running matches.
(joni checks each 30k iterations whether interrupted flag is set vs.
just checking a volatile field)

Recently we upgraded to a recent joni version (elastic#47374), and this PR
is a followup of that PR.

This change should also fix elastic#43673, since it appears when unit tests
are ran the a test runner thread's interrupted flag may already have
been set, due to some thread reuse.
@martijnvg martijnvg added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 v7.6.0 labels Oct 22, 2019
@martijnvg martijnvg requested a review from talevy October 22, 2019 12:01
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Ingest)

@@ -38,7 +40,7 @@
public class TimeoutChecker implements Closeable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 Does this change look good to you?

This change to ML is required in order to make the watchdog Matcher based instead of Thread based.

Copy link
Contributor

@talevy talevy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find! changes look good to me. I cannot comment on the ML changes. so I would wait to hear from the ML team on those

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for making the necessary updates to ML!

My two comments in the ML test are just nits - it will still work as it is now but in the future somebody might wonder why it's messing with the interrupted flag at all.

The thing I noted about the return value of Grok.match has been like that ever since the timeout functionality was introduced to Grok - it's not a problem introduced by this PR. So up to you if you want to leave it for a followup.

} finally {
threadWatchdog.unregister();
matcherWatchdog.unregister(matcher);
}
return (result != -1);
Copy link
Contributor

@droberts195 droberts195 Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with the Javadoc. The Javadoc says the return value is "true if grok expression matches text, false otherwise". But given the current code it should say "true if grok expression matches text or there is a timeout, false otherwise".

Probably a better fix would be to change this line to return (result >= 0);. It looks like ML is the only component outside of test code that calls this method. If the ML file structure finder is told there's a match when actually there's a timeout then it will move onto the next step but then time out almost immediately afterwards when the overall elapsed time is checked during that next step, so the net effect is still that the endpoint times out. So from an ML perspective I don't mind whether you change this line or not. But it might be best to make the return value more intuitive before someone else uses this method in production code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it might be best to make the return value more intuitive before someone else uses this method in production code.

Agreed. I will change the jdocs in this PR and in a followup will do the change that you're suggesting here.

} finally {
TimeoutChecker.watchdog.unregister();
TimeoutChecker.watchdog.unregister(matcher);
assertThat(watchdog.registry.get(Thread.currentThread()).matchers.size(), equalTo(0));
}
} finally {
// ensure the interrupted flag is cleared to stop it making subsequent tests fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this finally block is unnecessary now that the other code in this test is not altering whether the current thread is interrupted.

@@ -59,18 +64,24 @@ public void testCheckTimeoutExceeded() throws Exception {
}
}

public void testWatchdog() {
public void testWatchdog() throws Exception {

assertFalse(Thread.interrupted());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the finally block, I think this is redundant now that the test doesn't check or change whether the current thread is interrupted.

@martijnvg martijnvg merged commit 12d32af into elastic:master Oct 24, 2019
martijnvg added a commit that referenced this pull request Oct 24, 2019
…48346)

There is a watchdog in order to avoid long running (and expensive)
grok expressions. Currently the watchdog is thread based, threads
that run grok expressions are registered and after completion unregister.
If these threads stay registered for too long then the watch dog interrupts
these threads. Joni (the library that powers grok expressions) has a
mechanism that checks whether the current thread is interrupted and
if so abort the pattern matching.

Newer versions have an additional method to abort long running pattern
matching inside joni. Instead of checking the thread's interrupted flag,
joni now also checks a volatile field that can be set via a `Matcher`
instance. This is more efficient method for aborting long running matches.
(joni checks each 30k iterations whether interrupted flag is set vs.
just checking a volatile field)

Recently we upgraded to a recent joni version (#47374), and this PR
is a followup of that PR.

This change should also fix #43673, since it appears when unit tests
are ran the a test runner thread's interrupted flag may already have
been set, due to some thread reuse.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GrokTests unit tests fail with "Unable to find pattern [] in Grok's pattern dictionary"
5 participants