Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Include out-of-order as well as in-order terms in reverse search #950

Merged
merged 3 commits into from
Jan 20, 2020

Conversation

droberts195
Copy link
Contributor

When categories include messages with different token orders,
we were not including the tokens whose order varied between
the messages in the terms of the reverse search. This change
adds these out-of-order tokens that are part of the category
definition back into the reverse search terms.

Fixes the easy part of #949

When categories include messages with different token orders,
we were not including the tokens whose order varied between
the messages in the terms of the reverse search.  This change
adds these out-of-order tokens that are part of the category
definition back into the reverse search terms.

Fixes the easy part of elastic#949
// reverse search creation are tested more thoroughly in the unit tests for
// their respective classes, but this test helps to confirm that they work
// together)
BOOST_TEST_REQUIRE(output.find("\"terms\":\"Killing job\"") != std::string::npos);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix of this PR makes terms here Killing job where before this PR it would have been empty.

// their respective classes, but this test helps to confirm that they work
// together)
BOOST_TEST_REQUIRE(output.find("\"terms\":\"Killing job\"") != std::string::npos);
BOOST_TEST_REQUIRE(output.find("\"regex\":\".*\"") != std::string::npos);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex is still poor, .*, after this PR. A followup that will not be backported as far will fix this.

See #949 (comment) for details.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somebody more familiar with C++ might want to review as well. The proof is in the pudding (the tests in this case).

I wonder how this will effect existing ML Jobs? Do we adjust the terms when we expand categories?

@droberts195
Copy link
Contributor Author

I wonder how this will effect existing ML Jobs? Do we adjust the terms when we expand categories?

Yes, we will if some aspect of the category changes as a result of assigning a new message to it. But it will only make the category definitions stored in the results index better, so I don't think it's a problem. The category definition stored in the opaque categorizer state that the C++ uses is not changed at all by this fix. The problem being fixed here is that the way that C++ internal state was reported in a more friendly format was losing some information in this edge case.

The other part of the fix to #949 will involve changing the C++ internal state, and that is why it is too difficult/dangerous to include in a 6.8 patch.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a minor suggestion on variable naming.

size_t end(category.outOfOrderCommonTokenIndex());
for (size_t index = 0; index < end; ++index) {
size_t tokenId(baseTokenIds[index].first);
bool first{true};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this name. Maybe firstInOrderToken would be clearer?

@droberts195 droberts195 merged commit 4b2bd40 into elastic:master Jan 20, 2020
@droberts195 droberts195 deleted the fix_poor_reverse_search_part1 branch January 20, 2020 13:59
droberts195 added a commit to droberts195/ml-cpp that referenced this pull request Jan 20, 2020
…earch

When categories include messages with different token orders,
we were not including the tokens whose order varied between
the messages in the terms of the reverse search.  This change
adds these out-of-order tokens that are part of the category
definition back into the reverse search terms.

Backport of elastic#950
droberts195 added a commit to droberts195/ml-cpp that referenced this pull request Jan 20, 2020
…earch

When categories include messages with different token orders,
we were not including the tokens whose order varied between
the messages in the terms of the reverse search.  This change
adds these out-of-order tokens that are part of the category
definition back into the reverse search terms.

Backport of elastic#950
droberts195 added a commit to droberts195/ml-cpp that referenced this pull request Jan 20, 2020
…earch

When categories include messages with different token orders,
we were not including the tokens whose order varied between
the messages in the terms of the reverse search.  This change
adds these out-of-order tokens that are part of the category
definition back into the reverse search terms.

Backport of elastic#950
droberts195 added a commit that referenced this pull request Jan 20, 2020
…earch (#954)

When categories include messages with different token orders,
we were not including the tokens whose order varied between
the messages in the terms of the reverse search.  This change
adds these out-of-order tokens that are part of the category
definition back into the reverse search terms.

Backport of #950
droberts195 added a commit that referenced this pull request Jan 20, 2020
…earch (#956)

When categories include messages with different token orders,
we were not including the tokens whose order varied between
the messages in the terms of the reverse search.  This change
adds these out-of-order tokens that are part of the category
definition back into the reverse search terms.

Backport of #950
droberts195 added a commit that referenced this pull request Jan 20, 2020
…earch (#955)

When categories include messages with different token orders,
we were not including the tokens whose order varied between
the messages in the terms of the reverse search.  This change
adds these out-of-order tokens that are part of the category
definition back into the reverse search terms.

Backport of #950
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants