Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synonym Graph Support (LUCENE-6664) #21517

Merged
merged 1 commit into from Nov 28, 2016

Conversation

Projects
None yet
5 participants
@mattweber
Copy link
Contributor

commented Nov 12, 2016

Integrate @mikemccand's patch from LUCENE-6664 into elasticsearch and add support for handling a graph token stream in match/multi-match queries.

This PR still needs some work, ie. lots more tests, fix some TODO's, etc. I wanted to get some feedback and see if you would even consider merging this before I spend much more time on it. I do have it as a plugin as well but it would be much nicer in core since the plugin needs to fork Match and MultiMatch queries.

@clintongormley

This comment has been minimized.

Copy link
Member

commented Nov 13, 2016

heya @mattweber - thanks for the PR!

@mikemccand could you have a look at this please?

@mikemccand
Copy link
Contributor

left a comment

Wow, this is very impressive @mattweber! Thank you ;)

I had given up hope that Lucene would ever correctly handle multi-token synonyms, but maybe ES can, and if that proves successful, maybe we can eventually push the improvements back to Lucene.

Just to give some context here:

Lucene has never handled multi-token synonyms correctly, and it's a well known/publicized problem, e.g.:

Essentially, the multi-tokens synonyms get squashed on top of each other, causing documents that should match to fail to match, and vice versa, at search time. The above blog post explains the problem. It's quite a glaring, long standing, embarrassing bug.

So the goal with LUCENE-6664, which was never committed to Lucene due to controversy, was to be the first step to fix this with a new SynonymGraphFilter that produces a correct graph as output with multi-token synonyms.

The second step (making use of that graph for indexing and searching) is still not solved by LUCENE-6664, but for this PR, it looks like you've done that nicely, by detecting when the query-time analyzer has created a graph and then enumerating all unique paths through that graph and constructing a corresponding BooleanQuery with those paths as SHOULD clauses. Very nice!

Another option (can be done later/separately) is to rewrite directly to TermAutomatonQuery: that query already has helper methods to build itself from a graph TokenStream. It can also make things more efficient at search time since e.g. it will rewrite to a MultiPhraseQuery if it can.

If we parse a double-quoted text (meaning user intended a PhraseQuery), and it had graph synonyms inside it, I think this works very well right? I.e., it will match any document having the requested phrase or any of its graph-synonym-expanded phrases.

core/src/main/java/org/apache/lucene/util/XQueryBuilder.java Outdated
positionCount += positionIncrement;
} else {
hasSynonyms = true;
// TODO: better way to detect a graph token stream?

This comment has been minimized.

Copy link
@mikemccand

mikemccand Nov 14, 2016

Contributor

I think you could just pull the PositionLengthAttribute and if it has a value greater than 1 on any token, it is a graph.

This comment has been minimized.

Copy link
@mattweber

mattweber Nov 14, 2016

Author Contributor

Awesome, I like that much better. I will make this change and set the type of the synonym back to SYNONYM.

This comment has been minimized.

Copy link
@mattweber

mattweber Nov 14, 2016

Author Contributor

Ok this doesn't work because the standard synonym token filter also sets the position length attribute which will cause the query builder to attempt to build the automaton from a non-graph token stream.

Maybe I just introduce a new GraphTokenStreamAttribute attribute that defaults to false and is only set to true when using a graph-aware token filter and we have a synonym with a position length greater than 1?

This comment has been minimized.

Copy link
@mikemccand

mikemccand Nov 15, 2016

Contributor

Ok this doesn't work because the standard synonym token filter also sets the position length attribute

Arrrrgh, you are right.

Here's maybe another idea: why not always treat things as if they were a graph,since a single linear chain of tokens really is just a graph. And then, when the graph enumerates to just one path, it should naturally "become" what already happens today? I.e. don't break out any special treatment for "graph" vs "not graph".

This comment has been minimized.

Copy link
@mattweber

mattweber Nov 15, 2016

Author Contributor

I set a flag using FlagsAttribute when the SynonymGraphFilter sets a synonym. I can then detect a graph by looking for this flag along with position length > 1.

core/src/main/java/org/apache/lucene/util/XRollingBuffer.java Outdated
*
* Temporary until LUCENE-6664.
*/
public abstract class XRollingBuffer<T extends XRollingBuffer.Resettable> {

This comment has been minimized.

Copy link
@mikemccand

mikemccand Nov 14, 2016

Contributor

Hmm, why did we need to fork this class from Lucene? Is it only for the added getBufferSize public API? If so, I think we can make this change directly in Lucene. I'll do that ...

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 14, 2016

@mikemccand I saw that you recently committed the new rewrite code to TermAutomatonQuery. I decided not to go that route because I wanted to make sure I could control the different mode/operators across the different tokens in each path. I think everything is essentially a phrase using TermAutomatonQuery and in this approach we use whatever the original query terms were configured to use. The GraphTokenStreamFiniteStrings class uses all the TokenStream to Automaton code from TermAutomatonQuery, maybe it would be good to extract that out so we can reuse it. Probably not important until we wanted to get it into Lucene.

I have only integrated this forked QueryBuilder into the Match queries which really don't do anything other than analysis. I will have to see what it takes to get it into QueryString and SimpleQueryString queries.

Yes, RollingBuffer was only forked due to making getBufferSize public. Looks like you already handled that. Also, only reason we needed to fork QueryBuilder is because the createFieldQuery method was final. Is that something we could change?

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 14, 2016

While I am thinking about it, I want to note that minimum_should_match is not currently being applied correctly when we detect a graph query. I need to look into it.

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 14, 2016

Also, I had initially tried to use BytesTermAttribute to reuse the BytesRef we had for each term instead of converting back to a string and using CharTermAttribute. I kept getting NPE when used with the CachingTokenFilter in the query builder. Not sure if that is something worth investigating as well.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 15, 2016

Also, only reason we needed to fork QueryBuilder is because the createFieldQuery method was final. Is that something we could change?

I'll open a Lucene issue to explore that; looks like it was originally (born) final here: https://issues.apache.org/jira/browse/LUCENE-5261

I decided not to go that route because I wanted to make sure I could control the different mode/operators across the different tokens in each path. I think everything is essentially a phrase using TermAutomatonQuery and in this approach we use whatever the original query terms were configured to use.

Ahh OK ... can you give an example where you would not want the separate paths to be treated as a phrase query? I'm just trying to understand the use case; maybe we can improve TermAutomatonQuery to work better for such cases.

The GraphTokenStreamFiniteStrings class uses all the TokenStream to Automaton code from TermAutomatonQuery, maybe it would be good to extract that out so we can reuse it. Probably not important until we wanted to get it into Lucene.

I like this idea, basically just a static method to turn a TokenStream into the equivalent set of Token[] ... but yes we can wait I think. We also have TokenStreamToAutomaton, but that puts each byte of each token as a transition.

I kept getting NPE when used with the CachingTokenFilter in the query builder. Not sure if that is something worth investigating as well.

Hmm do you have more details about this? A traceback? Likely something was making the hard assumption that a token would always have a CharTermAttribute. CachingTokenFilter looks ok on quick inspection, and QueryBuilder.createFieldQuery seems to only pull the TermToBytesRefAttribute so I'm not sure where it's breaking.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 15, 2016

I opened https://issues.apache.org/jira/browse/LUCENE-7560 to explore removing final from QueryBuilder.createFieldQuery.

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 15, 2016

@mikemccand thanks for opening that issue! I will get you more info on the BytesTermAttribute and CachingTokenFilter NPE I found. I believe it was during a clone of the attribute where it tried to do deep copy of the BytesRef but the BytesRef itself was null for some reason.

Ahh OK ... can you give an example where you would not want the separate paths to be treated as a phrase query? I'm just trying to understand the use case; maybe we can improve TermAutomatonQuery to work better for such cases.

For example, non-phrase match query of "say what the fudge" using the AND operator. I don't want this to be forced into a phrase, I just want all terms to match. Since we have a synonym also "say wtf" can also match, again not as a phrase. Ideally, only the expanded multi-token synonym itself would be treated as a phrase, not the original search terms. So "say wtf" would end up looking like:

AND(say, wtf)
AND(say, PHRASE(what the fudge))

That AND could also be an OR, or a PHRASE depending on how the user configured the match query. I figured it would get pretty hairy with all the different combinations so treating everything the same using the user's configured query settings would be best. Does this make sense?

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Nov 15, 2016

I love the stuff happening here! Yet, if we do this in ES and later move it to lucene we might be fighting with BWC big time, but don't get me wrong I think we should do it this way. To prevent these problems I think we need to make it experimental AND we need a way for people to opt into these features. I thought about this for a while and I think we need to add a setting to enable this which is disabled by default. To make this explicit we can keep these settings private and have a dedicated / generic experimental_features plugin that does nothing but registers these settings. That way users need to install the plugin AND set the setting to enabled/disabled for each feature. This will be explicit and they will get failures if they use the feature and the feature got removed or moved from experimental to supported @clintongormley WDYT

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 15, 2016

@mikemccand Ignore my issue about the NPE, I must have been doing something else wrong at the time. I just pushed a commit that uses BytesTermAttribute which avoids the unnecessary string conversion.

@simonw You guys tell me what I need to do and I will work on it. This feature is a big one for me and having this in core (even as an experimental plugin) vs. my custom plugin is a big win.

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 15, 2016

@simonw Also, I think the opt-in is the fact that the user uses the synonym_graph token filter at all. If they don't use it the flags attribute never gets set and will never get into the new code path. Is that enough?

@mattweber mattweber force-pushed the mattweber:graph_synonyms branch Nov 16, 2016

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 16, 2016

Just pushed another commit that will allow us to properly handle minimum_should_match.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Nov 16, 2016

@simonw Also, I think the opt-in is the fact that the user uses the synonym_graph token filter at all. If they don't use it, nothing changes. Is that enough?

I am not sure - people get surprised quickly and BWC layers are difficult. I'd rather have another step to take here for users but don't worry about it here for now

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 16, 2016

I opened https://issues.apache.org/jira/browse/LUCENE-7560 to explore removing final from QueryBuilder.createFieldQuery.

OK this is now done (for Lucene 6.4.0).

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 16, 2016

So "say wtf" would end up looking like:

Ahh thank you for that example @mattweber, that makes perfect sense, and the simplicity of simply enumerating all matching phrases out of the TokenStream graph is really nice. We can later think about how we could make use of TAQ, only for the "graph parts", but keep the other query parser operators for the non-graph parts.

I like the idea of a separate, explicit experimental_features plugin, but I agree we should not let that hold up this issue (which is already complex enough). For this issue we should make it clear in the docs that this is experimental, and when we later have this plugin, we can at that point apply it to this feature.

Can we rename SausageGraphFilter to maybe SquashGraphFilter or FlattenGraphFilter or something? The docs need to make it clear that the index analyzer must use SquashGraphFilter.

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 16, 2016

@mikemccand I will do the rename and start on some documentation. Now that I fixed the minimum should match bug yesterday I think all the major issues are resolved. Do you guys see any major problems that might hold up getting this in?

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 16, 2016

Thanks @mattweber ... I don't see any problems now. One minor thing: can you add TODO to the two forked Lucene classes to un-fork them once we upgrade to Lucene 6.4.0?

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 16, 2016

@mikemccand Should we just remove SausageGraphFilter from this PR? Really, if we were going to use SynonymGraphFilter and then SausageGraphFilter at index time we might as well just use the normal SynonymFilter. We can document that SynonymGraphFilter is meant to be used as a search analyzer only.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 16, 2016

@mattweber I think that's a good approach.

Long term, I expect that SynonymGraphFilter will replace SynonymFilter, but short term (this change) let's minimize the risk and focus only on query time synonyms.

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 17, 2016

@mikemccand Ok, removed SausageGraphFilter and added the TODO for Lucene 6.4.0. On that note, we need more changes to the upstream QueryBuilder...

Can the following private methods be changed to protected so we can access them from our overridden createFieldQuery method?

  • analyzeTerm
  • analyzeMultiPhrase
  • analyzePhrase
  • analyzeBoolean
  • analyzeMultiBoolean

@mattweber mattweber force-pushed the mattweber:graph_synonyms branch Nov 17, 2016

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 17, 2016

Can the following private methods be changed to protected so we can access them from our overridden createFieldQuery method?

I'll try :) I'll reopen the Lucene issue.

@mattweber mattweber force-pushed the mattweber:graph_synonyms branch Nov 18, 2016

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 18, 2016

@mikemccand Thanks! I have made my forked version a copy of what I think it will look like in Lucene 6.4.0 assuming those methods get changed from private to protected. I also moved all the graph detection logic out into overridden methods of MatchQueryBuilder to make un-forking easier.

I think we are ready for review, should I squash everything?

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 28, 2016

@jpountz Can you please take a look?

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2016

I'll look now.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2016

Query parsing looks good to me.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2016

Thanks @mattweber and @jpountz ... I'll merge this now!

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 28, 2016

Awesome, would you like me to backport to 5x?

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2016

Awesome, would you like me to backport to 5x?

Oh, that would be wonderful, thank you!

@mikemccand mikemccand merged commit 04e07bc into elastic:master Nov 28, 2016

1 check passed

CLA Commit author has signed the CLA
Details
@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2016

@mattweber unfortunately 5.1.0 release is already feature frozen, so this will be in the 5.2.0 release.

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 28, 2016

Ok thanks @mikemccand I will just backport and open new PR against 5.x branch. What is the normal process you guys do to backport? Cherry pick the commit? For this I might need to go re-fork the QueryBuilder from the 5.x version of lucene.

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Nov 28, 2016

@mikemccand cherry-picked commit applies clean. I am running tests right now and will open PR for you once those finish up.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2016

@mattweber Yes, cherry pick is normally how I do it; that's great that it's clean! Thank you.

mattweber added a commit to mattweber/elasticsearch that referenced this pull request Nov 28, 2016

Synonym Graph Support (LUCENE-6664) (elastic#21517)
Integrate the patch from LUCENE-6664 into elasticsearch and
add support for handling a graph token stream in match/multi-match
queries.

This fixes longstanding bugs with multi-token synonyms returning
incorrect results with proximity queries.

mikemccand pushed a commit that referenced this pull request Nov 28, 2016

Synonym Graph Support (LUCENE-6664) (#21517) (#21836)
Integrate the patch from LUCENE-6664 into elasticsearch and
add support for handling a graph token stream in match/multi-match
queries.

This fixes longstanding bugs with multi-token synonyms returning
incorrect results with proximity queries.
@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Dec 22, 2016

@mattweber I just pushed the SynonymGraphFilter to Lucene (https://issues.apache.org/jira/browse/LUCENE-6664) for its next (6.4.0) feature release; once Lucene releases 6.4.0 and we upgrade to it we should try to un-fork the changes here.

So ... now I'm thinking more about how query parsing could work more generally for any "graph" token streams... and I'm wondering what would actually go wrong if you had used PositionLengthAttribute > 1 here (instead of GRAPH_FLAG) to know when to walk the graph for all paths?

In the cases where old SynonymFilter does this, it is in fact a real graph, and it carries important information, i.e. using GraphQuery in that case would fix some buggy cases today?

E.g. if I have a synonym:

  denial of service -> dos

And I try a match query with this phrase search:

  "denial of service attack"

and I'm using the old SynonymFilter, it will match denial of service and insert dos with posLen=3, yet today, we will fail to match documents that have dos attack, but if we used GraphQuery it would match I believe?

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Dec 23, 2016

@mikemccand I believe this will work with any graph token stream. It does fix some buggy cases of the original SynonymFilter which is the reason I added the GRAPH_FLAG detection (so it is not a breaking change forcing me to wait until ES6). This test will actually fail if we remove the GRAPH_FLAG detection because it will actually traverse the graph from the original SynonymFilter.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Dec 23, 2016

@mikemccand I believe this will work with any graph token stream.

OK this is my impression too (that we do not need GRAPH_FLAG, except for back-compat reasons).

Yet, I would argue we are fixing a (nasty, long standing) bug here, by having even the old SynonymFilter return correct results? And that test is indeed showing exactly the bug; thanks for the pointer. Why not just fix the bug for users still using the old SynonymFilter and any other graph token streams?

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Dec 23, 2016

@mikemccand I am all for removing the GRAPH_FLAG and just documenting the "fixed" behavior.

@clintongormley your thoughts?

Also, FYI I started working on support for graph token streams in query_string and simple_query_string and noticed it introduces a lot of duplicate code because of where we need to override QueryBuilder methods. Are those overrides something we should get into lucene?? I imagine more push back there due to the "fixed" behavior.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Dec 23, 2016

Are those overrides something we should get into lucene?? I imagine more push back there due to the "fixed" behavior.

That's exactly what I'd like to do! Patches welcome in Lucene land :)

It's why I was mulling why we can't do this new behavior more generally whenever posLen > 1 ...

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Dec 23, 2016

@mikemccand Alright, let me work up a patch for lucene. If we get that in then the only things we need to handle here are the min should match modifications.

@mikemccand

This comment has been minimized.

Copy link
Contributor

commented Dec 23, 2016

Thanks @mattweber!

@mattweber

This comment has been minimized.

Copy link
Contributor Author

commented Dec 26, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.