New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-7603: Support Graph Token Streams in QueryBuilder #129
Conversation
747fb0d
to
85e5c7c
Compare
Rebased against master, added missing ASF header. |
int posInc = posIncAtt.getPositionIncrement(); | ||
assert pos > -1 || posInc > 0; | ||
|
||
if (posInc > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a notable limitation that should be documented in javadocs somewhere. Can't we support holes without demanding the stream use '*' ? And might there be a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsmiley That was actually pulled out of the existing TokenStreamToTermAutomatonQuery.java. Let me look into it more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few small comments, but I'm out of time right now and I still need to think more about the hole-preserving...
throw new IllegalArgumentException("cannot handle holes; to accept any term, use '*' term"); | ||
} | ||
// always use inc 1 while building, but save original increment | ||
int fakePosInc = posInc > 1 ? 1 : posInc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just Math.min(1, posInc)
instead?
Integer id = termToID.get(term); | ||
if (id == null) { | ||
if (incr > 1 || id == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm doesn't this mean that if the same term shows up, but with different incr
, that it will get different id
assigned? But I think that is actually fine, since nowhere here do we depend on / expect that the same term must have the same id.
* Adds a transition to the automaton. | ||
*/ | ||
private void addTransition(int source, int dest, BytesRef term) { | ||
private int addTransition(int source, int dest, int incr, BytesRef term) { | ||
if (term == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can become an assert?
/** | ||
* Gets the list of finite string token streams from the given input graph token stream. | ||
*/ | ||
public List<TokenStream> getTokenStreams(final TokenStream in) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this method private, make this class's constructor private, and add a static
method here, the sole public method on this class, that receives the incoming TokenStream
and returns the resulting TokenStream[]
? Otherwise the API is sort of awkard, since e.g. this method seems like a getter yet it's doing lots of side-effects under the hood ...
// always use inc 1 while building, but save original increment | ||
int fakePosInc = posInc > 1 ? 1 : posInc; | ||
|
||
assert pos > -1 || fakePosInc > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we upgrade this to a real if
? I.e. we need a well-formed TokenStream
input ... it cannot have posInc=0
on its first token.
d8cb393
to
81230f5
Compare
@mikemccand I addressed you comments. I also added some more tests and fixed a bug that would yield wrong increment when a term that had previously been seen was found again with an increment of 0. Tests were added. I have squashed these changes with the previous commit so it is clear to see the difference between the original PR which did not support position increments and the new one that does. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall really nice Matt.
@@ -80,22 +77,41 @@ public boolean incrementToken() throws IOException { | |||
} | |||
} | |||
|
|||
private GraphTokenStreamFiniteStrings() { | |||
this.builder = new Automaton.Builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other fields are initialized at the declaration; might as well move this here too?
assert term != null; | ||
boolean isStackedGap = incr == 0 && prevIncr > 1; | ||
boolean hasGap = incr > 1; | ||
term = BytesRef.deepCopyOf(term); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deepCopyOf is only needed if you generate a new ID, not for an existing one.
BTW... have you seen BytesRefHash? I think re-using that could minimize the code here to deal with this stuff.
@@ -210,85 +199,41 @@ private void finish() { | |||
*/ | |||
private void finish(int maxDeterminizedStates) { | |||
Automaton automaton = builder.finish(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So all this code here removed wasn't needed after all? It's nice to see it all go away (less to maintain / less complexity) :-)
/** | ||
* Builds automaton and builds the finite string token streams. | ||
*/ | ||
private List<TokenStream> process(final TokenStream in) throws IOException { | ||
build(in); | ||
|
||
List<TokenStream> tokenStreams = new ArrayList<>(); | ||
final FiniteStringsIterator finiteStrings = new FiniteStringsIterator(det); | ||
for (IntsRef string; (string = finiteStrings.next()) != null; ) { | ||
final BytesRef[] tokens = new BytesRef[string.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm; rather than materializing an array of tokens and increments, maybe you could simply give the IntsRefString to BytesRefArrayTokenStream (and make BRATS not static) so that it could do this on the fly? Not a big deal either way (current or my proposal). If you do as I suggest then BRATS would no longer be a suitable name; maybe simply FiniteStringTokenStream or CustomTokenStream.
81230f5
to
97f32bf
Compare
Thanks @dsmiley! I have just pushed up code with your suggestions except for using This has been great, love the feedback! |
This change looks great to me! What an awesome improvement, to properly use graph token streams at search time so multi-token synonyms are correct. I'll push this in a few days once I'm back home unless someone pushes first (@dsmiley feel free)... Thank you @mattweber! |
Adds support for handling graph token streams inside the QueryBuilder util class used by query parsers.
97f32bf
to
7d67767
Compare
I've merged this into Lucene's master (7.0), and I'm working on 6.x (#130) now. Thanks @mattweber! Can you close this? |
Adds support for handling graph token streams inside the
QueryBuilder util class used by query parsers.