Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-7603: Support Graph Token Streams in QueryBuilder #129

Closed
wants to merge 1 commit into from

Conversation

mattweber
Copy link
Contributor

Adds support for handling graph token streams inside the
QueryBuilder util class used by query parsers.

@mattweber
Copy link
Contributor Author

Rebased against master, added missing ASF header.

int posInc = posIncAtt.getPositionIncrement();
assert pos > -1 || posInc > 0;

if (posInc > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a notable limitation that should be documented in javadocs somewhere. Can't we support holes without demanding the stream use '*' ? And might there be a test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsmiley That was actually pulled out of the existing TokenStreamToTermAutomatonQuery.java. Let me look into it more.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few small comments, but I'm out of time right now and I still need to think more about the hole-preserving...

throw new IllegalArgumentException("cannot handle holes; to accept any term, use '*' term");
}
// always use inc 1 while building, but save original increment
int fakePosInc = posInc > 1 ? 1 : posInc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just Math.min(1, posInc) instead?

Integer id = termToID.get(term);
if (id == null) {
if (incr > 1 || id == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm doesn't this mean that if the same term shows up, but with different incr, that it will get different id assigned? But I think that is actually fine, since nowhere here do we depend on / expect that the same term must have the same id.

* Adds a transition to the automaton.
*/
private void addTransition(int source, int dest, BytesRef term) {
private int addTransition(int source, int dest, int incr, BytesRef term) {
if (term == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can become an assert?

/**
* Gets the list of finite string token streams from the given input graph token stream.
*/
public List<TokenStream> getTokenStreams(final TokenStream in) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this method private, make this class's constructor private, and add a static method here, the sole public method on this class, that receives the incoming TokenStream and returns the resulting TokenStream[]? Otherwise the API is sort of awkard, since e.g. this method seems like a getter yet it's doing lots of side-effects under the hood ...

// always use inc 1 while building, but save original increment
int fakePosInc = posInc > 1 ? 1 : posInc;

assert pos > -1 || fakePosInc > 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we upgrade this to a real if? I.e. we need a well-formed TokenStream input ... it cannot have posInc=0 on its first token.

@mattweber
Copy link
Contributor Author

@mikemccand I addressed you comments. I also added some more tests and fixed a bug that would yield wrong increment when a term that had previously been seen was found again with an increment of 0. Tests were added. I have squashed these changes with the previous commit so it is clear to see the difference between the original PR which did not support position increments and the new one that does.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall really nice Matt.

@@ -80,22 +77,41 @@ public boolean incrementToken() throws IOException {
}
}

private GraphTokenStreamFiniteStrings() {
this.builder = new Automaton.Builder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other fields are initialized at the declaration; might as well move this here too?

assert term != null;
boolean isStackedGap = incr == 0 && prevIncr > 1;
boolean hasGap = incr > 1;
term = BytesRef.deepCopyOf(term);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deepCopyOf is only needed if you generate a new ID, not for an existing one.

BTW... have you seen BytesRefHash? I think re-using that could minimize the code here to deal with this stuff.

@@ -210,85 +199,41 @@ private void finish() {
*/
private void finish(int maxDeterminizedStates) {
Automaton automaton = builder.finish();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all this code here removed wasn't needed after all? It's nice to see it all go away (less to maintain / less complexity) :-)

/**
* Builds automaton and builds the finite string token streams.
*/
private List<TokenStream> process(final TokenStream in) throws IOException {
build(in);

List<TokenStream> tokenStreams = new ArrayList<>();
final FiniteStringsIterator finiteStrings = new FiniteStringsIterator(det);
for (IntsRef string; (string = finiteStrings.next()) != null; ) {
final BytesRef[] tokens = new BytesRef[string.length];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm; rather than materializing an array of tokens and increments, maybe you could simply give the IntsRefString to BytesRefArrayTokenStream (and make BRATS not static) so that it could do this on the fly? Not a big deal either way (current or my proposal). If you do as I suggest then BRATS would no longer be a suitable name; maybe simply FiniteStringTokenStream or CustomTokenStream.

@mattweber
Copy link
Contributor Author

Thanks @dsmiley! I have just pushed up code with your suggestions except for using BytesRefHash due to the fact we might have the same BytesRef but need a different id because we have position gap.

This has been great, love the feedback!

@mikemccand
Copy link
Member

This change looks great to me! What an awesome improvement, to properly use graph token streams at search time so multi-token synonyms are correct.

I'll push this in a few days once I'm back home unless someone pushes first (@dsmiley feel free)...

Thank you @mattweber!

Adds support for handling graph token streams inside the
QueryBuilder util class used by query parsers.
@mikemccand
Copy link
Member

I've merged this into Lucene's master (7.0), and I'm working on 6.x (#130) now. Thanks @mattweber! Can you close this?

@mattweber mattweber closed this Jan 3, 2017
markrmiller added a commit that referenced this pull request Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants