LUCENE-7603: Support Graph Token Streams in QueryBuilder #129

mattweber · 2016-12-26T17:09:29Z

Adds support for handling graph token streams inside the
QueryBuilder util class used by query parsers.

mattweber · 2016-12-28T18:20:48Z

Rebased against master, added missing ASF header.

dsmiley · 2016-12-29T15:41:47Z

lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java

+      int posInc = posIncAtt.getPositionIncrement();
+      assert pos > -1 || posInc > 0;
+
+      if (posInc > 1) {


This seems like a notable limitation that should be documented in javadocs somewhere. Can't we support holes without demanding the stream use '*' ? And might there be a test for this?

@dsmiley That was actually pulled out of the existing TokenStreamToTermAutomatonQuery.java. Let me look into it more.

mikemccand

I left a few small comments, but I'm out of time right now and I still need to think more about the hole-preserving...

mikemccand · 2016-12-30T10:03:58Z

lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java

-        throw new IllegalArgumentException("cannot handle holes; to accept any term, use '*' term");
-      }
+      // always use inc 1 while building, but save original increment
+      int fakePosInc = posInc > 1 ? 1 : posInc;


Maybe just Math.min(1, posInc) instead?

mikemccand · 2016-12-30T10:15:15Z

lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java

    Integer id = termToID.get(term);
-    if (id == null) {
+    if (incr > 1 || id == null) {


Hmm doesn't this mean that if the same term shows up, but with different incr, that it will get different id assigned? But I think that is actually fine, since nowhere here do we depend on / expect that the same term must have the same id.

mikemccand · 2016-12-30T10:15:47Z

lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java

-   * Adds a transition to the automaton.
-   */
-  private void addTransition(int source, int dest, BytesRef term) {
+  private int addTransition(int source, int dest, int incr, BytesRef term) {
    if (term == null) {


This can become an assert?

mikemccand · 2016-12-30T10:26:55Z

lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java

+  /**
+   * Gets the list of finite string token streams from the given input graph token stream.
+   */
+  public List<TokenStream> getTokenStreams(final TokenStream in) throws IOException {


Could we make this method private, make this class's constructor private, and add a static method here, the sole public method on this class, that receives the incoming TokenStream and returns the resulting TokenStream[]? Otherwise the API is sort of awkard, since e.g. this method seems like a getter yet it's doing lots of side-effects under the hood ...

mikemccand · 2016-12-30T10:31:45Z

lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java

+      // always use inc 1 while building, but save original increment
+      int fakePosInc = posInc > 1 ? 1 : posInc;
+
+      assert pos > -1 || fakePosInc > 0;


Can we upgrade this to a real if? I.e. we need a well-formed TokenStream input ... it cannot have posInc=0 on its first token.

mattweber · 2016-12-30T16:50:56Z

@mikemccand I addressed you comments. I also added some more tests and fixed a bug that would yield wrong increment when a term that had previously been seen was found again with an increment of 0. Tests were added. I have squashed these changes with the previous commit so it is clear to see the difference between the original PR which did not support position increments and the new one that does.

dsmiley

Overall really nice Matt.

dsmiley · 2016-12-30T16:53:47Z

lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java

@@ -80,22 +77,41 @@ public boolean incrementToken() throws IOException {
    }
  }

+  private GraphTokenStreamFiniteStrings() {
+    this.builder = new Automaton.Builder();


The other fields are initialized at the declaration; might as well move this here too?

dsmiley · 2016-12-30T17:10:55Z

lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java

+    assert term != null;
+    boolean isStackedGap = incr == 0 && prevIncr > 1;
+    boolean hasGap = incr > 1;
+    term = BytesRef.deepCopyOf(term);


The deepCopyOf is only needed if you generate a new ID, not for an existing one.

BTW... have you seen BytesRefHash? I think re-using that could minimize the code here to deal with this stuff.

dsmiley · 2016-12-30T17:16:06Z

lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java

@@ -210,85 +199,41 @@ private void finish() {
   */
  private void finish(int maxDeterminizedStates) {
    Automaton automaton = builder.finish();
-


So all this code here removed wasn't needed after all? It's nice to see it all go away (less to maintain / less complexity) :-)

dsmiley · 2016-12-30T17:26:29Z

lucene/core/src/java/org/apache/lucene/util/graph/GraphTokenStreamFiniteStrings.java

+  /**
+   * Builds automaton and builds the finite string token streams.
+   */
+  private List<TokenStream> process(final TokenStream in) throws IOException {
    build(in);

    List<TokenStream> tokenStreams = new ArrayList<>();
    final FiniteStringsIterator finiteStrings = new FiniteStringsIterator(det);
    for (IntsRef string; (string = finiteStrings.next()) != null; ) {
      final BytesRef[] tokens = new BytesRef[string.length];


Hmm; rather than materializing an array of tokens and increments, maybe you could simply give the IntsRefString to BytesRefArrayTokenStream (and make BRATS not static) so that it could do this on the fly? Not a big deal either way (current or my proposal). If you do as I suggest then BRATS would no longer be a suitable name; maybe simply FiniteStringTokenStream or CustomTokenStream.

mattweber · 2016-12-30T20:14:06Z

Thanks @dsmiley! I have just pushed up code with your suggestions except for using BytesRefHash due to the fact we might have the same BytesRef but need a different id because we have position gap.

This has been great, love the feedback!

mikemccand · 2016-12-31T10:59:56Z

This change looks great to me! What an awesome improvement, to properly use graph token streams at search time so multi-token synonyms are correct.

I'll push this in a few days once I'm back home unless someone pushes first (@dsmiley feel free)...

Thank you @mattweber!

Adds support for handling graph token streams inside the QueryBuilder util class used by query parsers.

mikemccand · 2017-01-03T10:22:35Z

I've merged this into Lucene's master (7.0), and I'm working on 6.x (#130) now. Thanks @mattweber! Can you close this?

mattweber force-pushed the LUCENE-7603 branch from 747fb0d to 85e5c7c Compare December 28, 2016 17:56

dsmiley reviewed Dec 29, 2016

View reviewed changes

mikemccand reviewed Dec 30, 2016

View reviewed changes

mattweber force-pushed the LUCENE-7603 branch from d8cb393 to 81230f5 Compare December 30, 2016 16:46

dsmiley requested changes Dec 30, 2016

View reviewed changes

mattweber force-pushed the LUCENE-7603 branch from 81230f5 to 97f32bf Compare December 30, 2016 20:10

Support Graph Token Streams in QueryBuilder

7d67767

Adds support for handling graph token streams inside the QueryBuilder util class used by query parsers.

mattweber force-pushed the LUCENE-7603 branch from 97f32bf to 7d67767 Compare December 31, 2016 16:30

mattweber closed this Jan 3, 2017

markrmiller added a commit that referenced this pull request Jul 14, 2020

#129 - Flakey test.

472aafa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-7603: Support Graph Token Streams in QueryBuilder #129

LUCENE-7603: Support Graph Token Streams in QueryBuilder #129

mattweber commented Dec 26, 2016

mattweber commented Dec 28, 2016

dsmiley Dec 29, 2016

mattweber Dec 29, 2016

mikemccand left a comment

mikemccand Dec 30, 2016

mikemccand Dec 30, 2016

mikemccand Dec 30, 2016

mikemccand Dec 30, 2016

mikemccand Dec 30, 2016

mattweber commented Dec 30, 2016

dsmiley left a comment

dsmiley Dec 30, 2016

dsmiley Dec 30, 2016

dsmiley Dec 30, 2016

dsmiley Dec 30, 2016

mattweber commented Dec 30, 2016

mikemccand commented Dec 31, 2016

mikemccand commented Jan 3, 2017

LUCENE-7603: Support Graph Token Streams in QueryBuilder #129

LUCENE-7603: Support Graph Token Streams in QueryBuilder #129

Conversation

mattweber commented Dec 26, 2016

mattweber commented Dec 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattweber commented Dec 30, 2016

dsmiley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattweber commented Dec 30, 2016

mikemccand commented Dec 31, 2016

mikemccand commented Jan 3, 2017