New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9142 Refactor IntSet operations for determinize #1184
Conversation
lucene/core/src/java/org/apache/lucene/util/automaton/FrozenIntSet.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/IntSet.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/IntSet.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/SortedIntSet.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for tackling this @madrob!
lucene/core/src/java/org/apache/lucene/util/automaton/FrozenIntSet.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/automaton/IntSet.java
Outdated
Show resolved
Hide resolved
@@ -77,6 +88,7 @@ public void incr(int num) { | |||
values[i] = num; | |||
counts[i] = 1; | |||
upto++; | |||
stale = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come the code didn't need/use stale
before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it didn't track whether the cached hash code was stale and relied on the caller to do it. I feel like it is more clear to manage it internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, found the replacement we should use: ArrayUtil.copyOfSubArray
int[] counts; | ||
int upto; | ||
private int hashCode; | ||
|
||
// Tracks if the hashCode computation is out of date and also if the array is out of sync with the map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the array discarded as soon as we switch over to the map (so how could it be out of sync)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array is used for equality comparison.
lucene/core/src/java/org/apache/lucene/util/automaton/SortedIntSet.java
Outdated
Show resolved
Hide resolved
|
||
@RunWith(com.carrotsearch.randomizedtesting.RandomizedRunner.class) | ||
public class TestIntSet { | ||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need @Test
annotations -- LuceneTestCase
runner knows test*
methods are tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove these @Test
annotations?
|
||
// If we hold more than this many states, we switch from | ||
// O(N^2) linear ops to O(N log(N)) TreeMap | ||
private final static int TREE_MAP_CUTOVER = 30; | ||
private final static int TREE_MAP_CUTOVER = 32; | ||
|
||
private final Map<Integer,Integer> map = new TreeMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to use a native map from HPPC instead -- does HPPC have a sorted map? Hmm, but this is core, and we don't have any dependencies in Lucene's core :)
Also, I suspect that branch of code that was (asymmetrically) comparing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One there consideration is whether we need to do the array/map split at all. For small cases where the map overhead may be comparatively large, the total case may be small enough that it doesn't matter. I haven't done performance profiling here to determine this, but note that once we switch to map we don't go back to arrays until we get down to zero, not if we dip back below the threshold.
Split SortedIntSet into a class heirarchy to make comparisons to FrozenIntSet more meaningful. Use Arrays.equals for more efficient comparison.
Started over on this, tried to be less ambitious but still improving the situation I think.
@mikemccand @dweiss @uschindler can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look great! So you solved the crazy asymmetric equals
bug from before by fixing the inheritance so now it's always IntSet.equals
that runs, nice.
I wonder if any of the luceneutil
queries would exercise (benchmark) these classes? I don't think so? Who are the heavy consumers of determinize
?
lucene/core/src/java/org/apache/lucene/util/automaton/FrozenIntSet.java
Outdated
Show resolved
Hide resolved
|
||
@RunWith(com.carrotsearch.randomizedtesting.RandomizedRunner.class) | ||
public class TestIntSet { | ||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove these @Test
annotations?
Fuzzy queries using multi-byte code points. Spell check (I think again, with multi-byte points). Possibly some of the graph queries, but I don't completely understand the context of how those build up their automata. |
* LUCENE-9142 Refactor SortedIntSet for equality Split SortedIntSet into a class heirarchy to make comparisons to FrozenIntSet more meaningful. Use Arrays.equals for more efficient comparison. Add tests for IntSet to verify correctness.
* LUCENE-9142 Refactor IntSet operations for determinize (#1184) Co-authored-by: Mike <madrob@users.noreply.github.com>
…ld not clone bitsets repeatedly (apache#1184) Co-authored-by: David Smiley <dsmiley@salesforce.com>
…ld not clone bitsets repeatedly (apache#1184) Co-authored-by: David Smiley <dsmiley@salesforce.com>
Fix a bug where a frozen set could be symmetrically unequal to the
sorted set that created it because we compared the backing array
instead of only the active elements.