Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9142 Refactor IntSet operations for determinize #1184

Merged
merged 3 commits into from Feb 6, 2020

Conversation

madrob
Copy link
Contributor

@madrob madrob commented Jan 18, 2020

Fix a bug where a frozen set could be symmetrically unequal to the
sorted set that created it because we compared the backing array
instead of only the active elements.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for tackling this @madrob!

@@ -77,6 +88,7 @@ public void incr(int num) {
values[i] = num;
counts[i] = 1;
upto++;
stale = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the code didn't need/use stale before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it didn't track whether the cached hash code was stale and relied on the caller to do it. I feel like it is more clear to manage it internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, found the replacement we should use: ArrayUtil.copyOfSubArray

int[] counts;
int upto;
private int hashCode;

// Tracks if the hashCode computation is out of date and also if the array is out of sync with the map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the array discarded as soon as we switch over to the map (so how could it be out of sync)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array is used for equality comparison.


@RunWith(com.carrotsearch.randomizedtesting.RandomizedRunner.class)
public class TestIntSet {
@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need @Test annotations -- LuceneTestCase runner knows test* methods are tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove these @Test annotations?


// If we hold more than this many states, we switch from
// O(N^2) linear ops to O(N log(N)) TreeMap
private final static int TREE_MAP_CUTOVER = 30;
private final static int TREE_MAP_CUTOVER = 32;

private final Map<Integer,Integer> map = new TreeMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to use a native map from HPPC instead -- does HPPC have a sorted map? Hmm, but this is core, and we don't have any dependencies in Lucene's core :)

@mikemccand
Copy link
Member

Also, I suspect that branch of code that was (asymmetrically) comparing IntSet and FrozenIntSet where it was using values.length instead of upto was actually dead code. Maybe we can remove that comparison path?

Copy link
Contributor Author

@madrob madrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One there consideration is whether we need to do the array/map split at all. For small cases where the map overhead may be comparatively large, the total case may be small enough that it doesn't matter. I haven't done performance profiling here to determine this, but note that once we switch to map we don't go back to arrays until we get down to zero, not if we dip back below the threshold.

Split SortedIntSet into a class heirarchy to make comparisons to
FrozenIntSet more meaningful. Use Arrays.equals for more efficient
comparison.
@madrob
Copy link
Contributor Author

madrob commented Jan 23, 2020

Started over on this, tried to be less ambitious but still improving the situation I think.

Unclear why precommit failed according to GitHub, when I click through it looks like it succeeds?
Needed to refresh for this.

@mikemccand @dweiss @uschindler can you take another look?

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great! So you solved the crazy asymmetric equals bug from before by fixing the inheritance so now it's always IntSet.equals that runs, nice.

I wonder if any of the luceneutil queries would exercise (benchmark) these classes? I don't think so? Who are the heavy consumers of determinize?


@RunWith(com.carrotsearch.randomizedtesting.RandomizedRunner.class)
public class TestIntSet {
@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove these @Test annotations?

@madrob
Copy link
Contributor Author

madrob commented Feb 4, 2020

Who are the heavy consumers of determinize?

Fuzzy queries using multi-byte code points. Spell check (I think again, with multi-byte points). Possibly some of the graph queries, but I don't completely understand the context of how those build up their automata.

@madrob madrob merged commit abd282d into apache:master Feb 6, 2020
@madrob madrob deleted the lucene9142 branch February 6, 2020 20:42
zhaih pushed a commit to zhaih/lucene-solr that referenced this pull request Jun 23, 2021
* LUCENE-9142 Refactor SortedIntSet for equality

Split SortedIntSet into a class heirarchy to make comparisons to
FrozenIntSet more meaningful. Use Arrays.equals for more efficient
comparison. Add tests for IntSet to verify correctness.
mikemccand pushed a commit that referenced this pull request Jun 23, 2021
* LUCENE-9142 Refactor IntSet operations for determinize (#1184)

Co-authored-by: Mike <madrob@users.noreply.github.com>
risdenk added a commit to risdenk/lucene-solr that referenced this pull request Nov 22, 2022
…ld not clone bitsets repeatedly (apache#1184)

Co-authored-by: David Smiley <dsmiley@salesforce.com>
risdenk added a commit to risdenk/lucene-solr that referenced this pull request Nov 22, 2022
…ld not clone bitsets repeatedly (apache#1184)

Co-authored-by: David Smiley <dsmiley@salesforce.com>
risdenk added a commit that referenced this pull request Nov 28, 2022
…ld not clone bitsets repeatedly (#1184) (#2675)

Co-authored-by: David Smiley <dsmiley@salesforce.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants