Cleanup TermsHashPerField #1573

s1monw · 2020-06-12T15:18:21Z

Several classes within the IndexWriter indexing chain haven't been touched for several years. Most of these classes expose their internals through public members and are difficult to construct in tests since they depend on many other classes. This change tries to clean up TermsHashPerField and adds a dedicated standalone test for it to make it more accessible for other developers since it's simpler to understand. There are also attempts to make documentation better as a result of this refactoring.

:

mikemccand

This looks like a great cleanup -- thanks @s1monw

I left some small comments.

I think we should run indexing throughput benchmark to make sure there's no real impact here. I'll do that.

lucene/core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java

lucene/core/src/java/org/apache/lucene/index/ParallelPostingsArray.java

mikemccand · 2020-06-13T16:33:30Z

lucene/core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java

@@ -56,12 +56,6 @@ public FreqProxTermsWriterPerField(FieldInvertState invertState, TermsHash terms
  @Override
  void finish() throws IOException {
    super.finish();
-    sumDocFreq += fieldState.uniqueTermCount;
-    sumTotalTermFreq += fieldState.length;


Hmm, did these aggregations move somewhere else? Oh, they look entirely removed? Were they redundant (computed elsewhere) and these ones were unused?

sumDocFreq and sumTotalTermFreq are unused. They were used in FreqProxFields in the past but not anymore for a while now. I removed their commented out usage so you can see it in a followup commit

see this https://github.com/apache/lucene-solr/pull/1573/files#diff-aa6c5376b6b755262430916164fd0088L84

mikemccand · 2020-06-13T16:36:04Z

lucene/core/src/java/org/apache/lucene/index/TermVectorsConsumerPerField.java

@@ -222,7 +234,7 @@ void writeProx(TermVectorsPostingsArray postings, int termID) {
  }

  @Override
-  void newTerm(final int termID) {
+  void newTerm(final int termID, final int docID) {


Hmm docID is unused in this method? But I guess the other impl (normal postings) needs it?

that's correct.

lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java

s1monw · 2020-06-14T08:31:13Z

I think we should run indexing throughput benchmark to make sure there's no real impact here. I'll do that.

+1 thanks @mikemccand

lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java

dweiss

I really like the changes you made, Simon.

lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java

lucene/core/src/test/org/apache/lucene/index/TestTermsHashPerField.java

mikemccand · 2020-06-16T12:29:49Z

I tested indexing throughput on luceneutil with wikimediumall, single thread for indexing SerialMergeScheduler:

[mike@beast3 facet]$ grep "GB/hour" /l/logs/simon?
/l/logs/simon0:Indexer: 46.44432470391602 GB/hour plain text
/l/logs/simon1:Indexer: 46.267723012921515 GB/hour plain text
/l/logs/simon2:Indexer: 46.26414201429784 GB/hour plain text
[mike@beast3 facet]$ grep "GB/hour" /l/logs/trunk?
/l/logs/trunk0:Indexer: 45.632881600179495 GB/hour plain text
/l/logs/trunk1:Indexer: 46.09383252131896 GB/hour plain text
/l/logs/trunk2:Indexer: 45.439666582156924 GB/hour plain text

Net/net this change might be a bit faster, or just noise, so all clear to push! Thanks @s1monw.

s1monw · 2020-06-16T12:32:02Z

thanks @mikemccand - I will run tests again and push.

mikemccand · 2020-06-16T12:35:00Z

thanks @mikemccand - I will run tests again and push.

++

Several classes within the IndexWriter indexing chain haven't been touched for several years. Most of these classes expose their internals through public members and are difficult to construct in tests since they depend on many other classes. This change tries to clean up TermsHashPerField and adds a dedicated standalone test for it to make it more accessible for other developers since it's simpler to understand. There are also attempts to make documentation better as a result of this refactoring.

* upstream/master: (218 commits) LUCENE-9412 Do not validate jenkins HTTPS cert LUCENE-8962: add ability to selectively merge on commit (apache#1552) Replace DWPT.DocState with simple method parameters (apache#1594) LUCENE-9402: Let MultiCollector handle minCompetitiveScore (apache#1567) SOLR-14574: Fix or suppress warnings in solr/core/src/test (part 2) SOLR-14561 CoreAdminAPI's parameters instanceDir and dataDir are now validated (apache#1572) SOLR-14532: Add *.iml files to gitignore SOLR-14577: Return BAD REQUEST when field is missing in terms QP (apache#1588) SOLR-14574: Fix or suppress warnings in solr/core/src/test (part 1) remove debug code LUCENE-9408: roll back only called once enforcement LUCENE-8962: Allow waiting for all merges in a merge spec (apache#1585) SOLR-14572 document missing SearchComponents (apache#1581) LUCENE-9359: Avoid test failures when the extra file is a dir. SOLR-14573: Fix or suppress warnings in solrj/src/test LUCENE-9353: Move terms metadata to its own file. (apache#1473) Cleanup TermsHashPerField (apache#1573) SOLR-14558: Record all log lines in SolrLogPostTool (apache#1570) LUCENE-9404: simplify checksum calculation of ByteBuffersIndexOutput LUCENE-9403: tune BufferedChecksum.DEFAULT_BUFFERSIZE ...

s1monw added 9 commits June 3, 2020 18:16

first cut

a17d9c0

fo

26d204b

test

103740f

foo

4b2b7c9

:

Merge branch 'master' into cleanup_terms_hash_per_field

2810b5b

Merge branch 'master' into cleanup_terms_hash_per_field

9930367

remove unnecessary member

57d0043

add more docs and rename vars to improve readability

d3d20dc

fix typo

acacaab

mikemccand reviewed Jun 13, 2020

View reviewed changes

apply feedback

6b02c1b

dweiss reviewed Jun 14, 2020

View reviewed changes

lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java Outdated Show resolved Hide resolved

dweiss reviewed Jun 14, 2020

View reviewed changes

lucene/core/src/java/org/apache/lucene/index/TermsHashPerField.java Outdated Show resolved Hide resolved

lucene/core/src/test/org/apache/lucene/index/TestTermsHashPerField.java Outdated Show resolved Hide resolved

s1monw added 2 commits June 15, 2020 20:13

apply feedback

4ef7fcb

Merge branch 'master' into cleanup_terms_hash_per_field

e04aa19

Merge branch 'master' into cleanup_terms_hash_per_field

5ebbfa1

s1monw merged commit c083e54 into apache:master Jun 16, 2020

s1monw deleted the cleanup_terms_hash_per_field branch June 16, 2020 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup TermsHashPerField #1573

Cleanup TermsHashPerField #1573

s1monw commented Jun 12, 2020

mikemccand left a comment

mikemccand Jun 13, 2020

s1monw Jun 14, 2020

s1monw Jun 14, 2020

mikemccand Jun 13, 2020

s1monw Jun 14, 2020

s1monw commented Jun 14, 2020

dweiss left a comment

mikemccand commented Jun 16, 2020

s1monw commented Jun 16, 2020

mikemccand commented Jun 16, 2020

Cleanup TermsHashPerField #1573

Cleanup TermsHashPerField #1573

Conversation

s1monw commented Jun 12, 2020

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand Jun 13, 2020

Choose a reason for hiding this comment

s1monw Jun 14, 2020

Choose a reason for hiding this comment

s1monw Jun 14, 2020

Choose a reason for hiding this comment

mikemccand Jun 13, 2020

Choose a reason for hiding this comment

s1monw Jun 14, 2020

Choose a reason for hiding this comment

s1monw commented Jun 14, 2020

dweiss left a comment

Choose a reason for hiding this comment

mikemccand commented Jun 16, 2020

s1monw commented Jun 16, 2020

mikemccand commented Jun 16, 2020