Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup TermsHashPerField #1573

Merged
merged 13 commits into from
Jun 16, 2020
Merged

Conversation

s1monw
Copy link
Member

@s1monw s1monw commented Jun 12, 2020

Several classes within the IndexWriter indexing chain haven't been touched for several years. Most of these classes expose their internals through public members and are difficult to construct in tests since they depend on many other classes. This change tries to clean up TermsHashPerField and adds a dedicated standalone test for it to make it more accessible for other developers since it's simpler to understand. There are also attempts to make documentation better as a result of this refactoring.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great cleanup -- thanks @s1monw

I left some small comments.

I think we should run indexing throughput benchmark to make sure there's no real impact here. I'll do that.

@@ -56,12 +56,6 @@ public FreqProxTermsWriterPerField(FieldInvertState invertState, TermsHash terms
@Override
void finish() throws IOException {
super.finish();
sumDocFreq += fieldState.uniqueTermCount;
sumTotalTermFreq += fieldState.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, did these aggregations move somewhere else? Oh, they look entirely removed? Were they redundant (computed elsewhere) and these ones were unused?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sumDocFreq and sumTotalTermFreq are unused. They were used in FreqProxFields in the past but not anymore for a while now. I removed their commented out usage so you can see it in a followup commit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -222,7 +234,7 @@ void writeProx(TermVectorsPostingsArray postings, int termID) {
}

@Override
void newTerm(final int termID) {
void newTerm(final int termID, final int docID) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm docID is unused in this method? But I guess the other impl (normal postings) needs it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct.

@s1monw
Copy link
Member Author

s1monw commented Jun 14, 2020

I think we should run indexing throughput benchmark to make sure there's no real impact here. I'll do that.

+1 thanks @mikemccand

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the changes you made, Simon.

@mikemccand
Copy link
Member

I tested indexing throughput on luceneutil with wikimediumall, single thread for indexing SerialMergeScheduler:

[mike@beast3 facet]$ grep "GB/hour" /l/logs/simon?
/l/logs/simon0:Indexer: 46.44432470391602 GB/hour plain text
/l/logs/simon1:Indexer: 46.267723012921515 GB/hour plain text
/l/logs/simon2:Indexer: 46.26414201429784 GB/hour plain text
[mike@beast3 facet]$ grep "GB/hour" /l/logs/trunk?
/l/logs/trunk0:Indexer: 45.632881600179495 GB/hour plain text
/l/logs/trunk1:Indexer: 46.09383252131896 GB/hour plain text
/l/logs/trunk2:Indexer: 45.439666582156924 GB/hour plain text

Net/net this change might be a bit faster, or just noise, so all clear to push! Thanks @s1monw.

@s1monw
Copy link
Member Author

s1monw commented Jun 16, 2020

thanks @mikemccand - I will run tests again and push.

@mikemccand
Copy link
Member

thanks @mikemccand - I will run tests again and push.

++

@s1monw s1monw merged commit c083e54 into apache:master Jun 16, 2020
@s1monw s1monw deleted the cleanup_terms_hash_per_field branch June 16, 2020 12:45
s1monw added a commit that referenced this pull request Jun 16, 2020
Several classes within the IndexWriter indexing chain haven't been touched for 
several years. Most of these classes expose their internals through public
members and are difficult to construct in tests since they depend on many other
classes. This change tries to clean up TermsHashPerField and adds a dedicated
standalone test for it to make it more accessible for other developers since
it's simpler to understand. There are also attempts to make documentation better
as a result of this refactoring.
MarcusSorealheis added a commit to MarcusSorealheis/lucene-solr that referenced this pull request Jun 20, 2020
* upstream/master: (218 commits)
  LUCENE-9412 Do not validate jenkins HTTPS cert
  LUCENE-8962: add ability to selectively merge on commit (apache#1552)
  Replace DWPT.DocState with simple method parameters (apache#1594)
  LUCENE-9402: Let MultiCollector handle minCompetitiveScore (apache#1567)
  SOLR-14574: Fix or suppress warnings in solr/core/src/test (part 2)
  SOLR-14561 CoreAdminAPI's parameters instanceDir and dataDir are now validated (apache#1572)
  SOLR-14532: Add *.iml files to gitignore
  SOLR-14577: Return BAD REQUEST when field is missing in terms QP (apache#1588)
  SOLR-14574: Fix or suppress warnings in solr/core/src/test (part 1)
  remove debug code
  LUCENE-9408: roll back only called once enforcement
  LUCENE-8962: Allow waiting for all merges in a merge spec (apache#1585)
  SOLR-14572 document missing SearchComponents (apache#1581)
  LUCENE-9359: Avoid test failures when the extra file is a dir.
  SOLR-14573: Fix or suppress warnings in solrj/src/test
  LUCENE-9353: Move terms metadata to its own file. (apache#1473)
  Cleanup TermsHashPerField (apache#1573)
  SOLR-14558: Record all log lines in SolrLogPostTool (apache#1570)
  LUCENE-9404: simplify checksum calculation of ByteBuffersIndexOutput
  LUCENE-9403: tune BufferedChecksum.DEFAULT_BUFFERSIZE
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants