Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up NumericDocValuesWriter with index sorting #12381

Merged
merged 4 commits into from
Jun 30, 2023

Conversation

easyice
Copy link
Contributor

@easyice easyice commented Jun 22, 2023

Description

like pr-399, the DocsWithFieldSet#add() can avoid create instance for FixedBitSet in dense scene, so in NumericDocValuesWriter#sortDocValues() we can do the same thing, just the way to judge dense or sparse, I'm not sure if it's rigorous enough

the benchmark for write ten SortedNumericDocValuesField, the optimization saves ~7% commit time

public class IndexBenchMarksNDV {

  public static void main(final String[] args) throws Exception {
    doWriteNDV();
  }

  static void doWriteNDV() throws IOException {
    BenchMark benchMark = new BenchMark(5, 5, 1000000);
    benchMark.run();
  }

  static class BenchMark {
    final int warmup;
    final int numValues;
    final int loopCount;

    Directory dir;
    IndexWriter indexWriter;

    BenchMark(int warmup, int loopCount, int numValues) {
      this.warmup = warmup;
      this.numValues = numValues;
      this.loopCount = loopCount;
    }

    private void init() throws IOException {
      Path tempDir = Files.createTempDirectory(Paths.get("/Volumes/RamDisk"), "tmp");
      dir = MMapDirectory.open(tempDir);
      IndexWriterConfig iwc = new IndexWriterConfig(null);
      iwc.setMergePolicy(NoMergePolicy.INSTANCE);
      iwc.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);
      Sort indexSort = new Sort(new SortedNumericSortField("f1", SortField.Type.LONG));
      iwc.setIndexSort(indexSort);
      indexWriter = new IndexWriter(dir, iwc);
    }

    private void close() throws IOException {
      indexWriter.close();
      dir.close();
    }

    private long doWrite() throws IOException {
      for (int i = 0; i < numValues; i++) {
        Document document = new Document();
        for (int f = 0; f < 10; f++) {
          document.add(new SortedNumericDocValuesField("f" + f, i / 1000));
        }
        indexWriter.addDocument(document);
      }
      Document document = new Document();
      for (int f = 0; f < 10; f++) {
        document.add(new SortedNumericDocValuesField("f" + f, 1));
      }
      indexWriter.addDocument(document);
      long t0 = System.nanoTime();
      indexWriter.commit();
      return System.nanoTime() - t0;
    }

    void run() throws IOException {
      init();
      for (int i = 0; i < warmup; i++) {
        doWrite();
      }
      System.gc();
      List<Double> times = new ArrayList<>();
      for (int i = 0; i < loopCount; i++) {
        long took = doWrite();
        times.add(took / 1000000D);
      }
      double min = times.stream().mapToDouble(Number::doubleValue).min().getAsDouble();
      System.out.println("took(ms):" + String.format(Locale.ROOT, "%.2f", min));
      close();
    }
  }
}

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments about the implementation but it looks like a good optimization to me. Can you also add a CHANGES entry under 9.8?

@@ -75,4 +75,9 @@ public DocIdSetIterator iterator() {
public int cardinality() {
return cardinality;
}

/** Return the FixedBitSet of this set. */
public FixedBitSet bitSet() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather like to expose something like boolean dense() instead of the internal bitset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it is fixed

@@ -111,4 +111,70 @@ public void or(DocIdSetIterator iter) throws IOException {
set(doc);
}
}

public static final BitSet all(int maxDoc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently only have 2 implementations of BitSet, which the JVM optimizes better than N implementations. Could we remove this special BitSet implementation and use a special null marker instead to imply that all docs match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great suggestion! i tested earlier the virtual function can be inline always if there are only 2 implementations

@easyice
Copy link
Contributor Author

easyice commented Jun 29, 2023

@jpountz Thank you for comments, it's very helpful to me, the code has updated.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments, it looks good to me otherwise.

@@ -76,7 +76,8 @@ public void flush(SegmentWriteState state, Sorter.DocMap sortMap, NormsConsumer
NumericDocValuesWriter.sortDocValues(
state.segmentInfo.maxDoc(),
sortMap,
new BufferedNorms(values, docsWithField.iterator()));
new BufferedNorms(values, docsWithField.iterator()),
docsWithField.dense() && sortMap.size() == docsWithField.cardinality());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only testing dense() should be correct right?

Suggested change
docsWithField.dense() && sortMap.size() == docsWithField.cardinality());
docsWithField.dense());

Copy link
Contributor Author

@easyice easyice Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no,DocsWithFieldSet will update it's bitset when a doc with field really added only, in sparse, it will not call DocsWithFieldSet#add, so if the first 64 doc with field has added, and then some doc added without this filed, the docsWithField.dense() will return true, i think we can remove docsWithField.dense(), use sortMap.size() == docsWithField.cardinality() only for dense case, the sortMap.size() will return the number of documents for the LeafReader, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right! thanks for explaining

Copy link
Contributor Author

@easyice easyice Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have removed docsWithField.dense()

sortMap.size(),
sortMap,
oldValues,
docsWithField.dense() && sortMap.size() == docsWithField.cardinality());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and likewise here?

Suggested change
docsWithField.dense() && sortMap.size() == docsWithField.cardinality());
docsWithField.dense());

if (target < maxDoc) {
return target;
}
return DocIdSetIterator.NO_MORE_DOCS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the call site, this is only called when target is less than maxDoc, so thit could be simplified to just return target without the target < maxDoc check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, it's updated

@jpountz jpountz merged commit 01200b5 into apache:main Jun 30, 2023
4 checks passed
@jpountz jpountz added this to the 9.8.0 milestone Jun 30, 2023
jpountz pushed a commit to jpountz/lucene that referenced this pull request Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants