Speed up NumericDocValuesWriter with index sorting #12381

easyice · 2023-06-22T15:36:12Z

Description

like pr-399, the DocsWithFieldSet#add() can avoid create instance for FixedBitSet in dense scene, so in NumericDocValuesWriter#sortDocValues() we can do the same thing, just the way to judge dense or sparse, I'm not sure if it's rigorous enough

the benchmark for write ten SortedNumericDocValuesField, the optimization saves ~7% commit time

public class IndexBenchMarksNDV {

  public static void main(final String[] args) throws Exception {
    doWriteNDV();
  }

  static void doWriteNDV() throws IOException {
    BenchMark benchMark = new BenchMark(5, 5, 1000000);
    benchMark.run();
  }

  static class BenchMark {
    final int warmup;
    final int numValues;
    final int loopCount;

    Directory dir;
    IndexWriter indexWriter;

    BenchMark(int warmup, int loopCount, int numValues) {
      this.warmup = warmup;
      this.numValues = numValues;
      this.loopCount = loopCount;
    }

    private void init() throws IOException {
      Path tempDir = Files.createTempDirectory(Paths.get("/Volumes/RamDisk"), "tmp");
      dir = MMapDirectory.open(tempDir);
      IndexWriterConfig iwc = new IndexWriterConfig(null);
      iwc.setMergePolicy(NoMergePolicy.INSTANCE);
      iwc.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);
      Sort indexSort = new Sort(new SortedNumericSortField("f1", SortField.Type.LONG));
      iwc.setIndexSort(indexSort);
      indexWriter = new IndexWriter(dir, iwc);
    }

    private void close() throws IOException {
      indexWriter.close();
      dir.close();
    }

    private long doWrite() throws IOException {
      for (int i = 0; i < numValues; i++) {
        Document document = new Document();
        for (int f = 0; f < 10; f++) {
          document.add(new SortedNumericDocValuesField("f" + f, i / 1000));
        }
        indexWriter.addDocument(document);
      }
      Document document = new Document();
      for (int f = 0; f < 10; f++) {
        document.add(new SortedNumericDocValuesField("f" + f, 1));
      }
      indexWriter.addDocument(document);
      long t0 = System.nanoTime();
      indexWriter.commit();
      return System.nanoTime() - t0;
    }

    void run() throws IOException {
      init();
      for (int i = 0; i < warmup; i++) {
        doWrite();
      }
      System.gc();
      List<Double> times = new ArrayList<>();
      for (int i = 0; i < loopCount; i++) {
        long took = doWrite();
        times.add(took / 1000000D);
      }
      double min = times.stream().mapToDouble(Number::doubleValue).min().getAsDouble();
      System.out.println("took(ms):" + String.format(Locale.ROOT, "%.2f", min));
      close();
    }
  }
}

jpountz

I left some comments about the implementation but it looks like a good optimization to me. Can you also add a CHANGES entry under 9.8?

jpountz · 2023-06-28T14:47:13Z

lucene/core/src/java/org/apache/lucene/index/DocsWithFieldSet.java

@@ -75,4 +75,9 @@ public DocIdSetIterator iterator() {
  public int cardinality() {
    return cardinality;
  }
+
+  /** Return the FixedBitSet of this set. */
+  public FixedBitSet bitSet() {


I would rather like to expose something like boolean dense() instead of the internal bitset.

ok, it is fixed

jpountz · 2023-06-28T14:48:42Z

lucene/core/src/java/org/apache/lucene/util/BitSet.java

@@ -111,4 +111,70 @@ public void or(DocIdSetIterator iter) throws IOException {
      set(doc);
    }
  }
+
+  public static final BitSet all(int maxDoc) {


We currently only have 2 implementations of BitSet, which the JVM optimizes better than N implementations. Could we remove this special BitSet implementation and use a special null marker instead to imply that all docs match?

great suggestion! i tested earlier the virtual function can be inline always if there are only 2 implementations

easyice · 2023-06-29T10:51:53Z

@jpountz Thank you for comments, it's very helpful to me, the code has updated.

jpountz

I left some minor comments, it looks good to me otherwise.

jpountz · 2023-06-29T11:34:11Z

lucene/core/src/java/org/apache/lucene/index/NormValuesWriter.java

@@ -76,7 +76,8 @@ public void flush(SegmentWriteState state, Sorter.DocMap sortMap, NormsConsumer
          NumericDocValuesWriter.sortDocValues(
              state.segmentInfo.maxDoc(),
              sortMap,
-              new BufferedNorms(values, docsWithField.iterator()));
+              new BufferedNorms(values, docsWithField.iterator()),
+              docsWithField.dense() && sortMap.size() == docsWithField.cardinality());


Only testing dense() should be correct right?

Suggested change

docsWithField.dense() && sortMap.size() == docsWithField.cardinality());

docsWithField.dense());

no,DocsWithFieldSet will update it's bitset when a doc with field really added only, in sparse, it will not call DocsWithFieldSet#add, so if the first 64 doc with field has added, and then some doc added without this filed, the docsWithField.dense() will return true, i think we can remove docsWithField.dense(), use sortMap.size() == docsWithField.cardinality() only for dense case, the sortMap.size() will return the number of documents for the LeafReader, what do you think?

you're right! thanks for explaining

i have removed docsWithField.dense()

jpountz · 2023-06-29T11:34:38Z

lucene/core/src/java/org/apache/lucene/index/NumericDocValuesWriter.java

+              sortMap.size(),
+              sortMap,
+              oldValues,
+              docsWithField.dense() && sortMap.size() == docsWithField.cardinality());


and likewise here?

Suggested change

docsWithField.dense() && sortMap.size() == docsWithField.cardinality());

docsWithField.dense());

jpountz · 2023-06-29T11:38:20Z

lucene/core/src/java/org/apache/lucene/index/NumericDocValuesWriter.java

+      if (target < maxDoc) {
+        return target;
+      }
+      return DocIdSetIterator.NO_MORE_DOCS;


Looking at the call site, this is only called when target is less than maxDoc, so thit could be simplified to just return target without the target < maxDoc check?

Nice catch, it's updated

jpountz reviewed Jun 28, 2023

View reviewed changes

easyice added 2 commits June 29, 2023 18:03

init

a4215fa

fix for review

b5a5b1f

easyice force-pushed the speedup_ndvwriter branch from d00b36c to b5a5b1f Compare June 29, 2023 10:32

jpountz approved these changes Jun 29, 2023

View reviewed changes

easyice added 2 commits June 30, 2023 12:35

fix for review

6147bd7

remove DocsWithFieldSet#dense()

f0504d5

jpountz approved these changes Jun 30, 2023

View reviewed changes

jpountz merged commit 01200b5 into apache:main Jun 30, 2023
4 checks passed

jpountz added this to the 9.8.0 milestone Jun 30, 2023

jpountz pushed a commit to jpountz/lucene that referenced this pull request Jun 30, 2023

Speed up NumericDocValuesWriter with index sorting (apache#12381)

5fc16bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up NumericDocValuesWriter with index sorting #12381

Speed up NumericDocValuesWriter with index sorting #12381

easyice commented Jun 22, 2023 •

edited by jpountz

Loading

jpountz left a comment

jpountz Jun 28, 2023

easyice Jun 29, 2023

jpountz Jun 28, 2023

easyice Jun 29, 2023

easyice commented Jun 29, 2023

jpountz left a comment

jpountz Jun 29, 2023

easyice Jun 30, 2023 •

edited

Loading

jpountz Jun 30, 2023

easyice Jun 30, 2023 •

edited

Loading

jpountz Jun 29, 2023

jpountz Jun 29, 2023

easyice Jun 30, 2023

	docsWithField.dense() && sortMap.size() == docsWithField.cardinality());
	docsWithField.dense());

Speed up NumericDocValuesWriter with index sorting #12381

Speed up NumericDocValuesWriter with index sorting #12381

Conversation

easyice commented Jun 22, 2023 • edited by jpountz Loading

Description

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

easyice commented Jun 29, 2023

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

easyice Jun 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

easyice Jun 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

easyice commented Jun 22, 2023 •

edited by jpountz

Loading

easyice Jun 30, 2023 •

edited

Loading

easyice Jun 30, 2023 •

edited

Loading