Use group-varint encode the positions #12842

easyice · 2023-11-24T14:47:06Z

Thanks the suggestion from @jpountz , as discussed in #12826

This PR use group-varint to encode some vint values if storeOffsets is true, it's still using class GroupVIntReader and GroupVIntWriter, i will update it after #12841 is finished.

Currently i don't use group-vint if (storeOffsets==false && storePayload==false), which means only token is stored, because i'm worried that it will use extra memory when bulk decoding. Feel free to correct me.

Then benchmark and file size changes i'll add next week.

jpountz · 2023-11-24T16:17:33Z

Thanks for looking. Unfortunately, the case I'm most interested in is when storeOffsets is false and there are no payloads, since this is the default. :)

easyice · 2023-11-25T01:34:15Z

Thanks for your suggestion, i'm thinking about that too, i will continue working on this.

easyice · 2023-12-14T17:38:18Z

Sorry for the late update! i spent some more time on other PR, i encoded the positions with group-varint when storeOffsets is false and there are no payloads. with the last commit, it uses a long[] buffer with 128 size to encode/decode. i wrote a simple benchmark to show flush() performance, it seems no significant performance improvement, because readVInt and readGroupVInt have similar performance in ByteBuffersDataOutput on current branch, i'll test it with #12841 optimized code tomorrow.

The simple benchmark summary:

using 200 terms per field.
freq per term set to 100. that means, the cardinality of a field is 2.(the group-varint encoding of the positions does not cross doc boundaries)
10000 docs total.

Benchmark code

public class SortedStringWriteBenchmark {

  static class Benchark {
    Random rand = new Random(0);

    String randomString(int termsPerField, int freqPerTerm) {
      List<String> values = new ArrayList<>();
      for (int i = 0; i < termsPerField; ) {
        String s = TestUtil.randomSimpleString(rand, 5, 10);
        for (int j = 0; j < freqPerTerm; j++) {
          values.add(s);
        }
        i += freqPerTerm;
      }
      Collections.shuffle(values);
      String text = String.join(" ", values);
      return text;
    }

    List<String> randomStrings(int max, int termsPerField, int freqPerTerm) {
      List<String> values = new ArrayList<>();
      for (int i = 0; i < max; i++) {
        values.add(randomString(termsPerField, freqPerTerm));
      }
      return values;
    }

    long write() throws IOException {
      List<String> terms = randomStrings(10000, 200, 100);

      Path temp = Files.createTempDirectory(Paths.get("/Volumes/RamDisk"), "tmpDirPrefix");
      Directory dir = MMapDirectory.open(temp);
      IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
      config.setIndexSort(new Sort(new SortField("sort", SortField.Type.LONG)));
      config.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);
      IndexWriter w = new IndexWriter(dir, config);
      for (int i = 0; i < terms.size(); ++i) {
        Document doc = new Document();
        doc.add(new NumericDocValuesField("sort", rand.nextInt()));
        doc.add(new TextField("field", terms.get(i), Field.Store.NO));
        w.addDocument(doc);
      }
      long t0 = System.currentTimeMillis();
      w.flush();
      long took = System.currentTimeMillis() - t0;
      w.close();
      dir.close();
      return took;
    }
  }

  public static void main(final String[] args) throws Exception {
    int iter = 50;
    Benchark benchark = new Benchark();
    List<Long> times = new ArrayList<>();
    for (int i = 0; i < iter; i++) {
      long took = benchark.write();
      times.add(took);
      System.out.println("iteration " + i + ",took(ms):" + took);
    }
    double avg = times.stream().skip(iter / 2).mapToLong(Number::longValue).average().getAsDouble();
    long min = times.stream().mapToLong(Number::longValue).min().getAsLong();
    System.out.println("best took(ms) avg:" + avg + ", min:" + min);
  }

easyice · 2023-12-15T16:15:32Z

i'll test it with #12841 optimized code tomorrow.

emmm... there's still no significant performance improvement, possibly the write path writeGroupVInts is a bit slower than writeVInt, because group-varint needs write to a buffer per group at first(since it needs to compute the length for every integer, and finally combined into flag), then copy it to output. so it does an extra memory copy. secondly, the BytesRefBuilder#append method may also have some overhead. thinking about other approach...

JMH output:

Benchmark                                                     (size)   Mode  Cnt  Score   Error   Units
GroupVIntBenchmark.benchByteBuffersIndexInput_writeGroupVInt      64  thrpt    5  1.108 ± 0.624  ops/us
GroupVIntBenchmark.benchByteBuffersIndexInput_writeVInt           64  thrpt    5  1.603 ± 0.312  ops/us

JMH benchmark code for write

public class GroupVIntBenchmark {

  // Cumulative frequency for each number of bits per value used by doc deltas of tail postings on
  // wikibigall.
  private static final float[] CUMULATIVE_FREQUENCY_BY_BITS_REQUIRED =
      new float[] {
        0.0f,
        0.01026574f,
        0.021453038f,
        0.03342156f,
        0.046476692f,
        0.060890317f,
        0.07644147f,
        0.093718216f,
        0.11424741f,
        0.13989712f,
        0.17366524f,
        0.22071244f,
        0.2815692f,
        0.3537585f,
        0.43655503f,
        0.52308f,
        0.6104675f,
        0.7047371f,
        0.78155357f,
        0.8671179f,
        0.9740598f,
        1.0f
      };

  final int maxSize = 256;
  final long[] docs = new long[maxSize];

  // benchmark for write
  final ByteBuffersDataOutput byteBuffersDataOutput = new ByteBuffersDataOutput();


  @Param({"64"})
  public int size;


  @Setup(Level.Trial)
  public void init() throws Exception {
    Random r = new Random(0);
    for (int i = 0; i < maxSize; ++i) {
      float randomFloat = r.nextFloat();
      // Reproduce the distribution of the number of bits per values that we're observing for tail
      // postings on wikibigall.
      int numBits = 1 + Arrays.binarySearch(CUMULATIVE_FREQUENCY_BY_BITS_REQUIRED, randomFloat);
      if (numBits < 0) {
        numBits = -numBits;
      }
      docs[i] = r.nextInt(1 << (numBits - 1), 1 << numBits);
    }
  }


  @Benchmark
  public void benchByteBuffersIndexInput_writeGroupVInt(Blackhole bh) throws IOException {
    byteBuffersDataOutput.reset();
    byteBuffersDataOutput.writeGroupVInts(docs, size);
  }

  @Benchmark
  public void benchByteBuffersIndexInput_writeVInt(Blackhole bh) throws IOException {
    byteBuffersDataOutput.reset();
    for (int i = 0; i < size; i++) {
      byteBuffersDataOutput.writeVInt((int)docs[i]);
    }
  }
}

github-actions · 2024-01-08T12:22:51Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

jpountz

Sorry this PR was no longer on my radar, it looks promising!

jpountz · 2024-02-23T16:24:17Z

lucene/core/src/java/org/apache/lucene/index/FreqProxTermsWriter.java

+    private void writePositionsWithOutOffsets(final PostingsEnum in, final DataOutput out, int freq)
+        throws IOException {
+      int previousPosition = 0;
+      if (storePayloads) {


maybe do this branch if storePayloads || storeOffsets (and re-introduce the logic for offsets below) since the common case is when positions are indexed, but not offsets and not payloads? This way we can remove writePositionsWithOffsets?

Thanks for looking this! I will update this later next week.

easyice · 2024-03-04T10:49:18Z

Hi Adrien, for simplicity, I changed the logic to use group-varint for when positions only (no offsets and no payloads). In addition, since writeGroupVInt is slower than writeVInt, we don't see the improve for indexWriter.flush, it may even be a bit slower.

I wrote a new ugly JMH benchmark for addPositions/nextPositions:

PR   benchmark_nextPositions  thrpt    5  0.448 ± 0.042  ops/us
main benchmark_nextPositions  thrpt    5  0.646 ± 0.045  ops/us

Code

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 3, time = 3)
@Measurement(iterations = 5, time = 5)
@Fork(
    value = 1,
    jvmArgsPrepend = {"--add-modules=jdk.unsupported"})
public class PosGroupVIntBenchmark {
  Directory dir;
  private Random rand = new Random(0);
  PostingsEnum reuse;
  TermsEnum termsEnum;

  // copy from TestUtil
  private String randomSimpleString(Random r, int minLength, int maxLength) {
    final int end = r.nextInt(minLength, maxLength);
    if (end == 0) {
      // allow 0 length
      return "";
    }
    final char[] buffer = new char[end];
    for (int i = 0; i < end; i++) {
      buffer[i] = (char) r.nextInt('a', 'z');
    }
    return new String(buffer, 0, end);
  }

  private String randomString(int termsPerField, int freqPerTerm) {
    List<String> values = new ArrayList<>();
    for (int i = 0; i < termsPerField; ) {
      String s = randomSimpleString(rand, 5, 10);
      for (int j = 0; j < freqPerTerm; j++) {
        values.add(s);
      }
      i += freqPerTerm;
    }
    Collections.shuffle(values, rand);
    String text = String.join(" ", values);
    return text;
  }

  private List<String> randomStrings(int size, int termsPerField, int freqPerTerm) {
    List<String> values = new ArrayList<>();
    for (int i = 0; i < size; i++) {
      values.add(randomString(termsPerField, freqPerTerm));
    }
    return values;
  }

  @Setup(Level.Trial)
  public void init() throws Exception {
    dir = new ByteBuffersDirectory();
    List<String> terms = randomStrings(10, 200, 100);
    IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
    Sort indexSort = new Sort(new SortField("sort", SortField.Type.LONG));
    config.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);
    IndexWriter w = new IndexWriter(dir, config);
    for (int i = 0; i < terms.size(); ++i) {
      Document doc = new Document();
      doc.add(new NumericDocValuesField("sort", rand.nextInt()));
      doc.add(new TextField("field", terms.get(i), Field.Store.NO));
      w.addDocument(doc);
    }

    w.commit();
    IndexReader r = DirectoryReader.open(w);
    CodecReader cr = (CodecReader) r.leaves().get(0).reader();
    CodecReader wrap = SortingCodecReader.wrap(cr, indexSort);
    Terms fieldTerms = wrap.getPostingsReader().terms("field");
    termsEnum = fieldTerms.iterator();
    termsEnum.next();
    reuse = termsEnum.postings(null, PostingsEnum.POSITIONS);
    w.close();
  }

  @Benchmark
  public void benchmark_addPositions() throws Exception {
    termsEnum.postings(reuse, PostingsEnum.POSITIONS);
  }

  @Benchmark
  public void benchmark_nextPositions() throws Exception {
    termsEnum.postings(reuse, PostingsEnum.POSITIONS);
    reuse.nextDoc();
    int freq = reuse.freq();
    for (int i = 0; i < freq; i++) {
      reuse.nextPosition();
    }
  }
}

jpountz · 2024-03-13T14:34:30Z

It looks like writeGroupVInt has room for improvement. Can we improve it by making it look a bit more like the read logic?

easyice · 2024-03-16T16:32:16Z

Yeah, it looks like we can optimize writeGroupVInt in the same way as we did for the read logic. I'd love to try and run some benchmarks.

github-actions · 2024-03-31T00:20:28Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

use group-varint encode the positions

5f086c7

use group-vint when storeOffsets is false and without payloads

a8ba94e

github-actions bot added the Stale label Jan 8, 2024

jpountz reviewed Feb 23, 2024

View reviewed changes

github-actions bot removed the Stale label Feb 24, 2024

easyice added 3 commits March 2, 2024 10:07

Merge branch 'main' into groupvint_positions

163aff1

move reader/writer to DataInput/DataOutput

489b522

use group-varint for positions only

53d5541

github-actions bot added the Stale label Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use group-varint encode the positions #12842

Use group-varint encode the positions #12842

easyice commented Nov 24, 2023

jpountz commented Nov 24, 2023

easyice commented Nov 25, 2023

easyice commented Dec 14, 2023 •

edited

easyice commented Dec 15, 2023

github-actions bot commented Jan 8, 2024

jpountz left a comment

jpountz Feb 23, 2024

easyice Feb 24, 2024

easyice commented Mar 4, 2024

jpountz commented Mar 13, 2024

easyice commented Mar 16, 2024

github-actions bot commented Mar 31, 2024

Use group-varint encode the positions #12842

Are you sure you want to change the base?

Use group-varint encode the positions #12842

Conversation

easyice commented Nov 24, 2023

jpountz commented Nov 24, 2023

easyice commented Nov 25, 2023

easyice commented Dec 14, 2023 • edited

easyice commented Dec 15, 2023

github-actions bot commented Jan 8, 2024

jpountz left a comment

Choose a reason for hiding this comment

jpountz Feb 23, 2024

Choose a reason for hiding this comment

easyice Feb 24, 2024

Choose a reason for hiding this comment

easyice commented Mar 4, 2024

jpountz commented Mar 13, 2024

easyice commented Mar 16, 2024

github-actions bot commented Mar 31, 2024

easyice commented Dec 14, 2023 •

edited