Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE #12717

easyice · 2023-10-24T09:14:04Z

Description

Currently we use vint encoding the doc IDs if the doc buffer < 128, then decode in Lucene90PostingsReader#readVIntBlock. In the high cardinality field, it it possibly slow to decoding, can we use the max BPV to encode the block like DocIdsWriter#writeDocIds or DirectWriter?

The text was updated successfully, but these errors were encountered:

jpountz · 2023-10-27T14:32:19Z

Does this actually matter for performance? My gut feeling is that either a value has a long postings list, and then the vast majority of blocks will be encoded with PFOR and should be fast, or it has a short postings list and then queries should be fast anyway?

easyice · 2023-10-27T16:37:01Z

@jpountz Thanks for your explanation, i got some flame graph that shows the readVIntBlock takes up a bit large proportion, I'll try to reproduce it with some mocked data

jpountz · 2023-10-31T17:16:22Z

Since we're changing the postings format anyway in #12741, I wonder if it would be worth looking into different encodings for these tail postings. Maybe we could use group-varint and only fall back to regular vints for the last 1, 2 or 3 postings? (Group-varint is supposedly faster to decode than regular vints.)

easyice · 2023-11-01T14:58:08Z

Ohhhhh.. Group-varint is a interesting encoder, I'd love to try it later in the week

easyice · 2023-11-05T15:06:57Z

I reproduced this using low cardinality fields, for instance we let the posing size be 100, write 10 million docs then force merge to single segment, use TermInSetQuery with 512 terms as search benchmark, the flame graph shows the readVIntBlock accounted for 27%.

I implemented the first version of Group-varint, the decoding process for each group is :

input.readByte() for the flag
input.readBytes() to buffer for the int bytes
BitUtil.VH_LE_INT.get(buffer, off) & MASKS to get the value

but the benchmark has a bit regression, it seems readBytes()->Unsafe.copyMemory() is slow

So i fully read the posting to buffer at first, then the performance improved ~18% for posting size be 50, the .doc file increased by ~9%. i also try to use bkd.DocIdsWriter#writeDocIds to encode the docs, the search performance has improved are similar to that.

benchmark code:

public class SingleBlockPosting {
  private static final boolean INDEX = true;
  private static final boolean SEARCH = true;
  private static final int BENCHMARK_ITERATION = 10;
  private static final long SEED = 3;
  private static final String FIELD = "f1";

  private static final int numDocs = 1000_0000;
  private static final int postingSize = 100;
  private static int cardinality = numDocs / postingSize;

  public static Long[] randomLongs() {
    Random rand = new Random(SEED);
    HashSet<Long> setLongs = new HashSet<>();
    while (setLongs.size() < cardinality) {
      setLongs.add(rand.nextLong());
    }
    return setLongs.toArray(new Long[0]);
  }

  public static void index() throws IOException {
    Long[] longs = randomLongs();
    Directory dir = MMapDirectory.open(Paths.get("/Volumes/RamDisk/singleblock"));
    IndexWriterConfig iwc = new IndexWriterConfig(null);
    iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    iwc.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);
    IndexWriter indexWriter = new IndexWriter(dir, iwc);
    for (int i = 0; i < numDocs; i++) {
      Document doc = new Document();
      doc.add(new StringField(FIELD, String.valueOf(longs[i % cardinality]), Field.Store.NO));
      indexWriter.addDocument(doc);
    }
    indexWriter.commit();
    indexWriter.forceMerge(1);
    indexWriter.close();
    dir.close();
  }

  public static void doSearchBenchMark(int termCount) throws IOException {
    List<Long> times = new ArrayList<>();
    for (int i = 0; i < BENCHMARK_ITERATION; i++) {
      times.add(doSearch(termCount));
    }
    long took = times.stream().mapToLong(Number::longValue).min().getAsLong();
    System.out.println("best result: term count: " + termCount + ", took(ms): " + took);
  }

  public static long doSearch(int termCount) throws IOException {
    Directory directory = FSDirectory.open(Paths.get("/Volumes/RamDisk/singleblock"));
    IndexReader indexReader = DirectoryReader.open(directory);
    IndexSearcher searcher = new IndexSearcher(indexReader);
    searcher.setQueryCachingPolicy(
        new QueryCachingPolicy() {
          @Override
          public void onUse(Query query) {}

          @Override
          public boolean shouldCache(Query query) throws IOException {
            return false;
          }
        });

    long total = 0;
    Query query = getQuery(termCount);
    for (int i = 0; i < 1000; i++) {
      long start = System.currentTimeMillis();
      doQuery(searcher, query);
      long end = System.currentTimeMillis();
      total += end - start;
    }
    System.out.println("term count: " + termCount + ", took(ms): " + total);
    indexReader.close();
    directory.close();
    return total;
  }

  private static Query getQuery(int termCount) {
    List<BytesRef> terms = new ArrayList<>();
    Long[] longs = randomLongs();
    for (int i = 0; i < termCount; i++) {
      terms.add(new BytesRef(Long.toString(longs[i % cardinality])));
    }
    return new TermInSetQuery(FIELD, terms);
  }

  private static void doQuery(IndexSearcher searcher, Query query) throws IOException {
    TotalHitCountCollectorManager collectorManager = new TotalHitCountCollectorManager();
    int totalHits = searcher.search(query, collectorManager);
    // System.out.println(totalHits);
  }

  public static void main(String[] args) throws IOException {
    if (INDEX) {
      index();
    }
    if (SEARCH) {
      doSearchBenchMark(512);
    }
  }
}

easyice added the type:enhancement label Oct 24, 2023

easyice mentioned this issue Nov 8, 2023

Use group-varint encoding for the tail of postings #12782

Merged

easyice closed this as completed Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE #12717

Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE #12717

easyice commented Oct 24, 2023

jpountz commented Oct 27, 2023

easyice commented Oct 27, 2023

jpountz commented Oct 31, 2023 •

edited

easyice commented Nov 1, 2023

easyice commented Nov 5, 2023 •

edited

Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE #12717

Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE #12717

Comments

easyice commented Oct 24, 2023

Description

jpountz commented Oct 27, 2023

easyice commented Oct 27, 2023

jpountz commented Oct 31, 2023 • edited

easyice commented Nov 1, 2023

easyice commented Nov 5, 2023 • edited

jpountz commented Oct 31, 2023 •

edited

easyice commented Nov 5, 2023 •

edited