Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use max BPV encoding in postings if doc buffer size less than ForUtil.BLOCK_SIZE #12717

Closed
easyice opened this issue Oct 24, 2023 · 5 comments
Closed

Comments

@easyice
Copy link
Contributor

easyice commented Oct 24, 2023

Description

Currently we use vint encoding the doc IDs if the doc buffer < 128, then decode in Lucene90PostingsReader#readVIntBlock. In the high cardinality field, it it possibly slow to decoding, can we use the max BPV to encode the block like DocIdsWriter#writeDocIds or DirectWriter?

@jpountz
Copy link
Contributor

jpountz commented Oct 27, 2023

Does this actually matter for performance? My gut feeling is that either a value has a long postings list, and then the vast majority of blocks will be encoded with PFOR and should be fast, or it has a short postings list and then queries should be fast anyway?

@easyice
Copy link
Contributor Author

easyice commented Oct 27, 2023

@jpountz Thanks for your explanation, i got some flame graph that shows the readVIntBlock takes up a bit large proportion, I'll try to reproduce it with some mocked data

@jpountz
Copy link
Contributor

jpountz commented Oct 31, 2023

Since we're changing the postings format anyway in #12741, I wonder if it would be worth looking into different encodings for these tail postings. Maybe we could use group-varint and only fall back to regular vints for the last 1, 2 or 3 postings? (Group-varint is supposedly faster to decode than regular vints.)

@easyice
Copy link
Contributor Author

easyice commented Nov 1, 2023

Ohhhhh.. Group-varint is a interesting encoder, I'd love to try it later in the week

@easyice
Copy link
Contributor Author

easyice commented Nov 5, 2023

I reproduced this using low cardinality fields, for instance we let the posing size be 100, write 10 million docs then force merge to single segment, use TermInSetQuery with 512 terms as search benchmark, the flame graph shows the readVIntBlock accounted for 27%.

image

I implemented the first version of Group-varint, the decoding process for each group is :

  • input.readByte() for the flag
  • input.readBytes() to buffer for the int bytes
  • BitUtil.VH_LE_INT.get(buffer, off) & MASKS to get the value

but the benchmark has a bit regression, it seems readBytes()->Unsafe.copyMemory() is slow
image

So i fully read the posting to buffer at first, then the performance improved ~18% for posting size be 50, the .doc file increased by ~9%. i also try to use bkd.DocIdsWriter#writeDocIds to encode the docs, the search performance has improved are similar to that.

benchmark code:

public class SingleBlockPosting {
  private static final boolean INDEX = true;
  private static final boolean SEARCH = true;
  private static final int BENCHMARK_ITERATION = 10;
  private static final long SEED = 3;
  private static final String FIELD = "f1";

  private static final int numDocs = 1000_0000;
  private static final int postingSize = 100;
  private static int cardinality = numDocs / postingSize;

  public static Long[] randomLongs() {
    Random rand = new Random(SEED);
    HashSet<Long> setLongs = new HashSet<>();
    while (setLongs.size() < cardinality) {
      setLongs.add(rand.nextLong());
    }
    return setLongs.toArray(new Long[0]);
  }

  public static void index() throws IOException {
    Long[] longs = randomLongs();
    Directory dir = MMapDirectory.open(Paths.get("/Volumes/RamDisk/singleblock"));
    IndexWriterConfig iwc = new IndexWriterConfig(null);
    iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    iwc.setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH);
    IndexWriter indexWriter = new IndexWriter(dir, iwc);
    for (int i = 0; i < numDocs; i++) {
      Document doc = new Document();
      doc.add(new StringField(FIELD, String.valueOf(longs[i % cardinality]), Field.Store.NO));
      indexWriter.addDocument(doc);
    }
    indexWriter.commit();
    indexWriter.forceMerge(1);
    indexWriter.close();
    dir.close();
  }

  public static void doSearchBenchMark(int termCount) throws IOException {
    List<Long> times = new ArrayList<>();
    for (int i = 0; i < BENCHMARK_ITERATION; i++) {
      times.add(doSearch(termCount));
    }
    long took = times.stream().mapToLong(Number::longValue).min().getAsLong();
    System.out.println("best result: term count: " + termCount + ", took(ms): " + took);
  }

  public static long doSearch(int termCount) throws IOException {
    Directory directory = FSDirectory.open(Paths.get("/Volumes/RamDisk/singleblock"));
    IndexReader indexReader = DirectoryReader.open(directory);
    IndexSearcher searcher = new IndexSearcher(indexReader);
    searcher.setQueryCachingPolicy(
        new QueryCachingPolicy() {
          @Override
          public void onUse(Query query) {}

          @Override
          public boolean shouldCache(Query query) throws IOException {
            return false;
          }
        });

    long total = 0;
    Query query = getQuery(termCount);
    for (int i = 0; i < 1000; i++) {
      long start = System.currentTimeMillis();
      doQuery(searcher, query);
      long end = System.currentTimeMillis();
      total += end - start;
    }
    System.out.println("term count: " + termCount + ", took(ms): " + total);
    indexReader.close();
    directory.close();
    return total;
  }

  private static Query getQuery(int termCount) {
    List<BytesRef> terms = new ArrayList<>();
    Long[] longs = randomLongs();
    for (int i = 0; i < termCount; i++) {
      terms.add(new BytesRef(Long.toString(longs[i % cardinality])));
    }
    return new TermInSetQuery(FIELD, terms);
  }

  private static void doQuery(IndexSearcher searcher, Query query) throws IOException {
    TotalHitCountCollectorManager collectorManager = new TotalHitCountCollectorManager();
    int totalHits = searcher.search(query, collectorManager);
    // System.out.println(totalHits);
  }

  public static void main(String[] args) throws IOException {
    if (INDEX) {
      index();
    }
    if (SEARCH) {
      doSearchBenchMark(512);
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants