Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalable merge/compaction of big doc values segments. #12203

Open
sherman opened this issue Mar 13, 2023 · 8 comments
Open

Scalable merge/compaction of big doc values segments. #12203

sherman opened this issue Mar 13, 2023 · 8 comments

Comments

@sherman
Copy link

sherman commented Mar 13, 2023

The question is regarding the scalable merge/compaction of doc values, given the following context:

  • I have a large sharded index.
  • Each shard can contain segments of millions of documents.
  • There are several hundreds fields in the index, and half of them are doc values.

Sometimes, I face issues with merge times when I need to merge or compact a large segment. The problem is that it's a single-threaded operation where a single segment is merged in a single merger thread.

From the codec doc values format of version 9.x, it appears possible to use map-reduce techniques when writing new large doc value segments. This is because all metadata is read before any field data can be read, and all doc value types have offset and length fields in the metadata.

My basic idea is to write each field in parallel to a separate file and then perform a low-level merge of the binary data (just appending bytes to the final file). After that, I can rewrite only the metadata to update the offsets.

As I am still new to Lucene development, could someone please provide some critique of this idea?

p.s. Unfortunately, the same idea is not applicable for the inverted index format due to its complexity.

@vigyasharma
Copy link
Contributor

My basic idea is to write each field in parallel to a separate file and then perform a low-level merge of the binary data (just appending bytes to the final file). After that, I can rewrite only the metadata to update the offsets.

This is an interesting approach, I'd like to explore it more. There has been some discussion on this problem in #9626 , that you might find useful.

I think it makes sense to chip away at this problem with one format type at a time. Let's start with the approach you have in mind for doc-values. If you want to raise a PR for this, I'm happy to iterate with you on it. We can start with a draft PR that demonstrates the idea first, and benchmark it for viability. And eventually refine it to consolidate across the general segment merging framework.

I was playing around with doing this for the postings format some time back. It has been on the shelf for some time, but I guess it's time to dust it off and try again. Hopefully, we can collaborate on some common learnings.

@rmuir
Copy link
Member

rmuir commented Apr 11, 2023

are you sure docvalues is really the slow part of your merge. I actually think doing this for terms/postings would be more bang-for-the-buck?

docvalues is a bit harder and trickier: typically docvalues are only a tiny fraction of merge costs, compared to postings (especially merging the terms seems to be very intensive).

there are some real traps here with docvalues, especially string fields (SORTED/SORTED_SET). In order to merge these fields, it has to remap the ordinals which requires an additional datastructure to do. Doing this for many fields at once without being careful could spike memory (and possibly for little benefit as again these fields are typically much faster to merge than indexed ones).

terms/postings is easier in that you don't have this trap in your way, and i think it is a bigger win since it is typically the majority of the time being spent in merge. there are still other concerns, such as overwhelming the hardware, slowing down searches while its happening, etc. Typically we want merge to be "easy" on the machine, it can be viewed as a tax that we have to pay...

@sherman
Copy link
Author

sherman commented Apr 12, 2023

Hi, @rmuir!

are you sure docvalues is really the slow part of your merge. I actually think doing this for terms/postings would be more bang-for-the-buck?

I am not stating that the doc values are the heaviest part of the force merge process. In my case, the rewriting of doc values from the original segment (10 millions docs) took 318 seconds, which is comparable to the time it takes to merge posting lists. The fully parallel writing (w/o a final metadata update) took 23 seconds!

docvalues is a bit harder and trickier: typically docvalues are only a tiny fraction of merge costs, compared to postings (especially merging the terms seems to be very intensive).

there are some real traps here with docvalues, especially string fields (SORTED/SORTED_SET). In order to merge these >?>fields, it has to remap the ordinals which requires an additional datastructure to do. Doing this for many fields at once without >being careful could spike memory (and possibly for little benefit as again these fields are typically much faster to merge than >indexed ones).

Hmm. After examining the codec code in version 9.x, I came to the opposite conclusion. Please correct me if I'm wrong, but it appears that each doc values field data consists of two files: meta and data. Moreover, it seems that each doc value field is written separately and without sharing data between them.

Perhaps I wasn't clear earlier, but what I meant was to write multiple doc values using the original codec, if that's possible. For instance, if I have two fields, I would have four files (two data files and two meta files). Then, I could copy those data files at the byte level, using the something like cat file1 > all_fields; cat file2 >> all_fields. As for the metadata files, I would need to fix the absolute numbers (i.e., the offsets). Writing of data files is parallel operation, updating + rewriting metadata is a single-threaded.

@rmuir
Copy link
Member

rmuir commented Apr 14, 2023

Check it out here for string fields to see what i mean:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L662-L663
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/DocValuesConsumer.java#L849-L850

it would be bad to instantiate a ton of these OrdinalMap instances (one for each field) all at once, rather than sequentially which keeps the memory low.

@sherman
Copy link
Author

sherman commented Apr 14, 2023

Yes, I saw that code. However, I thought it is intended for use when merging multiple segments into one, field by field.

May I share a simple example to explain why I think so? Let's say we have the following example:

var logger = LogManager.getLogger("lucene_index");
var infoStream = new Log4j2InfoStream(logger, Level.toLevel("info"));

var out = FSDirectory.open(Path.of("/home/sherman/experiments7/"));
var cfg = new IndexWriterConfig();
cfg.setUseCompoundFile(false);
var indexWriter = new IndexWriter(out, cfg);
for (var i = 0; i < 10000; i++) {
    var doc = new Document();
    doc.add(new SortedSetDocValuesField("field1", new BytesRef("term1_" + i)));
    if (i < 500) {
        doc.add(new SortedSetDocValuesField("field2", new BytesRef("term1_" + i)));
    }
    indexWriter.addDocument(doc);
}
indexWriter.close();

In this example, each field has its own term dictionary, because I see two calls of addTermsDict() method.

I think I need to conduct further investigation to create a prototype that can produce the same files as a result.

Maybe, I missed something.

@mikemccand
Copy link
Member

the rewriting of doc values from the original segment (10 millions docs) took 318 seconds, which is comparable to the time it takes to merge posting lists. The fully parallel writing (w/o a final metadata update) took 23 seconds!

Wow this is amazing speedup! Can you provide some details? Where is the prototype implementation, how many cores/threads did you use, etc.

@rmuir's concern about concurrently enumerating multiple OrdinalMaps is a real risk: this data structure can be memory consuming, especially for high cardinality (many unique strings) fields, and, indices where each segment has very different sets of strings for that field. But maybe if we limited the concurrency of such costly fields in some way this might be OK.

@sherman
Copy link
Author

sherman commented Apr 19, 2023

Hi, @mikemccand!

I re-run the tests.

My hardware:
Model name: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
96 cores.

The test was quite simple: rewriting all doc values of a specific index segment, which is similar to what we do when we run a compaction process.

So, in this test I had a segment with 8.021.709 documents and the following statistics of doc values fields (yes, we have a lot of doc values fields):

SORTED_NUMERIC=1649
SORTED=1
SORTED_SET=4863

The total size of .dvd file is: 8.3G
The baseline (single thread) took 249 seconds.

The parallel test (32 threads) took 19 seconds.

Of course, in a final implementation, we should add a time for copying and rewriting metadata to the total time.

@mikemccand
Copy link
Member

The test was quite simple: rewriting all doc values of a specific index segment, which is similar to what we do when we run a compaction process.

Awesome -- I look forward to seeing the rough PR even if it is prototype!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants