Add ordering of files in compound files #12241

cbuescher · 2023-04-25T13:33:33Z

Today there is no specific ordering of how files are written to a compound file.
The current order is determined by iterating over the set of file names in
SegmentInfo, which is unspecific. This PR proposes to change to an order based
on file size. Colocating data from files that are smaller (typically metadata
files like terms index, field info etc...) but accessed often can help when
parts of these files are held in cache.

In our particular case, the motivation is coming from reading larger compound
files from a remote object store through a caching layer that keeps chunks of
the file in pages. Keeping small files together can help improve the efficiency
of the cache because data that is read often (like metadata) is kept together.

Today there is no specic ordering of how files are written to a compound file. The current order is determined by iterating over the set of file names in SegmentInfo, which is unspecific. This PR proposes to change to an order based on file size. Colocating data from files that are smaller (typically metadata files like terms index, field info etc...) but accessed often can help when parts of these files are help in cache. In our particular case, the motivation is coming from reading larger compound files from a remote object store through a caching layer that keeps chunks of the file in pages. Keeping small files together can help improve the efficiency of the cache because data that is read often (like metadata) is kept together.

mikemccand · 2023-04-25T13:52:31Z

+1, cool idea!

cbuescher · 2023-04-25T15:00:00Z

@romseygeek fyi

romseygeek

This is great, thanks @cbuescher. I left a couple of suggestions for changes, can you fix those up and add a CHANGES entry for 9.6?

romseygeek · 2023-04-26T09:20:33Z

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90CompoundFormat.java

@@ -102,11 +103,40 @@ public void write(Directory dir, SegmentInfo si, IOContext context) throws IOExc
    }
  }

+  private static class SizedFile {
+    String name;


Let's make these final?

romseygeek · 2023-04-26T10:01:04Z

lucene/core/src/test/org/apache/lucene/codecs/lucene90/TestLucene90CompoundFormat.java

+      randomFileSize += random().nextInt(1, 100);
+      files.add(filename);
+    }
+    si.setFiles(files);


Can we explicitly shuffle the files list to make it clear that things are in a random order within the segment info? And maybe add a comment to the point that it's held internally as a set so there's no defined ordering in any case?

I added something for that

romseygeek

LGTM! Thanks @cbuescher

Today there is no specific ordering of how files are written to a compound file. The current order is determined by iterating over the set of file names in SegmentInfo, which is undefined. This commit changes to an order based on file size. Colocating data from files that are smaller (typically metadata files like terms index, field info etc...) but accessed often can help when parts of these files are held in cache.

Add changes.txt entry

e087d07

romseygeek requested changes Apr 26, 2023

View reviewed changes

cbuescher added 2 commits April 26, 2023 12:03

make clear that file list is random in test

e1bf5ad

iter

4af3f41

romseygeek approved these changes Apr 26, 2023

View reviewed changes

romseygeek merged commit f45e096 into apache:main Apr 26, 2023

javanna added this to the 9.6.0 milestone May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ordering of files in compound files #12241

Add ordering of files in compound files #12241

cbuescher commented Apr 25, 2023 •

edited

Loading

mikemccand commented Apr 25, 2023

cbuescher commented Apr 25, 2023

romseygeek left a comment

romseygeek Apr 26, 2023

cbuescher Apr 26, 2023

romseygeek Apr 26, 2023

cbuescher Apr 26, 2023

romseygeek left a comment

Add ordering of files in compound files #12241

Add ordering of files in compound files #12241

Conversation

cbuescher commented Apr 25, 2023 • edited Loading

mikemccand commented Apr 25, 2023

cbuescher commented Apr 25, 2023

romseygeek left a comment

Choose a reason for hiding this comment

romseygeek Apr 26, 2023

Choose a reason for hiding this comment

cbuescher Apr 26, 2023

Choose a reason for hiding this comment

romseygeek Apr 26, 2023

Choose a reason for hiding this comment

cbuescher Apr 26, 2023

Choose a reason for hiding this comment

romseygeek left a comment

Choose a reason for hiding this comment

cbuescher commented Apr 25, 2023 •

edited

Loading