-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ordering of files in compound files #12241
Conversation
Today there is no specic ordering of how files are written to a compound file. The current order is determined by iterating over the set of file names in SegmentInfo, which is unspecific. This PR proposes to change to an order based on file size. Colocating data from files that are smaller (typically metadata files like terms index, field info etc...) but accessed often can help when parts of these files are help in cache. In our particular case, the motivation is coming from reading larger compound files from a remote object store through a caching layer that keeps chunks of the file in pages. Keeping small files together can help improve the efficiency of the cache because data that is read often (like metadata) is kept together.
+1, cool idea! |
@romseygeek fyi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks @cbuescher. I left a couple of suggestions for changes, can you fix those up and add a CHANGES entry for 9.6?
@@ -102,11 +103,40 @@ public void write(Directory dir, SegmentInfo si, IOContext context) throws IOExc | |||
} | |||
} | |||
|
|||
private static class SizedFile { | |||
String name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make these final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
randomFileSize += random().nextInt(1, 100); | ||
files.add(filename); | ||
} | ||
si.setFiles(files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explicitly shuffle the files
list to make it clear that things are in a random order within the segment info? And maybe add a comment to the point that it's held internally as a set so there's no defined ordering in any case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added something for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @cbuescher
Today there is no specific ordering of how files are written to a compound file. The current order is determined by iterating over the set of file names in SegmentInfo, which is undefined. This commit changes to an order based on file size. Colocating data from files that are smaller (typically metadata files like terms index, field info etc...) but accessed often can help when parts of these files are held in cache.
Today there is no specific ordering of how files are written to a compound file.
The current order is determined by iterating over the set of file names in
SegmentInfo, which is unspecific. This PR proposes to change to an order based
on file size. Colocating data from files that are smaller (typically metadata
files like terms index, field info etc...) but accessed often can help when
parts of these files are held in cache.
In our particular case, the motivation is coming from reading larger compound
files from a remote object store through a caching layer that keeps chunks of
the file in pages. Keeping small files together can help improve the efficiency
of the cache because data that is read often (like metadata) is kept together.