Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ordering of files in compound files #12241

Merged
merged 4 commits into from
Apr 26, 2023

Conversation

cbuescher
Copy link
Contributor

@cbuescher cbuescher commented Apr 25, 2023

Today there is no specific ordering of how files are written to a compound file.
The current order is determined by iterating over the set of file names in
SegmentInfo, which is unspecific. This PR proposes to change to an order based
on file size. Colocating data from files that are smaller (typically metadata
files like terms index, field info etc...) but accessed often can help when
parts of these files are held in cache.

In our particular case, the motivation is coming from reading larger compound
files from a remote object store through a caching layer that keeps chunks of
the file in pages. Keeping small files together can help improve the efficiency
of the cache because data that is read often (like metadata) is kept together.

Today there is no specic ordering of how files are written to a compound file.
The current order is determined by iterating over the set of file names in
SegmentInfo, which is unspecific. This PR proposes to change to an order based
on file size. Colocating data from files that are smaller (typically metadata
files like terms index, field info etc...) but accessed often can help when
parts of these files are help in cache.

In our particular case, the motivation is coming from reading larger compound
files from a remote object store through a caching layer that keeps chunks of
the file in pages. Keeping small files together can help improve the efficiency
of the cache because data that is read often (like metadata) is kept together.
@mikemccand
Copy link
Member

+1, cool idea!

@cbuescher
Copy link
Contributor Author

@romseygeek fyi

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @cbuescher. I left a couple of suggestions for changes, can you fix those up and add a CHANGES entry for 9.6?

@@ -102,11 +103,40 @@ public void write(Directory dir, SegmentInfo si, IOContext context) throws IOExc
}
}

private static class SizedFile {
String name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

randomFileSize += random().nextInt(1, 100);
files.add(filename);
}
si.setFiles(files);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explicitly shuffle the files list to make it clear that things are in a random order within the segment info? And maybe add a comment to the point that it's held internally as a set so there's no defined ordering in any case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added something for that

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @cbuescher

@romseygeek romseygeek merged commit f45e096 into apache:main Apr 26, 2023
asfgit pushed a commit that referenced this pull request Apr 26, 2023
Today there is no specific ordering of how files are written to a compound file.
The current order is determined by iterating over the set of file names in
SegmentInfo, which is undefined. This commit changes to an order based
on file size. Colocating data from files that are smaller (typically metadata
files like terms index, field info etc...) but accessed often can help when
parts of these files are held in cache.
@javanna javanna added this to the 9.6.0 milestone May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants