Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Dataset] Devise a mechanism to limit the total "system ram" (process + cache) used by dataset writes #30179

Open
asfimport opened this issue Nov 8, 2021 · 1 comment

Comments

@asfimport
Copy link

asfimport commented Nov 8, 2021

The dataset writer now correctly applies backpressure.  However, that backpressure is only applied when the write calls slow down.  This only happens when the OS disk cache fills up.

However, filling up the OS disk cache is undesirable.  It will cause all running processes to get swapped (assuming the system has any swap configured) and will make the system unusable for anything else.

This typically has no actual benefit to the dataset write.  The marginal performance boost provided by the extra RAM is often not worth the cost.

One way to do this would be to use direct I/O (although that comes with a plethora of warnings).  Another way might be to flag the output was WONTNEED but I don't know for sure if this works (the OS might still cache it so that it can satisfy the write call quickly).  Another way might be to somehow track how much disk cache is being used for writes but that would get complex.  I'm sure there are other ways I'm just not aware of yet.

Reporter: Weston Pace / @westonpace
Assignee: Ziheng Wang / @marsupialtail

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-14635. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Weston Pace / @westonpace:
Just jotting this down but I did a bit of looking into this and I think a combination of writing the file with O_DSYNC and then using WONTNEED on the written data should accomplish this for Linux, though there may be other ways.

On windows there is a flag FILE_FLAG_WRITE_THROUGH which will be key I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants