Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "Too many open files" issues #8812

Merged
merged 6 commits into from
Sep 6, 2023

Conversation

lnkuiper
Copy link
Contributor

@lnkuiper lnkuiper commented Sep 6, 2023

This PR fixes two issues with the error "Too many open files".

The first one is in the Parquet reader, where we implement an optimization that threads can "look ahead" and open files that need to be scanned while waiting for the "current" file to be opened. This is a great optimization because it keeps threads busy when some files take much longer to open than others. However, in extreme cases, this can cause hundreds of Parquet files to be opened that will be read (and closed) much later. This PR limits the "look ahead" to the number of threads, which should prevent issues with "Too many open files" errors being thrown.

The second one has to do with temporary files. When DuckDB has to offload data to disk, it opens temporary files that can hold up to 4,000 blocks at a time. We keep these files open for efficiency, allowing blocks to be moved in and out easily. If the amount of data that needs to be offloaded is huge, we could have hundreds or even thousands of temporary files. Of course, this also causes a "Too many open files" error to be thrown. I've changed so we now logarithmically increase the number of blocks that can be stored in a temporary file.

I'm happy to receive feedback and think about other solutions if these aren't appropriate.

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! LGTM. One potential issue here is that the temporary files might grow past the file size limit on some file systems perhaps?

@lnkuiper
Copy link
Contributor Author

lnkuiper commented Sep 6, 2023

Looking at the limits on Wikipedia, I don't think that will be an issue. Anyone using FAT32 (which has a 4 GB limit) will also have an issue with our single-file DB format.

AWS maintains a single-file limit of 5 TB, while Azure maintains a limit of 4 TB. I doubt that anyone will run into this issue, but I can add an upper bound just to be sure?

@Mytherin Mytherin merged commit 9698e9e into duckdb:main Sep 6, 2023
47 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented Sep 6, 2023

We can leave it like this, thanks!

: db(db), file_index(index), path(FileSystem::GetFileSystem(db).JoinPath(
temp_directory, "duckdb_temp_storage-" + to_string(index) + ".tmp")) {
TemporaryFileHandle(idx_t temp_file_count, DatabaseInstance &db, const string &temp_directory, idx_t index)
: max_allowed_index((1 << temp_file_count) * MAX_ALLOWED_INDEX_BASE), db(db), file_index(index),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: max_allowed_index((1 << temp_file_count) * MAX_ALLOWED_INDEX_BASE), db(db), file_index(index),
: max_allowed_index((1ull << temp_file_count) * MAX_ALLOWED_INDEX_BASE), db(db), file_index(index),

1 << temp_file_count is computed as a 32 bit integer, so just to stay safe I would go explicitly for using 64bit types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this should have been explicitly typed. However, it will never be an issue because when temp_file_count is 30 (right before it overflows), we already have a temp file that can store more than 1 exabyte.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering temp_file_count is an idx_t, will it actually be computed as a 32 bit integer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this:

#include <iostream>
 
int main()
{
    auto one = 1;
    auto bigone = 1ull;
    uint64_t exp = 32;
    std::cout << (one << exp)<< '\n'
              << (bigone << exp)<< '\n';
}

That returns like:

1
4294967296

(first number is implementation dependent, but 32 bit, second is 64 bit)

@lnkuiper lnkuiper deleted the parquet_too_many_files branch September 6, 2023 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants