Fix "Too many open files" issues #8812

lnkuiper · 2023-09-06T07:22:03Z

This PR fixes two issues with the error "Too many open files".

The first one is in the Parquet reader, where we implement an optimization that threads can "look ahead" and open files that need to be scanned while waiting for the "current" file to be opened. This is a great optimization because it keeps threads busy when some files take much longer to open than others. However, in extreme cases, this can cause hundreds of Parquet files to be opened that will be read (and closed) much later. This PR limits the "look ahead" to the number of threads, which should prevent issues with "Too many open files" errors being thrown.

The second one has to do with temporary files. When DuckDB has to offload data to disk, it opens temporary files that can hold up to 4,000 blocks at a time. We keep these files open for efficiency, allowing blocks to be moved in and out easily. If the amount of data that needs to be offloaded is huge, we could have hundreds or even thousands of temporary files. Of course, this also causes a "Too many open files" error to be thrown. I've changed so we now logarithmically increase the number of blocks that can be stored in a temporary file.

I'm happy to receive feedback and think about other solutions if these aren't appropriate.

Mytherin

Thanks for the PR! LGTM. One potential issue here is that the temporary files might grow past the file size limit on some file systems perhaps?

lnkuiper · 2023-09-06T11:49:03Z

Looking at the limits on Wikipedia, I don't think that will be an issue. Anyone using FAT32 (which has a 4 GB limit) will also have an issue with our single-file DB format.

AWS maintains a single-file limit of 5 TB, while Azure maintains a limit of 4 TB. I doubt that anyone will run into this issue, but I can add an upper bound just to be sure?

Mytherin · 2023-09-06T12:15:48Z

We can leave it like this, thanks!

carlopi · 2023-09-06T12:18:58Z

src/storage/standard_buffer_manager.cpp

-	    : db(db), file_index(index), path(FileSystem::GetFileSystem(db).JoinPath(
-	                                     temp_directory, "duckdb_temp_storage-" + to_string(index) + ".tmp")) {
+	TemporaryFileHandle(idx_t temp_file_count, DatabaseInstance &db, const string &temp_directory, idx_t index)
+	    : max_allowed_index((1 << temp_file_count) * MAX_ALLOWED_INDEX_BASE), db(db), file_index(index),


Suggested change

: max_allowed_index((1 << temp_file_count) * MAX_ALLOWED_INDEX_BASE), db(db), file_index(index),

: max_allowed_index((1ull << temp_file_count) * MAX_ALLOWED_INDEX_BASE), db(db), file_index(index),

1 << temp_file_count is computed as a 32 bit integer, so just to stay safe I would go explicitly for using 64bit types.

I agree that this should have been explicitly typed. However, it will never be an issue because when temp_file_count is 30 (right before it overflows), we already have a temp file that can store more than 1 exabyte.

Considering temp_file_count is an idx_t, will it actually be computed as a 32 bit integer?

I tried this:

#include <iostream> int main() { auto one = 1; auto bigone = 1ull; uint64_t exp = 32; std::cout << (one << exp)<< '\n' << (bigone << exp)<< '\n'; }

That returns like:

1 4294967296

(first number is implementation dependent, but 32 bit, second is 64 bit)

lnkuiper added 6 commits September 5, 2023 13:44

use state instead of flag

060d5d8

limit file opening to number of threads

524074a

Merge branch 'main' into parquet_too_many_files

cafdc07

actually use the new variable

8748e76

increase max allowed index logarithmically

199a39a

Merge branch 'main' into parquet_too_many_files

4773ddb

lnkuiper requested a review from Mytherin September 6, 2023 07:22

Mytherin approved these changes Sep 6, 2023

View reviewed changes

Mytherin merged commit 9698e9e into duckdb:main Sep 6, 2023
47 checks passed

carlopi reviewed Sep 6, 2023

View reviewed changes

lnkuiper deleted the parquet_too_many_files branch September 6, 2023 12:23

szarnyasg mentioned this pull request Sep 7, 2023

Too many open files error using temp directory on large CSV #8690

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "Too many open files" issues #8812

Fix "Too many open files" issues #8812

lnkuiper commented Sep 6, 2023

Mytherin left a comment

lnkuiper commented Sep 6, 2023

Mytherin commented Sep 6, 2023

carlopi Sep 6, 2023

lnkuiper Sep 6, 2023

Mytherin Sep 6, 2023

carlopi Sep 6, 2023

	: max_allowed_index((1 << temp_file_count) * MAX_ALLOWED_INDEX_BASE), db(db), file_index(index),
	: max_allowed_index((1ull << temp_file_count) * MAX_ALLOWED_INDEX_BASE), db(db), file_index(index),

Fix "Too many open files" issues #8812

Fix "Too many open files" issues #8812

Conversation

lnkuiper commented Sep 6, 2023

Mytherin left a comment

Choose a reason for hiding this comment

lnkuiper commented Sep 6, 2023

Mytherin commented Sep 6, 2023

carlopi Sep 6, 2023

Choose a reason for hiding this comment

lnkuiper Sep 6, 2023

Choose a reason for hiding this comment

Mytherin Sep 6, 2023

Choose a reason for hiding this comment

carlopi Sep 6, 2023

Choose a reason for hiding this comment