Skip to content

[#9280] improvement(catalogs-fileset): Refactor FileSystem retrieval to use future and solve hang problem#9282

Merged
jerryshao merged 25 commits intoapache:mainfrom
yuqi1129:issue_9280
Dec 8, 2025
Merged

[#9280] improvement(catalogs-fileset): Refactor FileSystem retrieval to use future and solve hang problem#9282
jerryshao merged 25 commits intoapache:mainfrom
yuqi1129:issue_9280

Conversation

@yuqi1129
Copy link
Contributor

@yuqi1129 yuqi1129 commented Nov 27, 2025

What changes were proposed in this pull request?

Replaced Awaitility-based polling with CompletableFuture for FileSystem retrieval using a dedicated ThreadPoolExecutor. This improves performance and simplifies timeout handling by leveraging async execution and modern concurrency utilities. Added cancellation handling and clearer exception propagation for robustness.

Why are the changes needed?

provider.getFileSystem(path, config) may hang and will never return a value, which will cause the Awaitility mechanism does not work as expected.

Fix: #9280

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Test locally and existing tests.

Replaced Awaitility-based polling with CompletableFuture for FileSystem retrieval using a dedicated ThreadPoolExecutor. This improves performance and simplifies timeout handling by leveraging async execution and modern concurrency utilities. Added cancellation handling and clearer exception propagation for robustness.
@yuqi1129
Copy link
Contributor Author

Gravitino still hangs on, although this has nothing to do with file system initialization, the following is the stack:

	at java.lang.Thread.sleep(java.base@17.0.8/Native Method)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doPauseBeforeRetry(AmazonHttpClient.java:1867)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.pauseBeforeRetry(AmazonHttpClient.java:1841)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1282)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5227)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5173)
	at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1360)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$6(S3AFileSystem.java:2066)
	at org.apache.hadoop.fs.s3a.S3AFileSystem$$Lambda$790/0x000000f8017b9a18.apply(Unknown Source)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:412)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:375)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2056)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2032)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3273)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3053)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4263)
	at org.apache.gravitino.catalog.fileset.FilesetCatalogOperations.createMultipleLocationFileset(FilesetCatalogOperations.java:499)

Let me solve it by the way.

@yuqi1129 yuqi1129 changed the title [#9280] improvement(catalogs-fileset): Refactor FileSystem retrieval to use CompletableFuture [#9280] improvement(catalogs-fileset): Refactor FileSystem retrieval to use CompletableFuture and solve hang problem Nov 27, 2025
@yuqi1129 yuqi1129 requested a review from Copilot November 28, 2025 03:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors FileSystem retrieval to use CompletableFuture instead of Awaitility-based polling to address a hang issue where provider.getFileSystem(path, config) could hang indefinitely. The changes include adding a dedicated ThreadPoolExecutor for async FileSystem retrieval with proper timeout and cancellation handling, plus adding default timeout configurations to various FileSystem providers (HDFS, GCS, Azure, S3, OSS) to speed up test failures.

  • Replaces Awaitility polling with Future.get() with timeout for more robust timeout handling
  • Introduces a static ThreadPoolExecutor for async FileSystem retrieval operations
  • Adds default connection timeout and retry configurations across all FileSystem providers

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
FilesetCatalogOperations.java Replaces Awaitility with Future-based timeout mechanism using a new ThreadPoolExecutor
HDFSFileSystemProvider.java Adds default HDFS connection timeout and ping configurations
GCSFileSystemProvider.java Adds default GCS HTTP connect timeout and retry limit configurations
AzureFileSystemProvider.java Adds default Azure retry limit configuration
S3FileSystemProvider.java Adds default S3 retry limits and connection timeout configurations
OSSFileSystemProvider.java Adds default OSS connection timeout and retry limit configurations

Reduced thread pool keep-alive time from 50ms to 5s for better resource management. Added proper shutdown for the executor and improved logging for filesystem retrieval timeouts to enhance debugging and reliability.
@yuqi1129 yuqi1129 changed the title [#9280] improvement(catalogs-fileset): Refactor FileSystem retrieval to use CompletableFuture and solve hang problem [#9280] improvement(catalogs-fileset): Refactor FileSystem retrieval to use future and solve hang problem Nov 28, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.


private final ThreadPoolExecutor fileSystemExecutor =
new ThreadPoolExecutor(
Math.max(2, Runtime.getRuntime().availableProcessors() * 2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create too many threads for server if they're more than 32 cores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I limit the max thread number to 24

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.

@yuqi1129 yuqi1129 closed this Dec 4, 2025
@yuqi1129 yuqi1129 reopened this Dec 4, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Comment on lines +1902 to +1909
// Test the following method should finish with 10s
long now = System.currentTimeMillis();
try {
filesetCatalogOperations.getFileSystem(new Path("file:///tmp"), ImmutableMap.of());
} catch (IOException e) {
long timeTake = System.currentTimeMillis() - now;
Assertions.assertTrue(timeTake <= 10000);
}
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn't verify that the exception is actually thrown - it silently catches IOException and only checks the timing. If no exception is thrown, the test will pass incorrectly. Consider using Assertions.assertThrows() to ensure an IOException is thrown, then verify the timing within that context.

Suggested change
// Test the following method should finish with 10s
long now = System.currentTimeMillis();
try {
filesetCatalogOperations.getFileSystem(new Path("file:///tmp"), ImmutableMap.of());
} catch (IOException e) {
long timeTake = System.currentTimeMillis() - now;
Assertions.assertTrue(timeTake <= 10000);
}
// Test the following method should finish with 10s and throw IOException
Assertions.assertThrows(
IOException.class,
() -> {
long now = System.currentTimeMillis();
try {
filesetCatalogOperations.getFileSystem(new Path("file:///tmp"), ImmutableMap.of());
} finally {
long timeTake = System.currentTimeMillis() - now;
Assertions.assertTrue(timeTake <= 10000, "Timeout should occur within 10 seconds");
}
});

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Comment on lines +143 to +158
private final ThreadPoolExecutor fileSystemExecutor =
new ThreadPoolExecutor(
Math.max(2, Math.min(Runtime.getRuntime().availableProcessors() * 2, 16)),
Math.max(2, Math.min(Runtime.getRuntime().availableProcessors() * 2, 32)),
5L,
TimeUnit.SECONDS,
new ArrayBlockingQueue<>(1000),
new ThreadFactoryBuilder()
.setDaemon(true)
.setNameFormat("fileset-filesystem-getter-pool-%d")
.build(),
new ThreadPoolExecutor.AbortPolicy()) {
{
allowCoreThreadTimeOut(true);
}
};
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ThreadPoolExecutor is initialized as a final instance field, but the close() method calls shutdownNow() without checking if it has already been shut down. If close() is called multiple times, this could throw an exception or cause issues. Consider adding a guard to check the executor's state before shutting it down, or use a safer shutdown pattern.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment on lines +1904 to +1909
try {
filesetCatalogOperations.getFileSystem(new Path("file:///tmp"), ImmutableMap.of());
} catch (IOException e) {
long timeTake = System.currentTimeMillis() - now;
Assertions.assertTrue(timeTake <= 10000);
}
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test expects an IOException to be thrown but doesn't verify the exception message or type. Consider adding assertions to verify that the correct exception is thrown with an appropriate error message about the timeout, to ensure the error handling path works as expected.

Copilot uses AI. Check for mistakes.

private final ThreadPoolExecutor fileSystemExecutor =
new ThreadPoolExecutor(
Math.max(2, Math.min(Runtime.getRuntime().availableProcessors() * 2, 16)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep such threads active for use? My feeling is that most of the threads can be swept when idle to save resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have allowed core poll timeout, and the threads will stop if there is no task assigned to the pool.

e);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IOException("Interrupted while waiting for FileSystem", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is expected, shall we throw an exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, a TimeoutException will occur if it hangs for a long time. and only when we interrupt it deliberately will it throw InterruptedException.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't understand what I mean. InterruptedException often happens when closing or shutting down. This is expected and should not throw an IOException instead.

filesetCatalogOperations.getFileSystem(new Path("file:///tmp"), ImmutableMap.of());
} catch (IOException e) {
long timeTake = System.currentTimeMillis() - now;
Assertions.assertTrue(timeTake <= 10000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this will be very flaky if the test machine is under heavy load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used annotation Timeout(15) to replace it.

}

if (!configs.containsKey(HDFS_IPC_PING_KEY)) {
additionalConfigs.put(HDFS_IPC_PING_KEY, "true");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that you have several customized conf values here and above, you'd also define constants for these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@jerryshao jerryshao merged commit af47d3a into apache:main Dec 8, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improvement] Reduce possibility of filesystem instance hangs when connecting to HDFS/S3/GCS

3 participants