[SPARK-44924][SS] Add config for FileStreamSource cached files #45362

ragnarok56 · 2024-03-02T20:25:45Z

What changes were proposed in this pull request?

This change adds configuration options for the streaming input File Source for maxCachedFiles and discardCachedInputRatio. These values were originally introduced with #27620 but were hardcoded to 10,000 and 0.2, respectively.

Why are the changes needed?

Under certain workloads with large maxFilesPerTrigger settings, the performance gain from caching the input files capped at 10,000 can cause a cluster to be underutilized and jobs to take longer to finish if each batch takes a while to finish. For example, a job with maxFilesPerTrigger set to 100,000 would do all 100k in batch 1, then only 10k in batch 2, but both batches could take just as long since some of the files cause skewed processing times. This results in a cluster spending nearly the same amount of time while processing only 1/10 of the files it could have.

Does this PR introduce any user-facing change?

Updated documentation for structured streaming sources to describe new configurations options

How was this patch tested?

New and existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

HeartSaVioR

+1

Apologize the delay... I was dealing with too many context switches and forgot this. Thanks again to push this forward!

HeartSaVioR · 2024-05-20T01:50:35Z

Thanks! Merging to master.

… Trigger.AvailableNow ### What changes were proposed in this pull request? Files don't need to be cached for reuse in `FileStreamSource` when using `Trigger.AvailableNow` because all files are already cached for the lifetime of the query in `allFilesForTriggerAvailableNow`. ### Why are the changes needed? As reported in https://issues.apache.org/jira/browse/SPARK-44924 (with a PR to address #45362), the hard coded cap of 10k files being cached can cause problems when using a maxFilesPerTrigger > 10k. It causes every other batch to be 10k files, which can greatly limit the throughput of a new streaming trying to catch up. ### Does this PR introduce _any_ user-facing change? Every other streaming batch won't be 10k files if using Trigger.AvailableNow and maxFilesPerTrigger greater than 10k. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #46627 from Kimahriman/available-now-no-cache. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…causes batch with no files to be processed ### What changes were proposed in this pull request? This is a followup to a bug identified from #45362. When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked. This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0 ### Why are the changes needed? Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users. ### Does this PR introduce _any_ user-facing change? Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0 ### How was this patch tested? New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch ### Was this patch authored or co-authored using generative AI tooling? No Closes #47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase. Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com> Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…causes batch with no files to be processed ### What changes were proposed in this pull request? This is a followup to a bug identified from apache#45362. When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked. This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0 ### Why are the changes needed? Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users. ### Does this PR introduce _any_ user-facing change? Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0 ### How was this patch tested? New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase. Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com> Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

github-actions bot added SQL STRUCTURED STREAMING DOCS labels Mar 2, 2024

ragnarok56 force-pushed the filestream-cached-files-config branch from e1f95ee to 8b81a26 Compare March 16, 2024 23:07

ragnarok56 added 6 commits May 9, 2024 21:05

allow for 0 maxCachedFiles

1df968f

scalastyle

f11045b

Add settings for cached files in streaming

96eb729

update for maxBytesPerTrigger changes

6198e44

fix typo in docs

3fbd368

fix tests

473cd39

ragnarok56 force-pushed the filestream-cached-files-config branch from 8b81a26 to 473cd39 Compare May 10, 2024 01:06

Kimahriman mentioned this pull request May 17, 2024

[SPARK-48314][SS] Don't double cache files for FileStreamSource using Trigger.AvailableNow #46627

Closed

HeartSaVioR approved these changes May 20, 2024

View reviewed changes

HeartSaVioR closed this in 74b42fd May 20, 2024

ragnarok56 mentioned this pull request Jul 3, 2024

[SPARK-48802][SS][FOLLOWUP] FileStreamSource maxCachedFiles set to 0 causes batch with no files to be processed #47195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44924][SS] Add config for FileStreamSource cached files #45362

[SPARK-44924][SS] Add config for FileStreamSource cached files #45362

ragnarok56 commented Mar 2, 2024 •

edited

Loading

HeartSaVioR left a comment

HeartSaVioR commented May 20, 2024

[SPARK-44924][SS] Add config for FileStreamSource cached files #45362

[SPARK-44924][SS] Add config for FileStreamSource cached files #45362

Conversation

ragnarok56 commented Mar 2, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented May 20, 2024

ragnarok56 commented Mar 2, 2024 •

edited

Loading