Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44924][SS] Add config for FileStreamSource cached files #45362

Closed

Conversation

ragnarok56
Copy link
Contributor

@ragnarok56 ragnarok56 commented Mar 2, 2024

What changes were proposed in this pull request?

This change adds configuration options for the streaming input File Source for maxCachedFiles and discardCachedInputRatio. These values were originally introduced with #27620 but were hardcoded to 10,000 and 0.2, respectively.

Why are the changes needed?

Under certain workloads with large maxFilesPerTrigger settings, the performance gain from caching the input files capped at 10,000 can cause a cluster to be underutilized and jobs to take longer to finish if each batch takes a while to finish. For example, a job with maxFilesPerTrigger set to 100,000 would do all 100k in batch 1, then only 10k in batch 2, but both batches could take just as long since some of the files cause skewed processing times. This results in a cluster spending nearly the same amount of time while processing only 1/10 of the files it could have.

Does this PR introduce any user-facing change?

Updated documentation for structured streaming sources to describe new configurations options

How was this patch tested?

New and existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Apologize the delay... I was dealing with too many context switches and forgot this. Thanks again to push this forward!

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

HeartSaVioR pushed a commit that referenced this pull request May 22, 2024
… Trigger.AvailableNow

### What changes were proposed in this pull request?

Files don't need to be cached for reuse in `FileStreamSource` when using `Trigger.AvailableNow` because all files are already cached for the lifetime of the query in `allFilesForTriggerAvailableNow`.

### Why are the changes needed?

As reported in https://issues.apache.org/jira/browse/SPARK-44924 (with a PR to address #45362), the hard coded cap of 10k files being cached can cause problems when using a maxFilesPerTrigger > 10k. It causes every other batch to be 10k files, which can greatly limit the throughput of a new streaming trying to catch up.

### Does this PR introduce _any_ user-facing change?

Every other streaming batch won't be 10k files if using Trigger.AvailableNow and maxFilesPerTrigger greater than 10k.

### How was this patch tested?

New UT

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46627 from Kimahriman/available-now-no-cache.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
HeartSaVioR pushed a commit that referenced this pull request Jul 8, 2024
…causes batch with no files to be processed

### What changes were proposed in this pull request?

This is a followup to a bug identified from #45362.  When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked.  This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0

### Why are the changes needed?

Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.

### Does this PR introduce _any_ user-facing change?

Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0

### How was this patch tested?

New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase.

Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com>
Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Jul 10, 2024
…causes batch with no files to be processed

### What changes were proposed in this pull request?

This is a followup to a bug identified from apache#45362.  When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked.  This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0

### Why are the changes needed?

Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.

### Does this PR introduce _any_ user-facing change?

Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0

### How was this patch tested?

New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase.

Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com>
Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…causes batch with no files to be processed

### What changes were proposed in this pull request?

This is a followup to a bug identified from apache#45362.  When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked.  This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0

### Why are the changes needed?

Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.

### Does this PR introduce _any_ user-facing change?

Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0

### How was this patch tested?

New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase.

Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com>
Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…causes batch with no files to be processed

### What changes were proposed in this pull request?

This is a followup to a bug identified from apache#45362.  When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked.  This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0

### Why are the changes needed?

Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.

### Does this PR introduce _any_ user-facing change?

Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0

### How was this patch tested?

New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase.

Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com>
Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants