-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44924][SS] Add config for FileStreamSource cached files #45362
Closed
ragnarok56
wants to merge
6
commits into
apache:master
from
ragnarok56:filestream-cached-files-config
Closed
[SPARK-44924][SS] Add config for FileStreamSource cached files #45362
ragnarok56
wants to merge
6
commits into
apache:master
from
ragnarok56:filestream-cached-files-config
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ragnarok56
force-pushed
the
filestream-cached-files-config
branch
from
March 16, 2024 23:07
e1f95ee
to
8b81a26
Compare
ragnarok56
force-pushed
the
filestream-cached-files-config
branch
from
May 10, 2024 01:06
8b81a26
to
473cd39
Compare
HeartSaVioR
approved these changes
May 20, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Apologize the delay... I was dealing with too many context switches and forgot this. Thanks again to push this forward!
Thanks! Merging to master. |
HeartSaVioR
pushed a commit
that referenced
this pull request
May 22, 2024
… Trigger.AvailableNow ### What changes were proposed in this pull request? Files don't need to be cached for reuse in `FileStreamSource` when using `Trigger.AvailableNow` because all files are already cached for the lifetime of the query in `allFilesForTriggerAvailableNow`. ### Why are the changes needed? As reported in https://issues.apache.org/jira/browse/SPARK-44924 (with a PR to address #45362), the hard coded cap of 10k files being cached can cause problems when using a maxFilesPerTrigger > 10k. It causes every other batch to be 10k files, which can greatly limit the throughput of a new streaming trying to catch up. ### Does this PR introduce _any_ user-facing change? Every other streaming batch won't be 10k files if using Trigger.AvailableNow and maxFilesPerTrigger greater than 10k. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #46627 from Kimahriman/available-now-no-cache. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
HeartSaVioR
pushed a commit
that referenced
this pull request
Jul 8, 2024
…causes batch with no files to be processed ### What changes were proposed in this pull request? This is a followup to a bug identified from #45362. When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked. This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0 ### Why are the changes needed? Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users. ### Does this PR introduce _any_ user-facing change? Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0 ### How was this patch tested? New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch ### Was this patch authored or co-authored using generative AI tooling? No Closes #47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase. Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com> Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
ericm-db
pushed a commit
to ericm-db/spark
that referenced
this pull request
Jul 10, 2024
…causes batch with no files to be processed ### What changes were proposed in this pull request? This is a followup to a bug identified from apache#45362. When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked. This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0 ### Why are the changes needed? Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users. ### Does this PR introduce _any_ user-facing change? Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0 ### How was this patch tested? New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase. Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com> Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
jingz-db
pushed a commit
to jingz-db/spark
that referenced
this pull request
Jul 22, 2024
…causes batch with no files to be processed ### What changes were proposed in this pull request? This is a followup to a bug identified from apache#45362. When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked. This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0 ### Why are the changes needed? Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users. ### Does this PR introduce _any_ user-facing change? Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0 ### How was this patch tested? New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase. Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com> Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
attilapiros
pushed a commit
to attilapiros/spark
that referenced
this pull request
Oct 4, 2024
…causes batch with no files to be processed ### What changes were proposed in this pull request? This is a followup to a bug identified from apache#45362. When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked. This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0 ### Why are the changes needed? Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users. ### Does this PR introduce _any_ user-facing change? Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0 ### How was this patch tested? New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase. Lead-authored-by: ragnarok56 <kevin.nacios@gmail.com> Co-authored-by: Kevin Nacios <kevin.nacios@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This change adds configuration options for the streaming input File Source for
maxCachedFiles
anddiscardCachedInputRatio
. These values were originally introduced with #27620 but were hardcoded to 10,000 and 0.2, respectively.Why are the changes needed?
Under certain workloads with large
maxFilesPerTrigger
settings, the performance gain from caching the input files capped at 10,000 can cause a cluster to be underutilized and jobs to take longer to finish if each batch takes a while to finish. For example, a job withmaxFilesPerTrigger
set to 100,000 would do all 100k in batch 1, then only 10k in batch 2, but both batches could take just as long since some of the files cause skewed processing times. This results in a cluster spending nearly the same amount of time while processing only 1/10 of the files it could have.Does this PR introduce any user-facing change?
Updated documentation for structured streaming sources to describe new configurations options
How was this patch tested?
New and existing unit tests.
Was this patch authored or co-authored using generative AI tooling?
No