New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26872][STREAMING] Use a configurable value for final termination in the JobScheduler.stop() method #23926
Conversation
…n UI filters are added in spark-sql launch What changes were proposed in this pull request? This PR provides user to set the spark.streaming.jobTimeout, After that it will terminate forcefully. How was this patch tested? Tested manually
Can one of the admins verify this patch? |
What's the use case for this? I think it would be abnormal for termination to take a long time in this case |
Recognizing my use case may take hours, I would still suggest the units for the configurable value be minutes -- not hours.
From my initial Jira request:
I agree I'm abusing the Spark Streaming Context, but take it as a compliment that the Spark code can be used in flexible unforeseen ways! |
Hm, I thought an earlier part of the code is what waited for the batches to complete, but I think this is it. What about waiting for a multiple of the batch interval in this case? |
Basically, I found I could process a single batch of file input data through a streaming pipeline by:
The batch interval, not unexpectedly, determines when the first (and in my case only) batch actually begins processing. Since I'm impatient (and who among us isn't?), my batch interval is 1 millisecond. Processing begins immediately. Based upon the size of the input file, my expectation is to set the new spark.streaming.jobTimeout value to twice the guestimated run time. I expect my jobs to run for hours, not days. While specifying the jobTimeout in units of hours is acceptable, it may not be granular enough for other potential use cases. Specifying the timeout in minutes feels like the proper compromise between flexibility and awkwardly large numbers. |
In that case, why use streaming at all? is it for testing? |
The application processes streaming data from Kafka 24/7. The file processing is a backup mechanism for those "rare" occasions when something goes bump in the night and downstream processing fails. We manually run the same application to pick up the missing output by processing the raw input files that were saved while processing the streaming data. We have had issues with the manual process using The single batch file data technique I outlined previously allows us to again use the same application to process the input data but now by reading the input data directly from where it sits without the moving and file watching. The longer goal is to restructure the application architecture so either a |
It still seems much easier to process manually by not using streaming. Running one batch and stopping is exactly what non-streaming running is. I'm not getting it if that's the use case. |
The problem is that the application currently only supports the |
How about you wait for the batch to finish, and then shut it down? possibly with shutdownNow()? if there are no more batches, that should terminate quickly anyway, no? |
With a "long" batch interval, you have to wait the interval time before
processing starts for the first batch; with a "short" batch interval,
you'll get a "number" of empty batches queued until your first batch
completes. Since the additional batches are empty, at least a couple of
them squeak through with empty results files.
We may not agree on the workaround mechanism (and even at that I agree the
workaround is an abuse of the Spark streaming feature), but I suspect we
are in agreement that hard coded values in software lead to inflexibility
and limitations that are best avoided.
…On Sun, Mar 3, 2019 at 8:38 PM Sean Owen ***@***.***> wrote:
How about you wait for the batch to finish, and then shut it down?
possibly with shutdownNow()? if there are no more batches, that should
terminate quickly anyway, no?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23926 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AjwYhv70qFqmtas-6_y18ekZDKubn6Seks5vTHkJgaJpZM4bYwKz>
.
|
@smrosenberry |
@srowen and @smrosenberry |
@shivusondur Thanks for the support for community requests! |
I think the only change I'd entertain here is some multiple of the batch interval, like 2. I don't think this warrants a whole new config. I don't see why data is not available at the first batch in your example, but, whatever it is I am not sure Spark should accommodate it with a new config. |
Unfortunately, the batch interval introduces different issues for this use case (see my previous message) as it controls the ongoing streaming process. The configuration needed for this use case is to gracefully stop the streaming. From questions on StackOverflow, I know others besides myself would find useful something that would limit the number of batches created and processed followed by a clean termination of the application. Without much research, I expect such a change would reach deeper into the core code than the proposed spark.streaming.jobTimeout which simply uses existing code by eliminating a hard-coded magic number. |
@srowen |
When you write "only developer can configure", are you suggesting changing the value in source and then rebuilding? Security policy would prevent me from doing that. If I could have done that, I would have done so. ;) |
@smrosenberry |
What changes were proposed in this pull request?
This PR provides user to set the spark.streaming.jobTimeout, After that it will terminate forcefully.
How was this patch tested?
Tested manually