[SPARK-33240][SQL][3.0] Fail fast when fails to instantiate configured v2 session catalog #30167

HeartSaVioR · 2020-10-28T04:50:51Z

What changes were proposed in this pull request?

This patch proposes to change the behavior on failing fast when Spark fails to instantiate configured v2 session catalog.

Why are the changes needed?

The Spark behavior is against the intention of the end users - if end users configure session catalog which Spark would fail to initialize, Spark would swallow the error with only logging the error message and silently use the default catalog implementation.

This follows the voices on discussion thread in dev mailing list.

Does this PR introduce any user-facing change?

Yes. After the PR Spark will fail immediately if Spark fails to instantiate configured session catalog.

How was this patch tested?

New UT added.

…session catalog ### What changes were proposed in this pull request? This patch proposes to change the behavior on failing fast when Spark fails to instantiate configured v2 session catalog. ### Why are the changes needed? The Spark behavior is against the intention of the end users - if end users configure session catalog which Spark would fail to initialize, Spark would swallow the error with only logging the error message and silently use the default catalog implementation. This follows the voices on [discussion thread](https://lists.apache.org/thread.html/rdfa22a5ebdc4ac66e2c5c8ff0cd9d750e8a1690cd6fb456d119c2400%40%3Cdev.spark.apache.org%3E) in dev mailing list. ### Does this PR introduce _any_ user-facing change? Yes. After the PR Spark will fail immediately if Spark fails to instantiate configured session catalog. ### How was this patch tested? New UT added. Closes apache#30147 from HeartSaVioR/SPARK-33240. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

HyukjinKwon · 2020-10-28T05:24:55Z

Hm, is it something we should port back? I think it's hard to call it a bug. Maintenance release shouldn't have such behaviour changes in general according to semver though I got that DSv2 is still experimental but this doesn't look something necessary to port back.

SparkQA · 2020-10-28T05:32:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34955/

SparkQA · 2020-10-28T05:51:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34955/

HeartSaVioR · 2020-10-28T05:54:55Z

Isn't it declared as "incorrect" behavior in discussion thread in dev. mailing list? If that's not a bug what exactly we fix for master branch? I think it's a kind of serious bug as Spark "ignores" the end users' intention and silently fails back. Suppose the custom session catalog does audit then it silently skips auditing and continue to do the operation.

HyukjinKwon · 2020-10-28T06:05:13Z

We changed a behaviour after discussion, and I think It's just a design call we newly made. The previous behaviour also make sense by itself - falling back although it was a bad design decision.

HeartSaVioR · 2020-10-28T06:19:17Z

I don't agree about the point of declaring "a bug" (as I don't think the previous behavior makes sense), but if we are hesitate about making behavioral change in bugfix release then that's a fair point. I'll open this to hear about 3rd voices and close when I fail to get positive voices. Thanks for the opinion.

HyukjinKwon · 2020-10-28T06:34:56Z

All bug fixes bring behavioural changes. I think the point is if that previous behaviour was intended or not when it was proposed and merged.

The previous behaviour looks intended with a proper error message (which I agree wasn't a good call). Sure, waiting other people' opinions are fine.

SparkQA · 2020-10-28T07:05:01Z

Test build #130352 has finished for PR 30167 at commit f6550d0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

RussellSpitzer · 2020-10-28T12:02:37Z

I would consider the old behavior a "swallowed exception." There is almost no chance that a user intended the original catalog to be used, has no way of catching the error and retrying and cannot get any meaningful information from the failure if it occurs.

Like I mentioned on the Mailing list I think failing fast is the right thing to do, we could imagine this as a correctness issue. To give a silly example, imagine the replacement catalog redefined "INSERT" to not just write data to a file, but also simultaneously send some updates to a service which records data in a separate system. If the user has done a catalog replacement, but this behavior does not actually occur on insert, then their system will be inconsistent. If they were previously running this job before without error, they actually did have the error but it was ignorable so they just didn't know about it.

At the very minimum we should dump the whole trace so that end users can at least see why something broke but I do think it's dangerous to go on with business as usual after we fail to replace the catalog.

HyukjinKwon · 2020-10-28T12:23:11Z

@RussellSpitzer, I completely agree with this fix itself. But about porting it back:

The flip side works too. For example, for any reason, in a cluster the catalog was set and users used to use it with no problem. User don't want to change their existing workload and code but switch the catalog they are using. So they just reset the configuration from their cluster. After the fix, this case wouldn't work anymore. Of course, I like this failfast way so it can prevent such misuse but from user's perspective I'd imagine they would think a behaviour change which should ideally not happen in maintenance release.

To be more clear,

if users set their catalog properly, things work correctly with/without this fix.
this fix throws an exception for the case (we defined as misuse now) which might have worked before

I think it's arguable to call it a bug fix to port back.

RussellSpitzer · 2020-10-28T14:20:30Z

I think the use case you are specifying is rather unlikely compared to users who set this property and expect it to work or not work. It's hard for me to imagine someone relying on the fact that any exception in catalog initialization is ok, the domain of possible failures is just too large. It's more like their application works in spite of their intent, rather than in coordination with it.

From my perspective, the vast majority of users who set this property are expecting to use it and if it doesn't work they will stop and change the configuration but we make harder to determine by ignoring the exception and even harder than that by swallowing the stack trace. While the flip side is much harder for me to imagine, a user ok with the behavior which can occur for any number of reasons they cannot actually predict.

So again at minimum we should be logging the exception and trace but I don't think we actually disrupt any real use cases by failing fast either.

HyukjinKwon · 2020-10-28T14:46:29Z

I am not sure. It looks to me less usual to port back a fix for maintenance release when there's a possibility that breaks something that can make sense. I would be fine if other committers prefer it.

HeartSaVioR · 2020-10-28T23:52:08Z

Whether previous behavior is making sense or not looks to be a major argument; I think silently ignoring the user's intention never makes sense. Setting the config reflects their intention, and end users will get annoyed if the intention is silently disregarded in any case. If they do the query in spark-shell they'll get the error log message so they might be wondering why and try to figure out immediately, but if they run the query in application they'll indicate it later (if they have good monitoring tool then earlier but not then probably delayed to when they encounter other issue) and the bad things are already occurred.

We had examples on the harm about previous behavior, but we've not made clear the bright side of the previous behavior. Without enumerating the cases of bright side of the previous behavior I wouldn't agree about the statement this is not a bug but design call.

Probably worth to revive original discussion thread to discuss this.

HyukjinKwon · 2020-10-29T01:05:38Z

@HeartSaVioR, I fully agree that the newer behaviour makes much sense now. I also get that we can have many benefits by doing this fix. We have many fixes that benefit a lot in master branch.

As you pointed out, my point is that the fallback behaviour might make sense in few contexts, e.g. #30167 (comment) and I wouldn't expect such behaviour changes in a maintenance release. This makes me less sure about porting it back.

HeartSaVioR · 2020-10-29T03:09:14Z

For example, for any reason, in a cluster the catalog was set and users used to use it with no problem. User don't want to change their existing workload and code but switch the catalog they are using. So they just reset the configuration from their cluster. After the fix, this case wouldn't work

"The catalog was set and users used to use it with no problem" - we are talking about the case catalog was set but it wasn't effectively working as the default catalog (failback) isn't same as they provided, right? My declaration of this is "silently broken" which is "worse" than "not functioning". That is not a kind of "no problem". If you don't agree with the statement, let's hear the opinions from broader group.

HeartSaVioR · 2020-10-29T03:11:47Z

Additionally, if the pattern is normal in Spark codebase I think we should revisit - if users configure something (A) and Spark decides to fail back (B), it must be only case where there's no functional difference between A and B (e.g. whole stage codegen failback might be OK as it should ideally only have difference on performance). Otherwise Spark is silently breaking the intention.

HyukjinKwon · 2020-10-29T03:30:22Z

In a way, a user could think it's all internal as long as the tables are created somewhere and they can be read regardless of where the tables are created. And, a user could think "there's no functional difference".

I agree ^ this isn't right for the all reasons we discussed in the mailing list and here but I would prefer to avoid such potential surprise in a maintenance release.

HeartSaVioR · 2020-10-29T03:40:02Z

The current behavior was already a surprise, otherwise I wouldn't insist strongly and try to fix it.

Btw, probably I'd just revive the another thread that "disallow setting custom v2 session catalog until all statements are going through v2 session catalog" instead of this as we are finding more and more issues from there.

dongjoon-hyun · 2020-10-31T21:47:59Z

In the context of branch-3.0, I agree with @HyukjinKwon 's stance.

HeartSaVioR · 2020-11-01T00:01:22Z

We had examples on the harm about previous behavior, but we've not made clear the bright side of the previous behavior. Without enumerating the cases of bright side of the previous behavior I wouldn't agree about the statement this is not a bug but design call.

If we make a call without addressing this, I'd rather saying the decision is not taken based on the reasonable discussion. As I said the functionality is already broken in other part as well, I wouldn't mind much about this but I'd claim the functionality should be disabled.

Before doing that, I'll raise a PR to "provide" exception information when swallowing, which was @RussellSpitzer mentioned as "at the very minimum" and I fully agree with (in the sense of "at the very minimum").

dongjoon-hyun · 2020-11-02T16:42:04Z

Ya. Sure, you can continue to pursuit the reasonable discussion in here and another PR until we make agreement. For now, this PR doesn't look like that yet simply.

github-actions · 2021-02-11T00:43:36Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the Stale label Feb 11, 2021

github-actions bot closed this Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33240][SQL][3.0] Fail fast when fails to instantiate configured v2 session catalog #30167

[SPARK-33240][SQL][3.0] Fail fast when fails to instantiate configured v2 session catalog #30167

HeartSaVioR commented Oct 28, 2020

HyukjinKwon commented Oct 28, 2020 •

edited

SparkQA commented Oct 28, 2020

SparkQA commented Oct 28, 2020

HeartSaVioR commented Oct 28, 2020 •

edited

HyukjinKwon commented Oct 28, 2020

HeartSaVioR commented Oct 28, 2020

HyukjinKwon commented Oct 28, 2020

SparkQA commented Oct 28, 2020

RussellSpitzer commented Oct 28, 2020

HyukjinKwon commented Oct 28, 2020

RussellSpitzer commented Oct 28, 2020

HyukjinKwon commented Oct 28, 2020

HeartSaVioR commented Oct 28, 2020 •

edited

HyukjinKwon commented Oct 29, 2020 •

edited

HeartSaVioR commented Oct 29, 2020 •

edited

HeartSaVioR commented Oct 29, 2020 •

edited

HyukjinKwon commented Oct 29, 2020 •

edited

HeartSaVioR commented Oct 29, 2020

dongjoon-hyun commented Oct 31, 2020

HeartSaVioR commented Nov 1, 2020

dongjoon-hyun commented Nov 2, 2020

github-actions bot commented Feb 11, 2021

[SPARK-33240][SQL][3.0] Fail fast when fails to instantiate configured v2 session catalog #30167

[SPARK-33240][SQL][3.0] Fail fast when fails to instantiate configured v2 session catalog #30167

Conversation

HeartSaVioR commented Oct 28, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Oct 28, 2020 • edited

SparkQA commented Oct 28, 2020

SparkQA commented Oct 28, 2020

HeartSaVioR commented Oct 28, 2020 • edited

HyukjinKwon commented Oct 28, 2020

HeartSaVioR commented Oct 28, 2020

HyukjinKwon commented Oct 28, 2020

SparkQA commented Oct 28, 2020

RussellSpitzer commented Oct 28, 2020

HyukjinKwon commented Oct 28, 2020

RussellSpitzer commented Oct 28, 2020

HyukjinKwon commented Oct 28, 2020

HeartSaVioR commented Oct 28, 2020 • edited

HyukjinKwon commented Oct 29, 2020 • edited

HeartSaVioR commented Oct 29, 2020 • edited

HeartSaVioR commented Oct 29, 2020 • edited

HyukjinKwon commented Oct 29, 2020 • edited

HeartSaVioR commented Oct 29, 2020

dongjoon-hyun commented Oct 31, 2020

HeartSaVioR commented Nov 1, 2020

dongjoon-hyun commented Nov 2, 2020

github-actions bot commented Feb 11, 2021

HyukjinKwon commented Oct 28, 2020 •

edited

HeartSaVioR commented Oct 28, 2020 •

edited

HeartSaVioR commented Oct 28, 2020 •

edited

HyukjinKwon commented Oct 29, 2020 •

edited

HeartSaVioR commented Oct 29, 2020 •

edited

HeartSaVioR commented Oct 29, 2020 •

edited

HyukjinKwon commented Oct 29, 2020 •

edited