Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33240][SQL][3.0] Fail fast when fails to instantiate configured v2 session catalog #30167

Closed

Conversation

HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

This patch proposes to change the behavior on failing fast when Spark fails to instantiate configured v2 session catalog.

Why are the changes needed?

The Spark behavior is against the intention of the end users - if end users configure session catalog which Spark would fail to initialize, Spark would swallow the error with only logging the error message and silently use the default catalog implementation.

This follows the voices on discussion thread in dev mailing list.

Does this PR introduce any user-facing change?

Yes. After the PR Spark will fail immediately if Spark fails to instantiate configured session catalog.

How was this patch tested?

New UT added.

…session catalog

### What changes were proposed in this pull request?

This patch proposes to change the behavior on failing fast when Spark fails to instantiate configured v2 session catalog.

### Why are the changes needed?

The Spark behavior is against the intention of the end users - if end users configure session catalog which Spark would fail to initialize, Spark would swallow the error with only logging the error message and silently use the default catalog implementation.

This follows the voices on [discussion thread](https://lists.apache.org/thread.html/rdfa22a5ebdc4ac66e2c5c8ff0cd9d750e8a1690cd6fb456d119c2400%40%3Cdev.spark.apache.org%3E) in dev mailing list.

### Does this PR introduce _any_ user-facing change?

Yes. After the PR Spark will fail immediately if Spark fails to instantiate configured session catalog.

### How was this patch tested?

New UT added.

Closes apache#30147 from HeartSaVioR/SPARK-33240.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 28, 2020

Hm, is it something we should port back? I think it's hard to call it a bug. Maintenance release shouldn't have such behaviour changes in general according to semver though I got that DSv2 is still experimental but this doesn't look something necessary to port back.

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34955/

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34955/

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Oct 28, 2020

Isn't it declared as "incorrect" behavior in discussion thread in dev. mailing list? If that's not a bug what exactly we fix for master branch? I think it's a kind of serious bug as Spark "ignores" the end users' intention and silently fails back. Suppose the custom session catalog does audit then it silently skips auditing and continue to do the operation.

@HyukjinKwon
Copy link
Member

We changed a behaviour after discussion, and I think It's just a design call we newly made. The previous behaviour also make sense by itself - falling back although it was a bad design decision.

@HeartSaVioR
Copy link
Contributor Author

I don't agree about the point of declaring "a bug" (as I don't think the previous behavior makes sense), but if we are hesitate about making behavioral change in bugfix release then that's a fair point. I'll open this to hear about 3rd voices and close when I fail to get positive voices. Thanks for the opinion.

@HyukjinKwon
Copy link
Member

All bug fixes bring behavioural changes. I think the point is if that previous behaviour was intended or not when it was proposed and merged.

The previous behaviour looks intended with a proper error message (which I agree wasn't a good call). Sure, waiting other people' opinions are fine.

@SparkQA
Copy link

SparkQA commented Oct 28, 2020

Test build #130352 has finished for PR 30167 at commit f6550d0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@RussellSpitzer
Copy link
Member

I would consider the old behavior a "swallowed exception." There is almost no chance that a user intended the original catalog to be used, has no way of catching the error and retrying and cannot get any meaningful information from the failure if it occurs.

Like I mentioned on the Mailing list I think failing fast is the right thing to do, we could imagine this as a correctness issue. To give a silly example, imagine the replacement catalog redefined "INSERT" to not just write data to a file, but also simultaneously send some updates to a service which records data in a separate system. If the user has done a catalog replacement, but this behavior does not actually occur on insert, then their system will be inconsistent. If they were previously running this job before without error, they actually did have the error but it was ignorable so they just didn't know about it.

At the very minimum we should dump the whole trace so that end users can at least see why something broke but I do think it's dangerous to go on with business as usual after we fail to replace the catalog.

@HyukjinKwon
Copy link
Member

@RussellSpitzer, I completely agree with this fix itself. But about porting it back:

The flip side works too. For example, for any reason, in a cluster the catalog was set and users used to use it with no problem. User don't want to change their existing workload and code but switch the catalog they are using. So they just reset the configuration from their cluster. After the fix, this case wouldn't work anymore. Of course, I like this failfast way so it can prevent such misuse but from user's perspective I'd imagine they would think a behaviour change which should ideally not happen in maintenance release.

To be more clear,

  • if users set their catalog properly, things work correctly with/without this fix.
  • this fix throws an exception for the case (we defined as misuse now) which might have worked before

I think it's arguable to call it a bug fix to port back.

@RussellSpitzer
Copy link
Member

I think the use case you are specifying is rather unlikely compared to users who set this property and expect it to work or not work. It's hard for me to imagine someone relying on the fact that any exception in catalog initialization is ok, the domain of possible failures is just too large. It's more like their application works in spite of their intent, rather than in coordination with it.

From my perspective, the vast majority of users who set this property are expecting to use it and if it doesn't work they will stop and change the configuration but we make harder to determine by ignoring the exception and even harder than that by swallowing the stack trace. While the flip side is much harder for me to imagine, a user ok with the behavior which can occur for any number of reasons they cannot actually predict.

So again at minimum we should be logging the exception and trace but I don't think we actually disrupt any real use cases by failing fast either.

@HyukjinKwon
Copy link
Member

I am not sure. It looks to me less usual to port back a fix for maintenance release when there's a possibility that breaks something that can make sense. I would be fine if other committers prefer it.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Oct 28, 2020

Whether previous behavior is making sense or not looks to be a major argument; I think silently ignoring the user's intention never makes sense. Setting the config reflects their intention, and end users will get annoyed if the intention is silently disregarded in any case. If they do the query in spark-shell they'll get the error log message so they might be wondering why and try to figure out immediately, but if they run the query in application they'll indicate it later (if they have good monitoring tool then earlier but not then probably delayed to when they encounter other issue) and the bad things are already occurred.

We had examples on the harm about previous behavior, but we've not made clear the bright side of the previous behavior. Without enumerating the cases of bright side of the previous behavior I wouldn't agree about the statement this is not a bug but design call.

Probably worth to revive original discussion thread to discuss this.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 29, 2020

@HeartSaVioR, I fully agree that the newer behaviour makes much sense now. I also get that we can have many benefits by doing this fix. We have many fixes that benefit a lot in master branch.

As you pointed out, my point is that the fallback behaviour might make sense in few contexts, e.g. #30167 (comment) and I wouldn't expect such behaviour changes in a maintenance release. This makes me less sure about porting it back.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Oct 29, 2020

For example, for any reason, in a cluster the catalog was set and users used to use it with no problem. User don't want to change their existing workload and code but switch the catalog they are using. So they just reset the configuration from their cluster. After the fix, this case wouldn't work

"The catalog was set and users used to use it with no problem" - we are talking about the case catalog was set but it wasn't effectively working as the default catalog (failback) isn't same as they provided, right? My declaration of this is "silently broken" which is "worse" than "not functioning". That is not a kind of "no problem". If you don't agree with the statement, let's hear the opinions from broader group.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Oct 29, 2020

Additionally, if the pattern is normal in Spark codebase I think we should revisit - if users configure something (A) and Spark decides to fail back (B), it must be only case where there's no functional difference between A and B (e.g. whole stage codegen failback might be OK as it should ideally only have difference on performance). Otherwise Spark is silently breaking the intention.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 29, 2020

In a way, a user could think it's all internal as long as the tables are created somewhere and they can be read regardless of where the tables are created. And, a user could think "there's no functional difference".

I agree ^ this isn't right for the all reasons we discussed in the mailing list and here but I would prefer to avoid such potential surprise in a maintenance release.

@HeartSaVioR
Copy link
Contributor Author

The current behavior was already a surprise, otherwise I wouldn't insist strongly and try to fix it.

Btw, probably I'd just revive the another thread that "disallow setting custom v2 session catalog until all statements are going through v2 session catalog" instead of this as we are finding more and more issues from there.

@dongjoon-hyun
Copy link
Member

In the context of branch-3.0, I agree with @HyukjinKwon 's stance.

@HeartSaVioR
Copy link
Contributor Author

We had examples on the harm about previous behavior, but we've not made clear the bright side of the previous behavior. Without enumerating the cases of bright side of the previous behavior I wouldn't agree about the statement this is not a bug but design call.

If we make a call without addressing this, I'd rather saying the decision is not taken based on the reasonable discussion. As I said the functionality is already broken in other part as well, I wouldn't mind much about this but I'd claim the functionality should be disabled.

Before doing that, I'll raise a PR to "provide" exception information when swallowing, which was @RussellSpitzer mentioned as "at the very minimum" and I fully agree with (in the sense of "at the very minimum").

@dongjoon-hyun
Copy link
Member

Ya. Sure, you can continue to pursuit the reasonable discussion in here and another PR until we make agreement. For now, this PR doesn't look like that yet simply.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 11, 2021
@github-actions github-actions bot closed this Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants