Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17088][hive] Fix 'sharesHadoopClasses' option when creating client. #20169

Closed
wants to merge 1 commit into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jan 5, 2018

Because the call to the constructor of HiveClientImpl crosses class loader
boundaries, different versions of the same class (Configuration in this
case) were loaded, and that caused a runtime error when instantiating the
client. By using a safer type in the signature of the constructor, it's
possible to avoid the problem.

I considered removing 'sharesHadoopClasses', but it may still be desired
(even though there are 0 users of it since it was not working). When Spark
starts to support Hadoop 3, it may be necessary to use that option to
load clients for older Hive metastore versions that don't know about
Hadoop 3.

Tested with added unit test.

…ient.

Because the call to the constructor of HiveClientImpl crosses class loader
boundaries, different versions of the same class (Configuration) in this
case were loaded, and that caused a runtime error when instantiating the
client. By using a safer type in the signature of the constructor, it's
possible to avoid the problem.

I considered removing 'sharesHadoopClasses', but it may still be desired
(even though there are 0 users of it since it was not working). When Spark
starts to support Hadoop 3, it may be necessary to use that option to
load clients for older Hive metastore versions that don't know about
Hadoop 3.

Tested with added unit test.
@SparkQA
Copy link

SparkQA commented Jan 6, 2018

Test build #85733 has finished for PR 20169 at commit 668fcba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 23, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86539 has finished for PR 20169 at commit 668fcba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 23, 2018

Merging to master.

@asfgit asfgit closed this in dc4761f Jan 23, 2018
@@ -82,8 +83,9 @@ import org.apache.spark.util.{CircularBuffer, Utils}
*/
private[hive] class HiveClientImpl(
override val version: HiveVersion,
warehouseDir: Option[String],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add @param?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the semantics when warehouseDir is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the code that existed before.

sparkConf: SparkConf,
hadoopConf: Configuration,
hadoopConf: JIterable[JMap.Entry[String, String]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the bug fix... explained in the PR description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case was added in this PR!

@gatorsmile
Copy link
Member

Will review this more carefully later.

@vanzin Could you please ping me or @cloud-fan when changing the codes in SQL? I believe we are more familiar about the code base in this area compared with the other active committers. Thanks!

@gatorsmile
Copy link
Member

@vanzin Could we revert this PR since this is not well reviewed?

@vanzin
Copy link
Contributor Author

vanzin commented Jan 24, 2018

Could you please ping me or @cloud-fan when changing the codes in SQL?

I've had really spotty results pinging people. How about you guys make an effort to monitor changes in SQL if you really want to personally approve all of them?

Could we revert this PR since this is not well reviewed?

No, it's well reviewed and tested. You not really understanding the fix is a different issue. I'd appreciate if you made a bigger effort than you made so far in understanding the issue and the fix.

Thanks for taking a late look though.

@cloud-fan
Copy link
Contributor

A late LGTM. @vanzin I'm sorry that we sometimes missed your pinging, but I think it's still good to ping more SQL people for SQL changes, for notification purpose. A post-hoc review is better than no review. Thanks for your understading!

@HyukjinKwon
Copy link
Member

It might be optionally good to leave cc who are related with the proposal itself for notification purpose but I don't think it's quite useful to leave cc to few specific committers as a requirement since we suffer from a reviewing bandwidth problem. I was thinking that we should try to distribute this overhead.

I would do the ping only for big, important or invasive changes to let them focus on important stuff.

I think it's basically committer's responsibility to check PRs, rather than asking PR authors to ping and wait for reviews to few specific committers.

To be clear, this PR has been even open left more then 2 weeks after a review and approval!

@cloud-fan
Copy link
Contributor

but I don't think it's quite useful to leave cc to few specific committers as a requirement

I agree with that, we should not be bound to a few specific committers. But I think it's good and polite to notify people who is the major author of the file being touched, even he is not a committer. For example, the major author of the touched HiveClientImpl.scala is @gatorsmile .

I think it's basically committer's responsibility to check PRs, rather than asking PR authors to ping and wait for reviews to few specific committers.

I also agree with that, I'm also happy to see this PR got reviewed and merged. My only concern is we should notify people who wrote the code you are changing, and we doesn't need to wait for the response.

@gatorsmile
Copy link
Member

Let me explain my point here. I am not trying to offend anybody. Spark is becoming more and more complex. Peer review and test cases are the major methods we used for quality control. Spark is an infrastructure software, instead of an application software. We really need to be careful when reviewing the PRs. Nobody's codes are perfect. For example, if I made a change in SHS, I will ping @vanzin for sure, because @vanzin is the original author and the most active committer of SHS.

Personally, I am doing my best to review the changes in the SQL. It would be nice if you can ping me or @cloud-fan . Since Spark 2.3 release is the recent focus, the current focus of the whole community is for testing and reviewing the changes of Spark 2.3 release.

Regarding this PR, I think we do not need to introduce new parameter warehouseDir, which is already contained in hadoopConf . Keeping the interface clean is always good, I believe. Please review my follow-up PR: #20377

@HyukjinKwon
Copy link
Member

To be clear, I think I personally usually cc related guys. Yup, if I were fixing this codes, I would have cc'ed you @gatorsmile because I agree that it's good to do. Sure, more eyes are better. Sure, I think I am personally trying to support you and other committers to review and roll it better.

I left a comment only because It sounded to cc few specific committers as a requirement because it's not quite what has been on my mind.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 24, 2018

My only concern is we should notify people who wrote the code

Like, ahem, me? (Check this code history. I've worked on quite a good chunk of it over time.)

I agree with what @HyukjinKwon said here. A policy of "some people are required to be pinged on every PR in a certain module" is exactly what was voted down by the PMC. We already have a huge bandwidth problem with reviews, we should be encouraging committers (and other contributors) to be more active, not discouraging them that with that sort of policy.

@vanzin vanzin deleted the SPARK-17088 branch January 24, 2018 18:22
@gatorsmile
Copy link
Member

gatorsmile commented Jan 24, 2018

Thank you for your previous contributions! In the recent few releases, there are major refactoring in the related codes. The codes do not belong to anybody. Our goal is to build the best-quality Spark and benefit more people.

To improve the code quality, we always encourage more reviews from the committers and contributors, especially the ones who are active in the related code base. That is why I suggested above to ping me and @cloud-fan when the community does the change in Spark SQL.

Thanks again about your contributions always!

@vanzin
Copy link
Contributor Author

vanzin commented Jan 24, 2018

That is why I suggested above to ping me and @cloud-fan when the community does the change in Spark SQL.

The point I'm making (and I believe Hyukjin too) is that it should be the opposite. If you want to be involved in all these reviews, it's up to you to monitor what's going on. A new contributor, for example, cannot know who to ping when he wants to make a change.

And I hope you see how "x and y must be pinged in any SQL review" goes against the idea that the project is owned by all committers, not just a few of them.

@srowen
Copy link
Member

srowen commented Jan 24, 2018

Yes I'm not following, @gatorsmile. You're saying there are no owners but wanted to revert a change because you felt it was code that should only change with your review. You're correct that there are no formal owners and that this all relies on proactive review and proactive ping. We have to balance expediency with the cost of occasional post hoc review .

Above you suggested you hadn't really looked at this change, and I don't see any specific objection you have. Reverting would have been well out of line.

This seemed fine process wise until that point.

@gatorsmile
Copy link
Member

Yeah. That is a valid argument for the new contributors, but it is not true for the active contributors and committers like you and me.

Let me give an example. I am not familiar with the code base of SHS, but at least, I knew you did most of the work recently. I will ping you before merging the code changes, although I have the right to merge it without notifying you. Do you think this suggestion is good to improve the code quality of SHS?

WDYT? @srowen

@vanzin
Copy link
Contributor Author

vanzin commented Jan 24, 2018

I am not familiar with the code base of SHS

Please leave the strawmen out of the discussion. I am familiar with this part of the code base, you just believe I'm not.

@srowen
Copy link
Member

srowen commented Jan 24, 2018

I think @vanzin is as qualified to modify this code as anyone. I defer to your and his judgment about who really needs to weigh in, if anyone, before merging. We don't, and can't, efficiently operate otherwise. If this had resulted in a breakage or error, I'd expect the author to learn that, actually, modifying this code is more tricky than expected and really does need a second set of eyes next time.

But it didn't, so I'm not sure what problem you're solving. I do not see something that should have happened differently, even in hindsight. This is getting into committer-splaining well-understood ideas to your peers, so, would leave it here.

@gatorsmile
Copy link
Member

@vanzin You might not understand my point. We need a peer review, even if we are familiar with the code base. Let me give a better example for clarifying my point.

Even if I am familiar with SQL, I still ping @cloud-fan , right? Even if @cloud-fan might not have a bandwidth to review the PR, at least he is pinged. Do you think we should encourage this in our community? Also cc @srowen

@vanzin
Copy link
Contributor Author

vanzin commented Jan 24, 2018

And where exactly was a peer review missing here?

I didn't check this in without review. You're basically saying that only you and Wenchen are acceptable "peers". I just disagree strongly with that statement.

@srowen
Copy link
Member

srowen commented Jan 24, 2018

@gatorsmile, I see you merge changes with no review frequently, so I'm surprised to see you say this. I trust your judgment when you do. You've made the point repeatedly that you think you should be pinged on changes to this file. OK, people will try to remember that. For your part, you can be more proactive about review, as there are evidently files you really want to keep an eye on.

@gatorsmile
Copy link
Member

@vanzin I never said I and Wenchen are acceptable peers. I have to correct it. I think you also notice that Wenchen and I are the top contributors and reviewers in Spark SQL. If any change made in Spark SQL, I think it would be nice to ping us, right? Also cc @srowen

@srowen If you check the history of my commits, I never merge my own PRs without ping the committers who are familiar with the code base. Right?

@vanzin
Copy link
Contributor Author

vanzin commented Jan 24, 2018

I think you also notice that Wenchen and I are the top contributors and reviewers in Spark SQL

No disagreement there.

If any change made in Spark SQL, I think it would be nice to ping us, right?

Nice but not required. That's the whole point. If a committer is comfortable with the change he made, and the reviews he got, he does not need approval from any specific set of other committers.

Again, if you feel you need to be involved in all reviews, it's up to you to do that.

@gatorsmile
Copy link
Member

@vanzin Yes. That is nice to have, but not required in the Apache community.

I believe both of us want to build the best-quality Spark. That is why I asked whether we should encourage this in the community, especially for the active committers who are very familiar with the Spark community.

For example, even if I were confident about my own change in SHS, I will still ping you at the beginning. This is just to request your review for double checking the code changes in SHS. Maybe you do not have a bandwidth now, you might still can take a look at it when you are free. I do not expect you to check the whole change history and review all the changes in SHS, right?

@vanzin
Copy link
Contributor Author

vanzin commented Jan 24, 2018

The difference is that I won't care if you get a review from someone else and and check that code in. Even if you don't ping me. By the fact that you are a committer, I trust your judgement that you made the change carefully and got it properly reviewed. If I happen to find a problem in it later, I won't go back and blame you for not pinging me, if that were the case. I'd fix the issue in a new PR and go on.

You seem to be focused on the fact that you were not consulted about a change. You should be focusing on the fact that you are one person among many peers that we've all trusted to have good judgement in moving the project forward, and that you personally do not need to be involved in everything, and that it's really a little out of line for you to imply that.

I don't really have anything else to offer here.

@gatorsmile
Copy link
Member

To be honest, I really care the quality, especially for such an important infrastructure software. The above argument is not related to the trust at all. Sorry for any misunderstanding.

For example, if I got involved in SHS-related PRs, I will ping you for ensuring the code changes in SHS are properly reviewed. I also believe this is what we can do at best practices to improve the quality. I trust all the committers but I just want to ensure the best committers got notified and review it carefully, if possible.

@srowen
Copy link
Member

srowen commented Jan 24, 2018

(I take back my comment @gatorsmile. I feel sure I've seen you quickly commit a change -- which I think is entirely fine in cases where you think it's appropriate -- but couldn't find an example clicking through your last 10 PRs or so. You turn them around quickly, but that's not the same.)

tkakantousis pushed a commit to tkakantousis/spark that referenced this pull request Mar 11, 2018
…ient.

Because the call to the constructor of HiveClientImpl crosses class loader
boundaries, different versions of the same class (Configuration in this
case) were loaded, and that caused a runtime error when instantiating the
client. By using a safer type in the signature of the constructor, it's
possible to avoid the problem.

I considered removing 'sharesHadoopClasses', but it may still be desired
(even though there are 0 users of it since it was not working). When Spark
starts to support Hadoop 3, it may be necessary to use that option to
load clients for older Hive metastore versions that don't know about
Hadoop 3.

Tested with added unit test.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#20169 from vanzin/SPARK-17088.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants