-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23518][SQL] Avoid metastore access when the users only want to read and write data frames #20681
Conversation
@felixcheung Can you take a look at the changes in the R tests? I had to change it because the catalog imp conf is changed to "hive" when switching to hive context. Then when it is switched back, a "hive" external catalog will be materialized (lazily) and causes a test timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except R-related codes. cc @felixcheung
assign(".sparkRsession", previousSession, envir = .sparkREnv) | ||
remove(".prevSparkRsession", envir = .sparkREnv) | ||
remove(".prevSparkConf", envir = .sparkREnv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be . prevSessionConf
?
hiveSession | ||
} | ||
|
||
unsetHiveContext <- function() { | ||
previousSession <- get(".prevSparkRsession", envir = .sparkREnv) | ||
previousConf <- get(".prevSessionConf", envir = .sparkREnv) | ||
callJStatic("org.apache.spark.sql.api.r.SQLUtils", "setSparkContextSessionConf", previousSessioni, previousConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be previousSession
?
thx, let's wait for tests |
Test build #87681 has finished for PR 20681 at commit
|
Test build #87684 has finished for PR 20681 at commit
|
Test build #87686 has finished for PR 20681 at commit
|
Test build #87706 has finished for PR 20681 at commit
|
lines <- c("{\"name\":\"?????\"}", | ||
"{\"name\":\"??\", \"age\":30}", | ||
"{\"name\":\"?????\", \"age\":19}", | ||
"{\"name\":\"Xin ch?o\"}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to replace these unicode characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we should not change these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to make a few changes
Test build #87731 has finished for PR 20681 at commit
|
Test build #87737 has finished for PR 20681 at commit
|
Test build #87747 has finished for PR 20681 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG this should do it...
Test build #87757 has finished for PR 20681 at commit
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Test build #87797 has finished for PR 20681 at commit
|
retest this please |
Retest this please. |
Test build #87802 has finished for PR 20681 at commit
|
looks like test failures are related? |
Test build #87821 has finished for PR 20681 at commit
|
My original plan to fix the test should not work, because of this test: https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343 The new plan is to run some simple catalog commands immediately after the spark session is created, so the catalog is materialized (like the old behavior). |
Can we fix this test? https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343 |
We can remove the test, but it is not a good practice. You don't know exactly why the test is added, which hidden assumption he wants to guarantee, right? |
Overall, I think this suite needs a refactoring: split to in-memory catalog one and hive catalog one. The catalog conf should not be manipulated after the spark context is created. The other way is just a hack. |
cc @felixcheung Does this PR look good to you now? |
Test build #87858 has finished for PR 20681 at commit
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test here https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343 is just to see if the spark.sql.catalogImplementation is set, it could easier be done in a different way if that is an issue.
@@ -67,6 +67,8 @@ sparkSession <- if (windows_with_hadoop()) { | |||
sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) | |||
} | |||
sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getJavaSparkContext", sparkSession) | |||
# materialize the catalog implementation | |||
listTables() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are calling sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE)
is almost every other test files - does the same apply in those other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_sparkSQL.R
is the only one uses newJObject("org.apache.spark.sql.hive.test.TestHiveContext", ssc, FALSE)
on the ssc
, so the catalog impl spark conf is changed. So ``test_sparkSQL.R` is the only one broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Test build #87867 has finished for PR 20681 at commit
|
LGTM Thanks! Merged to master. |
… read and write data frames ## What changes were proposed in this pull request? apache#18944 added one patch, which allowed a spark session to be created when the hive metastore server is down. However, it did not allow running any commands with the spark session. This brings troubles to the user who only wants to read / write data frames without metastore setup. ## How was this patch tested? Added some unit tests to read and write data frames based on the original HiveMetastoreLazyInitializationSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Feng Liu <fengliu@databricks.com> Closes apache#20681 from liufengdb/completely-lazy.
What changes were proposed in this pull request?
#18944 added one patch, which allowed a spark session to be created when the hive metastore server is down. However, it did not allow running any commands with the spark session. This brings troubles to the user who only wants to read / write data frames without metastore setup.
How was this patch tested?
Added some unit tests to read and write data frames based on the original HiveMetastoreLazyInitializationSuite.
Please review http://spark.apache.org/contributing.html before opening a pull request.