[SPARK-23518][SQL] Avoid metastore access when the users only want to read and write data frames #20681

liufengdb · 2018-02-26T22:23:35Z

What changes were proposed in this pull request?

#18944 added one patch, which allowed a spark session to be created when the hive metastore server is down. However, it did not allow running any commands with the spark session. This brings troubles to the user who only wants to read / write data frames without metastore setup.

How was this patch tested?

Added some unit tests to read and write data frames based on the original HiveMetastoreLazyInitializationSuite.

Please review http://spark.apache.org/contributing.html before opening a pull request.

liufengdb · 2018-02-26T22:51:24Z

@felixcheung Can you take a look at the changes in the R tests? I had to change it because the catalog imp conf is changed to "hive" when switching to hive context. Then when it is switched back, a "hive" external catalog will be materialized (lazily) and causes a test timeout.

gatorsmile

LGTM except R-related codes. cc @felixcheung

felixcheung · 2018-02-27T01:05:30Z

R/pkg/tests/fulltests/test_sparkSQL.R

  assign(".sparkRsession", previousSession, envir = .sparkREnv)
  remove(".prevSparkRsession", envir = .sparkREnv)
+  remove(".prevSparkConf", envir = .sparkREnv)


this should be . prevSessionConf?

felixcheung · 2018-02-27T01:06:05Z

R/pkg/tests/fulltests/test_sparkSQL.R

  hiveSession
 }

 unsetHiveContext <- function() {
  previousSession <- get(".prevSparkRsession", envir = .sparkREnv)
+  previousConf <- get(".prevSessionConf", envir = .sparkREnv)
+  callJStatic("org.apache.spark.sql.api.r.SQLUtils", "setSparkContextSessionConf", previousSessioni, previousConf)


this should be previousSession?

felixcheung · 2018-02-27T01:41:58Z

thx, let's wait for tests

SparkQA · 2018-02-27T01:54:23Z

Test build #87681 has finished for PR 20681 at commit b447bea.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-27T02:19:16Z

Test build #87684 has finished for PR 20681 at commit 292cbce.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-27T04:51:58Z

Test build #87686 has finished for PR 20681 at commit 57f2a3d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-27T07:19:03Z

Test build #87706 has finished for PR 20681 at commit 8421e2d.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-02-27T08:05:30Z

R/pkg/tests/fulltests/test_sparkSQL.R

+  lines <- c("{\"name\":\"?????\"}",
+             "{\"name\":\"??\", \"age\":30}",
+             "{\"name\":\"?????\", \"age\":19}",
+             "{\"name\":\"Xin ch?o\"}")


Is it okay to replace these unicode characters?

no, we should not change these

felixcheung

we need to make a few changes

SparkQA · 2018-02-27T21:08:16Z

Test build #87731 has finished for PR 20681 at commit 1b29bbd.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-28T02:37:14Z

Test build #87737 has finished for PR 20681 at commit 21c3374.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-28T07:15:32Z

Test build #87747 has finished for PR 20681 at commit 999f86f.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

LG this should do it...

SparkQA · 2018-02-28T08:05:01Z

Test build #87757 has finished for PR 20681 at commit 5c922ca.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-28T17:54:02Z

retest this please

dongjoon-hyun

+1, LGTM.

SparkQA · 2018-02-28T20:24:32Z

Test build #87797 has finished for PR 20681 at commit 5c922ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liufengdb · 2018-02-28T20:27:31Z

retest this please

dongjoon-hyun · 2018-02-28T20:46:25Z

Retest this please.

SparkQA · 2018-03-01T00:18:58Z

Test build #87802 has finished for PR 20681 at commit 5c922ca.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-03-01T05:35:03Z

looks like test failures are related?

SparkQA · 2018-03-01T11:17:25Z

Test build #87821 has finished for PR 20681 at commit 6a962e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liufengdb · 2018-03-01T16:21:04Z

My original plan to fix the test should not work, because of this test: https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343

The new plan is to run some simple catalog commands immediately after the spark session is created, so the catalog is materialized (like the old behavior).

gatorsmile · 2018-03-01T17:00:42Z

Can we fix this test? https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343

liufengdb · 2018-03-01T18:07:14Z

We can remove the test, but it is not a good practice. You don't know exactly why the test is added, which hidden assumption he wants to guarantee, right?

liufengdb · 2018-03-01T18:12:21Z

Overall, I think this suite needs a refactoring: split to in-memory catalog one and hive catalog one. The catalog conf should not be manipulated after the spark context is created. The other way is just a hack.

gatorsmile · 2018-03-01T18:25:05Z

cc @felixcheung Does this PR look good to you now?

SparkQA · 2018-03-02T00:55:59Z

Test build #87858 has finished for PR 20681 at commit d0eacc2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HiveMetastoreLazyInitializationSuite extends SparkFunSuite

liufengdb · 2018-03-02T02:00:15Z

retest this please

felixcheung

I think this test here https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343 is just to see if the spark.sql.catalogImplementation is set, it could easier be done in a different way if that is an issue.

felixcheung · 2018-03-02T02:19:00Z

R/pkg/tests/fulltests/test_sparkSQL.R

@@ -67,6 +67,8 @@ sparkSession <- if (windows_with_hadoop()) {
    sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE)
  }
 sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getJavaSparkContext", sparkSession)
+# materialize the catalog implementation
+listTables()


we are calling sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) is almost every other test files - does the same apply in those other places?

test_sparkSQL.R is the only one uses newJObject("org.apache.spark.sql.hive.test.TestHiveContext", ssc, FALSE) on the ssc, so the catalog impl spark conf is changed. So ``test_sparkSQL.R` is the only one broken.

SparkQA · 2018-03-02T05:48:38Z

Test build #87867 has finished for PR 20681 at commit d0eacc2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HiveMetastoreLazyInitializationSuite extends SparkFunSuite

gatorsmile · 2018-03-02T18:38:02Z

LGTM

Thanks! Merged to master.

… read and write data frames ## What changes were proposed in this pull request? apache#18944 added one patch, which allowed a spark session to be created when the hive metastore server is down. However, it did not allow running any commands with the spark session. This brings troubles to the user who only wants to read / write data frames without metastore setup. ## How was this patch tested? Added some unit tests to read and write data frames based on the original HiveMetastoreLazyInitializationSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Feng Liu <fengliu@databricks.com> Closes apache#20681 from liufengdb/completely-lazy.

liufengdb changed the title ~~[SPARK-23518][SQL] Completely remove metastore access if the query is not using tables~~ [SPARK-23518][SQL] Avoid metastore access when the users only want to read and write data frames Feb 26, 2018

lazy

292cbce

gatorsmile reviewed Feb 26, 2018

View reviewed changes

felixcheung reviewed Feb 27, 2018

View reviewed changes

comments

57f2a3d

ueshin reviewed Feb 27, 2018

View reviewed changes

felixcheung reviewed Feb 27, 2018

View reviewed changes

felixcheung approved these changes Feb 28, 2018

View reviewed changes

dongjoon-hyun approved these changes Feb 28, 2018

View reviewed changes

fix test

6a962e9

remove the new dep

d0eacc2

felixcheung reviewed Mar 2, 2018

View reviewed changes

asfgit closed this in 3a4d15e Mar 2, 2018

[SPARK-23518][SQL] Avoid metastore access when the users only want to read and write data frames #20681

[SPARK-23518][SQL] Avoid metastore access when the users only want to read and write data frames #20681

Conversation

liufengdb commented Feb 26, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

liufengdb commented Feb 26, 2018 • edited Loading

gatorsmile left a comment

Choose a reason for hiding this comment

felixcheung Feb 27, 2018

Choose a reason for hiding this comment

felixcheung Feb 27, 2018

Choose a reason for hiding this comment

felixcheung commented Feb 27, 2018

SparkQA commented Feb 27, 2018

SparkQA commented Feb 27, 2018

SparkQA commented Feb 27, 2018

SparkQA commented Feb 27, 2018

ueshin Feb 27, 2018

Choose a reason for hiding this comment

felixcheung Feb 27, 2018

Choose a reason for hiding this comment

felixcheung left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 27, 2018

SparkQA commented Feb 28, 2018

SparkQA commented Feb 28, 2018

felixcheung left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 28, 2018

gatorsmile commented Feb 28, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 28, 2018

liufengdb commented Feb 28, 2018

dongjoon-hyun commented Feb 28, 2018

SparkQA commented Mar 1, 2018

felixcheung commented Mar 1, 2018

SparkQA commented Mar 1, 2018

liufengdb commented Mar 1, 2018

gatorsmile commented Mar 1, 2018

liufengdb commented Mar 1, 2018 • edited Loading

liufengdb commented Mar 1, 2018

gatorsmile commented Mar 1, 2018

SparkQA commented Mar 2, 2018

liufengdb commented Mar 2, 2018

felixcheung left a comment

Choose a reason for hiding this comment

felixcheung Mar 2, 2018

Choose a reason for hiding this comment

liufengdb Mar 2, 2018

Choose a reason for hiding this comment

felixcheung Mar 2, 2018

Choose a reason for hiding this comment

SparkQA commented Mar 2, 2018

gatorsmile commented Mar 2, 2018

liufengdb commented Feb 26, 2018 •

edited

Loading

liufengdb commented Feb 26, 2018 •

edited

Loading

liufengdb commented Mar 1, 2018 •

edited

Loading