Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23518][SQL] Avoid metastore access when the users only want to read and write data frames #20681

Closed
wants to merge 4 commits into from

Conversation

liufengdb
Copy link

@liufengdb liufengdb commented Feb 26, 2018

What changes were proposed in this pull request?

#18944 added one patch, which allowed a spark session to be created when the hive metastore server is down. However, it did not allow running any commands with the spark session. This brings troubles to the user who only wants to read / write data frames without metastore setup.

How was this patch tested?

Added some unit tests to read and write data frames based on the original HiveMetastoreLazyInitializationSuite.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@liufengdb liufengdb changed the title [SPARK-23518][SQL] Completely remove metastore access if the query is not using tables [SPARK-23518][SQL] Avoid metastore access when the users only want to read and write data frames Feb 26, 2018
@liufengdb
Copy link
Author

liufengdb commented Feb 26, 2018

@felixcheung Can you take a look at the changes in the R tests? I had to change it because the catalog imp conf is changed to "hive" when switching to hive context. Then when it is switched back, a "hive" external catalog will be materialized (lazily) and causes a test timeout.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except R-related codes. cc @felixcheung

assign(".sparkRsession", previousSession, envir = .sparkREnv)
remove(".prevSparkRsession", envir = .sparkREnv)
remove(".prevSparkConf", envir = .sparkREnv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be . prevSessionConf?

hiveSession
}

unsetHiveContext <- function() {
previousSession <- get(".prevSparkRsession", envir = .sparkREnv)
previousConf <- get(".prevSessionConf", envir = .sparkREnv)
callJStatic("org.apache.spark.sql.api.r.SQLUtils", "setSparkContextSessionConf", previousSessioni, previousConf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be previousSession?

@felixcheung
Copy link
Member

thx, let's wait for tests

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87681 has finished for PR 20681 at commit b447bea.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87684 has finished for PR 20681 at commit 292cbce.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87686 has finished for PR 20681 at commit 57f2a3d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87706 has finished for PR 20681 at commit 8421e2d.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

lines <- c("{\"name\":\"?????\"}",
"{\"name\":\"??\", \"age\":30}",
"{\"name\":\"?????\", \"age\":19}",
"{\"name\":\"Xin ch?o\"}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to replace these unicode characters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we should not change these

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make a few changes

@SparkQA
Copy link

SparkQA commented Feb 27, 2018

Test build #87731 has finished for PR 20681 at commit 1b29bbd.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87737 has finished for PR 20681 at commit 21c3374.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87747 has finished for PR 20681 at commit 999f86f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG this should do it...

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87757 has finished for PR 20681 at commit 5c922ca.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87797 has finished for PR 20681 at commit 5c922ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liufengdb
Copy link
Author

retest this please

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 1, 2018

Test build #87802 has finished for PR 20681 at commit 5c922ca.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

looks like test failures are related?

@SparkQA
Copy link

SparkQA commented Mar 1, 2018

Test build #87821 has finished for PR 20681 at commit 6a962e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liufengdb
Copy link
Author

My original plan to fix the test should not work, because of this test: https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343

The new plan is to run some simple catalog commands immediately after the spark session is created, so the catalog is materialized (like the old behavior).

@gatorsmile
Copy link
Member

@liufengdb
Copy link
Author

liufengdb commented Mar 1, 2018

We can remove the test, but it is not a good practice. You don't know exactly why the test is added, which hidden assumption he wants to guarantee, right?

@liufengdb
Copy link
Author

Overall, I think this suite needs a refactoring: split to in-memory catalog one and hive catalog one. The catalog conf should not be manipulated after the spark context is created. The other way is just a hack.

@gatorsmile
Copy link
Member

cc @felixcheung Does this PR look good to you now?

@SparkQA
Copy link

SparkQA commented Mar 2, 2018

Test build #87858 has finished for PR 20681 at commit d0eacc2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveMetastoreLazyInitializationSuite extends SparkFunSuite

@liufengdb
Copy link
Author

retest this please

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test here https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_sparkSQL.R#L3343 is just to see if the spark.sql.catalogImplementation is set, it could easier be done in a different way if that is an issue.

@@ -67,6 +67,8 @@ sparkSession <- if (windows_with_hadoop()) {
sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE)
}
sc <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getJavaSparkContext", sparkSession)
# materialize the catalog implementation
listTables()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are calling sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) is almost every other test files - does the same apply in those other places?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_sparkSQL.R is the only one uses newJObject("org.apache.spark.sql.hive.test.TestHiveContext", ssc, FALSE) on the ssc, so the catalog impl spark conf is changed. So ``test_sparkSQL.R` is the only one broken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@SparkQA
Copy link

SparkQA commented Mar 2, 2018

Test build #87867 has finished for PR 20681 at commit d0eacc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveMetastoreLazyInitializationSuite extends SparkFunSuite

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in 3a4d15e Mar 2, 2018
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
… read and write data frames

## What changes were proposed in this pull request?

apache#18944 added one patch, which allowed a spark session to be created when the hive metastore server is down. However, it did not allow running any commands with the spark session. This brings troubles to the user who only wants to read / write data frames without metastore setup.

## How was this patch tested?

Added some unit tests to read and write data frames based on the original HiveMetastoreLazyInitializationSuite.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Feng Liu <fengliu@databricks.com>

Closes apache#20681 from liufengdb/completely-lazy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants