Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2917] [SQL] Avoid table creation in logical plan analyzing for CTAS #1846

Closed
wants to merge 2 commits into from

Conversation

chenghao-intel
Copy link
Contributor

No description provided.

@chenghao-intel
Copy link
Contributor Author

Hello, Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA tests have started for PR 1846. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18186/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 8, 2014

QA results for PR 1846:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class CreateTableAsSelect(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18186/consoleFull

@chenghao-intel
Copy link
Contributor Author

Failed in Running PySpark tests

@@ -110,18 +110,11 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
*/
object CreateTables extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case InsertIntoCreatedTable(db, tableName, child) =>
case InsertIntoCreatedTable(db, tableName, child) if (db == None) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we want to check if db is None or not. We have

val databaseName = dbName.getOrElse(hive.sessionState.getCurrentDatabase)

right in line 115.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, thank you for noticing this.

@yhuai
Copy link
Contributor

yhuai commented Aug 11, 2014

Can you explain your intention? Why do we want to add a new physical operator to take care CTAS queries and put all actual operations into sideEffectResult?

@chenghao-intel
Copy link
Contributor Author

Thank you @yhuai for reviewing this.
The motivation I create this PR is currently the table will be created in logical plan analyzing (see object CreateTables) for CTAS, which may causes some side effect / bugs.

For example in #1847, I was trying to "explain" the CTAS like explain create table abc as select * from src. And probably the same in some case when developers debug their logical plan before real execution.

I think we'd better move the table creation from logical plan analyzing to executing. So I created a new Physical Operator CreateTableAsSelect

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA tests have started for PR 1846. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18298/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA tests have started for PR 1846. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18299/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA results for PR 1846:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class CreateTableAsSelect(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18298/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 11, 2014

QA results for PR 1846:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class CreateTableAsSelect(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18299/consoleFull

database: String,
tableName: String,
child: SparkPlan,
insert: MetastoreRelation => InsertIntoHiveTable)(@transient sc: HiveContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are failing because you have parameters that are in a second parameter list but not in otherCopyArgs. However, you shouldn't need to pass the HiveContext in manually at all. It is available automatically through the sqlContext method in SparkPlan.

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1846. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18409/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1846:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class CreateTableAsSelect(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18409/consoleFull

@chenghao-intel
Copy link
Contributor Author

Is this my fault? Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1846. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18424/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1846:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class CreateTableAsSelect(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18424/consoleFull

@chenghao-intel
Copy link
Contributor Author

Seems failed due to some other reasons.

@chenghao-intel
Copy link
Contributor Author

Failed in difference places, Jenkins, retest this please.

@chenghao-intel
Copy link
Contributor Author

@marmbrus seems Jenkins doesn't answer me, can you also check if the unit test failure for me? Seems failed in something else.

@marmbrus
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1846. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18508/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1846:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class CreateTableAsSelect(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18508/consoleFull

@chenghao-intel
Copy link
Contributor Author

Still failed in pyspark. But it passed in my local. Any suggestion?

=========================================================================
Running PySpark tests
=========================================================================
Running PySpark tests. Output is in python/unit-tests.log.
Testing with Python version:
Python 2.6.6
Running test: pyspark/rdd.py
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Traceback (most recent call last):
  File "pyspark/rdd.py", line 1849, in <module>
    _test()
  File "pyspark/rdd.py", line 1840, in _test
    globs['sc'] = SparkContext('local[4]', 'PythonTest', batchSize=2)
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/context.py", line 107, in __init__
    conf)
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/context.py", line 155, in _do_init
    self._jsc = self._initialize_context(self._conf._jconf)
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/context.py", line 201, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 699, in __call__

  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 369, in send_command

  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 362, in send_command

  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 318, in _get_connection

  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 325, in _create_connection

  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 432, in start

py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server

@chenghao-intel
Copy link
Contributor Author

test this please

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA tests have started for PR 1846 at commit 5be9554.

  • This patch merges cleanly.

@chenghao-intel
Copy link
Contributor Author

test this please

@SparkQA
Copy link

SparkQA commented Aug 21, 2014

QA tests have started for PR 1846 at commit 5be9554.

  • This patch merges cleanly.

tableName: String,
subquery: SparkPlan,
insert: MetastoreRelation => InsertIntoHiveTable)
extends LeafNode with Command {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a leaf node? I think that is going to break things like the addition of exchange operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a leaf node, and actually the child would be InsertIntoHiveTable, however, the InsertIntoHiveTable requires the table was existed in logical plan analysis, hence I pass in the MetastoreRelation => InsertIntoHiveTable as the parameter of CreateTableAsSelect.

You're right, the name subquery is quite confusing, I will change it, but the logical plan it represented is well transformed in HiveStrategies, see case logical.InsertIntoCreatedTable(database, tableName, child) in HiveStrategies.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 1846 at commit ae1e072.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have finished for PR 1846 at commit ae1e072.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • "$FWDIR"/bin/spark-submit --class $CLASS "$
    • class ExternalSorter(object):
    • "$FWDIR"/bin/spark-submit --class $CLASS "$
    • protected class AttributeEquals(val a: Attribute)
    • case class CreateTableAsSelect(

@chenghao-intel
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 1846 at commit ae1e072.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have finished for PR 1846 at commit ae1e072.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateTableAsSelect(

@chenghao-intel
Copy link
Contributor Author

@marmbrus @yhuai Can you review this for me?

@marmbrus
Copy link
Contributor

This is a pretty big change, will investigate further after the 1.1. release.

@chenghao-intel
Copy link
Contributor Author

OK, sounds good to me.

@marmbrus
Copy link
Contributor

@chenghao-intel, I'd like to this merged before I begin on my next big refactoring. One small thing, do you think you could add a test to make sure we don't regress and start creating tables for EXPLAIN queries?

@chenghao-intel
Copy link
Contributor Author

Thank you @marmbrus I will add an unit test for this soon.

@chenghao-intel
Copy link
Contributor Author

test this please.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 1846 at commit 6521ce8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 1846 at commit 6521ce8.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateTableAsSelect(

@@ -47,4 +47,16 @@ class SQLQuerySuite extends QueryTest {
GROUP BY key, value
ORDER BY value) a""").collect().toSeq)
}

test("test CTAS") {
sql("CREATE TABLE test_ctas_123 AS SELECT key, value FROM src").schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually isn't the semantic we want. All DDL commands should act as actions and be executed immediately. It is only when there is an EXPLAIN that we should not run the command eagerly.

I think InsertIntoCreatedTable should inherit from Command so that this happens automatically. I'd also consider renaming it CreateTableAsSelect as you did with the physical operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I understood. But I'd like to make that as 2 independent PRs:

What do you think about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as long as you don't remove InsertIntoCreatedTable from SchemaRDDLike.scala I'm okay with pushing the other changes to another PR.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 1846 at commit 56a0578.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 1846 at commit 56a0578.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateTableAsSelect(
    • case class CreateTableAsSelect(

@marmbrus
Copy link
Contributor

Thanks! Merged to master.

@asfgit asfgit closed this in ca83f1e Sep 11, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants