Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5941] [SQL] Unit Test loads the table src twice for leftsemijoin.q #4506

Closed
wants to merge 1 commit into from

Conversation

chenghao-intel
Copy link
Contributor

In leftsemijoin.q, there is a data loading command for table sales already, but in TestHive, it also created the table sales, which causes duplicated records inserted into the sales.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27207 has finished for PR 4506 at commit 3ee839f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

We have eagerly resolved the table since Spark 1.0 when Spark SQL was added.

https://github.com/apache/spark/blob/branch-1.0/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L185

Why do you think this is problematic?

@chenghao-intel
Copy link
Contributor Author

The deferred resolving table should be harmless, and since most of the DF API are yielding the unresolved logical plans(can I say that?), I think we'd better keep it the same for def table(name), .

Ideally, eagerly resolving the table should produce the the result in unit test, but seems not, probably something wrong somewhere, I will keep investigating that.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27505 has finished for PR 4506 at commit 40ccd81.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27625 has finished for PR 4506 at commit 4c58fbc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel
Copy link
Contributor Author

/cc @marmbrus @rxin @yhuai I noticed that we have added the dataFrameEagerAnalysis in SQLConf, probably we also need to update the def table code for using UnresolvedRelation instead of lookupRelation eagerly.
However some of the unit test (left semi join) will fail if we do that, as we have bug in TestHive, this PR is also fix the bug. More detailed information about the unit test failure can be found at the description.

@SparkQA
Copy link

SparkQA commented Feb 18, 2015

Test build #27672 has finished for PR 4506 at commit 28dd6d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chenghao-intel chenghao-intel changed the title [SQL] [Minor] Deferred table resolving for DF API [SPARK-5941] [SQL] Deferred table resolving for DF API Feb 22, 2015
@chenghao-intel
Copy link
Contributor Author

@marmbrus @rxin @yhuai any more comment? Or should I close this PR?

@rxin
Copy link
Contributor

rxin commented Feb 25, 2015

@marmbrus can you take a look? not 100% sure what's happening

@yhuai
Copy link
Contributor

yhuai commented Feb 25, 2015

If I remember it correctly, the issue is some test tables are loaded twice. @chenghao-intel Can you change the title?

@chenghao-intel chenghao-intel changed the title [SPARK-5941] [SQL] Deferred table resolving for DF API [SPARK-5941] [DataFrame] Postpone the table resolving for DataFrame API def table Feb 25, 2015
@chenghao-intel
Copy link
Contributor Author

Sorry for the confusing. I've updated the title and description.

|ROW FORMAT SERDE '${classOf[RegexSerDe].getCanonicalName}'
|WITH SERDEPROPERTIES ("input.regex" = "([^ ]*)\t([^ ]*)")
""".stripMargin.cmd,
s"LOAD DATA LOCAL INPATH '${getHiveFile("data/files/sales.txt")}' INTO TABLE sales".cmd),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since sales is not a preloaded in Hive's unit tests (https://github.com/apache/hive/blob/trunk/data/scripts/q_test_init.sql), seems it is fine to remove it.

@chenghao-intel chenghao-intel changed the title [SPARK-5941] [DataFrame] Postpone the table resolving for DataFrame API def table [SPARK-5941] [SQL] Unit Test loads the table src twice for leftsemijoin.q Apr 6, 2015
@SparkQA
Copy link

SparkQA commented Apr 6, 2015

Test build #29735 has finished for PR 4506 at commit b463f8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@chenghao-intel
Copy link
Contributor Author

Closing it since it's only impact a single hive compatible test.

@chenghao-intel
Copy link
Contributor Author

@marmbrus I've reopened this PR, just in case people runs into the bug of this while unit testing.

@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30132 has finished for PR 4506 at commit dd0b3f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30133 has finished for PR 4506 at commit 0be05f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30131 has finished for PR 4506 at commit b463f8a.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Thanks! Merged to master.

@asfgit asfgit closed this in c5602bd Apr 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants