Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9986][SPARK-9991][SPARK-9993][SQL]Create a simple test framework for local operators #8464

Closed
wants to merge 9 commits into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Aug 26, 2015

This PR includes the following changes:

  • Add LocalNodeTest for local operator tests and add unit tests for FilterNode and ProjectNode.
  • Add LimitNode and UnionNode and their unit tests to show how to use LocalNodeTest. (SPARK-9991, SPARK-9993)

@zsxwing
Copy link
Member Author

zsxwing commented Aug 26, 2015

cc @rxin

override def open(): Unit = {
iterator = data.iterator
}
private var iter: Iterator[InternalRow] = _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just assign value here(we can use val then)? Your changes to FilterNode and ProjectNode also assign value to predicate/project immediately instead of in the open.

@SparkQA
Copy link

SparkQA commented Aug 26, 2015

Test build #41628 has finished for PR 8464 at commit 4e101ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LimitNode(limit: Int, child: LocalNode) extends UnaryLocalNode
    • case class UnionNode(children: Seq[LocalNode]) extends LocalNode

@rxin
Copy link
Contributor

rxin commented Aug 27, 2015

I talked to @marmbrus and we thought it'd make more sense to just have the operators be iterators themselves, and then we will create new instances of these somewhere else.

@zsxwing
Copy link
Member Author

zsxwing commented Aug 27, 2015

I removed OpenCloseIterator and made LocalNodes be Iterators as discussion with @rxin. Also updated the description of this PR.

CatalystTypeConverters.createToCatalystConverter(StructType.fromAttributes(output))
new SeqScanNode(
output,
df.collect().map(r => converter(r).asInstanceOf[InternalRow]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use df.queryExecution.toRDD.collect() to get the internal rows directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Fixed it.

@SparkQA
Copy link

SparkQA commented Aug 27, 2015

Test build #41676 has finished for PR 8464 at commit 62b8d24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LimitNode(limit: Int, child: LocalNode) extends UnaryLocalNode
    • case class UnionNode(children: Seq[LocalNode]) extends LocalNode

@SparkQA
Copy link

SparkQA commented Aug 27, 2015

Test build #41681 has finished for PR 8464 at commit 22e7bc0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LimitNode(limit: Int, child: LocalNode) extends UnaryLocalNode
    • case class UnionNode(children: Seq[LocalNode]) extends LocalNode

* @param sortAnswers if true, the answers will be sorted by their toString representations prior
* to being compared.
*/
protected def checkAnswer2(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just name this checkAnswer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be checkAnswer2 because there is a default parameter sortAnswers and it cannot work with overload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use overloading instead of defaults for that as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, the type inference doesn't work well... E.g.,

    checkAnswer(
      testData,
      node => FilterNode(condition.expr, node),
      testData.filter(condition).collect()
    )

will need to change to

    checkAnswer(
      testData,
      (node: LocalNode) => FilterNode(condition.expr, node),
      testData.filter(condition).collect()
    )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reminds me some old arguments about overloading in the Scala community. It seems that they don't like overloading and don't want to improve it: https://groups.google.com/forum/#!msg/scala-language/h7akCAFnu8c/dmReTTsW11gJ

@SparkQA
Copy link

SparkQA commented Aug 28, 2015

Test build #41717 has finished for PR 8464 at commit 7dcd502.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LimitNode(limit: Int, child: LocalNode) extends UnaryLocalNode
    • case class UnionNode(children: Seq[LocalNode]) extends LocalNode


import org.apache.spark.sql.test.SharedSQLContext

class FilterNodeSuite extends LocalNodeTest with SharedSQLContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I was thinking it would be great if we can get rid of the SQLContext in the test cases for these local stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to reuse SQLTestData. It would be easy if we could use DataFrame to test the local stuff. I feel if we make sure not using SQLContext in the main codes of LocalNodes, we don't need to get rid of SQLContext in the test cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that SQLTestData is going away though.... #7406

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that SQLTestData is going away though.... #7406

Didn't notice that. I will add test data for each test case manually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized we need to call DataFrame.resolve to create a Column. Looks it's hard to get rid of SQLContext only in tests. I think it's better to do it in a separate PR to add Analyzer for LocalNode.

@rxin
Copy link
Contributor

rxin commented Aug 30, 2015

OK I'm going to merge this so we have the infrastructure in. We can fix issues later.

@asfgit asfgit closed this in 13f5f8e Aug 30, 2015
@zsxwing zsxwing deleted the local-execution branch August 31, 2015 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants