Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1508][SQL] Add SQLConf to SQLContext. #956

Closed
wants to merge 35 commits into from

Conversation

Projects
None yet
7 participants
@concretevitamin
Copy link
Contributor

commented Jun 3, 2014

This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands.

The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by sql() or hql(). Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf.

For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 3, 2014

Merged build triggered.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 3, 2014

Merged build started.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 3, 2014

Merged build finished. All automated tests passed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 3, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15410/

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 3, 2014

Merged build triggered.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 3, 2014

Merged build started.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 3, 2014

Merged build finished.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 3, 2014

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 3, 2014

Merged build triggered.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 3, 2014

Merged build started.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 4, 2014

Merged build finished.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 4, 2014

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 4, 2014

Merged build triggered.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 4, 2014

Merged build started.

@rxin

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2014

Can you also add a feature that typing "set abcd" should output the value of abcd.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 4, 2014

Merged build finished.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 4, 2014

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 4, 2014

Merged build triggered.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 4, 2014

Merged build started.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 4, 2014

Merged build finished. All automated tests passed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 4, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15450/

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 9, 2014

Build triggered.

@concretevitamin

This comment has been minimized.

Copy link
Contributor Author

commented Jun 9, 2014

I made the changes -- SQLConf is now a trait that gets mixed in by a SQLContext.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 9, 2014

Build started.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 9, 2014

Build finished. All automated tests passed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 9, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15571/

@marmbrus

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2014

I will merge this once it merges cleanly into master.

Here's a followup PR for sessionizing the config: https://issues.apache.org/jira/browse/SPARK-2087

Merge remote-tracking branch 'upstream/master' into sqlconf
Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
	sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
	sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
	sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 10, 2014

Merged build triggered.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 10, 2014

Merged build started.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 10, 2014

Merged build triggered.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 10, 2014

Merged build finished. All automated tests passed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 10, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15584/

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 10, 2014

Merged build started.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 10, 2014

Merged build finished. All automated tests passed.

@AmplabJenkins

This comment has been minimized.

Copy link

commented Jun 10, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15586/

@@ -269,6 +271,22 @@ class SQLContext(@transient val sparkContext: SparkContext)
protected abstract class QueryExecution {
def logical: LogicalPlan

def eagerlyProcess(plan: LogicalPlan): RDD[Row] = plan match {
case SetCommand(key, value) =>

This comment has been minimized.

Copy link
@liancheng

liancheng Jun 10, 2014

Contributor

I'm a bit confused that the actual logic for the set command is not handled in SetCommandPhysical but here as a special case. I think we can remove eagerlyProcess and processCmd in this file with the help of changes made in PR #948, so that we can move the actual processing logic of set command back to SetCommandPhysical. @rxin @marmbrus What do you think?

This comment has been minimized.

Copy link
@concretevitamin

concretevitamin Jun 10, 2014

Author Contributor

Hey @liancheng -- so this function eagerlyProcess is supposed to be triggered as soon as hql("set ...") is typed in / is being evaluated. The actual physical operator SetCommandPhysical handles evaluating the actual contents of the SchemaRDD. These contents are not (and need not be, in the spirit of RDDs) eagerly processed at this particular place.

This comment has been minimized.

Copy link
@liancheng

liancheng Jun 10, 2014

Contributor

Hi @concretevitamin, previously I thought the only important thing of a SchemaRDD made from an eagerly executed statement (a Hive native command, a set command, or some insertion command etc.) is its side effect, not the content of the RDD itself. But now I do agree that when user call .collect() of such an RDD, we should present some meaningful result.

As for the eager evaluation of the RDD content, IMHO, for command SchemaRDDs, the RDD contents are either empty or usually generated at the same time when the command is executed (at least I can't find a counter example in the current code base), thus this shouldn't be an issue.

So I think the constraints presented here are:

  1. The side effect of a command SchemaRDD should take place eagerly;
  2. The side effect of a command SchemaRDD should take place once and only once;
  3. When .collect() method is called, something meaningful, usually the output message lines of the command, should be presented.

Then how about adding a lazy field inside all the physical command nodes to wrap up the side effect and hold the command output? Take the SetCommandPhysical as an example:

trait PhysicalCommand(@transient context: SQLContext) {
   lazy val commandOutput: Any
}

case class SetCommandPhysical(
    key: Option[String], value: Option[String], output: Seq[Attribute])(
    @transient context: SQLContext)
  extends PhysicalCommand(context)
  with PhysicalCommand {

  override lazy val commandOutput = {
    // Perform the side effect, and record appropriate output
    ???
  }

  def execute(): RDD[Row] = {
    val row = new GenericRow(Array[Any](commandOutput))
    context.sparkContext.parallelize(row, 1)
  }
}

In this way, all the constraints are met:

  1. Eager evaluation: done by the toRdd call in SchemaRDDLike (PR #948),
  2. Side effect should take place once and only once: ensured by the lazy commandOutput field,
  3. Present meaningful output as RDD contents: command output is held by commandOutput and returned in execute().

An additional benefit is that, side effect logic of all the commands can be implemented within their own physical command nodes, instead of adding special cases inside SQLContext.toRdd and/or HiveContext.toRdd.

Did I miss something here?

On the other hand, although I think this solution can be more concise and clean, it may involve some non-trivial changes, so I think we'd better merge this PR as is and make those changes in another PR when appropriate.

This comment has been minimized.

Copy link
@marmbrus

marmbrus Jun 10, 2014

Contributor

Yeah, I merged this, but that sounds like a good plan to me.

Follow up JIRA: https://issues.apache.org/jira/browse/SPARK-2094

This comment has been minimized.

Copy link
@concretevitamin

concretevitamin Jun 10, 2014

Author Contributor

@liancheng That sounds like a very clear & good solution!

@asfgit asfgit closed this in 08ed9ad Jun 10, 2014

asfgit pushed a commit that referenced this pull request Jun 10, 2014

[SPARK-1508][SQL] Add SQLConf to SQLContext.
This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands.

The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf.

For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #956 from concretevitamin/sqlconf and squashes the following commits:

4968c11 [Zongheng Yang] Very minor cleanup.
d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method.
c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf
26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in.
dd19666 [Zongheng Yang] Update a comment.
baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor.
5f7e6d8 [Zongheng Yang] Add default num partitions.
22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method.
e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap.
88dd0c8 [Zongheng Yang] Remove redundant SET Keyword.
271f0b1 [Zongheng Yang] Minor change.
f8983d1 [Zongheng Yang] Minor changes per review comments.
1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case.
b766af9 [Zongheng Yang] Remove a test.
d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf).
555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment.
c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list.
2ea8cdc [Zongheng Yang] Wrap long line.
41d7f09 [Zongheng Yang] Fix imports.
13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands.
b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf.
6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete.
5b67985 [Zongheng Yang] New line at EOF.
c651797 [Zongheng Yang] Add commands.scala.
efd82db [Zongheng Yang] Clean up semantics of several cases of SET.
c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs).
0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL.
41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite.
2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest.
3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes.
d0c4578 [Zongheng Yang] Tmux typo.
0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext.
ce22d80 [Zongheng Yang] Fix parsing issues.
cb722c1 [Zongheng Yang] Finish up SQLConf patch.
4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.

(cherry picked from commit 08ed9ad)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@marmbrus

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2014

Thanks, merged into master an 1.0.

pdeyhim added a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014

[SPARK-1508][SQL] Add SQLConf to SQLContext.
This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands.

The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf.

For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes apache#956 from concretevitamin/sqlconf and squashes the following commits:

4968c11 [Zongheng Yang] Very minor cleanup.
d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method.
c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf
26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in.
dd19666 [Zongheng Yang] Update a comment.
baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor.
5f7e6d8 [Zongheng Yang] Add default num partitions.
22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method.
e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap.
88dd0c8 [Zongheng Yang] Remove redundant SET Keyword.
271f0b1 [Zongheng Yang] Minor change.
f8983d1 [Zongheng Yang] Minor changes per review comments.
1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case.
b766af9 [Zongheng Yang] Remove a test.
d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf).
555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment.
c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list.
2ea8cdc [Zongheng Yang] Wrap long line.
41d7f09 [Zongheng Yang] Fix imports.
13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands.
b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf.
6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete.
5b67985 [Zongheng Yang] New line at EOF.
c651797 [Zongheng Yang] Add commands.scala.
efd82db [Zongheng Yang] Clean up semantics of several cases of SET.
c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs).
0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL.
41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite.
2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest.
3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes.
d0c4578 [Zongheng Yang] Tmux typo.
0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext.
ce22d80 [Zongheng Yang] Fix parsing issues.
cb722c1 [Zongheng Yang] Finish up SQLConf patch.
4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014

[SPARK-1508][SQL] Add SQLConf to SQLContext.
This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands.

The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf.

For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes apache#956 from concretevitamin/sqlconf and squashes the following commits:

4968c11 [Zongheng Yang] Very minor cleanup.
d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method.
c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf
26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in.
dd19666 [Zongheng Yang] Update a comment.
baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor.
5f7e6d8 [Zongheng Yang] Add default num partitions.
22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method.
e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap.
88dd0c8 [Zongheng Yang] Remove redundant SET Keyword.
271f0b1 [Zongheng Yang] Minor change.
f8983d1 [Zongheng Yang] Minor changes per review comments.
1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case.
b766af9 [Zongheng Yang] Remove a test.
d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf).
555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment.
c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list.
2ea8cdc [Zongheng Yang] Wrap long line.
41d7f09 [Zongheng Yang] Fix imports.
13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands.
b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf.
6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete.
5b67985 [Zongheng Yang] New line at EOF.
c651797 [Zongheng Yang] Add commands.scala.
efd82db [Zongheng Yang] Clean up semantics of several cases of SET.
c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs).
0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL.
41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite.
2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest.
3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes.
d0c4578 [Zongheng Yang] Tmux typo.
0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext.
ce22d80 [Zongheng Yang] Fix parsing issues.
cb722c1 [Zongheng Yang] Finish up SQLConf patch.
4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.