Skip to content

Conversation

@zsxwing
Copy link
Member

@zsxwing zsxwing commented Jun 10, 2015

  1. Add SQLConfEntry to store the information about a configuration. For those configurations that cannot be found in sql-programming-guide.md, I left the doc as <TODO>.
  2. Verify the value when setting a configuration if this is in SQLConf.
  3. Use SET -v to display all public configurations.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 10, 2015

A simple example of the conf command:

scala> sqlContext.sql("conf").collect().foreach(v => println(v.toSeq.mkString("\t")))
spark.sql.inMemoryColumnarStorage.batchSize 10000   Controls the size of batches for columnar caching.  Larger batch sizes can improve memory utilization and compression, but risk OOMs when caching data.
spark.sql.hive.verifyPartitionPath  true    <TODO>
spark.sql.retainGroupColumns    true    <TODO>
spark.sql.codegen   false   When true, code will be dynamically generated at runtime for expression evaluation in a specific query. For some queries with complicated expression this option can lead to significant speed-ups. However, for simple queries this can actually slow down query execution.
spark.sql.planner.sortMergeJoin false   <TODO>
spark.sql.shuffle.partitions    200 Configures the number of partitions to use when shuffling data for joins or aggregations.
spark.sql.planner.externalSort  false   When true, performs sorts spilling to disk as needed otherwise sort each partition in memory.
spark.sql.broadcastTimeout  300 <TODO>
spark.sql.parquet.filterPushdown    false   Turn on Parquet filter pushdown optimization. This feature is turned off by default because of a known bug in Paruet 1.6.0rc3 (<a href="https://issues.apache.org/jira/browse/PARQUET-136">PARQUET-136</a>). However, if your table doesn't contain any nullable string or binary columns, it's still safe to turn this feature on.
spark.sql.parquet.cacheMetadata true    Turns on caching of Parquet schema metadata. Can speed up querying of static data.
spark.sql.unsafe.enabled    false   <TDDO>
spark.sql.useSerializer2    true    <TODO>
...

1. Add `SQLConfEntry` to store the information about a configuration. For those configurations that cannot be found in `sql-programming-guide.md`, I left the doc as `<TODO>`.
2. Verify the value when setting a configuration if this is in SQLConf.
3. Add a command `conf` to display all public configurations.
@SparkQA
Copy link

SparkQA commented Jun 10, 2015

Test build #34606 has finished for PR 6747 at commit 326881d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 10, 2015

Test build #34607 has finished for PR 6747 at commit f3c1b33.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove the float version, and just use double version?

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34678 has finished for PR 6747 at commit 99c9c16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34680 has finished for PR 6747 at commit 8973ced.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34685 has finished for PR 6747 at commit 6e47e56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34690 has finished for PR 6747 at commit 4abd807.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 12, 2015

Addressed all comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add scaladoc for the semantics of these options. In particular, isPublic isn't obvious to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added scaladoc for SQLConfEntry

@marmbrus
Copy link
Contributor

Thanks for working on this much needed cleanup! A few comments (some of which were preexisting):

  • The schema for set commands is wrong (we need some column name)
scala> sql("SET -v")
res4: org.apache.spark.sql.DataFrame = [: string]
  • Related to the above, I think we should consider breaking out the parts of commands like SET and SET -v into columns, instead of just returning a concatenated string.
    • When the option has a default, I think we should show that instead of <undefined>
scala> sql("SET spark.sql.autoBroadcastJoinThreshold").collect()
res1: Array[org.apache.spark.sql.Row] = Array([spark.sql.autoBroadcastJoinThreshold=<undefined>])
  • It seems a bunch of boilerplate could be removed and we could get some more typesafety by making the conf entries inter classes of SQLConf with methods like .value and .set(...). This would avoid having to have an entry and a def for each config option. Perhaps I'm missing something though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also have an entry type for configs like these that validates the option is in some list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added enumConf and used it for PARQUET_COMPRESSION

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34763 has finished for PR 6747 at commit 037b1db.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 12, 2015

The schema for set commands is wrong (we need some column name)

Added schemas for all set commands and split results into columns. For the configuration that has a default value, if it's not set , set <configuration> will show the default value.

It seems a bunch of boilerplate could be removed and we could get some more typesafety by making the conf entries inter classes of SQLConf with methods like .value and .set(...). This would avoid having to have an entry and a def for each config option. Perhaps I'm missing something though.

Sounds good. I will try it.

@zsxwing
Copy link
Member Author

zsxwing commented Jun 12, 2015

It seems a bunch of boilerplate could be removed and we could get some more typesafety by making the conf entries inter classes of SQLConf with methods like .value and .set(...). This would avoid having to have an entry and a def for each config option. Perhaps I'm missing something though.

One disadvantage is the size of a SQLConf instance will increase a lot since all SQLConfEntries are put into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a def unsetConf[T](entry: SQLConfEntry[T]) here can be convenient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Seems that we can remove the newline here.

@liancheng
Copy link
Contributor

One of the things that I envy Hive is ConfVars, and now we finally have our own version :)

Some follow-ups I can think of (but not necessarily to be done in this PR):

  • Add generic version of setConf, getConf, unsetConf to SQLContext (note that HiveContext.setConf overrides SQLContext.setConf and also manipulates HiveConf there)
  • Replace all occurrences of sqlContext.setConf/getConf/unsetConf to the corresponding generic version

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34767 has finished for PR 6747 at commit a2f4add.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34768 has finished for PR 6747 at commit 3c5f03e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Another thought is that, we probably also want to migrate data sources options to this new API.

@marmbrus How do you think?

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34775 has finished for PR 6747 at commit cf950c1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SetCommand(kv: Option[(String, Option[String])]) extends RunnableCommand with Logging
    • case class SetInFilter[T <: Comparable[T]](

@SparkQA
Copy link

SparkQA commented Jun 16, 2015

Test build #34982 has finished for PR 6747 at commit e014f53.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class IsNull(child: Expression) extends UnaryExpression with Predicate
    • case class IsNotNull(child: Expression) extends UnaryExpression with Predicate
    • case class SetCommand(kv: Option[(String, Option[String])]) extends RunnableCommand with Logging

@SparkQA
Copy link

SparkQA commented Jun 17, 2015

Test build #35047 has finished for PR 6747 at commit 7d09bad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SetCommand(kv: Option[(String, Option[String])]) extends RunnableCommand with Logging

@rxin
Copy link
Contributor

rxin commented Jun 18, 2015

Alright merging this. Thanks.

@asfgit asfgit closed this in 78a430e Jun 18, 2015
@zsxwing zsxwing deleted the sqlconf branch June 18, 2015 13:26
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
1. Add `SQLConfEntry` to store the information about a configuration. For those configurations that cannot be found in `sql-programming-guide.md`, I left the doc as `<TODO>`.
2. Verify the value when setting a configuration if this is in SQLConf.
3. Use `SET -v` to display all public configurations.

Author: zsxwing <zsxwing@gmail.com>

Closes apache#6747 from zsxwing/sqlconf and squashes the following commits:

7d09bad [zsxwing] Use SQLConfEntry in HiveContext
49f6213 [zsxwing] Add getConf, setConf to SQLContext and HiveContext
e014f53 [zsxwing] Merge branch 'master' into sqlconf
93dad8e [zsxwing] Fix the unit tests
cf950c1 [zsxwing] Fix the code style and tests
3c5f03e [zsxwing] Add unsetConf(SQLConfEntry) and fix the code style
a2f4add [zsxwing] getConf will return the default value if a config is not set
037b1db [zsxwing] Add schema to SetCommand
0520c3c [zsxwing] Merge branch 'master' into sqlconf
7afb0ec [zsxwing] Fix the configurations about HiveThriftServer
7e728e3 [zsxwing] Add doc for SQLConfEntry and fix 'toString'
5e95b10 [zsxwing] Add enumConf
c6ba76d [zsxwing] setRawString => setConfString, getRawString => getConfString
4abd807 [zsxwing] Fix the test for 'set -v'
6e47e56 [zsxwing] Fix the compilation error
8973ced [zsxwing] Remove floatConf
1fc3a8b [zsxwing] Remove the 'conf' command and use 'set -v' instead
99c9c16 [zsxwing] Fix tests that use SQLConfEntry as a string
88a03cc [zsxwing] Add new lines between confs and return types
ce7c6c8 [zsxwing] Remove seqConf
f3c1b33 [zsxwing] Refactor SQLConf to display better error message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants