-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12536] [SQL] Added "Empty Seq" in Explain Outputs For Empty LocalRelation and LocalTableScan #10494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to do this at all, we should do it in tree node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus @cloud-fan Could you give a hint how to do it in TreeNode?
It sounds like we are unable to determine if a node is empty in TreeNode. To know it, I think we have to check the value of data or rows? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! @cloud-fan
|
Test build #48361 has finished for PR 10494 at commit
|
…gistration We use scalastyle:off to turn off style checks in certain places where it is not possible to follow the style guide. This is usually ok. However, in udf registration, we disable the checker for a large amount of code simply because some of them exceed 100 char line limit. It is better to just disable the line limit check rather than everything. In this pull request, I only disabled line length check, and fixed a problem (lack explicit types for public methods). Author: Reynold Xin <rxin@databricks.com> Closes apache#10501 from rxin/SPARK-12547.
…es in postgresql If DataFrame has BYTE types, throws an exception: org.postgresql.util.PSQLException: ERROR: type "byte" does not exist Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#9350 from maropu/FixBugInPostgreJdbc.
|
Test build #48394 has finished for PR 10494 at commit
|
…umn as value
`ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values.
For example:
```r
ifelse(lit(1) == lit(1), lit(2), lit(3))
ifelse(df$mpg > 0, df$mpg, 0)
```
will both fail with
```r
attempt to replicate an object of type 'environment'
```
The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR.
For reference, added test cases which trigger failures:
```r
. Error: when(), otherwise() and ifelse() with column on a DataFrame ----------
error in evaluating the argument 'x' in selecting a method for function 'collect':
error in evaluating the argument 'col' in selecting a method for function 'select':
attempt to replicate an object of type 'environment'
Calls: when -> when -> ifelse -> ifelse
1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126
5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label)
6: condition(object)
7: compare(actual, expected, ...)
8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))
Error: Test failures
Execution halted
```
Author: Forest Fang <forest.fang@outlook.com>
Closes apache#10481 from saurfang/spark-12526.
…from apache#1293 Compilation error caused due to string concatenations that are not a constant Use raw string literal to avoid string concatenations https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/ Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#10488 from kiszk/SPARK-12530.
…rCreate * Changes api.r.SQLUtils to use ```SQLContext.getOrCreate``` instead of creating a new context. * Adds a simple test [SPARK-11199] #comment link with JIRA Author: Hossein <hossein@databricks.com> Closes apache#9185 from falaki/SPARK-11199.
…uced in / PR 10327 Sorry jkbradley Ref: apache#10327 (comment) Author: Sean Owen <sowen@cloudera.com> Closes apache#10508 from srowen/SPARK-12349.2.
…fication. In Spark we allow UDFs to declare its expected input types in order to apply type coercion. The expected input type parameter takes a Seq[DataType] and uses Nil when no type coercion is applied. It makes more sense to take Option[Seq[DataType]] instead, so we can differentiate a no-arg function vs function with no expected input type specified. Author: Reynold Xin <rxin@databricks.com> Closes apache#10504 from rxin/SPARK-12549.
This is a WIP. The PR has been taken over from nongli (see apache#10420). I have removed some additional dead code, and fixed a few issues which were caused by the fact that the inlined Hive parser is newer than the Hive parser we currently use in Spark. I am submitting this PR in order to get some feedback and testing done. There is quite a bit of work to do: - [ ] Get it to pass jenkins build/test. - [ ] Aknowledge Hive-project for using their parser. - [ ] Refactorings between HiveQl and the java classes. - [ ] Create our own ASTNode and integrate the current implicit extentions. - [ ] Move remaining ```SemanticAnalyzer``` and ```ParseUtils``` functionality to ```HiveQl```. - [ ] Removing Hive dependencies from the parser. This will require some edits in the grammar files. - [ ] Introduce our own context which needs to contain a ```TokenRewriteStream```. - [ ] Add ```useSQL11ReservedKeywordsForIdentifier``` and ```allowQuotedId``` to the catalyst or sql configuration. - [ ] Remove ```HiveConf``` from grammar files &HiveQl, and pass in our own configuration. - [ ] Moving the parser into sql/core. cc nongli rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Closes apache#10509 from hvanhovell/SPARK-12362.
apache#10441 broke the Streaming UI because of the new CSS style. <img width="503" alt="screen shot 2015-12-29 at 4 49 04 pm" src="https://cloud.githubusercontent.com/assets/1000778/12044763/1efce0fe-ae4c-11e5-9f8b-39df08426bf8.png"> This PR just added a class for the new style and only applied them to the paged tables. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10517 from zsxwing/fix-streaming-ui.
``` org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input columns text; ``` lets put a `:` after `columns` and put the columns in `[]` so that they match the toString of DataFrame. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#10518 from gatorsmile/improveAnalysisExceptionMsg.
This reverts commit b600bcc due to non-deterministic build breaks.
…K_WORKER_MEMORY without unit Updated the Worker Unit IllegalStateException message to indicate no values less than 1MB instead of 0 to help solve this. Requesting review Author: Neelesh Srinivas Salian <nsalian@cloudera.com> Closes apache#10483 from nssalian/SPARK-12263.
…Instance Most of cases we should propagate null when call `NewInstance`, and so far there is only one case we should stop null propagation: create product/java bean. So I think it makes more sense to propagate null by dafault. This also fixes a bug when encode null array/map, which is firstly discovered in apache#10401 Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10443 from cloud-fan/encoder.
Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information. Author: Holden Karau <holden@us.ibm.com> Closes apache#10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.
…r new pull requests This patch adds a new build check which enumerates Spark's resolved runtime classpath and saves it to a file, then diffs against that file to detect whether pull requests have introduced dependency changes. The aim of this check is to make it simpler to reason about whether pull request which modify the build have introduced new dependencies or changed transitive dependencies in a way that affects the final classpath. This supplants the checks added in SPARK-4123 / apache#5093, which are currently disabled due to bugs. This patch is based on pwendell's work in apache#8531. Closes apache#8531. Author: Josh Rosen <joshrosen@databricks.com> Author: Patrick Wendell <patrick@databricks.com> Closes apache#10461 from JoshRosen/SPARK-10359.
…ush-down filters for JDBC This is rework from apache#10386 and add more tests and LIKE push-down support. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10468 from maropu/SupportMorePushdownInJdbc.
…ith an unknown app Id I got an exception when accessing the below REST API with an unknown application Id. `http://<server-url>:18080/api/v1/applications/xxx/jobs` Instead of an exception, I expect an error message "no such app: xxx" which is a similar error message when I access `/api/v1/applications/xxx` ``` org.spark-project.guava.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: no app with key xxx at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263) at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:116) at org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:226) at org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:46) at org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66) ``` Author: Carson Wang <carson.wang@intel.com> Closes apache#10352 from carsonwang/unknownAppFix.
…-up (docs & tests) This PR is a follow-up for PR apache#9819. It adds documentation for the window functions and a couple of NULL tests. The documentation was largely based on the documentation in (the source of) Hive and Presto: * https://prestodb.io/docs/current/functions/window.html * https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics I am not sure if we need to add the licenses of these two projects to the licenses directory. They are both under the ASL. srowen any thoughts? cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#10402 from hvanhovell/SPARK-8641-docs.
We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0. Author: Reynold Xin <rxin@databricks.com> Closes apache#10531 from rxin/SPARK-12588.
Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#10536 from vanzin/SPARK-3873-yarn.
There's one warning left, caused by a bug in the checker. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#10537 from vanzin/SPARK-3873-graphx.
It was research code and has been deprecated since 1.0.0. No one really uses it since they can just use event logging. Author: Reynold Xin <rxin@databricks.com> Closes apache#10530 from rxin/SPARK-12561.
Closes apache#5358 Closes apache#3744 Closes apache#3677 Closes apache#3536 Closes apache#3249 Closes apache#3221 Closes apache#2446 Closes apache#3794 Closes apache#3815 Closes apache#3816 Closes apache#3866 Closes apache#4286 Closes apache#5184 Closes apache#5170 Closes apache#5142 Closes apache#5025 Closes apache#5005 Closes apache#4897 Closes apache#4887 Closes apache#4849 Closes apache#4632 Closes apache#4622 Closes apache#4456 Closes apache#4449 Closes apache#4417 Closes apache#5483 Closes apache#5325 Closes apache#6545 Closes apache#6449 Closes apache#6433 Closes apache#6416 Closes apache#6403 Closes apache#6386 Closes apache#6263 Closes apache#6245 Closes apache#6213 Closes apache#6155 Closes apache#6133 Closes apache#6018 Closes apache#5978 Closes apache#5869 Closes apache#5852 Closes apache#5848 Closes apache#5754 Closes apache#5598 Closes apache#5503 Closes apache#4380
…egistrator which are deprecated and no longer used Whole code of Vector.scala, VectorSuite.scala and GraphKryoRegistrator.scala are no longer used so it's time to remove them in Spark 2.0. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes apache#10613 from sarutak/SPARK-12665.
…t with exactly the same grouping expressi For queries like : select <> from table group by a distribute by a we can eliminate distribute by ; since group by will anyways do a hash partitioning Also applicable when user uses Dataframe API Author: Yash Datta <Yash.Datta@guavus.com> Closes apache#9858 from saucam/eliminatedistribute.
From JIRA: Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method. A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available. This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future. Author: Holden Karau <holden@us.ibm.com> Closes apache#9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.
PySpark SparseVector should have "Found duplicate indices" error message Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes apache#9525 from rekhajoshm/SPARK-11531.
… spark.ml Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#9931 from yanboliang/SPARK-11945.
…reeRegressor should support setSeed PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#9807 from yanboliang/spark-11815.
This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made: The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling. The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project: - ```CatalystQl```: This implements Query and Expression parsing functionality. - ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe. - ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#10583 from hvanhovell/SPARK-12575.
If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes apache#9986 from zero323/SPARK-12006.
…ator' metricName For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC". Also, in the documentation, it is said that: "The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators." However, the method is called setMetricName. This PR aims to fix both issues. Author: BenFradet <benjamin.fradet@gmail.com> Closes apache#10328 from BenFradet/SPARK-12368.
Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10621 from zsxwing/SPARK-12617-2.
…lt root path to gain the streaming batch url. Author: huangzhaowei <carlmartinmax@gmail.com> Closes apache#10617 from SaintBacchus/SPARK-12672.
…of default root path to gain the streaming batch url." This reverts commit 19e4e9f. Will merge apache#10618 instead.
To avoid to have a huge Java source (over 64K loc), that can't be compiled. cc hvanhovell Author: Davies Liu <davies@databricks.com> Closes apache#10624 from davies/split_ident.
This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example:
```
df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales")
```
When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write.
Note that there may be multiply files for one bucket, as the data is distributed.
Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway.
Limitations:
* Can't write bucketed data without hive metastore.
* Can't insert bucketed data into existing hive tables.
Author: Wenchen Fan <wenchen@databricks.com>
Closes apache#10498 from cloud-fan/bucket-write.
…ala Long not Java Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change Author: Sean Owen <sowen@cloudera.com> Closes apache#10554 from srowen/SPARK-12604.
…uet scan benchmarks. [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do this. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes apache#10589 from nongli/spark-12640.
…bSVMFile This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663). For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html) Author: Robert Dodier <robert_dodier@users.sourceforge.net> Closes apache#10611 from robert-dodier/loadlibsvmfile-error-msg-branch.
This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code. Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs. For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10534 from JoshRosen/remove-ttl-based-cleaning.
Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot:  Author: jerryshao <sshao@hortonworks.com> Closes apache#10618 from jerryshao/SPARK-12673.
MapPartitionsRDD was keeping a reference to `prev` after a call to `clearDependencies` which could lead to memory leak. Author: Guillaume Poulin <poulin.guillaume@gmail.com> Closes apache#10623 from gpoulin/map_partition_deps.
…not None" This reverts commit fcd013c. Author: Yin Huai <yhuai@databricks.com> Closes apache#10632 from yhuai/pythonStyle.
modify 'spark.memory.offHeap.enabled' default value to false Author: zzcclp <xm_zzc@sina.com> Closes apache#10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.
This PR manage the memory used by window functions (buffered rows), also enable external spilling. After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G. Author: Davies Liu <davies@databricks.com> Closes apache#10605 from davies/unsafe_window.
Parse the SQL query with except/intersect in FROM clause for HivQL. Author: Davies Liu <davies@databricks.com> Closes apache#10622 from davies/intersect.
Author: Jacek Laskowski <jacek@japila.pl> Closes apache#10603 from jaceklaskowski/streaming-actor-custom-receiver.
|
let me close it. will open another one to fix the issue. Thanks! |
The filter
Filter (False)generates an emptyLocalRelationinSimplifyFilters. In the currentExplain, the optimized and physical plans look confusing.For example,
After the fix, the Optimized Logical Plan and Physical Plan are changed to