-
Notifications
You must be signed in to change notification settings - Fork 29.1k
Opened by accident #17837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Opened by accident #17837
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…treaming Programming Guide ## What changes were proposed in this pull request? Currently some code snippets in the programming guide just do not compile. We should fix them. ## How was this patch tested? ``` SKIP_API=1 jekyll build ``` ## Screenshot from part of the change:  Author: Liwei Lin <lwlin7@gmail.com> Closes apache#16442 from lw-lin/ss-pro-guide-.
…alog ### What changes were proposed in this pull request? Fixed non-thread-safe functions used in SessionCatalog: - refreshTable - lookupRelation ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes apache#16437 from gatorsmile/addSyncToLookUpTable. (cherry picked from commit 35e9740) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…rtitioned Tables in InMemoryCatalog ### What changes were proposed in this pull request? The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition. This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`. ### How was this patch tested? Added test cases for both HiveExternalCatalog and InMemoryCatalog Author: gatorsmile <gatorsmile@gmail.com> Closes apache#16448 from gatorsmile/unsetSerdeProp. (cherry picked from commit b67b35f) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…lType should find a common type with `typeSoFar`
## What changes were proposed in this pull request?
CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`.
**decimal.csv**
```
9.03E+12
1.19E+11
```
**BEFORE**
```scala
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
|-- _c0: decimal(3,-9) (nullable = true)
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3
```
**AFTER**
```scala
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
|-- _c0: decimal(4,-9) (nullable = true)
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
+---------+
| _c0|
+---------+
|9.030E+12|
| 1.19E+11|
+---------+
```
## How was this patch tested?
Pass the newly add test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes apache#16463 from dongjoon-hyun/SPARK-18877-BACKPORT-21.
## What changes were proposed in this pull request? Current HistoryServer's ACLs is derived from application event-log, which means the newly changed ACLs cannot be applied to the old data, this will become a problem where newly added admin cannot access the old application history UI, only the new application can be affected. So here propose to add admin ACLs for history server, any configured user/group could have the view access to all the applications, while the view ACLs derived from application run-time still take effect. ## How was this patch tested? Unit test added. Author: jerryshao <sshao@hortonworks.com> Closes apache#16470 from jerryshao/SPARK-19033. (cherry picked from commit 4a4c3dc) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
…uotes JIRA Issue: https://issues.apache.org/jira/browse/SPARK-19083# sbin/start-history-server.sh script use of $ without quotes, this will affect the length of args which used in HistoryServerArguments::parse(args: List[String]) Author: zuotingbing <zuo.tingbing9@zte.com.cn> Closes apache#16484 from zuotingbing/sh. (cherry picked from commit a9a1373) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
…e for update mode and source/sink options ## What changes were proposed in this pull request? Updates - Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure. - Updated Output Modes section with Update mode - Added options for all the sources and sinks --------------------------- ---------------------------  --------------------------- --------------------------- <img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png"> <img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png"> --------------------------- ---------------------------    Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#16468 from tdas/SPARK-19074. (cherry picked from commit b59cdda) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
…or for original and loaded model
## What changes were proposed in this pull request?
While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573
The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.
Please refer to apache#16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes apache#16491 from wangmiao1981/ldabug.
(cherry picked from commit 036b503)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
configuration.html section headings were not specified correctly in markdown and weren't rendering, being recognized correctly. Removed extra p tags and pulled level 4 titles up to level 3, since level 3 had been skipped. This improves the TOC. Doc build, manual check. Author: Sean Owen <sowen@cloudera.com> Closes apache#16490 from srowen/SPARK-19106. (cherry picked from commit 54138f6) Signed-off-by: Sean Owen <sowen@cloudera.com>
…ABLE` with `LOCATION` ## What changes were proposed in this pull request? This PR adds a new behavior change description on `CREATE TABLE ... LOCATION` at `sql-programming-guide.md` clearly under `Upgrading From Spark SQL 1.6 to 2.0`. This change is introduced at Apache Spark 2.0.0 as [SPARK-15276](https://issues.apache.org/jira/browse/SPARK-15276). ## How was this patch tested? ``` SKIP_API=1 jekyll build ``` **Newly Added Description** <img width="913" alt="new" src="https://cloud.githubusercontent.com/assets/9700541/21743606/7efe2b12-d4ba-11e6-8a0d-551222718ea2.png"> Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#16400 from dongjoon-hyun/SPARK-18941. (cherry picked from commit 923e594) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request? - [X] Fix inconsistencies in function reference for dense rank and dense - [X] Make all languages equivalent in their reference to `dense_rank` and `rank`. ## How was this patch tested? N/A for docs. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Closes apache#16505 from anabranch/SPARK-19127. (cherry picked from commit 1f6ded6) Signed-off-by: Reynold Xin <rxin@databricks.com>
## What changes were proposed in this pull request? - [X] Make sure all join types are clearly mentioned - [X] Make join labeling/style consistent - [X] Make join label ordering docs the same - [X] Improve join documentation according to above for Scala - [X] Improve join documentation according to above for Python - [X] Improve join documentation according to above for R ## How was this patch tested? No tests b/c docs. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Closes apache#16504 from anabranch/SPARK-19126. (cherry picked from commit 19d9d4c) Signed-off-by: Felix Cheung <felixcheung@apache.org>
## What changes were proposed in this pull request? backport to 2.1 Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#16507 from felixcheung/portsparkuir21.
… for aggregations ## What changes were proposed in this pull request? Backport for apache#16361 to 2.1 branch. ## How was this patch tested? Unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#16518 from brkyvz/reg-break-2.1.
## What changes were proposed in this pull request?
Prior to this patch, we'll generate `compare(...)` for `GeneratedClass$SpecificOrdering` like below, leading to Janino exceptions saying the code grows beyond 64 KB.
``` scala
/* 005 */ class SpecificOrdering extends o.a.s.sql.catalyst.expressions.codegen.BaseOrdering {
/* ..... */ ...
/* 10969 */ private int compare(InternalRow a, InternalRow b) {
/* 10970 */ InternalRow i = null; // Holds current row being evaluated.
/* 10971 */
/* 1.... */ code for comparing field0
/* 1.... */ code for comparing field1
/* 1.... */ ...
/* 1.... */ code for comparing field449
/* 15012 */
/* 15013 */ return 0;
/* 15014 */ }
/* 15015 */ }
```
This patch would break `compare(...)` into smaller `compare_xxx(...)` methods when necessary; then we'll get generated `compare(...)` like:
``` scala
/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */ return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends o.a.s.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */ ...
/* 1.... */
/* 11290 */ private int compare_0(InternalRow a, InternalRow b) {
/* 11291 */ InternalRow i = null; // Holds current row being evaluated.
/* 11292 */
/* 11293 */ i = a;
/* 11294 */ boolean isNullA;
/* 11295 */ UTF8String primitiveA;
/* 11296 */ {
/* 11297 */
/* 11298 */ Object obj = ((Expression) references[0]).eval(null);
/* 11299 */ UTF8String value = (UTF8String) obj;
/* 11300 */ isNullA = false;
/* 11301 */ primitiveA = value;
/* 11302 */ }
/* 11303 */ i = b;
/* 11304 */ boolean isNullB;
/* 11305 */ UTF8String primitiveB;
/* 11306 */ {
/* 11307 */
/* 11308 */ Object obj = ((Expression) references[0]).eval(null);
/* 11309 */ UTF8String value = (UTF8String) obj;
/* 11310 */ isNullB = false;
/* 11311 */ primitiveB = value;
/* 11312 */ }
/* 11313 */ if (isNullA && isNullB) {
/* 11314 */ // Nothing
/* 11315 */ } else if (isNullA) {
/* 11316 */ return -1;
/* 11317 */ } else if (isNullB) {
/* 11318 */ return 1;
/* 11319 */ } else {
/* 11320 */ int comp = primitiveA.compare(primitiveB);
/* 11321 */ if (comp != 0) {
/* 11322 */ return comp;
/* 11323 */ }
/* 11324 */ }
/* 11325 */
/* 11326 */
/* 11327 */ i = a;
/* 11328 */ boolean isNullA1;
/* 11329 */ UTF8String primitiveA1;
/* 11330 */ {
/* 11331 */
/* 11332 */ Object obj1 = ((Expression) references[1]).eval(null);
/* 11333 */ UTF8String value1 = (UTF8String) obj1;
/* 11334 */ isNullA1 = false;
/* 11335 */ primitiveA1 = value1;
/* 11336 */ }
/* 11337 */ i = b;
/* 11338 */ boolean isNullB1;
/* 11339 */ UTF8String primitiveB1;
/* 11340 */ {
/* 11341 */
/* 11342 */ Object obj1 = ((Expression) references[1]).eval(null);
/* 11343 */ UTF8String value1 = (UTF8String) obj1;
/* 11344 */ isNullB1 = false;
/* 11345 */ primitiveB1 = value1;
/* 11346 */ }
/* 11347 */ if (isNullA1 && isNullB1) {
/* 11348 */ // Nothing
/* 11349 */ } else if (isNullA1) {
/* 11350 */ return -1;
/* 11351 */ } else if (isNullB1) {
/* 11352 */ return 1;
/* 11353 */ } else {
/* 11354 */ int comp = primitiveA1.compare(primitiveB1);
/* 11355 */ if (comp != 0) {
/* 11356 */ return comp;
/* 11357 */ }
/* 11358 */ }
/* 1.... */
/* 1.... */ ...
/* 1.... */
/* 12652 */ return 0;
/* 12653 */ }
/* 1.... */
/* 1.... */ ...
/* 15387 */
/* 15388 */ public int compare(InternalRow a, InternalRow b) {
/* 15389 */
/* 15390 */ int comp_0 = compare_0(a, b);
/* 15391 */ if (comp_0 != 0) {
/* 15392 */ return comp_0;
/* 15393 */ }
/* 15394 */
/* 15395 */ int comp_1 = compare_1(a, b);
/* 15396 */ if (comp_1 != 0) {
/* 15397 */ return comp_1;
/* 15398 */ }
/* 1.... */
/* 1.... */ ...
/* 1.... */
/* 15450 */ return 0;
/* 15451 */ }
/* 15452 */ }
```
## How was this patch tested?
- a new added test case which
- would fail prior to this patch
- would pass with this patch
- ordering correctness should already be covered by existing tests like those in `OrderingSuite`
## Acknowledgement
A major part of this PR - the refactoring work of `splitExpression()` - has been done by ueshin.
Author: Liwei Lin <lwlin7@gmail.com>
Author: Takuya UESHIN <ueshin@happy-camper.st>
Author: Takuya Ueshin <ueshin@happy-camper.st>
Closes apache#15480 from lw-lin/spec-ordering-64k-.
(cherry picked from commit acfc5f3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…orrectly
## What changes were proposed in this pull request?
`DataStreamReaderWriterSuite` makes test files in source folder like the followings. Interestingly, the root cause is `withSQLConf` fails to reset `OptionalConfigEntry` correctly. In other words, it resets the config into `Some(undefined)`.
```bash
$ git status
Untracked files:
(use "git add <file>..." to include in what will be committed)
sql/core/%253Cundefined%253E/
sql/core/%3Cundefined%3E/
```
## How was this patch tested?
Manual.
```
build/sbt "project sql" test
git status
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes apache#16522 from dongjoon-hyun/SPARK-19137.
(cherry picked from commit d5b1dc9)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
…ed to ensure catching fatal errors during query initialization ## What changes were proposed in this pull request? StreamTest sets `UncaughtExceptionHandler` after starting the query now. It may not be able to catch fatal errors during query initialization. This PR uses `onQueryStarted` callback to fix it. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16492 from zsxwing/SPARK-19113.
## What changes were proposed in this pull request? Updates to libthrift 0.9.3 to address a CVE. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes apache#16530 from srowen/SPARK-18997. (cherry picked from commit 856bae6) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
…ries ## What changes were proposed in this pull request? This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16520 from zsxwing/update-without-agg. (cherry picked from commit bc6c56e) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
…m family supported ## What changes were proposed in this pull request? backporting to 2.1, 2.0 and 1.6 ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#16532 from felixcheung/rgammabackport.
## What changes were proposed in this pull request? ``` df$foo <- 1 ``` instead of ``` df$foo <- lit(1) ``` ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#16510 from felixcheung/rlitcol. (cherry picked from commit d749c06) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
… e1071 package. ## What changes were proposed in this pull request? ```ml.R``` example depends on ```e1071``` package, if it's not available in users' environment, it will fail. I think the example should not depends on third-party packages, so I update it to remove the dependency. ## How was this patch tested? Manual test. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#16548 from yanboliang/spark-19158. (cherry picked from commit 2c586f5) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
…lect` in Thrift Server ## What changes were proposed in this pull request? To support `FETCH_FIRST`, SPARK-16563 used Scala `Iterator.duplicate`. However, Scala `Iterator.duplicate` uses a **queue to buffer all items between both iterators**, this causes GC and hangs for queries with large number of rows. We should not use this, especially for `spark.sql.thriftServer.incrementalCollect`. https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300 ## How was this patch tested? Pass the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#16440 from dongjoon-hyun/SPARK-18857. (cherry picked from commit a2c6adc) Signed-off-by: Sean Owen <sowen@cloudera.com>
## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close apache#16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16404 from cloud-fan/groupby. (cherry picked from commit 871d266) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…kContext is stopped ## What changes were proposed in this pull request? In SparkSession initialization, we store created the instance of SparkSession into a class variable _instantiatedContext. Next time we can use SparkSession.builder.getOrCreate() to retrieve the existing SparkSession instance. However, when the active SparkContext is stopped and we create another new SparkContext to use, the existing SparkSession is still associated with the stopped SparkContext. So the operations with this existing SparkSession will be failed. We need to detect such case in SparkSession and renew the class variable _instantiatedContext if needed. ## How was this patch tested? New test added in PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#16454 from viirya/fix-pyspark-sparksession. (cherry picked from commit c6c37b8) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Pivoting adds backticks (e.g. 3_count(\`c\`)) in column names and, in some cases,
thes causes analysis exceptions like;
```
scala> val df = Seq((2, 3, 4), (3, 4, 5)).toDF("a", "x", "y")
scala> df.groupBy("a").pivot("x").agg(count("y"), avg("y")).na.fill(0)
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_count(`y`)`;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:134)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:144)
...
```
So, this pr proposes to remove these backticks from column names.
## How was this patch tested?
Added a test in `DataFrameAggregateSuite`.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes apache#14812 from maropu/SPARK-17237.
(cherry picked from commit 5585ed9)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Otherwise the open parentheses isn't closed in query plan descriptions of batch scans.
PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ...
Author: Andrew Ash <andrew@andrewash.com>
Closes apache#16558 from ash211/patch-9.
(cherry picked from commit b040cef)
Signed-off-by: Reynold Xin <rxin@databricks.com>
…rame on a new SQLContext object fails with a Derby error Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro. Existing unit tests and a new unit test added to pyspark-sql: /python/run-tests --python-executables=python --modules=pyspark-sql Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Vinayak <vijoshi5@in.ibm.com> Author: Vinayak Joshi <vijoshi@users.noreply.github.com> Closes apache#16119 from vijoshi/SPARK-18687_master. (cherry picked from commit 285a779) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…rn null
## What changes were proposed in this pull request?
When we convert a string to integral, we will convert that string to `decimal(20, 0)` first, so that we can turn a string with decimal format to truncated integral, e.g. `CAST('1.2' AS int)` will return `1`.
However, this brings problems when we convert a string with large numbers to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, while Hive returns null as we expected.
This is a long standing bug(seems it was there the first day Spark SQL was created), this PR fixes this bug by adding the native support to convert `UTF8String` to integral.
## How was this patch tested?
new regression tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes apache#16550 from cloud-fan/string-to-int.
(cherry picked from commit 6b34e74)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? To allow specifying number of partitions when the DataFrame is created ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#16512 from felixcheung/rnumpart. (cherry picked from commit b0e8eb6) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
## What changes were proposed in this pull request? fix typo ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#17663 from felixcheung/likedoctypo. (cherry picked from commit b0a1e93) Signed-off-by: Felix Cheung <felixcheung@apache.org>
…optimization that can lead to NPE Avoid necessary execution that can lead to NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm NPE is no longer thrown ## What changes were proposed in this pull request? Change leftHasNonNullPredicate and rightHasNonNullPredicate to lazy so they are only executed when needed. ## How was this patch tested? Added test in DataFrameSuite that failed before this fix and now succeeds. Note that a test in catalyst project would be better but i am unsure how to do this. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Koert Kuipers <koert@tresata.com> Closes apache#17660 from koertkuipers/feat-catch-npe-in-eliminate-outer-join. (cherry picked from commit 608bf30) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…message ## What changes were proposed in this pull request? Also went through the same file to ensure other string concatenation are correct. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#17691 from zsxwing/fix-error-message. (cherry picked from commit 39e303a) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
## What changes were proposed in this pull request? It's illegal to have aggregate function in GROUP BY, and we should fail at analysis phase, if this happens. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#17704 from cloud-fan/minor.
… and years handled correctly' ## What changes were proposed in this pull request? `monthsSinceEpoch` in this test is like `math.floor(num)`, so `monthDiff` has two possible values. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16449 from zsxwing/watermark-test-hotfix. (cherry picked from commit 2394047) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
Hello PR apache#10991 removed the built-in history view from Spark Standalone, so the history server is no longer useful to Yarn or Mesos only. Author: Hervé <dud225@users.noreply.github.com> Closes apache#17709 from dud225/patch-1. (cherry picked from commit 3476799) Signed-off-by: Sean Owen <sowen@cloudera.com>
…ing ignoreCorruptFiles' flaky test ## What changes were proposed in this pull request? SharedSQLContext.afterEach now calls DebugFilesystem.assertNoOpenStreams inside eventually. SQLTestUtils withTempDir calls waitForTasksToFinish before deleting the directory. ## How was this patch tested? New test but marked as ignored because it takes 30s. Can be unignored for review. Author: Bogdan Raducanu <bogdan@databricks.com> Closes apache#17720 from bogdanrdc/SPARK-20407-BACKPORT2.1.
[SPARK-19050][SS][TESTS] Fix EventTimeWatermarkSuite 'delay in months and years handled correctly'
This commit works around a bug in BigDecimal.equals in Scala 2.10 which fails comparisons with doubles, e.g. BigDecimal(1.1) == 1.1 is false in Scala 2.10, but true in Scala 2.11. See scala/scala@29541c for details.
…yerPerceptronClassifierSuite I have failed to find the change between Scala 2.10 and 2.11 causing this, but operationally the issue originates from 'SparkSession.createDataFrame' which creates a (lazy!) 'Seq' referencing a non-serializable 'java.util.ListIterator'. I suspect one of the collection traits was change, but given how many of them are in the standard library, it is hard to tell which one. Here is a potential candidate scala/scala@0bb8a13
…2.1.1 ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-19611 fixes a regression from 2.0 where Spark silently fails to read case-sensitive fields missing a case-sensitive schema in the table properties. The fix is to detect this situation, infer the schema, and write the case-sensitive schema into the metastore. However this can incur an unexpected performance hit the first time such a problematic table is queried (and there is a high false-positive rate here since most tables don't actually have case-sensitive fields). This PR changes the default to NEVER_INFER (same behavior as 2.1.0). In 2.2, we can consider leaving the default to INFER_AND_SAVE. ## How was this patch tested? Unit tests. Author: Eric Liang <ekl@databricks.com> Closes apache#17749 from ericl/spark-20450.
…randomSplit ## What changes were proposed in this pull request? In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping splits. To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism. ## How was this patch tested? Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes apache#17751 from sameeragarwal/randomsplit2. (cherry picked from commit 31345fd) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? Just added the Maven `test`goal. ## How was this patch tested? No test needed, just a trivial documentation fix. Author: Armin Braun <me@obrown.io> Closes apache#17756 from original-brownbear/SPARK-20455. (cherry picked from commit c8f1219) Signed-off-by: Sean Owen <sowen@cloudera.com>
Using Option(name) instead of Some(name) to prevent runtime failures when using accumulators created like the following ``` sparkContext.accumulator(0, null) ``` Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com> Closes apache#17740 from szhem/SPARK-20404-null-acc-names. (cherry picked from commit 0bc7a90) Signed-off-by: Sean Owen <sowen@cloudera.com>
Current SHS (Spark History Server) has two different ACLs: * ACL of base URL, it is controlled by "spark.acls.enabled" or "spark.ui.acls.enabled", and with this enabled, only user configured with "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user who started SHS could list all the applications, otherwise none of them can be listed. This will also affect REST APIs which listing the summary of all apps and one app. * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". With this enabled only history admin user and user/group who ran this app can access the details of this app. With this two ACLs, we may encounter several unexpected behaviors: 1. if base URL's ACL (`spark.acls.enable`) is enabled but user A has no view permission. User "A" cannot see the app list but could still access details of it's own app. 2. if ACLs of base URL (`spark.acls.enable`) is disabled, then user "A" could download any application's event log, even it is not run by user "A". 3. The changes of Live UI's ACL will affect History UI's ACL which share the same conf file. The unexpected behaviors is mainly because we have two different ACLs, ideally we should have only one to manage all. So to improve SHS's ACL mechanism, here in this PR proposed to: 1. Disable "spark.acls.enable" and only use "spark.history.ui.acls.enable" for history server. 2. Check permission for event-log download REST API. With this PR: 1. Admin user could see/download the list of all applications, as well as application details. 2. Normal user could see the list of all applications, but can only download and check the details of applications accessible to him. New UTs are added, also verified in real cluster. CC tgravescs vanzin please help to review, this PR changes the semantics you did previously. Thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes apache#17755 from jerryshao/SPARK-20239-2.1-backport.
…ble when failed to fetch table metadata ### What changes were proposed in this pull request? This PR is to backport apache#17730 to Spark 2.1 --- -- `spark.catalog.listTables` and `spark.catalog.getTable` does not work if we are unable to retrieve table metadata due to any reason (e.g., table serde class is not accessible or the table type is not accepted by Spark SQL). After this PR, the APIs still return the corresponding Table without the description and tableType) ### How was this patch tested? Added a test case Author: Xiao Li <gatorsmile@gmail.com> Closes apache#17760 from gatorsmile/backport-17730.
## What changes were proposed in this pull request? We didn't enforce analyzed plans in Spark 2.1 when writing out to Kafka. ## How was this patch tested? New unit test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bill Chambers <bill@databricks.com> Closes apache#17804 from anabranch/SPARK-20496-2. (cherry picked from commit 733b81b) Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
The download link in history server UI is concatenated with:
```
<td><a href="{{uiroot}}/api/v1/applications/{{id}}/{{num}}/logs" class="btn btn-info btn-mini">Download</a></td>
```
Here `num` field represents number of attempts, this is not equal to REST APIs. In the REST API, if attempt id is not existed the URL should be `api/v1/applications/<id>/logs`, otherwise the URL should be `api/v1/applications/<id>/<attemptId>/logs`. Using `<num>` to represent `<attemptId>` will lead to the issue of "no such app".
Manual verification.
CC ajbozarth can you please review this change, since you add this feature before? Thanks!
Author: jerryshao <sshao@hortonworks.com>
Closes apache#17795 from jerryshao/SPARK-20517.
(cherry picked from commit ab30590)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. This seems to be the main cause of SPARK-20540. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date. This is fixed by tracking the currently requested size explicitly. ## How was this patch tested? Existing tests. Author: Ryan Blue <blue@apache.org> Closes apache#17813 from rdblue/SPARK-20540-fix-dynamic-allocation. (cherry picked from commit 2b2dd08) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Upgrade Netty to `4.0.43.Final` to add the fix for netty/netty#6153 Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16568 from zsxwing/SPARK-18971.
…c-with-upstream-2.1
Contributor
Author
|
Sorry, opened by accident. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Opened by accident.