Merged Apache branch-1.6 #169

markhamstra · 2016-06-29T17:16:09Z

No description provided.

…ady succeeded Don't re-queue a task if another attempt has already succeeded. This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded. I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown. Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks). With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts. Author: Jason Moore <jasonmoore2k@outlook.com> Closes apache#12751 from jasonmoore2k/SPARK-14915. (cherry picked from commit 77361a4) Signed-off-by: Sean Owen <sowen@cloudera.com>

…3c060 ## What changes were proposed in this pull request? I botched the back-port of SPARK-14915 to branch-1.6 in apache@bf3c060 resulting in a code block being added twice. This simply removes it, such that the net change is the intended one. ## How was this patch tested? Jenkins tests. (This in theory has already been tested.) Author: Sean Owen <sowen@cloudera.com> Closes apache#12950 from srowen/SPARK-14915.2.

…Thread Temp patch for branch 1.6， avoid deadlock between BlockManager and Executor Thread. Author: cenyuhai <cenyuhai@didichuxing.com> Closes apache#11546 from cenyuhai/SPARK-13566.

## What changes were proposed in this pull request? The configuration setting `spark.executor.logs.rolling.size.maxBytes` was changed to `spark.executor.logs.rolling.maxSize` in 1.4 or so. This commit fixes a remaining reference to the old name in the documentation. Also the description for `spark.executor.logs.rolling.maxSize` was edited to clearly state that the unit for the size is bytes. ## How was this patch tested? no tests Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes apache#13001 from philipphoffmann/patch-3.

…eb UI timeline ## What changes were proposed in this pull request? This patch fixes an escaping bug in the Web UI's event timeline that caused Javascript errors when displaying timeline entries whose descriptions include single quotes. The original bug can be reproduced by running ```scala sc.setJobDescription("double quote: \" ") sc.parallelize(1 to 10).count() sc.setJobDescription("single quote: ' ") sc.parallelize(1 to 10).count() ``` and then browsing to the driver UI. Previously, this resulted in an "Uncaught SyntaxError" because the single quote from the description was not escaped and ended up closing a Javascript string literal too early. The fix implemented here is to change the relevant Javascript to define its string literals using double-quotes. Our escaping logic already properly escapes double quotes in the description, so this is safe to do. ## How was this patch tested? Tested manually in `spark-shell` using the following cases: ```scala sc.setJobDescription("double quote: \" ") sc.parallelize(1 to 10).count() sc.setJobDescription("single quote: ' ") sc.parallelize(1 to 10).count() sc.setJobDescription("ampersand: &") sc.parallelize(1 to 10).count() sc.setJobDescription("newline: \n text after newline ") sc.parallelize(1 to 10).count() sc.setJobDescription("carriage return: \r text after return ") sc.parallelize(1 to 10).count() ``` /cc sarutak for review. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#12995 from JoshRosen/SPARK-15209. (cherry picked from commit 3323d0f) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

…distinct aggregate function #### Symptom: In the latest **branch 1.6**, when a `DISTINCT` aggregation function is used in the `HAVING` clause, Analyzer throws `AnalysisException` with a message like following: ``` resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; ``` #### Root cause: The problem is that the distinct aggregate in having condition are resolved by the rule `DistinctAggregationRewriter` twice, which messes up the resulted `EXPAND` operator. In a `ResolveAggregateFunctions` rule, when resolving ```Filter(havingCondition, _: Aggregate)```, the `havingCondition` is resolved as an `Aggregate` in a nested loop of analyzer rule execution (by invoking `RuleExecutor.execute`). At this nested level of analysis, the rule `DistinctAggregationRewriter` rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the `aggregateExpresssions` of the final `Aggregate` contains the resolved `gid` and the aggregate expression attributes (In the above case, they are `gid#558, id#559`). After completion of the nested analyzer rule execution, the resulted `aggregateExpressions` in the `havingCondition` is pushed down into the underlying `Aggregate` operator. The `DistinctAggregationRewriter` rule is executed again. The `projections` field of `EXPAND` operator is populated with the `aggregateExpressions` of the `havingCondition` mentioned above. However, the attributes (In the above case, they are `gid#558, id#559`) in the projection list of `EXPAND` operator can not be found in the underlying relation. #### Solution: This PR retrofits part of [apache#11579](apache#11579) that moves the `DistinctAggregationRewriter` to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure. #### How is the PR change tested New [test cases ](https://github.com/xwu0226/spark/blob/f73428f94746d6d074baf6702589545bdbd11cad/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L927-L988) are added to drive `DistinctAggregationRewriter` rewrites for multi-distinct aggregations , involving having clause. A following up PR will be submitted to add these test cases to master(2.0) branch. Author: xin Wu <xinwu@us.ibm.com> Closes apache#12974 from xwu0226/SPARK-14495_review.

…leaning executor's state ## What changes were proposed in this pull request? When the driver removes an executor's state, the connection between the driver and the executor may be still alive so that the executor cannot exit automatically (E.g., Master will send RemoveExecutor when a work is lost but the executor is still alive), so the driver should try to tell the executor to stop itself. Otherwise, we will leak an executor. This PR modified the driver to send `StopExecutor` to the executor when it's removed. ## How was this patch tested? manual test: increase the worker heartbeat interval to force it's always timeout and the leak executors are gone. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11399 from zsxwing/SPARK-13519.

…eartbeat to driver more than N times ## What changes were proposed in this pull request? Sometimes, network disconnection event won't be triggered for other potential race conditions that we may not have thought of, then the executor will keep sending heartbeats to driver and won't exit. This PR adds a new configuration `spark.executor.heartbeat.maxFailures` to kill Executor when it's unable to heartbeat to the driver more than `spark.executor.heartbeat.maxFailures` times. ## How was this patch tested? unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11401 from zsxwing/SPARK-13522.

## What changes were proposed in this pull request? Just fixed the log place introduced by apache#11401 ## How was this patch tested? unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11432 from zsxwing/SPARK-13522-follow-up.

## What changes were proposed in this pull request? If an executor is still alive even after the scheduler has removed its metadata, we may receive a heartbeat from that executor and tell its block manager to reregister itself. If that happens, the block manager master will know about the executor, but the scheduler will not. That is a dangerous situation, because when the executor does get disconnected later, the scheduler will not ask the block manager to also remove metadata for that executor. Later, when we try to clean up an RDD or a broadcast variable, we may try to send a message to that executor, triggering an exception. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes apache#13055 from andrewor14/block-manager-remove. (cherry picked from commit 40a949a) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

## What changes were proposed in this pull request? (This is the branch-1.6 version of apache#13039) When we acquire execution memory, we do a lot of things between shrinking the storage memory pool and enlarging the execution memory pool. In particular, we call memoryStore.evictBlocksToFreeSpace, which may do a lot of I/O and can throw exceptions. If an exception is thrown, the pool sizes on that executor will be in a bad state. This patch minimizes the things we do between the two calls to make the resizing more atomic. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes apache#13058 from andrewor14/safer-pool-1.6.

Fixed memory leak (HiveConf in the CommandProcessorFactory) Author: Oleg Danilov <oleg.danilov@wandisco.com> Closes apache#12932 from dosoft/SPARK-14261. (cherry picked from commit e384c7f) Signed-off-by: Reynold Xin <rxin@databricks.com>

…for 1.6) ## What changes were proposed in this pull request? Backport apache#13185 to branch 1.6. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#13196 from zsxwing/host-string-1.6.

… in generated code (branch-1.6) ## What changes were proposed in this pull request? This PR introduce place holder for comment in generated code and the purpose is same for apache#12939 but much safer. Generated code to be compiled doesn't include actual comments but includes place holder instead. Place holders in generated code will be replaced with actual comments only at the time of logging. Also, this PR can resolve SPARK-15205. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Added new test cases. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes apache#13230 from sarutak/SPARK-15165-branch-1.6.

## What changes were proposed in this pull request? To ensure that the deserialization of TaskMetrics uses a ClassLoader that knows about RDDBlockIds. The problem occurs only very rarely since it depends on which thread of the thread pool is used for the heartbeat. I observe that the code in question has been largely rewritten for v2.0.0 of Spark and the problem no longer manifests. However it would seem reasonable to fix this for those users who need to continue with the 1.6 version for some time yet. Hence I have created a fix for the 1.6 code branch. ## How was this patch tested? Due to the nature of the problem a reproducible testcase is difficult to produce. This problem was causing our application's nightly integration tests to fail randomly. Since applying the fix the tests have not failed due to this problem, for nearly six weeks now. Author: Simon Scott <simon.scott@viavisolutions.com> Closes apache#13222 from simonjscott/fix-10722.

This patch fixes a few integer overflows in `UnsafeSortDataFormat.copyRange()` and `ShuffleSortDataFormat copyRange()` that seems to be the most likely cause behind a number of `TimSort` contract violation errors seen in Spark 2.0 and Spark 1.6 while sorting large datasets. Added a test in `ExternalSorterSuite` that instantiates a large array of the form of [150000000, 150000001, 150000002, ...., 300000000, 0, 1, 2, ..., 149999999] that triggers a `copyRange` in `TimSort.mergeLo` or `TimSort.mergeHi`. Note that the input dataset should contain at least 268.43 million rows with a certain data distribution for an overflow to occur. Author: Sameer Agarwal <sameer@databricks.com> Closes apache#13336 from sameeragarwal/timsort-bug. (cherry picked from commit fe6de16) Signed-off-by: Reynold Xin <rxin@databricks.com>

## What changes were proposed in this pull request? Makes `UnsafeSortDataFormat` and `RecordPointerAndKeyPrefix` public. These are already public in 2.0 and are used in an `ExternalSorterSuite` test (see apache@0b8bdf7) ## How was this patch tested? Successfully builds locally Author: Sameer Agarwal <sameer@databricks.com> Closes apache#13339 from sameeragarwal/fix-compile.

## What changes were proposed in this pull request? A local variable in NumberConverter is wrongly shared between threads. This pr fixes the race condition. ## How was this patch tested? Manually checked. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#13391 from maropu/SPARK-15528. (cherry picked from commit 95db8a4) Signed-off-by: Sean Owen <sowen@cloudera.com>

…tents written if buffer isn't full 1. The class allocated 4x space than needed as it was using `Int` to store the `Byte` values 2. If CircularBuffer isn't full, currently toString() will print some garbage chars along with the content written as is tries to print the entire array allocated for the buffer. The fix is to keep track of buffer getting full and don't print the tail of the buffer if it isn't full (suggestion by sameeragarwal over apache#12194 (comment)) 3. Simplified `toString()` Added new test case Author: Tejas Patil <tejasp@fb.com> Closes apache#13351 from tejasapatil/circular_buffer. (cherry picked from commit ac38bdc) Signed-off-by: Sean Owen <sowen@cloudera.com>

This pull request fixes an issue in which cluster-mode executors fail to properly register a JDBC driver when the driver is provided in a jar by the user, but the driver class name is derived from a JDBC URL (rather than specified by the user). The consequence of this is that all JDBC accesses under the described circumstances fail with an `IllegalStateException`. I reported the issue here: https://issues.apache.org/jira/browse/SPARK-14204 My proposed solution is to have the executors register the JDBC driver class under all circumstances, not only when the driver is specified by the user. This patch was tested manually. I built an assembly jar, deployed it to a cluster, and confirmed that the problem was fixed. Author: Kevin McHale <kevin@premise.com> Closes apache#12000 from mchalek/jdbc-driver-registration.

…iles If an RDD partition is cached on disk and the DiskStore file is lost, then reads of that cached partition will fail and the missing partition is supposed to be recomputed by a new task attempt. In the current BlockManager implementation, however, the missing file does not trigger any metadata updates / does not invalidate the cache, so subsequent task attempts will be scheduled on the same executor and the doomed read will be repeatedly retried, leading to repeated task failures and eventually a total job failure. In order to fix this problem, the executor with the missing file needs to properly mark the corresponding block as missing so that it stops advertising itself as a cache location for that block. This patch fixes this bug and adds an end-to-end regression test (in `FailureSuite`) and a set of unit tests (`in BlockManagerSuite`). This is a branch-1.6 backport of apache#13473. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#13479 from JoshRosen/handle-missing-cache-files-branch-1.6.

…ation tokens to be added in current user credential. ## What changes were proposed in this pull request? The credentials are not added to the credentials of UserGroupInformation.getCurrentUser(). Further if the client has possibility to login using keytab then the updateDelegationToken thread is not started on client. ## How was this patch tested? ran dev/run-tests Author: Subroto Sanyal <ssanyal@datameer.com> Closes apache#13499 from subrotosanyal/SPARK-15754-save-ugi-from-changing. (cherry picked from commit 61d729a) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

…form "EST" is … ## What changes were proposed in this pull request? Stop using the abbreviated and ambiguous timezone "EST" in a test, since it is machine-local default timezone dependent, and fails in different timezones. Fixed [SPARK-15723](https://issues.apache.org/jira/browse/SPARK-15723). ## How was this patch tested? Note that to reproduce this problem in any locale/timezone, you can modify the scalatest-maven-plugin argLine to add a timezone: <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="Australia/Sydney"</argLine> and run $ mvn test -DwildcardSuites=org.apache.spark.status.api.v1.SimpleDateParamSuite -Dtest=none. Equally this will fix it in an effected timezone: <argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} -Duser.timezone="America/New_York"</argLine> To test the fix, apply the above change to `pom.xml` to set test TZ to `Australia/Sydney`, and confirm the test now passes. Author: Brett Randall <javabrett@gmail.com> Closes apache#13462 from javabrett/SPARK-15723-SimpleDateParamSuite. (cherry picked from commit 4e767d0) Signed-off-by: Sean Owen <sowen@cloudera.com>

Some VertexRDD and EdgeRDD are created during the intermediate step of g.connectedComponents() but unnecessarily left cached after the method is done. The fix is to unpersist these RDDs once they are no longer in use. A test case is added to confirm the fix for the reported bug. Author: Jason Lee <cjlee@us.ibm.com> Closes apache#10713 from jasoncl/SPARK-12655. (cherry picked from commit d0a5c32) Signed-off-by: Sean Owen <sowen@cloudera.com>

… empty .m2 cache This patch fixes a bug in `./dev/test-dependencies.sh` which caused spurious failures when the script was run on a machine with an empty `.m2` cache. The problem was that extra log output from the dependency download was conflicting with the grep / regex used to identify the classpath in the Maven output. This patch fixes this issue by adjusting the regex pattern. Tested manually with the following reproduction of the bug: ``` rm -rf ~/.m2/repository/org/apache/commons/ ./dev/test-dependencies.sh ``` Author: Josh Rosen <joshrosen@databricks.com> Closes apache#13568 from JoshRosen/SPARK-12712. (cherry picked from commit 921fa40) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

…entral Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but depends on that fork via a SBT subproject which is cloned from https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This unnecessarily slows down the initial build on fresh machines and is also risky because it risks a build breakage in case that GitHub repository ever changes or is deleted. In order to address these issues, I have published a pre-built binary of our forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` namespace and have updated Spark's build to use that artifact. This published artifact was built from https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains the contents of ScrapCodes's branch plus an additional patch to configure the build for artifact publication. /cc srowen ScrapCodes for review. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#13564 from JoshRosen/use-published-fork-of-pom-reader. (cherry picked from commit f74b777) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

## What changes were proposed in this pull request? fixing documentation for the groupby/agg example in python ## How was this patch tested? the existing example in the documentation dose not contain valid syntax (missing parenthesis) and is not using `Column` in the expression for `agg()` after the fix here's how I tested it: ``` In [1]: from pyspark.sql import Row In [2]: import pyspark.sql.functions as func In [3]: %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. :records = [{'age': 19, 'department': 1, 'expense': 100}, : {'age': 20, 'department': 1, 'expense': 200}, : {'age': 21, 'department': 2, 'expense': 300}, : {'age': 22, 'department': 2, 'expense': 300}, : {'age': 23, 'department': 3, 'expense': 300}] :-- In [4]: df = sqlContext.createDataFrame([Row(**d) for d in records]) In [5]: df.groupBy("department").agg(df["department"], func.max("age"), func.sum("expense")).show() +----------+----------+--------+------------+ |department|department|max(age)|sum(expense)| +----------+----------+--------+------------+ | 1| 1| 20| 300| | 2| 2| 22| 600| | 3| 3| 23| 300| +----------+----------+--------+------------+ Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes apache#13587 from mortada/groupby_agg_doc_fix. (cherry picked from commit 675a737) Signed-off-by: Reynold Xin <rxin@databricks.com>

…ackport for 1.6)" This reverts commit 7ad82b6. See SPARK-16017.

… 1.6 ## What changes were proposed in this pull request? This PR backports apache#13619. The original test added in branch-2.0 was failed in branch-1.6. This seems because the behaviour was changed in apache@101663f. This was failure while calculating Euler's number which ends up with a infinity regardless of this path. So, I brought the dataset from `AFTSurvivalRegressionExample` to make sure this is working and then wrote the test. I ran the test before/after creating empty partitions. `model.scale` becomes `1.0` with empty partitions and becames `1.547` without them. After this patch, this becomes always `1.547`. ## How was this patch tested? Unit test in `AFTSurvivalRegressionSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#13725 from HyukjinKwon/SPARK-15892-1-6.

…nthesis ## What changes were proposed in this pull request? The check on the end parenthesis of the expression to parse was using the wrong variable. I corrected that. ## How was this patch tested? Manual test Author: andreapasqua <andrea@radius.com> Closes apache#13750 from andreapasqua/sparse-vector-parser-assertion-fix. (cherry picked from commit 4c64e88) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…ylight Saving Time ## What changes were proposed in this pull request? Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes apache#13652 from davies/fix_timezone. (cherry picked from commit 001a589) Signed-off-by: Davies Liu <davies.liu@gmail.com>

…ue to Daylight Saving Time" This reverts commit 41efd20.

There's actually a race here: the state of the handler was changed before the connection was set, so the test code could be notified of the state change, wake up, and still see the connection as null, triggering the assert. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#12785 from vanzin/SPARK-14391. (cherry picked from commit 73c20bf)

…ylight Saving Time Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. Added regression tests. Author: Davies Liu <davies@databricks.com> Closes apache#13652 from davies/fix_timezone.

## What changes were proposed in this pull request? Fix the bug for Python UDF that does not have any arguments. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies.liu@gmail.com> Closes apache#13793 from davies/fix_no_arguments.

…dlocks ## What changes were proposed in this pull request? Set minimum number of dispatcher threads to 3 to avoid deadlocks on machines with only 2 cores ## How was this patch tested? Spark test builds Author: Pete Robbins <robbinspg@gmail.com> Closes apache#13355 from robbinspg/SPARK-13906.

…StreamSuite.offset recovery ## What changes were proposed in this pull request? Because this test extracts data from `DStream.generatedRDDs` before stopping, it may get data before checkpointing. Then after recovering from the checkpoint, `recoveredOffsetRanges` may contain something not in `offsetRangesBeforeStop`, which will fail the test. Adding `Thread.sleep(1000)` before `ssc.stop()` will reproduce this failure. This PR just moves the logic of `offsetRangesBeforeStop` (also renamed to `offsetRangesAfterStop`) after `ssc.stop()` to fix the flaky test. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#12903 from zsxwing/SPARK-6005. (cherry picked from commit 9533f53) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? In the case that we don't know which module a object came from, will call pickle.whichmodule() to go throught all the loaded modules to find the object, which could fail because some modules, for example, six, see https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling We should ignore the exception here, use `__main__` as the module name (it means we can't find the module). ## How was this patch tested? Manual tested. Can't have a unit test for this. Author: Davies Liu <davies@databricks.com> Closes apache#13788 from davies/whichmodule. (cherry picked from commit d489354) Signed-off-by: Davies Liu <davies.liu@gmail.com>

This is needed to avoid odd compiler errors when building just the sql package with maven, because of odd interactions between scalac and shaded classes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#11640 from vanzin/SPARK-13780.

## What changes were proposed in this pull request? This PR fixes `DataFrame.describe()` by forcing materialization to make the `Seq` serializable. Currently, `describe()` of `DataFrame` throws `Task not serializable` Spark exceptions when joining in Scala 2.10. ## How was this patch tested? Manual. (After building with Scala 2.10, test on bin/spark-shell and bin/pyspark.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#13902 from dongjoon-hyun/SPARK-16173-branch-1.6.

…ndexOutOfBoundsException. I have found the bug and tested the solution. ## What changes were proposed in this pull request? Just adjust the size of an array in line 58 so it does not cause an ArrayOutOfBoundsException in line 66. ## How was this patch tested? Manual tests. I have recompiled the entire project with the fix, it has been built successfully and I have run the code, also with good results. line 66: val yD = blas.ddot(trueWeights.length, x, 1, trueWeights, 1) + rnd.nextGaussian() * 0.1 crashes because trueWeights has length "nfeatures + 1" while "x" has length "features", and they should have the same length. To fix this just make trueWeights be the same length as x. I have recompiled the project with the change and it is working now: [spark-1.6.1]$ spark-submit --master local[*] --class org.apache.spark.mllib.util.SVMDataGenerator mllib/target/spark-mllib_2.11-1.6.1.jar local /home/user/test And it generates the data successfully now in the specified folder. Author: José Antonio <joseanmunoz@gmail.com> Closes apache#13895 from j4munoz/patch-2. (cherry picked from commit a3c7b41) Signed-off-by: Sean Owen <sowen@cloudera.com>

…g tests ## What changes were proposed in this pull request? Make spill tests wait until job has completed before returning the number of stages that spilled ## How was this patch tested? Existing Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes apache#13896 from srowen/SPARK-16193. (cherry picked from commit e877415) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? reduce the denominator of SparkPi by 1 ## How was this patch tested? integration tests Author: 杨浩 <yanghaogn@163.com> Closes apache#13910 from yanghaogn/patch-1. (cherry picked from commit b452026) Signed-off-by: Sean Owen <sowen@cloudera.com>

…oot` module ending up failure of Python tests ## What changes were proposed in this pull request? This PR fixes incorrect checking for `root` module (meaning all tests). I realised that apache#13806 is being failed due to this one. The PR corrects two files in `sql` and `core`. Since it seems fixing `core` module triggers all tests by `root` value from `determine_modules_for_files`. So, `changed_modules` becomes as below: ``` ['root', 'sql'] ``` and `module.dependent_modules` becaomes as below: ``` ['pyspark-mllib', 'pyspark-ml', 'hive-thriftserver', 'sparkr', 'mllib', 'examples', 'pyspark-sql'] ``` Now, `modules_to_test` does not include `root` and this checking is skipped but then both `changed_modules` and `modules_to_test` are being merged after that. So, this includes `root` module to test. This ends up with failing with the message below (e.g. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60990/consoleFull): ``` Error: unrecognized module 'root'. Supported modules: pyspark-core, pyspark-sql, pyspark-streaming, pyspark-ml, pyspark-mllib ``` ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#13845 from HyukjinKwon/fix-build-1.6.

jasonmoore2k and others added 30 commits May 5, 2016 20:14

[SPARK-13566][CORE] Avoid deadlock between BlockManager and Executor …

ab00652

…Thread Temp patch for branch 1.6， avoid deadlock between BlockManager and Executor Thread. Author: cenyuhai <cenyuhai@didichuxing.com> Closes apache#11546 from cenyuhai/SPARK-13566.

Merge branch 'branch-1.6' of github.com:apache/spark into csd-1.6

0864213

Merge branch 'branch-1.6' of github.com:apache/spark into csd-1.6

c9013de

jackson 2.7.3

9da35b2

rxin and others added 28 commits June 16, 2016 16:30

Update branch-1.6 for 1.6.2 release.

a4485c3

Preparing Spark release v1.6.2

f166493

Preparing development version 1.6.3-SNAPSHOT

b8f380f

Preparing Spark release v1.6.2-rc1

4168d9c

Preparing development version 1.6.3-SNAPSHOT

4621fe9

Revert "[SPARK-15395][CORE] Use getHostString to create RpcAddress (b…

e530823

…ackport for 1.6)" This reverts commit 7ad82b6. See SPARK-16017.

Revert "[SPARK-15613] [SQL] Fix incorrect days to millis conversion d…

3d569d9

…ue to Daylight Saving Time" This reverts commit 41efd20.

Preparing Spark release v1.6.2-rc2

54b1121

Preparing development version 1.6.3-SNAPSHOT

2083485

Merge branch 'branch-1.6' of github.com:apache/spark into csd-1.6

21749cb

reset to _2.10

496dde3

Merge branch 'branch-1.6' of github.com:apache/spark into csd-1.6

beecb79

version to 1.6.2

65aa44f

markhamstra merged commit 0be2095 into alteryx:csd-1.6 Jun 29, 2016

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017

Add instructions to find master URL (alteryx#169)

b2a5d3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merged Apache branch-1.6 #169

Merged Apache branch-1.6 #169

markhamstra commented Jun 29, 2016

Merged Apache branch-1.6 #169

Merged Apache branch-1.6 #169

Conversation

markhamstra commented Jun 29, 2016