[SPARK-20449][ML] Upgrade breeze version to 0.13.1 #17746

yanboliang · 2017-04-24T14:45:49Z

What changes were proposed in this pull request?

Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.

How was this patch tested?

Existing unit tests.

yanboliang · 2017-04-24T15:54:45Z

cc @dbtsai

dbtsai · 2017-04-24T16:56:34Z

R/pkg/inst/tests/testthat/test_mllib_classification.R


  model <- spark.mlp(df, label ~ features, layers = c(4, 3), maxIter = 2)
  mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction"))
  expect_equal(head(mlpPredictions$prediction, 10),
-               c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "0.0", "2.0", "1.0", "0.0"))
+               c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "0.0", "0.0", "1.0", "0.0"))


Why are the prediction results changed in R and Python test cases? The bugs being fixed are in LBFGS-B. In theory, the current optimizer only uses LBFGS, so the results should be the same.

Yeah, it’s weird. After in-depth investigation, I found all the failed tests are running on very tiny dataset with very tiny maxIter value, which means they are not converged. I suspect that some underlying breeze changes caused these failures, but I think it doesn’t matter, since all tests against large dataset are successful.
I don’t think checking intermediate result during iteration is make sense, and these intermediate result may vulnerable and not stable. We should only check the last converged result. So I sent #17757 to update relevant tests to make them invulnerable.
I also run PySpark LogisticRegression on a larger dataset against Spark depends on breeze 0.12 and 0.13.1, they got the same result with reasonable tolerance:
For breeze 0.12:

>>> df = spark.read.format("libsvm").load("/Users/yliang/data/trunk4/spark/data/mllib/sample_multiclass_classification_data.txt") >>> from pyspark.ml.classification import LogisticRegression >>> mlor = LogisticRegression(maxIter=100, regParam=0.01, family="multinomial") >>> mlorModel = mlor.fit(df) >>> mlorModel.coefficientMatrix DenseMatrix(3, 4, [1.0584, -1.8365, 3.2426, 3.6224, -2.1275, 2.8712, -2.8362, -2.5096, 1.069, -1.0347, -0.4064, -1.1128], 1) >>> mlorModel.interceptVector DenseVector([-1.1036, -0.5917, 1.6953])

For breeze 0.13.1:

>>> df = spark.read.format("libsvm").load("/Users/yliang/data/trunk4/spark/data/mllib/sample_multiclass_classification_data.txt") >>> from pyspark.ml.classification import LogisticRegression >>> mlor = LogisticRegression(maxIter=100, regParam=0.01, family="multinomial") >>> mlorModel = mlor.fit(df) >>> mlorModel.coefficientMatrix DenseMatrix(3, 4, [1.0584, -1.8365, 3.2426, 3.6224, -2.1274, 2.8712, -2.8363, -2.5096, 1.069, -1.0347, -0.4064, -1.1128], 1) >>> mlorModel.interceptVector DenseVector([-1.1036, -0.5917, 1.6953])

With small datasets, it can result an ill-conditioned problem, which is not ideal for unit-tests since they can be very unstable.

I wonder with real bigger datasets we use in scala unittests, with breeze 0.13.1, whether the number of iterations required for convergency are the same compared with breeze 0.13?

SparkQA · 2017-04-24T17:38:24Z

Test build #76110 has finished for PR 17746 at commit aeb7eb5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2017-04-25T17:06:11Z

LGTM. Merged into master and branch 2.2

## What changes were proposed in this pull request? Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17746 from yanboliang/spark-20449. (cherry picked from commit 67eef47) Signed-off-by: DB Tsai <dbtsai@dbtsai.com>

dbtsai · 2017-04-25T17:16:11Z

Many thanks for @WeichenXu123 helping to fix this bug in breeze!

srowen · 2017-04-25T17:19:21Z

Not that I have any specific concern, but did anyone look at the changes from 0.12 to 0.13 to see if anything might be breaking? probably not, but it does leak into the user classpath

dbtsai · 2017-04-25T17:26:50Z

@srowen Couple APIs changes in Breeze 0.13 are not source code compatible with 0.12. We should tell users about that in the release note, and they need to do their migration if they are using Breeze in their application. FYI, we only use optimization package in Breeze now, and we have a plan to move the optimization code into Spark to reduce the 3rd party dependencies brought by Breeze.

## What changes were proposed in this pull request? Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17757 from yanboliang/flaky-test. (cherry picked from commit dbb06c6) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

## What changes were proposed in this pull request? Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17757 from yanboliang/flaky-test.

yhuai · 2017-05-01T17:15:48Z

Can I ask how we decided merging this dependency change after the cut of the release branch (especially this change affects user code)?

jkbradley · 2017-05-01T23:05:39Z

+10 for not merging major changes like this so close to the release, especially after an RC has been cut, unless it's for blocker bugs. Same for new APIs such as #17715

I guess it's OK not to revert them, but let's definitely avoid doing this in the future.

dbtsai · 2017-05-02T01:46:28Z

The motivation to have this one merged in Spark 2.2 is not only just for #17715 but also because Breeze 0.13.x fixes many bugs in upstream. Since Spark was tightened to 0.12, many users (including my company) have difficulty to upgrade Breeze alone by themselves, which forces people to have a copy of implements with fixes in the application code.

In the long term, because we only use Breeze's optimizer in Spark, we should have our optimizer implementation in mllib-local to remove the heavy external dependencies.

In retrospect, we started to work on the fix we want to have in upstream too late, so we had the upstream to release the fixes right after the 2.2 branch was cut. We should definitely plan it earlier in the future.

yhuai · 2017-05-02T03:51:39Z

@dbtsai Thanks for the explanation and the context :)

superbobry · 2017-05-16T19:38:22Z

Hello, are there any plans to backport this into 2.1 branch? The LBFGS and other fixex in 0.13.1 seem important enough.

srowen · 2017-05-16T19:40:06Z

@superbobry see the discussion above? doesn't seem safe to do so

superbobry · 2017-05-16T20:39:29Z

@srowen thanks! I've missed the point that 0.13.1 was intentionally merged into the upcoming release.

@dbtsai could you give an example of the breaking API change between 0.12 and 0.13.1? I'm sure I've missed it as well, but from the commit log it seems it's all bug fixes or backward-compatible changes.

dbtsai · 2017-05-17T06:23:52Z

@superbobry As you can see in this PR, one of them is

-    override def link(mu: Double): Double = dist.Gaussian(0.0, 1.0).icdf(mu)
+    override def link(mu: Double): Double = dist.Gaussian(0.0, 1.0).inverseCdf(mu)

superbobry · 2017-05-17T10:40:47Z

Thank you.

Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#17746 from yanboliang/spark-20449. (cherry picked from commit 67eef47) Signed-off-by: DB Tsai <dbtsai@dbtsai.com>

Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#17746 from yanboliang/spark-20449.

* [SNAP-846][CLUSTER] Ensuring that Uncaught exceptions are handled in the Snappy side and do not cause a system.exit (#2) Instead of using SparkUncaughtExceptionHandler, executor now gets the uncaught exception handler and uses it to handle the exception. But if it is a local mode, it still uses the SparkUncaughtExceptionHandler A test has been added in the Snappy side PR for the same. * [SNAPPYDATA] Updated Benchmark code from Spark PR#13899 Used by the new benchmark from the PR adapted for SnappyData for its vectorized implementation. Build updated to set testOutput and other variables instead of appending to existing values (causes double append with both snappydata build adding and this adding for its tests) * [SNAPPYDATA] Spark version 2.0.1-2 * [SNAPPYDATA] fixing antlr generated code for IDEA * [SNAP-1083] fix numBuckets handling (#15) - don't apply numBuckets in Shuffle partitioning since Shuffle cannot create a compatible partitioning with matching numBuckets (only numPartitions) - check numBuckets too in HashPartitioning compatibility * [SNAPPYDATA] MemoryStore changes for snappydata * [SNAPPYDATA] Spark version 2.0.1-3 * [SNAPPYDATA] Added SnappyData modification license * [SNAPPYDATA] updating snappy-spark version after the merge * [SNAPPYDATA] Bootstrap perf (#16) Change involves: 1) Reducing the generated code size when writing struct having all fields of same data type. 2) Fixing an issue in WholeStageCodeGenExec, where a plan supporting CodeGen was not being prefixed by InputAdapter in case, the node did not participate in whole stage code gen. * [SNAPPYDATA] Provide preferred location for each bucket-id in case of partitioned sample table. (#22) These changes are related to AQP-79. Provide preferred location for each bucket-id in case of partitioned sample table. * [SNAPPYDATA] Bumping version to 2.0.3-1 * [SNAPPYDATA] Made two methods in Executor as protected to make them customizable for SnappyExecutors. (#26) * [SNAPPYDATA]: Honoring JAVA_HOME variable while compiling java files instead of using system javac. This eliminates problem when system jdk is set differently from JAVA_HOME * [SNAPPYDATA] Helper classes for DataSerializable implementation. (#29) This is to provide support for DataSerializable implementation in AQP * [SNAPPYDATA] More optimizations to UTF8String - allow direct UTF8String objects in RDD data conversions to DataFrame; new UTF8String.cloneIfRequired to clone only if required used by above - allow for some precision change in QueryTest result comparison * [SNAP-1192] correct offsetInBytes calculation (#30) corrected offsetInBytes in UnsafeRow.writeToStream * [SNAP-1198] Use ConcurrentHashMap instead of queue for ContextCleaner.referenceBuffer (#32) Use a map instead of queue for ContextCleaner.referenceBuffer. Profiling shows lot of time being spent removing from queue where a hash map will do (referenceQueue is already present for poll). * [SNAP-1194] explicit addLong/longValue methods in SQLMetrics (#33) This avoids runtime erasure for add/value methods that will result in unnecessary boxing/unboxing overheads. - Adding spark-kafka-sql project - Update version of deps as per upstream. - corrected kafka-clients reference * [SNAPPYDATA] Adding fixed stats to common filter expressions Missing filter statistics in filter's logical plan is causing incorrect plan selection at times. Also, join statistics always return sizeInBytes as the product of its child sizeInBytes which result in a big number. For join, product makes sense only when it is a cartesian product join. Hence, fixed the spark code to check for the join type. If the join is a equi-join, we now sum the sizeInBytes of the child instead of doing a product. For missing filter statistics, adding a heuristics based sizeInBytes calculation mentioned below. If the filtering condition is: - equal to: sizeInBytes is 5% of the child sizeInBytes - greater than less than: sizeInBytes is 50% of the child sizeInBytes - isNull: sizeInBytes is 50% of the child sizeInBytes - starts with: sizeInBytes is 10% of the child sizeInBytes * [SNAPPYDATA] adding kryo serialization missing in LongHashedRelation * [SNAPPYDATA] Correcting HashPartitioning interface to match apache spark Addition of numBuckets as default parameter made HashPartitioning incompatible with upstream apache spark. Now adding it separately so restore compatibility. * [SNAP-1233] clear InMemorySorter before calling its reset (#35) This is done so that any spill call (due to no EVICTION_DOWN) from within the spill call will return without doing anything, else it results in NPE trying to read page tables which have already been cleared. * [SNAPPYDATA] Adding more filter conditions for plan sizing as followup - IN is 50% of original - StartsWith, EndsWith 10% - Contains and LIKE at 20% - AND is multiplication of sizing of left and right (with max filtering of 5%) - OR is 1/x+1/y sizing of the left and right (with min filtering of 50%) - NOT three times of that without NOT * [SNAPPYDATA] reduced factors in filters a bit to be more conservative * [SNAP-1240] Snappy monitoring dashboard (#36) * UI HTML, CSS and resources changes * Adding new health status images * Adding SnappyData Logo. * Code changes for stting/updating Spark UI tabs list. * Adding icon images for Running, Stopped and Warning statuses. * 1. Adding New method for generating Spark UI page without page header text. 2. Updating CSS: Cluster Normal status text color is changed to match color of Normal health logo. * Suggestion: Rename Storage Tab to Spark Cache. * Resolving Precheckin failure due to scala style comments :snappy-spark:snappy-spark-core_2.11:scalaStyle SparkUI.scala message=Insert a space after the start of the comment line=75 column=4 UIUtils.scala message=Insert a space after the start of the comment line=267 column=4 * [SNAP-1251] Avoid exchange when number of shuffle partitions > child partitions (#37) - reason is that shuffle is added first with default shuffle partitions, then the child with maximum partitions is selected; now marking children where implicit shuffle was introduced then taking max of rest (except if there are no others in which case the negative value gets chosen and its abs returns default shuffle partitions) - second change is to add a optional set of alias columns in OrderlessHashPartitioning for expression matching to satisfy partitioning in case it is on an alias for partitioning column (helps queries like TPCH Q21 where implicit aliases are introduced to resolve clashes in self-joins); data sources can use this to pass projection aliases, if any (only snappydata ones in embedded mode) * [SNAPPYDATA] reverting lazy val to def for defaultNumPreShufflePartitions use child.outputPartitioning.numPartitions for shuffle partition case instead of depending on it being defaultNumPreShufflePartitions * [SNAPPYDATA] Code changes for displaying product version details. (#38) * [SNAPPYDATA] Fixes for Scala Style precheckin failure. (#39) * [SNAPPYDATA] Removing duplicate RDD already in snappy-core Update OrderlessHashPartitioning to allow multiple aliases for a partitioning column. Reduce plan size statistics by a factor of 2 for groupBy. * [SNAP-1256] (#41) set the memory manager as spark's UnifiedMemoryManager, if spark.memory.manager is set as default * SNAP-1257 (#40) * SNAP-1257 1. Adding SnappyData Product documentation link on UI. 2. Fixes for SnappyData Product version not displayed issue. * SNAP-1257: Renamed SnappyData Guide link as Docs. Conflicts: core/src/main/scala/org/apache/spark/ui/UIUtils.scala * [SNAPPYDATA] Spark Version 2.0.3-2 * [SNAP-1185] Guard logging and time measurements (#28) - add explicit log-level check for some log lines in java code (scala code already uses logging arguments as pass-by-name) - for System.currentTimeInMillis() calls that are used only by logging, guard it with the appropriate log-level check - use System.nanoTime in a few places where duration is to be measured; also using a DoubleAccumulator to add results for better accuracy - cache commonly used logging.is*Enabled flags - use explicit flag variable in Logging initialized lazily instead of lazy val that causes hang in streaming tests for some reason even if marked transient - renamed flags for consistency - add handling for possible DoubleAccumulators in a couple of places that expect only LongAccumulators in TaskMetrics - fixing scalastyle error due to 2c432045 Conflicts: core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala core/src/main/scala/org/apache/spark/storage/BlockManager.scala * SNAP-1281: UI does not show up if spark shell is run without snappydata (#42) Fixes: Re-enabling the default spark redirection handler to redirect user to spark jobs page. * [SNAP-1136] Kryo closure serialtization support and optimizations (#27) - added back configurable closure serializer in Spark which was removed in SPARK-12414; some minor changes taken from closed Spark PR https://github.com/apache/spark/pull/6361 - added optimized Kryo serialization for multiple classes; currently registration and string sharing fix for kryo (https://github.com/EsotericSoftware/kryo/issues/128) is only in the SnappyData layer PooledKryoSerializer implementation; classes providing maximum benefit have added KryoSerializable notably Accumulators and *Metrics - use closureSerializer for Netty messaging too instead of fixed JavaSerializer - updated kryo to 4.0.0 to get the fix for kryo#342 - actually fixing scalastyle errors introduced by d80ef1b4 - set ordering field with kryo serialization in GenerateOrdering - removed warning if non-closure passed for cleaning * [SNAP-1190] Reduce partition message overhead from driver to executor (#31) - DAGScheduler: - For small enough common task data (RDD + closure) send inline with the Task instead of a broadcast - Transiently store task binary data in Stage to re-use if possible - Compress the common task bytes to save on network cost - Task: New TaskData class to encapsulate task compressed bytes from above, the uncompressed length and reference index if TaskData is being read from a separate list (see next comments) - CoarseGrainedClusterMessage: Added new LaunchTasks message to encapsulate multiple Task messages to same executor - CoarseGrainedSchedulerBackend: - Create LaunchTasks by grouping messages in ExecutorTaskGroup per executor - Actual TaskData is sent as part of TaskDescription and not the Task to easily separate out the common portions in a separate list - Send the common TaskData as a separate ArrayBuffer of data with the index into this list set in the original task's TaskData - CoarseGrainedExecutorBackend: Handle LaunchTasks by splitting into individual jobs - CompressionCodec: added bytes compress/decompress methods for more efficient byte array compression - Executor: - Set the common decompressed task data back into the Task object. - Avoid additional serialization of TaskResult just to determine the serialization time. Instead now calculate the time inline during serialization write/writeExternal methods - TaskMetrics: more generic handling for DoubleAccumulator case - Task: Handling of TaskData during serialization to send a flag to indicate whether data is inlined or will be received via broadcast - ResultTask, ShuffleMapTask: delegate handling of TaskData to parent Task class - SparkEnv: encapsulate codec creation as a zero-arg function to avoid repeated conf lookups - SparkContext.clean: avoid checking serializability in case non-default closure serializer is being used - Test updates for above Conflicts: core/src/main/scala/org/apache/spark/SparkEnv.scala core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala core/src/main/scala/org/apache/spark/scheduler/Task.scala core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala * [SNAP-1202] Reduce serialization overheads of biggest contributors in queries (#34) - Properties serialization in Task now walks through the properties and writes to same buffer instead of using java serialization writeObject on a separate buffer - Cloning of properties uses SerializationUtils which is inefficient. Instead added Utils.cloneProperties that will clone by walking all its entries (including defaults if requested) - Separate out WholeStageCodegenExec closure invocation into its own WholeStageCodegenRDD for optimal serialization of its components including base RDD and CodeAndComment. This RDD also removes the limitation of having a max of only 2 RDDs in inputRDDs(). * [SNAP-1067] Optimizations seen in perf analysis related to SnappyData PR#381 (#11) - added hashCode/equals to UnsafeMapData and optimized hashing/equals for Decimal (assuming scale is same for both as in the calls from Spark layer) - optimizations to UTF8String: cached "isAscii" and "hash" - more efficient ByteArrayMethods.arrayEquals (~3ns vs ~9ns for 15 byte array) - reverting aggregate attribute changes (nullability optimization) from Spark layer and instead take care of it on the SnappyData layer; also reverted other changes in HashAggregateExec made earlier for AQP and nullability - copy spark-version-info in generateSources target for IDEA - updating snappy-spark version after the merge Conflicts: build.gradle sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala * [SNAP-1067] Optimizations seen in perf analysis related to SnappyData PR#381 (#11) - added hashCode/equals to UnsafeMapData and optimized hashing/equals for Decimal (assuming scale is same for both as in the calls from Spark layer) - optimizations to UTF8String: cached "isAscii" and "hash" - more efficient ByteArrayMethods.arrayEquals (~3ns vs ~9ns for 15 byte array) - reverting aggregate attribute changes (nullability optimization) from Spark layer and instead take care of it on the SnappyData layer; also reverted other changes in HashAggregateExec made earlier for AQP and nullability - copy spark-version-info in generateSources target for IDEA Conflicts: common/unsafe/src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java * [SNAPPYDATA] Bootstrap perf (#16) 1) Reducing the generated code size when writing struct having all fields of same data type. 2) Fixing an issue in WholeStageCodeGenExec, where a plan supporting CodeGen was not being prefixed by InputAdapter in case, the node did not participate in whole stage code gen. * [SNAPPYDATA] Skip cast if non-nullable type is being inserted in nullable target Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala * [SNAPPYDATA] optimized versions for a couple of string functions * [SNAPPYDATA] Update to gradle-scalatest version 0.13.1 * Snap 982 (#43) * a) Added a method in SparkContext to manipulate addedJar. This is an workaround for SNAP-1133. b) made repl classloader a variable in Executor.scala * Changed Executor field variable to protected. * Changed build.gradle of launcher and network-yarn to exclude netty dependecies , which was causing some messages to hang. made urlclassLoader in Executor.scala a variable. * Made Utils.doFetchFile method public. * Made Executor.addReplClassLoaderIfNeeded() method as public. * [SNAPPYDATA] Increasing the code generation cache eviction size to 300 from 100 * [SNAP-1398] Update janino version to latest 3.0.x This works around some of the limitations of older janino versions causing SNAP-1398 * [SNAPPYDATA] made some methods protected to be used by SnappyUnifiedManager (#47) * SNAP-1420 What changes were proposed in this pull request? Logging level of cluster manager classes is changed to info in store-log4j.properties. But, there are multiple task level logs which generate lot of unneccessary info level logs. Changed these logs from info to debug. Other PRs #48 SnappyDataInc/snappy-store#168 SnappyDataInc/snappydata#573 * [SNAPPYDATA] Reducing file read/write buffer sizes Reduced buffer sizes from 1M to 64K to reduce unaccounted memory overhead. Disk read/write buffers beyond 32K don't help in performance in any case. * [SNAP-1486] make QueryPlan.cleanArgs a transient lazy val (#51) cleanArgs can end up holding transient fields of the class which can be recalculated on the other side if required in any case. Also added full exception stack for cases of task listener failures. * SNAP-1420 Review What changes were proposed in this pull request? Added a task logger that does task based info logging. This logger has WARN as log level by default. Info logs can be enabled using the following setting in log4j.properties. log4j.logger.org.apache.spark.Task=INFO How was this patch tested? Manual testing. Precheckin. * [SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMap (#53) Merging Spark fix. Radix sort require that half of array as free (as temporary space), so we use 0.5 as the scale factor to make sure that BytesToBytesMap will not have more items than 1/2 of capacity. Turned out this is not true, the current implementation of append() could leave 1 more item than the threshold (1/2 of capacity) in the array, which break the requirement of radix sort (fail the assert in 2.2, or fail to insert into InMemorySorter in 2.1). This PR fix the off-by-one bug in BytesToBytesMap. This PR also fix a bug that the array will never grow if it fail to grow once (stay as initial capacity), introduced by #15722 . Conflicts: core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java * SNAP-1545: Snappy Dashboard UI Revamping (#52) Changes: - Adding new methods simpleSparkPageWithTabs_2 and commonHeaderNodes_2 for custom snappy UI changes - Adding javascript librarires d3.js, liquidFillGauge.js and snappy-dashboard.js for snappy UI new widgets and styling changes. - Updating snappy-dashboard.css for new widgets and UI content stylings - Relocating snappy-dashboard.css into ui/static/snappydata directory. * [SNAPPYDATA] handle "prepare" in answer comparison inside Map types too * [SNAPPYDATA] fixing scalastyle errors introduced in previous commits * SNAP-1698: Snappy Dashboard UI Enhancements (#55) * SNAP-1698: Snappy Dashboard UI Enhancements Changes: - CSS styling and JavaScript code changes for displaying Snappy cluster CPU usage widget. - Removed Heap and Off-Heap usage widgets. - Adding icons/styling for displaying drop down and pull up carets/pointers to expand cell details. - Adding handler for toggling expand and collapse cell details. * [SNAPPYDATA] reduce a byte copy reading from ColumnVector When creating a UTF8String from a dictionary item from ColumnVector, avoid a copy by creating it over the range of bytes directly. Conflicts: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java * [SNAPPYDATA] moved UTF8String.fromBuffer to Utils.stringFromBuffer This is done to maintain full compatibility with upstream spark-unsafe module. Conflicts: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java * [SNAPPYDATA] reverting changes to increase DECIMAL precision to 127 The changes to DECIMAL precision were incomplete and broken in more ways than one. The other reason being that future DECIMAL optimization for operations in generated code will depend on value to fit in two longs and there does not seem to be a practical use-case of having precision >38 (which is not supported by most mainstream databases either) Renamed UnsafeRow.todata to toData for consistency. Conflicts: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java * [SNAPPYDATA][MERGE-2.1] Some fixes after the merge - Fix for SnappyResourceEventsDUnitTest from Rishi - Scala style fixes from Sachin J - deleting unwanted files - reverting some changes that crept in inadvertently More code changes: - adding dependency for org.fusesource.leveldbjni, com.fasterxml.jackson.core, io.dropwizard.metrics, io.netty and org.apache.commons - fixing compilation issues after merge - adding dependency for jetty-client, jetty-proxy and mllib-local for graphx - bumped up parquetVersion and scalanlp breeze - fixed nettyAllVersion, removed hardcoded value - bumped up version - Implement Kryo.read/write for subclasses of Task - Do not implement KryoSerializable in Task - spark.sql.warehouse.dir moved to StaticSQLConf - moved VECTORIZED_AGG_MAP_MAX_COLUMNS from StaticSQLConf to SQLConf - corrected jackson-databind version * [SNAPPYDATA][MERGE-2.1] - Removed SimplifyCasts, RemoveDispensableExpressions - Fixed precheckin failuers - Fixed Task serialization issues - Serialize new TaskMetrics using Kryo serializer - Pass extraOptions in case of saveAsTable - removed debug statement - SnappySink for structured streaming query result * [SNAPPYDATA][MERGE-2.1] removed struct streaming classes * [SNAPPYDATA][MERGE-2.1] - Avoid splitExpressions for DynamicFoldableExpressions. This used to create a lot of codegen issues - Bump up the Hadoop version, to avoid issues in IDEA. - Modified AnalysisException to use getSimpleMessage * [SNAPPYDATA][MERGE-2.1] - Handled Array[Decimal] type in ScalaReflection, fixes SNAP-1772 (SplitSnappyClusterDUnitTest#testComplexTypesForColumnTables_SNAP643) - Fixing scalaStyle issues - updated .gitignore; gitignore build-artifacts and .gradle * [SNAPPYDATA][MERGE-2.1] Missing patches and version changes - updated optimized ByteArrayMethods.arrayEquals as per the code in Spark 2.1 - adapt the word alignment code and optimize it a bit - in micro-benchmarks the new method is 30-60% faster than upstream version; at larger sizes it is 40-50% faster meaning its base word comparison loop itself is faster - increase default locality time from 3s to 10s since the previous code to force executor-specific routing if it is alive has been removed - added back cast removal optimization when types differ only in nullability - add serialization and proper nanoTime handling from *CpuTime added in Spark 2.1.x; use DoubleAccumulator for these new fields like done for others to get more accurate results; also avoid the rare conditions where these cpu times could be negative - cleanup handling of jobId and related new fields in Task with kryo serialization - reverted change to AnalysisException with null check for plan since it is transient now - reverted old Spark 2.0 code that was retained in InsertIntoTable and changed to Spark 2.1 code - updated library versions and make them uniform as per upstream Spark for commons-lang3, metrics-core, py4j, breeze, univocity; also updated exclusions as per the changes to Spark side between 2.0.2 to 2.1.0 - added gradle build for the new mesos sub-project * [SNAP-1790] Fix one case of incorrect offset in ByteArrayMethods.arrayEquals The endOffset incorrectly uses current leftOffset+length when the leftOffset may already have been incremented for word alignment. * Fix from Hemant for fialing :docs target during precheckin run (#61) * SNAP-1794 (#59) * Retaining Spark's CodeGenerator#splitExpressions changes * [SNAP-1389] Optimized UTF8String.compareTo (#62) - use unsigned long comparisons, followed by unsigned int comparison if possible, before finishing with unsigned byte comparisons for better performance - use big-endian long/int for comparison since it requires the lower-index characters to be MSB positions - no alignment attempted since we expect most cases to fail early in first long comparison itself Detailed performance results in https://github.com/SnappyDataInc/spark/pull/62 * [SNAPPYDATA][PERF] Fixes for issues found during concurrency testing (#63) ## What changes were proposed in this pull request? Moved the regex patterns outside the functions into static variables to avoid their recreation. Made WholeStageCodeGenRDD as a case class so that its member variables can be accessed using productIterator. ## How was this patch tested? Precheckin ## Other PRs https://github.com/SnappyDataInc/snappy-store/pull/247 https://github.com/SnappyDataInc/snappydata/pull/730 * [SNAPPYDATA][PERF] optimized pattern matching for byte/time strings also added slf4j excludes to some imports * SNAP-1792: Display snappy members logs on Snappy Pulse UI (#58) Changes: - Adding snappy member details javascript for new UI view named SnappyData Member Details Page * SNAP-1744: UI itself needs to consistently refer to itself as "SnappyData Pulse" (#64) * SNAP-1744: UI itself needs to consistently refer to itself as "SnappyData Pulse" Changes: - SnappyData Dashboard UI is named as SnappyData Pulse now. - Code refactoring and code clean up. * Removed Array[Decimal] handling from spark layer as it only fixes embedded mode. (#66) * Removed Array[Decimal] handling from spark layer as it only fixes embedded mode * Snap 1890 : Snappy Pulse UI suggestions for 1.0 (#69) * SNAP-1890: Snappy Pulse UI suggestions for 1.0 Changes: - SnappyData logo shifted to right most side on navigation tab bar. - Adding SnappyData's own new Pulse logo on left most side on navigation tab bar. - Displaying SnappyData Build details along with product version number on Pulse UI. - Adding CSS,HTML, JS code changes for displaying version details pop up. * [SNAP-1377,SNAP-902] Proper handling of exception in case of Lead and Server HA (#65) * [SNAP-1377] Added callback used for checking CacheClosedException * [SNAP-1377] Added check for GemfirexdRuntimeException and GemfireXDException * Added license header in source file * Fix issue seen during precheckin * Snap 1833 (#67) Added a fallback path for WholeStageCodeGenRDD. As we dynamically change the classloader, generated code compile time classloaders and runtime class loader might be different. There is no clean way to handle this apart from recompiling the generated code. This code path will be executed only in case of components having dynamically changing class loaders i.e Snappy jobs & UDFs. Other sql queries won't be impacted by this. * Refactored the executor exception handling for cache (#71) Refactored the executor exception handling for cache closed exception. * [SNAP-1930] Rectified a code in WholeStageCodeGenRdd. (#73) This change will avoid repeatedly calling code compilation incase of a ClassCastException. * Snap 1813 : Security - Add Server (Jetty web server) level user authentication for Web UI in SnappyData. (#72) * SNAP-1813: Security - Add Server (Jetty web server) level user authentication for Web UI in SnappyData. Changes: - Adding Securty handler in jetty server with Basic Authentication. - Adding LDAP Authentication code changes for Snappy UI. Authenticator (SnappyBasicAuthenticator) is initialized by snappy leader. * [SNAPPYDATA] fixing scalastyle failure introduced by last commit merge of SNAP-1813 in 6b8f59e58f6f21103149ebacebfbaa5b7a5cbf00 introduced scalastyle failure * Resized company logo (#74) * Changes:. - Adding resized SnappyData Logo for UI . - Displaying spark version in version details pop up. - Code/Files(unused logo images) clean up. - Updated CSS * [SNAPPYDATA] update janino to latest release 3.0.7 * [SNAP-1951] move authentication handler bind to be inside connect (#75) When bind to default 5050 port fails, then code clears the loginService inside SecurityHandler.close causing the next attempt on 5051 to fail with "IllegalStateException: No LoginService for SnappyBasicAuthenticator". This change moves the authentication handler setting inside the connect method. * Bump version spark 2.1.1.1-rc1, store 1.5.6-rc1 and sparkJobserver 0.6.2.6-rc1 * Updated the year in the Snappydata copyright header. (#76) * [SNAPPYDATA] upgrade netty versions (SPARK-18971, SPARK-18586) - upgrade netty-all to 4.0.43.Final (SPARK-18971) - upgrade netty-3.8.0.Final to netty-3.9.9.Final for security vulnerabilities (SPARK-18586) * Added code to dump generated code in case of exception (#77) ## What changes were proposed in this pull request? Added code to dump generated code in case of exception in the server side. hasNext function of the iterator is the one that fails in case of an excpetion. Added exception handling for next as well, just in case. ## How was this patch tested? Manual. Precheckin. * [SNAPPYDATA] more efficient passing of non-primitive literals Instead of using CodegenFallback, add the value directly as reference object. Avoids an unncessary cast for every loop (and a virtual call) as also serialized object is smaller. * [SNAP-1993] Optimize UTF8String.contains (#78) - Optimized version of UTF8String.contains that improves performance by 40-50%. However, it is still 1.5-3X slower than JDK String.contains (that probably uses JVM intrinsics since the library version is slower than the new UTF8String.contains) - Adding native JNI hooks to UTF8String.contains and ByteArrayMethods.arrayEquals if present. Comparison when searching in decently long strings (100-200 characters from customers.csv treating full line as a single string). Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01 on Linux 4.10.0-33-generic Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz compare contains: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ UTF8String (orig) 241 / 243 4.7 214.4 1.0X UTF8String (opt) 133 / 137 8.4 118.4 1.8X String 97 / 99 11.6 86.4 2.5X Regex 267 / 278 4.2 237.5 0.9X * Fix to avoid dumping of gen code in case of low memory exception. (#79) * Don't log the generated code when a low memory exception is being thrown. Also, fixed a review comment that print a exception message before the generated code. * [SNAPPYDATA][AQP-293] Native JNI callback changes for UTF8String (#80) - added MacOSX library handling to Native; made minimum size to use JNI as configurable (system property "spark.utf8.jniSize") - added compareString to Native API for string comparison - commented out JNI for ByteArrayMethods.arrayEquals since it is seen to be less efficient for cases where match fails in first few bytes (JNI overhead of 5-7ns is far more) - made the "memory leak" warning in Executor to be debug level; reason being that it comes from proper MemoryConsumers so its never a leak and it should not be required of MemoryConsumers to always clean up memory (unnecessary additional task listeners for each ParamLiteral) - pass source size in Native to make the API uniform * [SNAPPYDATA] update jetty version update jetty to latest 9.2.x version in an attempt to fix occasional "bad request" errors seen currently on dashboard * [SNAP-2033] pass the original number of buckets in table via OrderlessHashPartitioning (#82) also reduced parallel forks in tests to be same as number of processors/cores * Update versions for snappydata 1.0.0, store 1.6.0, spark 2.1.1.1 and spark-jobserver 0.6.2.6 * [SNAPPYDATA] use common "vendorName" in build scripts * [SPARK-21967][CORE] org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance * Using 64 bit unsigned long comparison instead of unsigned int comparison in `org.apache.spark.unsafe.types.UTF8String#compareTo` for better performance. * Making `IS_LITTLE_ENDIAN` a constant for correctness reasons (shouldn't use a non-constant in `compareTo` implementations and it def. is a constant per JVM) Build passes and the functionality is widely covered by existing tests as far as I can see. Author: Armin <me@obrown.io> Closes #19180 from original-brownbear/SPARK-21967. * [SNAPPYDATA] relax access-level of Executor thread pools to protected * [SNAPPYDATA] Fix previous conflict in GenerateUnsafeProjection (#84) From @jxwr: remove two useless lines. * [SPARK-18586][BUILD] netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193 ## What changes were proposed in this pull request? Force update to latest Netty 3.9.x, for dependencies like Flume, to resolve two CVEs. 3.9.2 is the first version that resolves both, and, this is the latest in the 3.9.x line. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16102 from srowen/SPARK-18586. * [SPARK-18951] Upgrade com.thoughtworks.paranamer/paranamer to 2.6 ## What changes were proposed in this pull request? I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes jackson fail to handle byte array defined in a case class. Then I find https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests that it is caused by a bug in paranamer. Let's upgrade paranamer. Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade paranamer to 2.6. Author: Yin Huai <yhuai@databricks.com> Closes #16359 from yhuai/SPARK-18951. * [SPARK-18971][CORE] Upgrade Netty to 4.0.43.Final ## What changes were proposed in this pull request? Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153 ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16568 from zsxwing/SPARK-18971. * [SPARK-19409][BUILD] Bump parquet version to 1.8.2 ## What changes were proposed in this pull request? According to the discussion on #16281 which tried to upgrade toward Apache Parquet 1.9.0, Apache Spark community prefer to upgrade to 1.8.2 instead of 1.9.0. Now, Apache Parquet 1.8.2 is released officially last week on 26 Jan. We can use 1.8.2 now. https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142%3Cdev.parquet.apache.org%3E This PR only aims to bump Parquet version to 1.8.2. It didn't touch any other codes. ## How was this patch tested? Pass the existing tests and also manually by doing `./dev/test-dependencies.sh`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16751 from dongjoon-hyun/SPARK-19409. * [SPARK-19409][BUILD][TEST-MAVEN] Fix ParquetAvroCompatibilitySuite failure due to test dependency on avro ## What changes were proposed in this pull request? After using Apache Parquet 1.8.2, `ParquetAvroCompatibilitySuite` fails on **Maven** test. It is because `org.apache.parquet.avro.AvroParquetWriter` in the test code used new `avro 1.8.0` specific class, `LogicalType`. This PR aims to fix the test dependency of `sql/core` module to use avro 1.8.0. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/2530/consoleFull ``` ParquetAvroCompatibilitySuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/apache/avro/LogicalType at org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144) ``` ## How was this patch tested? Pass the existing test with **Maven**. ``` $ build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver test ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 02:07 h [INFO] Finished at: 2017-02-04T05:41:43+00:00 [INFO] Final Memory: 77M/987M [INFO] ------------------------------------------------------------------------ ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16795 from dongjoon-hyun/SPARK-19409-2. * [SPARK-19411][SQL] Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown There is a metadata introduced before to mark the optional columns in merged Parquet schema for filter predicate pushdown. As we upgrade to Parquet 1.8.2 which includes the fix for the pushdown of optional columns, we don't need this metadata now. Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16756 from viirya/remove-optional-metadata. * [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/hacks due to bugs of old Parquet versions ## What changes were proposed in this pull request? We've already upgraded parquet-mr to 1.8.2. This PR does some further cleanup by removing a workaround of PARQUET-686 and a hack due to PARQUET-363 and PARQUET-278. All three Parquet issues are fixed in parquet-mr 1.8.2. ## How was this patch tested? Existing unit tests. Author: Cheng Lian <lian@databricks.com> Closes #16791 from liancheng/parquet-1.8.2-cleanup. * [SPARK-20449][ML] Upgrade breeze version to 0.13.1 Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17746 from yanboliang/spark-20449. (cherry picked from commit 67eef47acfd26f1f0be3e8ef10453514f3655f62) Signed-off-by: DB Tsai <dbtsai@dbtsai.com> * [SNAPPYDATA] version upgrades as per previous cherry-picks Following cherry-picked versions for dependency upgrades that fix various issues: 553aac5, 1a64388, a8567e3, 26a4cba, 55834a8 Some were already updated in snappy-spark while others are handled in this. * Snap 2044 (#85) * Corrected SnappySession code. * Snap 2061 (#83) * added previous code for reference * added data validation in the test * Incorporated review comments. added test for dataset encoder conversion to dataframe. * [SNAPPYDATA] build changes/fixes (#81) - update gradle to 3.5 - updated many dependencies to latest bugfix releases - changed provided dependencies to compile/compileOnly - changed deprecated "<<" with doLast - changed deprecated JavaCompile.forkOptions.executable with javaHome - gradlew* script changes as from upstream release (as updated by ./gradlew wrapper --gradle-version 3.5.1) * [SNAP-2061] fix scalastyle errors, add test - fix scalastyle errors in SQLContext - moved the Dataset/DataFrame nested POJO tests to JavaDatasetSuite from SQLContextSuite - added test for Dataset.as(Encoder) for nested POJO in the same * [SPARK-17788][SPARK-21033][SQL] fix the potential OOM in UnsafeExternalSorter and ShuffleExternalSorter In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for pointer, 1 `long` for key-prefix, and another 2 `long`s as the temporary buffer for radix sort. In `UnsafeExternalSorter`, we set the `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, and hoping the max size of point array to be 8 GB. However this is wrong, `1024 * 1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point array before reach this limitation, we may hit the max-page-size error. Users may see exception like this on large dataset: ``` Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more than 17179869176 bytes at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94) ... ``` Setting `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to a smaller number is not enough, users can still set the config to a big number and trigger the too large page size issue. This PR fixes it by explicitly handling the too large page size exception in the sorter and spill. This PR also change the type of `spark.shuffle.spill.numElementsForceSpillThreshold` to int, because it's only compared with `numRecords`, which is an int. This is an internal conf so we don't have a serious compatibility issue. TODO Author: Wenchen Fan <wenchen@databricks.com> Closes #18251 from cloud-fan/sort. * [SNAPPYDATA] add missing jersey-hk2 dependency required after the upgrade to jersey 2.26 that does not include it automatically (used by Executors tab in the GUI) guard debug logs with "debugEnabled()" * [SNAPPYDATA][SNAP-2120] make codegen cache size configurable (#87) - use "spark.sql.codegen.cacheSize" to set codegenerator cache size else default to 1000 - also added explicit returns in MemoryPool else it does boxing/unboxing inside the sync block that also shows up in perf analysis (can be seen via decompiler too) - avoid NPE for "Stages" tab of a standby lead * Snap 2084 (#86) If SnappyUMM is found in classpath , SparkEnv will assign the memory manager to SnappyUMM.If user has explicitly set the memory manager that will take precedence. * [SNAPPYDATA] some optimizations to ExecutionMemoryPool - avoid multiple lookups into the map in ExecutionMemoryPool.releaseMemory - avoid an unnecessary boxing/unboxing by adding explicit return from lock.synchronized blocks * [SNAP-2087] fix ArrayIndexOutOfBoundsException with JSON data - issue is the custom code generation added for homogeneous Struct types where isNullAt check used an incorrect index variable - also cleaned up determination of isHomogeneousStruct in both safe/unsafe projection * [SNAPPYDATA] fixing all failures in snappy-spark test suite Three broad categories of issues fixed: - handling of double values in JSON conversion layer of the metrics; upstream spark has all metrics as Long but snappy-spark has the timings one as double to give more accurate results - library version differences between Spark's maven poms and SnappyData's gradle builds; these are as such not product issues but this checkin changes some versions to be matching to maven builds to be fully upstream compatible - path differences in test resource files/jars when run using gradle rather than using maven Other fixes and changes: - the optimized Decimal.equals gave incorrect result in case the scale of the two is different; this followed the Java BigDecimal convention of returning false if the scale is different but that is incorrect as per Spark's conventions; this should normally not happen from catalyst layer but can happen in RDD operations - correct accumulator result in Task to be empty rather than null when nothing present - override the extended two argument DStream.initialize in MapWithStateDStream.initialize - correct the UI path for Spark cache to be "/Spark Cache/" rather than "/storage/" - avoid sending the whole child plan across in DeserializeToObjectExec to executors when only the output is required (also see SNAP-1840 caused due to this) - rounding of some of the time statistics (that are accumulated as double) in Spark metrics - SparkListenerSuite local metrics tests frequently failed due to deserialization time being zero (despite above change); the reason being the optimizations in snappy-spark that allow it to run much quicker and not registering even with System.nanoTime(); now extended the closure to force a 1 milliseond sleep in its readExternal method - use spark.serializer consistently for data only and spark.closureSerializer for others (for the case the two are different) - don't allow inline message size to exceed spark.rpc.message.maxSize - revert default spark.locality.wait to be 3s in Spark (will be set at snappydata layer if required) - make SparkEnv.taskLogger to be serializable if required (extend Spark's Logging trait) - account for task decompression time in the deserialization time too The full spark test suite can be run either by: - ./dev/snappy-build.sh && ./dev/snappy-build.sh test (or equivalent) - ./gradlew check - from SnappyData: - ./gradlew snappy-spark:check, OR - ./gradlew precheckin -Pspark (for full test suite run including snappydata suite) For SnappyData product builds, one of the last two ways from SnappyData should be used * [SNAPPYDATA] fixing one remaining failure in gradle runs * Preserve the preferred location in MapPartitionRDD. (#92) * * SnappyData Spark Version 2.1.1.2 * [SNAP-2218] honour timeout in netty RPC transfers (#93) use a future for enforcing timeout (2 x configured value) in netty RPC transfers after which the channel will be closed and fail * Check for null connection. (#94) If connection is not established properly null connection should be handled properly. * [SNAPPYDATA] revert changes in Logging to upstream reverting flag check optimization in Logging to be compatible with upstream Spark * [SNAPPYDATA] Changed TestSparkSession in test class APIs to base SparkSession This is to allow override by SnappySession extensions. * [SNAPPYDATA] increased default codegen cache size to 2K also added MemoryMode in MemoryPool warning message * [SNAP-2225] Removed OrderlessHashPartitioning. (#95) Handled join order in optimization phase. Also removed custom changes in HashPartition. We won't store bucket information in HashPartitioning. Instead based on the flag "linkPartitionToBucket" we can determine the number of partitions to be either numBuckets or num cores assigned to the executor. Reverted changes related to numBuckets in Snappy Spark. * [SNAP-2242] Unique application names & kill app by names (#98) The standalone cluster should support unique application names. As they are user visible and easy to track user can write scripts to kill applications by names. Also, added support to kill Spark applications by names(case insensitive). * [SNAPPYDATA] make Dataset.boundEnc as lazy val avoid materializing it immediately (for point queries that won't use it) * fix for SNAP-2342 . enclosing with braces when the child plan of aggregate nodes are not simple relations or subquery aliases (#101) * Snap 1334 : Auto Refresh feature for Dashboard UI (#99) * SNAP-1334: Summary: - Fixed the JQuery DataTable Sorting Icons problem in the Spark UI by adding missing sort icons and CSS. - Adding new snappy-commons.js JavaScript for common utility functions used by Snappy Web UI. - Updated Snappy Dashboard and Member Details JavaScripts for following 1. Creating and periodically updating JQuery Data Tables for Members, Tables and External Tables tabular lists. 2. Loading , creating and updating Google Charts. 3. Creating and periodically updating the Google Line Charts for CPU and various Memory usages. 4. Preparing and making AJAX calls to snappy specific web services. 5. Updated/cleanup of Spark UIUtils class. Code Change details: - Sparks UIUtils.headerSparkPage customized to accommodate snappy specific web page changes. - Removed snappy specific UIUtils.simpleSparkPageWithTabs as most of the content was similar to UIUtils.headerSparkPage. - Adding snappy-commons.js javascript script for utility functions used by Snappy UI. - JavaScript implementation of New Members Grid on Dashboard page for displaying members stats and which will auto-refresh periodically. - JavaScript code changes for rendering collapsible details in members grid for description, heap and off-heap. - JavaScript code changes for rendering progress bar for CPU and Memory usages. - Display value as "NA" wherever applicable in case of Locator node. - JavaScript code implementation for displaying Table stats and External Table stats. - Changes for periodic updating of Table stats and External Table stats. - CSS updated for page styling and code formatting. - Adding Sort Control Icons for data tables. - - Code changes for adding, loading and rendering google charts for snappy members usages trends. - Displaying cluster level usage trends for Average CPU, Heap and Off-Heap with their respective storage and execution splits and Disk usage. - Removed Snappy page specific javaScripts from UIUtils to respective page classes. - Grouped all dashboard related ajax calls into single ajax call clusterinfo. - Utility function convertSizeToHumanReadable is updated in snappy-commons.js to include TB size. - All line charts updated to include crosshair pointer/marks. - Chart titles updated with % sign and GB for size to indicate values are in percents or in GB. - Adding function updateBasicMemoryStats to update members basic memory stats. - Displaying Connection Error message whenever cluster goes down. - Disable sorting on Heap and Off-Heap columns, as cell contains multiple values in different units. * Fixes for SNAP-2376: (#102) - Adding 5 seconds timeout for auto refresh AJAX calls. - Displays request timeout message in case AJAX request takes longer than 5 seconds. * [SNAP-2379] App was getting registered with error (#103) This change pertains to the modification to Standalone cluster for not allowing applications with the same name. The change was erroneous and was allowing the app to get registered even after determining a duplicate name. * Fixes for SNAP-2383: (#106) - Adding code changes for retaining page selection in tables during stats auto refresh. * Handling of POJOs containg array of Pojos while creating data frames (#105) * Handling of POJOs containg array of Pojos while creating data frames * added bug test for SNAp-2384 * Spark compatibility (#107) Made overrideConfs as a variable. & made a method protected. * Fixes for SNAP-2400 : (#108) - Removed (commented out) timeout from AJAX calls. * Code changes for SNAP-2144: (#109) * Code changes for SNAP-2144: - JavaScript and CSS changes for displaying CPU cores details on Dashboard page. - Adding animation effect to CPU Core details. * Fixes for SNAP-2415: (#110) - Removing z-index. * Fixing scala style issue. * Code changes for SNAP-2144: - Display only Total CPU Cores count and remove cores count break up (into locators, leads and data servers). * Reverting previous commit. * Code changes for SNAP-2144: (#113) - Display only Total CPU Cores count and remove cores count break up (into locators, leads and data servers). * Fixes for SNAP-2422: (#112) - Code changes for displaying error message if loading Google charts library fails. - Code changes for retrying loading of Google charts library. - Update Auto-Refresh error message to guide user to go to lead logs if there is any connectivity issue. * Fix to SNAP-2247 (#114) * This is a Spark bug. Please see PR https://github.com/apache/spark/pull/17529 Needed to do similar change in the code path of prepared statement where precision needed to be adjusted if smaller than scale. * Fixes for SNAP-2437: (#115) - Updating CSS, to fix the member description details alignment issue. * SNAP-2307 fixes (#116) SNAP-2307 fixes related to SnappyTableScanSuite * reverting changes done in pull request #116 (#119) Merging after discussing with Rishi * Code changes for ENT-21: (#118) - Adding skipHandlerStart flag based on which handler can be started, wherever applicable. - Updating access specifiers. * * Bump up version to 2.1.1.3 * [SNAPPYDATA] fixed scalastyle * * Version 2.1.1.3-RC1 * Code changes for SNAP-2471: (#120) - Adding close button in the SnappyData Version Details Pop Up to close it. * * [ENT-46] Mask sensitive information. (#121) * Code changes for SNAP-2478: (#122) - Updating font size of members basic statistics on Member Details Page. - Display External Tables only if available. * Fixes for SNAP-2377: (#123) - To fix Trend charts layout issue, changing fixed width to width in percent for all trends charts on UI. * [SNAPPY-2511] initialize SortMergeJoin build-side scanner lazily (#124) Avoid sorting the build side of SortMergeJoin if the streaming side is empty. This already works that way for inner joins with code generation where the build side is initialized on first call from processNext (using the generated variable "needToSort" in SortExec). This change also enables the behaviour for non-inner join queries that use "SortMergeJoinScanner" that instantiates build-side upfront. * [SPARK-24950][SQL] DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13 - Update DateTimeUtilsSuite so that when testing roundtripping in daysToMillis and millisToDays multiple skipdates can be specified. - Updated test so that both new years eve 2014 and new years day 2015 are skipped for kiribati time zones. This is necessary as java versions pre 181-b13 considered new years day 2015 to be skipped while susequent versions corrected this to new years eve. Unit tests Author: Chris Martin <chris@cmartinit.co.uk> Closes #21901 from d80tb7/SPARK-24950_datetimeUtilsSuite_failures. (cherry picked from commit c5b8d54c61780af6e9e157e6c855718df972efad) Signed-off-by: Sean Owen <srowen@gmail.com> * [SNAP-2569] remove explicit HiveSessionState dependencies To enable using any SparkSession with Spark's HiveServer2, explicit dependencies on HiveSessionState in processing have been removed. * [SNAPPYDATA] make Benchmark class compatible with upstream * [SNAPPYDATA] fix default bind-address of ThriftCLIService - ThriftCLIService uses InetAddress.getLocalHost() as default address to be shown but hive thrift server actually uses InetAddress.anyLocalAddress() - honour bind host property in ThriftHttpCLIService too * [SNAPPYDATA] generate spark-version-info.properties in source path spark-version-info.properties is now generated in src/main/extra-resources rather than in build output so that IDEA can pick it up cleanly remove Kafka-0.8 support from build: updated examples for Kafka-0.10 * [SNAPPYDATA] Increase hive-thrift shell history file size to 50000 lines - skip init to set history max-size else it invokes load() in constructor that truncates the file to default 500 lines - update jline to 2.14.6 for this new constructor (https://github.com/jline/jline2/issues/277) - add explicit dependency on jline2 in hive-thriftserver to get the latest version * [SNAPPYDATA] fix RDD info URLs to "Spark Cache" - corrected the URL paths for RDDs to use /Spark Cache/ instead of /storage/ - updated effected tests * [SNAPPYDATA] improved a gradle dependency to avoid unnecessary re-evaluation * Changed the year frim 2017 to 2018 in license headers. * SNAP-2602 : On snappy UI, add column named "Overflown Size"/ "Disk Size" in Tables. (#127) * Changes for SNAP-2602: - JavaScript changes for displaying tables overflown size to disk as Spill-To-Disk size. * Changes for SNAP-2612: (#126) - Displaying external tables fully qualified name (schema.tablename). * SNAP-2661 : Provide Snappy UI User a control over Auto Update (#128) * Changes for SNAP-2661 : Provide Snappy UI User a control over Auto Update - Adding JavaScript and CSS code changes for Auto Update ON/OFF Switch on Snappy UI (Dashboard and Member Details page). * [SNAPPYDATA] Property to set if hive meta-store client should use isolated ClassLoader (#132) - added a property to allow setting whether hive client should be isolated or not - improved message for max iterations warning in RuleExecutor * [SNAP-2751] Enable connecting to secure SnappyData via Thrift server (#130) * * Changes from @sumwale to set the credentials from thrift layer into session conf. * * This fixes an issue with RANGE operator in non-code generated plans (e.g. if too many target table columns) * Patch provided by @sumwale * avoid dumping generated code in quick succession for exceptions * correcting scalastyle errors * * Trigger authentication check irrespective of presence of credentials. * [SNAPPYDATA] update gradle to version 5.0 - updated builds for gradle 5.0 - moved all embedded versions to top-level build.gradle * change javax.servlet-api version to 3.0.1 * Updated the janino compiler version similar to upstream spark (#134) Updated the Janino compiler dependency version similar/compatible with the spark dependencies. * Changes for SNAP-2787: (#137) - Adding an option "ALL" in Show Entries drop down list of tabular lists, in order to display all the table entries to avoid paging. * Fixes for SNAP-2750: (#131) - Adding JavaScript plugin code for JQuery Data Table to sort columns containing file/data sizes in human readable form. - Updating HTML, CSS and JavaScript, for sorting, of tables columns. * Changes for SNAP-2611: (#138) - Setting configuration parameter for setting ordering column. * SNAP-2457 - enabling plan caching for hive thrift server sessions. (#139) * Changes for SNAP-2926: (#142) - Changing default page size for all tabular lists from 10 to 50. - Sorting Members List tabular view on Member Type for ordering all nodes such that all locators first, then all leads and then all servers. * Snap 2900 (#140) Changes: * For SNAP-2900 - Adding HTML, CSS, and JavaScript code changes for adding Expand and Collapse control button against each members list entry. Clicking on this control button, all additional cell details will be displayed or hidden. - Similarly adding parent expand and collapse control to expand and collapse all rows in the table in single click. - Removing existing Expand and Collapse control buttons per cell, as those will be redundant. * For SNAP-2908 - Adding third party library Jquery Sparklines to add sparklines (inline charts) in members list for CPU and Memory Usages. - Adding HTML, CSS, and JavaScript code changes for rendering CPU and Memory usages Sparklines. * Code clean up. - Removing unused icons and images. - removing unused JavaScript Library liquidFillGauge.js * Changes for SNAP-2908: [sparkline enhancements] (#143) [sparkline enhancements] * Adding text above sparklines to display units and time duration of charts. * Formatting sparkline tooltips to display numbers with 3 precision places. * [SNAP-2934] Avoid double free of page that caused server crash due to SIGABORT/SIGSEGV (#144) * [SNAP-2956] Wrap non fatal OOME from Spark layer in a LowMemoryException (#146) * Fixes for SNAP-2965: (#147) - Using disk store UUID as an unique identifier for each member node. * [SNAPPYDATA] correcting typo in some exception messages * SNAP-2917 - generating SparkR library along with snappy product (#141) removing some unused build code * [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in … (#149) * [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search scalanlp/breeze#651 Most of the content of this PR is cherry-picked from https://github.com/apache/spark/commit/b35660dd0e930f4b484a079d9e2516b0a7dacf1d with minimal code changes done to resolve merge conflicts. --- Faced one test failure (ParquetHiveCompatibilitySuite#"SPARK-10177 timestamp") while running precheckin. This was due to recent upgrade in `jodd` library version to `5.0.6`. Downgraded `jodd` library version to `3.9.1` to fix this failure. Note that this changes is independent from breeze version upgrade. * Changes for SNAP-2974 : Snappy UI re-branding to TIBCO ComputeDB (#150) * Changes for SNAP-2974: Snappy UI re-branding to TIBCO ComputeDB 1. Adding TIBCO ComputDB product logo 2. Adding Help Icon, clicking on which About box is displayed 3. Updating About Box content - Adding TIBCO ComputeDB product name and its Edition type - Adding Copyright information - Adding Assistance details web links - Adding Product Documentation link 4. Removing or Changing user visible SnappyData references on UI to TIBCO ComputeDB. 5. Renaming pages to just Dashboard, Member Details and Jobs 6. Removing Docs link from tabs bar * * Version changes * Code changes for SNAP-2989: Snappy UI rebranding to Tibco ComputeDB iff it's Enterprise Edition (#151) Product UI updated for following: 1. SnappyData is Community Edition - Displays Pulse logo on top left side. - Displays SnappyData logo on top right side. - About Box : Displays product name "Project SnappyData - Community Edition" Displays product version, copyright information Displays comunity product documentation link. 2. TIBCO ComputeDB is Enterprise : - Displays TIBCO ComputeDB logo on top left side. - About Box: Displays product name "TIBCO ComputeDB - Enterprise Edition" Displays product version, copyright information Displays enterprise product documentation link. * * Updated some metainfo in prep for 1.1.0 release * Changes for SNAP-2989: (#152) - Removing SnappyData Community page link from Enterprise About Box. - Fixes for issue SnappyData logo is displayed on first page load in Enterprise edition. * [SNAPPYDATA] fix scalastyle error * Spark compatibility fixes (#153) - Spark compatibility suite fixes to make them work both in Spark and SD - expand PathOptionSuite to check for data after table rename - use Resolver to check intersecting columns in NATURAL JOIN * Considering jobserver class loader as a key for generated code cache - (#154) ## Considering jobserver class loader as a key for generated code cache For each submission of a snappy-job, a new URI class loader is used. The first run of a snappy-job may generate some code and it will be cached. The subsequent run of the snappy job will end up using the generated code which was cached by the first run of the job. This can lead to issues as the class loader used for the cached code is the one from the first job submission and subsequent submissions will be using a different class loader. This change is done to avoid such failures. * SNAP-3054: Rename UI tab "JDBC/ODBC Server" to "Hive Thrift Server" (#156) - Renaming tab name "JDBC/ODBC Server" to "Hive Thrift Server". * SNAP-3015: Put thousands separators for Tables > Rows Count column in Dashboard. (#157) - Adding thousands separators for table row count as per locale. * Tracking spark block manager directories for each executors and cleaning them in next run if left orphan. * [SNAPPYDATA] fix scalastyle errors introduced by previous commit * Revert: Tracking spark block manager directories for each executors and cleaning them in next run if left orphan. * allow for override of TestHive session * [SNAP-3010] Cleaning block manager directories if left orphan (#158) ## What changes were proposed in this pull request? Tracking spark block manager directories for each executor and cleaning them in next run if left orphan. The changes are for tracking the spark local directories (which are used by block manager to store shuffle data) and changes to clean the local directories (which are left orphan due to abrupt failure of JVM). The changes to clean the orphan directory are also kept as part of Spark module itself instead of cleaning it on Snappy Cluster start. This is done because the changes to track the local directory has to go in Spark and if the clean up is not done at the same place then the metadata file used to track the local directories will keep growing while running spark cluster from snappy's spark distribution. This cleanup is skipped when master is local because in local mode driver and executors will end up writing `.tempfiles.list` file in the same directory which may l…

Upgrade breeze version to 0.13.1

aeb7eb5

dbtsai reviewed Apr 24, 2017

View reviewed changes

yanboliang mentioned this pull request Apr 25, 2017

[Minor][ML] Fix some PySpark & SparkR flaky tests #17757

Closed

asfgit closed this in 67eef47 Apr 25, 2017

yanboliang deleted the spark-20449 branch April 26, 2017 00:20

superbobry mentioned this pull request May 16, 2017

Bumped breeze version criteo-forks/spark#40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20449][ML] Upgrade breeze version to 0.13.1 #17746

[SPARK-20449][ML] Upgrade breeze version to 0.13.1 #17746

yanboliang commented Apr 24, 2017

yanboliang commented Apr 24, 2017

dbtsai Apr 24, 2017

yanboliang Apr 25, 2017 •

edited

dbtsai Apr 25, 2017

SparkQA commented Apr 24, 2017

dbtsai commented Apr 25, 2017

dbtsai commented Apr 25, 2017

srowen commented Apr 25, 2017

dbtsai commented Apr 25, 2017

yhuai commented May 1, 2017

jkbradley commented May 1, 2017

dbtsai commented May 2, 2017

yhuai commented May 2, 2017

superbobry commented May 16, 2017

srowen commented May 16, 2017

superbobry commented May 16, 2017

dbtsai commented May 17, 2017

superbobry commented May 17, 2017

[SPARK-20449][ML] Upgrade breeze version to 0.13.1 #17746

[SPARK-20449][ML] Upgrade breeze version to 0.13.1 #17746

Conversation

yanboliang commented Apr 24, 2017

What changes were proposed in this pull request?

How was this patch tested?

yanboliang commented Apr 24, 2017

dbtsai Apr 24, 2017

Choose a reason for hiding this comment

yanboliang Apr 25, 2017 • edited

Choose a reason for hiding this comment

dbtsai Apr 25, 2017

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2017

dbtsai commented Apr 25, 2017

dbtsai commented Apr 25, 2017

srowen commented Apr 25, 2017

dbtsai commented Apr 25, 2017

yhuai commented May 1, 2017

jkbradley commented May 1, 2017

dbtsai commented May 2, 2017

yhuai commented May 2, 2017

superbobry commented May 16, 2017

srowen commented May 16, 2017

superbobry commented May 16, 2017

dbtsai commented May 17, 2017

superbobry commented May 17, 2017

yanboliang Apr 25, 2017 •

edited