[SPARK-24763][SS] Remove redundant key data from value in streaming aggregation #21733

HeartSaVioR · 2018-07-09T06:25:28Z

What changes were proposed in this pull request?

This patch proposes a new flag option for stateful aggregation: remove redundant key data from value.
Enabling new option runs similar with current, and uses less memory for state according to key/value fields of state operator.

Please refer below link to see detailed perf. test result:
https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16536539&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16536539

Since the state between enabling the option and disabling the option is not compatible, the option is set to 'disable' by default (to ensure backward compatibility), and OffsetSeqMetadata would prevent modifying the option after executing query.

How was this patch tested?

Modify unit tests to cover both disabling option and enabling option.
Also did manual tests to see whether propose patch improves state memory usage.

HeartSaVioR · 2018-07-09T06:26:15Z

cc. @tdas @zsxwing @jose-torres @jerryshao @arunmahadevan @HyukjinKwon

SparkQA · 2018-07-09T06:30:12Z

Test build #92734 has finished for PR 21733 at commit 2a9cc49.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-09T07:05:01Z

Test build #92735 has finished for PR 21733 at commit 89a30ab.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-09T08:29:07Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala

@@ -53,7 +54,30 @@ class StreamingAggregationSuite extends StateStoreMetricsTest

  import testImplicits._

-  test("simple count, update mode") {
+  val confAndTestNamePostfixMatrix = List(


Can we do this within beforeAll and afterAll with spark.conf.set / spark.conf.unset?

withSQLConf looks like used widely between SQL unit tests, and does additional work (SparkSession.setActiveSession), so I'm not sure it will work technically same. Moreover, we need to run same test "multiple times", with changing configuration.

Could you propose your code if you don't really mind? Thanks in advance!

Oops. "we need to run same test "multiple times", with changing configuration." I missed this.
We could consider like:

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

Lines 61 to 72 in 489a529

override def beforeEach(): Unit = {

super.beforeEach()

// Note that there are many tests here that require record-level filtering set to be true.

spark.conf.set(SQLConf.PARQUET_RECORD_FILTER_ENABLED.key, "true")

}

override def afterEach(): Unit = {

try {

spark.conf.unset(SQLConf.PARQUET_RECORD_FILTER_ENABLED.key)

} finally {

super.afterEach()

}

e.g.,

StreamingAggregationSuite { override def afterAll(): Unit = { // false } override def beforeAll(): Unit = { // false } } RemoveRedundantStreamingAggregationSuite extends StreamingAggregationSuite { override def afterAll(): Unit = { // true } override def beforeAll(): Unit = { // true } }

but I believe the current implementation works too in this case.

OK. I'd like to wait for other reviewers regarding opinions/suggestions on this. Let me keep this as it is until then.

SparkQA · 2018-07-09T10:21:43Z

Test build #92738 has finished for PR 21733 at commit bb5f672.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2018-07-09T12:13:08Z

retest this, please

SparkQA · 2018-07-09T16:19:18Z

Test build #92755 has finished for PR 21733 at commit bb5f672.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2018-07-10T03:49:48Z

retest this, please

SparkQA · 2018-07-10T05:46:14Z

Test build #92791 has finished for PR 21733 at commit bb5f672.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres · 2018-07-10T15:53:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -825,6 +825,16 @@ object SQLConf {
    .intConf
    .createWithDefault(100)

+  val ADVANCED_REMOVE_REDUNDANT_IN_STATEFUL_AGGREGATION =


I get worried when I see things described as "advanced features". What will go wrong if a user who's insufficiently advanced tries to use it?

This is not compatible with current stateful aggregation (definitely, that's the improvement of this patch) and there is no undo. So only new query can turn on this option, and once end users enable the option in the query, the option must be enabled unless end users clear out checkpoint as well as state. (I've added the new option to OffsetSeqMetadata to remember the first setting like partition count).

I'm seeing performance on far or even slightly better on specific workload (publicized in description link), but I would say I cannot try out exhaustive workloads. I actually expected a tradeoff between performance vs state memory usage, so assuming if other workloads follow the tradeoff, end users may need to try out this option in their query with non-production environment (for example, staged) to ensure enabling option doesn't break their expectation of performance with benefit of reducing state memory.

That's why I also make changes available as an option instead of modifying default behavior. If we apply this to the default behavior, we need to provide state migration.

Can this be a query option instead of a SparkConf, then? I worry it will be very hard to reason about the current scenario, where the conf defines how all states are stored - except that some streams started with a different value will silently override it.

I'm sorry, but I'm not sure if I understand your suggestion correctly. I guess defining configuration to spark conf would be easier to guard against modification after starting query, via existing approach - adding conf to OffsetSeqMetadata - whereas I'm not sure we could guard against modification of query option. I might be missing something here.
Could you elaborate a bit more? Thanks in advance!

And the default value of this is false so end users will be aware of existence of this option, and have a chance to read the explanation before setting this option to true.

We might elaborate a bit more on the config: tradeoff between reduced memory usage vs possible perf. hit and suggest running this in non-production before applying this to production. If we feel safer on elaborating more on this, I'm happy to update it.

arunmahadevan · 2018-07-11T20:02:18Z

@HeartSaVioR , the results looks promising. I am wondering if theres a way to make this default option than introducing new configs. Since this is internal details anyway theres no need to expose any config if we can identify the old vs new format by looking at the fields in the row or by introducing a row version to differentiate old vs new.

HeartSaVioR · 2018-07-11T21:25:56Z

@arunmahadevan
I'm actually in favor of changing default behavior, just not 100% sure the result would be promising for exhaustive use cases. I might need to prepare more kinds of key/value pair (key size bigger than value size, key size smaller than value size, key size equals to value size, what else I'm missing here?) and run some tests and back it up with new numbers.

Btw, as you commented, there seems two approaches to identify the old and new format:

looking at the fields in the row

Actually I tried to do it before (via checking count of fields in value row, since this patch reduces the count of fields in value row), and soon realized I can't do it because HDFSBackedStateStoreProvider relies on provided keySchema and valueSchema when serializing / deserializing rows (only byte arrays are stored... no numFields), not leveraging UnsafeRow's serialization/deserialization mechanism (writeExternal/readExternal or write/read via Kyro), so the numFields in deserialized row could go wrong if the schema doesn't match with actual rows, and we can't verify this.

Current approach saves cost to write/read one additional integer with sacrificing the way to verify the rows. If we would want to add the feature back, state migration should be happened or we should have row version to differentiate the twos.

introducing a row version to differentiate old vs new

We could do this via applying same approach in #21739 so this is valid, but query with old state format should do state migration (not easy to do since it should be done against multiple versions of states), or continue relying on old state format (same as above).

@jose-torres Could you please take a look at @arunmahadevan 's comment as well as this comment and comment yours? Thanks in advance!

jose-torres · 2018-07-11T21:38:41Z

We could still save the value of the option to offsetSeqMetadata and error if it's changed. The value of using an option would just be that there's no global default; a poweruser can set the option for the queries they think would benefit without affecting all the other queries which get run.

I agree it would be nice to just have some safe path allowing us to always use the new strategy. Absent that, there's an unfortunate tradeoff of reduced memory footprint vs added complexity. I think we ultimately need a committer to decide whether that's worth it.

HeartSaVioR · 2018-07-11T22:02:28Z

I guess we would have to treat reducing state memory size to have worth to do: as described in above commit, we already optimized in HDFSBackedStateStoreProvider for reducing state store disk size (as well as network transfer) via not storing 4 bytes per each row (from both key and value). This approach would normally save more than previous optimization on value row, given key would have window information which contains two values: start and end.

The main issue on this approach for me is possible perf. impact on workloads. Hopefully the workload I've covered shows even slight perf. improvement but not sure for other workloads yet. I might say we need to consider changing default behavior when I have overall good backing numbers afterwards, but in any way, I'm sure I agree that deciding from committer(s) is necessary. Would we be better to initiate mail thread in dev. mailing list?

HeartSaVioR · 2018-07-12T09:29:10Z

@arunmahadevan @jose-torres

https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16541367&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16541367

I had a chance to test this patch with more kinds of use cases, and in overall enabling option shows on far or slightly better performance whereas it reduces state size according to the ratio of size of key-value pair. I'm now feeling that it would make sense to adopt new strategy to the default and use old behavior as fallback of supporting old app, but the numbers are for persuading committers and I still agree decision would be necessary from committer(s).

SparkQA · 2018-07-18T11:30:55Z

Test build #93221 has finished for PR 21733 at commit db9d9ce.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
sealed trait StreamingAggregationStateManager extends Serializable
abstract class StreamingAggregationStateManagerBaseImpl(
class StreamingAggregationStateManagerImplV1(
class StreamingAggregationStateManagerImplV2(

SparkQA · 2018-07-18T11:33:11Z

Test build #93222 has finished for PR 21733 at commit 4754469.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
sealed trait StreamingAggregationStateManager extends Serializable
abstract class StreamingAggregationStateManagerBaseImpl(
class StreamingAggregationStateManagerImplV1(
class StreamingAggregationStateManagerImplV2(

SparkQA · 2018-07-19T13:36:45Z

Test build #93277 has finished for PR 21733 at commit ca198ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
sealed trait StreamingAggregationStateManager extends Serializable
abstract class StreamingAggregationStateManagerBaseImpl(
class StreamingAggregationStateManagerImplV1(
class StreamingAggregationStateManagerImplV2(

…ggregation * add option to configure enabling new feature: remove redundant key data from value * modify code to respect new option (turning on/off feature) * modify tests to run tests with both on/off * Add guard in OffsetSeqMetadata to prevent modifying option after executing query

…thods

…backward compatible"

HeartSaVioR · 2018-07-20T04:57:35Z

Now I'd like to propose changing default behavior to apply new path but keeping backward compatibility, so applied it to the patch. I'm still open on decision to apply it as advanced option as first approach, and happy to roll back when we decide on that way.

HeartSaVioR · 2018-07-20T07:01:15Z

Add tests for StatefulOperatorsHelper itself as well. (Sorry for pushing commits multiple times which trigger multiple builds. It might be ideal if older test builds are terminated once newer test build for specific PR is just launched: and looks like it already works like the way via failing with -9.)

HeartSaVioR · 2018-08-07T21:22:23Z

@tdas
Done running perf. test with 4 more tests:

https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/blob/master/src/main/scala/com/hortonworks/spark/benchmark/

BenchmarkMovingAggregationsListenerKeyMuchBigger

rate: 160000

version	input rows per second	processed rows per second	total state rows	used bytes of current state version
latest master (`c9914cf`)	159877.232	149537.817	65000	133511303
patch (on top of `c9914cf`)	160049.118	152497.945	65000	73236351

state size: 54.854 % (reduces 45.15%)

BenchmarkMovingAggregationsListenerManyKeys

rate: 120000

version	input rows per second	processed rows per second	total state rows	used bytes of current state version
latest master (`c9914cf`)	120266.810	107482.042	65000	38433719
patch (on top of `c9914cf`)	119865.855	109268.772	65000	24900343

state size: 64.787% (reduces 35.21%)

BenchmarkMovingAggregationsListenerManyValues

rate: 25000

version	input rows per second	processed rows per second	total state rows	used bytes of current state version
latest master (`c9914cf`)	25009.236	21216.126	90000	77161711 (857.352 per row)
patch (on top of `c9914cf`)	25060.635	20774.500	99495	78230335 (786.274 per row)

state size: 91.709 % (reduces 8.29 %)

BenchmarkMovingAggregationsListenerValueMuchBigger

rate: 85000

version	input rows per second	processed rows per second	total state rows	used bytes of current state version
latest master (`c9914cf`)	85310.774	79091.271	1000	1324255
patch (on top of `c9914cf`)	84791.761	79755.905	1000	1282687

state size: 96.861 % (reduces 3.14 %)

I don't find any outstanding perf. hit, and expected state size reduction is shown from all over the cases.

* add docs

HeartSaVioR · 2018-08-08T02:25:44Z

Also added javadoc as well. Most of contents are from StateStore but I didn't copy the note to implementation for state store since it is duplicated. Please let me know if we want to add content for the parameter target state store as well.

SparkQA · 2018-08-08T06:06:21Z

Test build #94403 has finished for PR 21733 at commit e0ee04a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

Looking much better, thank you for the changes! There are still a few improvements that can be made.

tdas · 2018-08-08T07:33:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

@@ -81,4 +85,221 @@ package object state {
        storeCoordinator)
    }
  }
+
+  /**


Ummm why is it in this package class and not in separate file?? Is there any reason it has to be state package object when not all of stateful require it, only streaming aggregation?

Maybe I misinterpret your suggestion before. I thought you are suggesting move to state package class. Will place it to separate file.

tdas · 2018-08-08T07:34:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+    def remove(store: StateStore, key: UnsafeRow): Unit
+
+    /**
+     * Return an iterator containing all the key-value pairs in target state store.


super nit: some of these can be compressed to a single line doc.

Will address.

tdas · 2018-08-08T07:35:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+     * Commit all the updates that have been made to the target state store, and return the
+     * new version.
+     *
+     * @param store The target StateStore instance.


superfluous. just the main statement has all the information.

tdas · 2018-08-08T07:37:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+     * Extract columns consisting key from input row, and return the new row for key columns.
+     *
+     * @param row The input row.
+     * @return The row instance which only contains key columns.


nit: a lot of the @param and @return in the docs are a bit superfluous as it just repeats what the main statement already says.

Will just remove all the @param and @return if they are repeating.

tdas · 2018-08-08T07:39:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+     * @param row The input row.
+     * @return The row instance which only contains key columns.
+     */
+    def getKey(row: InternalRow): UnsafeRow


nit: why is the input typed InternalRow where everything else is UnsafeRow? seems inconsistent.

getKey was basically UnsafeProjection in statefulOperator so didn't necessarily require UnsafeRow. I just followed the usage to make it less restrict, but we know, in reality row will be always UnsafeRow. So OK to fix if it provides consistency.

tdas · 2018-08-08T08:07:06Z

...a/org/apache/spark/sql/execution/streaming/state/StreamingAggregationStateManagerSuite.scala

+    // state manager should return row which is same as input row regardless of format version
+    assert(inputRow === stateManager.get(memoryStateStore, keyRow))
+  }
+


nit: extra line.

Will remove.

tdas · 2018-08-08T08:08:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+        return savedState
+      }
+
+      val joinedRow = joiner.join(key, savedState)


cant you dedup the code with restoreOriginRow method?

Missed spot. Will leverage restoreOriginRow.

tdas · 2018-08-08T08:08:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+      store.iterator().map(rowPair => restoreOriginRow(rowPair))
+    }
+
+    private def restoreOriginRow(rowPair: UnsafeRowPair): UnsafeRow = {


rename to restoreOriginalRow

Will rename.

tdas · 2018-08-08T08:09:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

@@ -167,6 +165,18 @@ trait WatermarkSupport extends UnaryExecNode {
      }
    }
  }
+
+  protected def removeKeysOlderThanWatermark(storeManager: StreamingAggregationStateManager,


incorrect indent of parameters

Actually... where is this used? This does not seem to be used anywhere?

Missed spot. It will be called from Update mode of StateStoreSaveExec. Will address.

tdas · 2018-08-08T08:24:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

        val hasInput = iter.hasNext
        if (!hasInput && keyExpressions.isEmpty) {
          // If our `keyExpressions` are empty, we're getting a global aggregation. In that case
          // the `HashAggregateExec` will output a 0 value for the partial merge. We need to
          // restore the value, so that we don't overwrite our state with a 0 value, but rather
          // merge the 0 with existing state.
+          // In this case the value should represent origin row, so no need to restore.


what does this mean? I think this is not needed any more.

Yes it can be removed now. Will remove.

HeartSaVioR

Thanks again @tdas for the super detailed and thoughtful review! Will address your review comments and update the PR.

HeartSaVioR · 2018-08-08T14:02:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "be modified after running.")
+      .intConf
+      .checkValue(v => Set(1, 2).contains(v), "Valid versions are 1 and 2")
+      .createWithDefault(2)


Nice suggestion. Will add the test.

HeartSaVioR · 2018-08-08T14:03:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+     * Extract columns consisting key from input row, and return the new row for key columns.
+     *
+     * @param row The input row.
+     * @return The row instance which only contains key columns.


Will just remove all the @param and @return if they are repeating.

HeartSaVioR · 2018-08-08T14:08:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+     * @param row The input row.
+     * @return The row instance which only contains key columns.
+     */
+    def getKey(row: InternalRow): UnsafeRow


getKey was basically UnsafeProjection in statefulOperator so didn't necessarily require UnsafeRow. I just followed the usage to make it less restrict, but we know, in reality row will be always UnsafeRow. So OK to fix if it provides consistency.

HeartSaVioR · 2018-08-08T14:11:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+
+    def getKey(row: InternalRow): UnsafeRow = keyProjector(row)
+
+    override def commit(store: StateStore): Long = store.commit()


This change is actually based on your review comment: always use state manager and don't directly access state store whenever possible. If your suggestion only applies to operations I can remove commit() from this interface.

HeartSaVioR · 2018-08-08T14:13:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

        val hasInput = iter.hasNext
        if (!hasInput && keyExpressions.isEmpty) {
          // If our `keyExpressions` are empty, we're getting a global aggregation. In that case
          // the `HashAggregateExec` will output a 0 value for the partial merge. We need to
          // restore the value, so that we don't overwrite our state with a 0 value, but rather
          // merge the 0 with existing state.
+          // In this case the value should represent origin row, so no need to restore.


Yes it can be removed now. Will remove.

HeartSaVioR · 2018-08-08T14:19:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+        return savedState
+      }
+
+      val joinedRow = joiner.join(key, savedState)


Missed spot. Will leverage restoreOriginRow.

HeartSaVioR · 2018-08-08T14:22:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+      GenerateUnsafeRowJoiner.create(StructType.fromAttributes(keyExpressions),
+        StructType.fromAttributes(valueExpressions))
+    @transient private lazy val restoreValueProjector = GenerateUnsafeProjection.generate(
+      keyValueJoinedExpressions, inputRowAttributes)


My bad. You're right. Will fix. Btw, needToProjectToRestoreValue is always false, unless sequence of columns for key and value get mixed up.

HeartSaVioR · 2018-08-08T14:22:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+
+    private val valueExpressions: Seq[Attribute] = inputRowAttributes.diff(keyExpressions)
+    private val keyValueJoinedExpressions: Seq[Attribute] = keyExpressions ++ valueExpressions
+    private val needToProjectToRestoreValue: Boolean =


HeartSaVioR · 2018-08-08T14:23:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+   * - key: Same as key expressions.
+   * - value: Same as input row attributes. The schema of value contains key expressions as well.
+   *
+   * This implementation only works when input row attributes contain all the key attributes.


I intended to put the sentence as a requirement / precondition for possible future usages, but if you think we don't need to put it explicitly I can remove it.

HeartSaVioR · 2018-08-08T14:27:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

+    def remove(store: StateStore, key: UnsafeRow): Unit
+
+    /**
+     * Return an iterator containing all the key-value pairs in target state store.


Will address.

TODO list * add test which reads checkpoint from 2.3.1 and runs query

* add test which restores query from Spark 2.3.1

HeartSaVioR · 2018-08-09T04:07:07Z

@tdas Addressed review comments. Please take a look again. Thanks in advance.

SparkQA · 2018-08-09T07:05:01Z

Test build #94469 has finished for PR 21733 at commit 65801a6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2018-08-09T07:15:17Z

retest this, please

SparkQA · 2018-08-09T11:33:17Z

Test build #94474 has finished for PR 21733 at commit 65801a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2018-08-16T03:19:13Z

@tdas Kindly reminder.

tdas · 2018-08-21T07:31:35Z

This looks good!! Only one comment, please don't add the .crc files. They are useless and adds unnecessarily clutter.

tdas · 2018-08-21T07:36:29Z

LGTM. Will merge when tests pass. :)

HeartSaVioR · 2018-08-21T07:38:25Z

@tdas Removed the .crc files. Thanks for reviewing!
Btw, it might be good to find the way to add .crc files to .gitignore if .crc files are mostly useless all the times. Might worth to add another minor PR?

tdas · 2018-08-21T07:39:03Z

Good point. That can be minor Pr.

SparkQA · 2018-08-21T11:38:42Z

Test build #95003 has finished for PR 21733 at commit 19888ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Add .crc files to .gitignore so that we don't add .crc files in state checkpoint to git repo which could be added in test resources. This is based on comments in apache#21733, apache#21733 (comment). ## How was this patch tested? Add `.1.delta.crc` and `.2.delta.crc` in `<spark root>/sql/core/src/test/resources`, and confirm git doesn't suggest the files to add to stage. Closes apache#22170 from HeartSaVioR/add-crc-files-to-gitignore. Authored-by: Jungtaek Lim <kabhwan@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

HeartSaVioR · 2018-08-21T22:41:43Z

Thanks all for reviewing and thanks @tdas for merging this in!

HyukjinKwon reviewed Jul 9, 2018

View reviewed changes

jose-torres reviewed Jul 10, 2018

View reviewed changes

HeartSaVioR force-pushed the SPARK-24763 branch from db9d9ce to 4754469 Compare July 18, 2018 07:40

HeartSaVioR force-pushed the SPARK-24763 branch from 4754469 to ca198ef Compare July 19, 2018 09:45

HeartSaVioR added 4 commits July 20, 2018 12:52

Fix scala checkstyle issue

941b88d

Remove duplicating code, use configuration key instead of string literal

abec57f

Refine code change: introduce trait and classes to group duplicate me…

977428c

…thods

HeartSaVioR force-pushed the SPARK-24763 branch from ca198ef to 977428c Compare July 20, 2018 03:53

Change the strategy: "add new option" -> "apply by default, but keep …

63dfb5d

…backward compatible"

Add tests for StatefulOperatorsHelper as well

e844636

HeartSaVioR force-pushed the SPARK-24763 branch from 0218408 to e844636 Compare July 20, 2018 07:02

Address a part of review comments from @tdas

e0ee04a

* add docs

tdas suggested changes Aug 8, 2018

View reviewed changes

HeartSaVioR commented Aug 8, 2018

View reviewed changes

HeartSaVioR added 2 commits August 9, 2018 09:58

WIP address review comments from @tdas

8629f59

TODO list * add test which reads checkpoint from 2.3.1 and runs query

Address review comments from @tdas

65801a6

* add test which restores query from Spark 2.3.1

Remove .crc files as following up @tdas guide

19888ab

HeartSaVioR mentioned this pull request Aug 21, 2018

[MINOR] Add .crc files to .gitignore #22170

Closed

asfgit closed this in 6c5cb85 Aug 21, 2018

HeartSaVioR deleted the SPARK-24763 branch August 21, 2018 22:42

HeartSaVioR mentioned this pull request Aug 22, 2018

[MINOR] Fix build failure due to non-direct conflict: removing import affects others #22178

Closed

HeartSaVioR mentioned this pull request Sep 20, 2018

WIP - [SPARK-10816][SS] Support session window natively #22482

Closed

HeartSaVioR mentioned this pull request Jan 3, 2019

[MINOR][BUILD] Remove *.crc from .gitignore #23020

Closed

HeartSaVioR mentioned this pull request Feb 18, 2019

[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows #23634

Closed

HeartSaVioR mentioned this pull request Jun 25, 2019

[SPARK-28120][SS] Rocksdb state storage implementation #24922

Closed

HeartSaVioR mentioned this pull request Nov 1, 2019

[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows #26108

Closed

	override def beforeEach(): Unit = {
	super.beforeEach()
	// Note that there are many tests here that require record-level filtering set to be true.
	spark.conf.set(SQLConf.PARQUET_RECORD_FILTER_ENABLED.key, "true")
	}

	override def afterEach(): Unit = {
	try {
	spark.conf.unset(SQLConf.PARQUET_RECORD_FILTER_ENABLED.key)
	} finally {
	super.afterEach()
	}


		def getKey(row: InternalRow): UnsafeRow = keyProjector(row)

		override def commit(store: StateStore): Long = store.commit()

[SPARK-24763][SS] Remove redundant key data from value in streaming aggregation #21733

[SPARK-24763][SS] Remove redundant key data from value in streaming aggregation #21733

Conversation

HeartSaVioR commented Jul 9, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HeartSaVioR commented Jul 9, 2018

SparkQA commented Jul 9, 2018

SparkQA commented Jul 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 9, 2018

HeartSaVioR commented Jul 9, 2018

SparkQA commented Jul 9, 2018

HeartSaVioR commented Jul 10, 2018

SparkQA commented Jul 10, 2018

Choose a reason for hiding this comment

HeartSaVioR Jul 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunmahadevan commented Jul 11, 2018

HeartSaVioR commented Jul 11, 2018 • edited Loading

jose-torres commented Jul 11, 2018

HeartSaVioR commented Jul 11, 2018 • edited Loading

HeartSaVioR commented Jul 12, 2018 • edited Loading

SparkQA commented Jul 18, 2018

SparkQA commented Jul 18, 2018

SparkQA commented Jul 19, 2018

HeartSaVioR commented Jul 20, 2018

HeartSaVioR commented Jul 20, 2018 • edited Loading

HeartSaVioR commented Aug 7, 2018 • edited Loading

HeartSaVioR commented Aug 8, 2018

SparkQA commented Aug 8, 2018

tdas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR commented Aug 9, 2018

SparkQA commented Aug 9, 2018

HeartSaVioR commented Aug 9, 2018

SparkQA commented Aug 9, 2018

HeartSaVioR commented Aug 16, 2018

tdas commented Aug 21, 2018

tdas commented Aug 21, 2018

HeartSaVioR commented Aug 21, 2018

tdas commented Aug 21, 2018

HeartSaVioR commented Jul 9, 2018 •

edited

Loading

HeartSaVioR Jul 10, 2018 •

edited

Loading

HeartSaVioR commented Jul 11, 2018 •

edited

Loading

HeartSaVioR commented Jul 11, 2018 •

edited

Loading

HeartSaVioR commented Jul 12, 2018 •

edited

Loading

HeartSaVioR commented Jul 20, 2018 •

edited

Loading

HeartSaVioR commented Aug 7, 2018 •

edited

Loading