[SPARK-24971][SQL] remove SupportsDeprecatedScanRow #21921

cloud-fan · 2018-07-30T17:02:27Z

What changes were proposed in this pull request?

This is a follow up of #21118 .

In #21118 we added SupportsDeprecatedScanRow. Ideally data source should produce InternalRow instead of Row for better performance. We should remove SupportsDeprecatedScanRow and encourage data sources to produce InternalRow, which is also very easy to build.

How was this patch tested?

existing tests.

cloud-fan · 2018-07-30T17:03:08Z

cc @rdblue @rxin @jose-torres @gatorsmile @gengliangwang

SparkQA · 2018-07-30T20:55:35Z

Test build #93796 has finished for PR 21921 at commit d6a93b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class RateStreamContinuousReader(options: DataSourceOptions) extends ContinuousReader
class TextSocketMicroBatchReader(options: DataSourceOptions) extends MicroBatchReader with Logging

jose-torres · 2018-07-30T21:03:05Z

lgtm

cloud-fan · 2018-08-01T13:39:46Z

thanks, merging to master!

rdblue · 2018-08-01T15:51:38Z

@cloud-fan, I thought it was a requirement to have a committer +1 before merging. Or is this list of committers out of date?

rdblue · 2018-08-01T15:54:34Z

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala

@@ -91,7 +90,7 @@ class RateStreamContinuousReader(options: DataSourceOptions)
        i,
        numPartitions,
        perPartitionRate)
-        .asInstanceOf[InputPartition[Row]]
+        .asInstanceOf[InputPartition[InternalRow]]


Why is this cast necessary?

I didn't dig into it as the cast was already there. The reason seems to be, java.util.List isn't covariant.

I don't think it's a good idea to leave casts. Can you check to see if this can be avoided? I found in #21118 that many of the casts were unnecessary if variables had declared types and it is much better to avoid explicit casts that work around the type system.

rdblue · 2018-08-01T15:55:45Z

...main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchReader.scala

@@ -169,7 +170,7 @@ class RateStreamMicroBatchReader(options: DataSourceOptions, checkpointLocation:
    (0 until numPartitions).map { p =>
      new RateStreamMicroBatchInputPartition(
        p, numPartitions, rangeStart, rangeEnd, localStartTimeMs, relativeMsPerValue)
-        : InputPartition[Row]
+        : InputPartition[InternalRow]


Is this needed? Doesn't RateStreamMicroBatchInputPartition implement InputPartition[InternalRow]?

This is fine since it isn't a cast, but it's generally better to check whether these are still necessary after refactoring.

rdblue · 2018-08-01T15:57:20Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

@@ -121,17 +121,6 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
    }
  }

-  test("unsafe row scan implementation") {
-    Seq(classOf[UnsafeRowDataSourceV2], classOf[JavaUnsafeRowDataSourceV2]).foreach { cls =>


Why remove unsafe tests?

That's a followup of #21118. you removed SupportsScanUnsafeRow there and then this test becomes meaningless.

rdblue · 2018-08-01T15:58:01Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

-    override def planRowInputPartitions(): JList[InputPartition[Row]] = {
-      val lowerBound = filters.collect {
+    override def planInputPartitions(): JList[InputPartition[InternalRow]] = {
+      val lowerBound = filters.collectFirst {


Nit: this is an unrelated change.

I agree but this is really minor. When I changed the code nearby, the IDE shows a warning for not using collectFirst here. Then I went for it.

Fine by me since it is so small, just wanted to point it out.

rdblue · 2018-08-01T15:58:38Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

  }
 }


-class UnsafeRowDataSourceV2 extends DataSourceV2 with ReadSupport {


These aren't Row implementations. Why remove them?

Same answer as above, I'm guessing.

rdblue · 2018-08-01T16:27:37Z

This looks fine other than the possibly unnecessary cast.

cloud-fan · 2018-08-01T16:28:56Z

@rdblue I vaguely remember that, if the PR author himself is a committer, we can merge a PR with one more LGTM from the community and no one objects in several days. I'm sorry if it's not the case.

gatorsmile · 2018-08-01T17:25:37Z

@cloud-fan To be safe, let us get one more LGTM from the other committer.

gatorsmile · 2018-08-01T17:27:51Z

retest this please

rdblue · 2018-08-01T17:39:54Z

@cloud-fan, @gatorsmile, I'm fine with that if it's documented somewhere. I wasn't aware of that convention and no one brought it up the last time I pointed out commits without a committer +1.

gatorsmile · 2018-08-01T18:43:51Z

@rdblue I do not think it is documented. Let us be more conservative. Collect LGTM from the committers no matter whether the PR author is a committer or not.

gatorsmile · 2018-08-01T19:03:54Z

It sounds like Github is experiencing a very bad delay. @cloud-fan Could you submit a follow-up PR to address the comments from @rdblue ?

cloud-fan · 2018-08-01T19:12:54Z

addressed in #21948 (comment)

rdblue · 2018-08-01T19:54:01Z

Yeah, I'd say that if it isn't documented then lets go with the usually RTC conventions.

SparkQA · 2018-08-01T21:21:05Z

Test build #93887 has finished for PR 21921 at commit d6a93b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class RateStreamContinuousReader(options: DataSourceOptions) extends ContinuousReader
class TextSocketMicroBatchReader(options: DataSourceOptions) extends MicroBatchReader with Logging

This is a follow up of apache#21118 . In apache#21118 we added `SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` instead of `Row` for better performance. We should remove `SupportsDeprecatedScanRow` and encourage data sources to produce `InternalRow`, which is also very easy to build. existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21921 from cloud-fan/row. (cherry picked from commit defc54c) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDD.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ContinuousMemoryStream.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchReader.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceSuite.scala

This is a follow up of apache#21118 . In apache#21118 we added `SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` instead of `Row` for better performance. We should remove `SupportsDeprecatedScanRow` and encourage data sources to produce `InternalRow`, which is also very easy to build. existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21921 from cloud-fan/row.

This is a follow up of apache#21118 . In apache#21118 we added `SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` instead of `Row` for better performance. We should remove `SupportsDeprecatedScanRow` and encourage data sources to produce `InternalRow`, which is also very easy to build. existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21921 from cloud-fan/row. (cherry picked from commit defc54c) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDD.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ContinuousMemoryStream.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchReader.scala sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceSuite.scala

remove SupportsDeprecatedScanRow

d6a93b1

rdblue reviewed Aug 1, 2018

View reviewed changes

asfgit closed this in defc54c Aug 1, 2018

rdblue mentioned this pull request Aug 7, 2018

[SPARK-24882][SQL] improve data source v2 API #22009

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24971][SQL] remove SupportsDeprecatedScanRow #21921

[SPARK-24971][SQL] remove SupportsDeprecatedScanRow #21921

cloud-fan commented Jul 30, 2018

cloud-fan commented Jul 30, 2018 •

edited

Loading

SparkQA commented Jul 30, 2018

jose-torres commented Jul 30, 2018

cloud-fan commented Aug 1, 2018

rdblue commented Aug 1, 2018

rdblue Aug 1, 2018

cloud-fan Aug 1, 2018

rdblue Aug 1, 2018

rdblue Aug 1, 2018

cloud-fan Aug 1, 2018

rdblue Aug 1, 2018

rdblue Aug 1, 2018

cloud-fan Aug 1, 2018

rdblue Aug 1, 2018

rdblue Aug 1, 2018

cloud-fan Aug 1, 2018

rdblue Aug 1, 2018

rdblue Aug 1, 2018

rdblue Aug 1, 2018

rdblue commented Aug 1, 2018

cloud-fan commented Aug 1, 2018

gatorsmile commented Aug 1, 2018 •

edited

Loading

gatorsmile commented Aug 1, 2018

rdblue commented Aug 1, 2018

gatorsmile commented Aug 1, 2018

gatorsmile commented Aug 1, 2018

cloud-fan commented Aug 1, 2018

rdblue commented Aug 1, 2018

SparkQA commented Aug 1, 2018

[SPARK-24971][SQL] remove SupportsDeprecatedScanRow #21921

[SPARK-24971][SQL] remove SupportsDeprecatedScanRow #21921

Conversation

cloud-fan commented Jul 30, 2018

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Jul 30, 2018 • edited Loading

SparkQA commented Jul 30, 2018

jose-torres commented Jul 30, 2018

cloud-fan commented Aug 1, 2018

rdblue commented Aug 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Aug 1, 2018

cloud-fan commented Aug 1, 2018

gatorsmile commented Aug 1, 2018 • edited Loading

gatorsmile commented Aug 1, 2018

rdblue commented Aug 1, 2018

gatorsmile commented Aug 1, 2018

gatorsmile commented Aug 1, 2018

cloud-fan commented Aug 1, 2018

rdblue commented Aug 1, 2018

SparkQA commented Aug 1, 2018

cloud-fan commented Jul 30, 2018 •

edited

Loading

gatorsmile commented Aug 1, 2018 •

edited

Loading