Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24971][SQL] remove SupportsDeprecatedScanRow #21921

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a follow up of #21118 .

In #21118 we added SupportsDeprecatedScanRow. Ideally data source should produce InternalRow instead of Row for better performance. We should remove SupportsDeprecatedScanRow and encourage data sources to produce InternalRow, which is also very easy to build.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jul 30, 2018

@SparkQA
Copy link

SparkQA commented Jul 30, 2018

Test build #93796 has finished for PR 21921 at commit d6a93b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class RateStreamContinuousReader(options: DataSourceOptions) extends ContinuousReader
  • class TextSocketMicroBatchReader(options: DataSourceOptions) extends MicroBatchReader with Logging

@jose-torres
Copy link
Contributor

lgtm

@cloud-fan
Copy link
Contributor Author

thanks, merging to master!

@rdblue
Copy link
Contributor

rdblue commented Aug 1, 2018

@cloud-fan, I thought it was a requirement to have a committer +1 before merging. Or is this list of committers out of date?

@@ -91,7 +90,7 @@ class RateStreamContinuousReader(options: DataSourceOptions)
i,
numPartitions,
perPartitionRate)
.asInstanceOf[InputPartition[Row]]
.asInstanceOf[InputPartition[InternalRow]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this cast necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't dig into it as the cast was already there. The reason seems to be, java.util.List isn't covariant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to leave casts. Can you check to see if this can be avoided? I found in #21118 that many of the casts were unnecessary if variables had declared types and it is much better to avoid explicit casts that work around the type system.

@@ -169,7 +170,7 @@ class RateStreamMicroBatchReader(options: DataSourceOptions, checkpointLocation:
(0 until numPartitions).map { p =>
new RateStreamMicroBatchInputPartition(
p, numPartitions, rangeStart, rangeEnd, localStartTimeMs, relativeMsPerValue)
: InputPartition[Row]
: InputPartition[InternalRow]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Doesn't RateStreamMicroBatchInputPartition implement InputPartition[InternalRow]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine since it isn't a cast, but it's generally better to check whether these are still necessary after refactoring.

@@ -121,17 +121,6 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {
}
}

test("unsafe row scan implementation") {
Seq(classOf[UnsafeRowDataSourceV2], classOf[JavaUnsafeRowDataSourceV2]).foreach { cls =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove unsafe tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a followup of #21118. you removed SupportsScanUnsafeRow there and then this test becomes meaningless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

override def planRowInputPartitions(): JList[InputPartition[Row]] = {
val lowerBound = filters.collect {
override def planInputPartitions(): JList[InputPartition[InternalRow]] = {
val lowerBound = filters.collectFirst {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is an unrelated change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but this is really minor. When I changed the code nearby, the IDE shows a warning for not using collectFirst here. Then I went for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me since it is so small, just wanted to point it out.

}
}


class UnsafeRowDataSourceV2 extends DataSourceV2 with ReadSupport {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't Row implementations. Why remove them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as above, I'm guessing.

@rdblue
Copy link
Contributor

rdblue commented Aug 1, 2018

This looks fine other than the possibly unnecessary cast.

@cloud-fan
Copy link
Contributor Author

@rdblue I vaguely remember that, if the PR author himself is a committer, we can merge a PR with one more LGTM from the community and no one objects in several days. I'm sorry if it's not the case.

@gatorsmile
Copy link
Member

gatorsmile commented Aug 1, 2018

@cloud-fan To be safe, let us get one more LGTM from the other committer.

@gatorsmile
Copy link
Member

retest this please

@rdblue
Copy link
Contributor

rdblue commented Aug 1, 2018

@cloud-fan, @gatorsmile, I'm fine with that if it's documented somewhere. I wasn't aware of that convention and no one brought it up the last time I pointed out commits without a committer +1.

@gatorsmile
Copy link
Member

@rdblue I do not think it is documented. Let us be more conservative. Collect LGTM from the committers no matter whether the PR author is a committer or not.

@asfgit asfgit closed this in defc54c Aug 1, 2018
@gatorsmile
Copy link
Member

It sounds like Github is experiencing a very bad delay. @cloud-fan Could you submit a follow-up PR to address the comments from @rdblue ?

@cloud-fan
Copy link
Contributor Author

addressed in #21948 (comment)

@rdblue
Copy link
Contributor

rdblue commented Aug 1, 2018

Yeah, I'd say that if it isn't documented then lets go with the usually RTC conventions.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93887 has finished for PR 21921 at commit d6a93b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class RateStreamContinuousReader(options: DataSourceOptions) extends ContinuousReader
  • class TextSocketMicroBatchReader(options: DataSourceOptions) extends MicroBatchReader with Logging

jzhuge pushed a commit to jzhuge/spark that referenced this pull request Mar 7, 2019
This is a follow up of apache#21118 .

In apache#21118 we added `SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` instead of `Row` for better performance. We should remove `SupportsDeprecatedScanRow` and encourage data sources to produce `InternalRow`, which is also very easy to build.

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#21921 from cloud-fan/row.

(cherry picked from commit defc54c)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDD.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ContinuousMemoryStream.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchReader.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala
	sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java
	sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceSuite.scala
rdblue pushed a commit to rdblue/spark that referenced this pull request Apr 3, 2019
This is a follow up of apache#21118 .

In apache#21118 we added `SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` instead of `Row` for better performance. We should remove `SupportsDeprecatedScanRow` and encourage data sources to produce `InternalRow`, which is also very easy to build.

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#21921 from cloud-fan/row.
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
This is a follow up of apache#21118 .

In apache#21118 we added `SupportsDeprecatedScanRow`. Ideally data source should produce `InternalRow` instead of `Row` for better performance. We should remove `SupportsDeprecatedScanRow` and encourage data sources to produce `InternalRow`, which is also very easy to build.

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#21921 from cloud-fan/row.

(cherry picked from commit defc54c)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousDataSourceRDD.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ContinuousMemoryStream.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamMicroBatchReader.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/socket.scala
	sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java
	sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/RateSourceSuite.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants