[SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter #16479

cloud-fan · 2017-01-05T16:01:47Z

What changes were proposed in this pull request?

OutputWriterFactory/OutputWriter are internal interfaces and we can remove some unnecessary APIs:

OutputWriterFactory.newWriter(path: String): no one calls it and no one implements it.
OutputWriter.write(row: Row): during execution we only call writeInternal, which is weird as OutputWriter is already an internal interface. We should rename writeInternal to write and remove def write(row: Row) and it's related converter code. All implementations should just implement def write(row: InternalRow)

How was this patch tested?

existing tests.

cloud-fan · 2017-01-05T16:03:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -64,18 +64,18 @@ object FileFormatWriter extends Logging {
      val outputWriterFactory: OutputWriterFactory,
      val allColumns: Seq[Attribute],
      val partitionColumns: Seq[Attribute],
-      val nonPartitionColumns: Seq[Attribute],
+      val dataColumns: Seq[Attribute],


rename nonPartitionColumns to dataColumns, to be consistent with other places in the codebase.

cloud-fan · 2017-01-05T16:03:58Z

cc @liancheng @gatorsmile @yhuai

SparkQA · 2017-01-05T16:08:14Z

Test build #70932 has finished for PR 16479 at commit 1de8e49.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2017-01-05T19:11:59Z

What is the benefit of making these changes?

cloud-fan · 2017-01-06T01:08:11Z

@yhuai It removes unnecessary code to make the codebase easier to maintain. Besides, the libsvm relation should be a little faster as it doesn't need to go through a converter.

SparkQA · 2017-01-06T03:23:33Z

Test build #70947 has finished for PR 16479 at commit 79bb30c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-06T03:56:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala

-
-  protected[sql] def writeInternal(row: InternalRow): Unit = {
-    write(converter(row))
-  }


I found the original PR that introduce these lines: https://github.com/apache/spark/pull/8010/files

gatorsmile · 2017-01-06T04:06:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala

-   */
-  def newWriter(path: String): OutputWriter = {
-    throw new UnsupportedOperationException("newInstance with just path not supported")
-  }


The usage of this function was removed in https://github.com/apache/spark/pull/15710/files

I think it is safe to remove it.

gatorsmile · 2017-01-06T05:00:58Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala

-  override def write(row: Row): Unit = {
-    val label = row.get(0)
-    val vector = row.get(1).asInstanceOf[Vector]
+  // This `asInstanceOf` is safe because it's guaranteed by `LibSVMFileFormat.verifySchema`


LibSVMFileFormat.verifySchema is only called in the buildReader , but this is the write path, right?

ok I added the verification.

SparkQA · 2017-01-06T08:02:57Z

Test build #70971 has finished for PR 16479 at commit 110bcdf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-06T10:44:17Z

Test build #70973 has finished for PR 16479 at commit 902f17a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-06T18:36:05Z

LGTM

cloud-fan · 2017-01-07T16:42:40Z

thanks for the review, merging to master!

## What changes were proposed in this pull request? `OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs: 1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it. 2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16479 from cloud-fan/hive-writer.

koertkuipers · 2017-01-22T20:10:53Z

how "internal" are these interfaces really? every time a change like this is made spark-avro breaks

spark-avro uses OutputWriter.write(row: Row)

cloud-fan · 2017-01-23T02:40:59Z

Everything in package org.apache.spark.sql.execution should be internal to Spark SQL. Technically you can still implement OutputWriter outside of Spark, but there is no guarantee about the stability.

Ideally we should not change any interface if unnecessary, but this change is reasonable. As an internal interface, it's more efficient to use InternalRow directly, instead of converting InternalRow to Row and then operate on Row. I'm sorry that this breaks spark-avro, but we can make spark-avro more efficient by switching to the new interface. Or we can just copy the previous conversion code to spark-avro, so that we can still covert InternalRow to Row and operate on Row in spark-avro.

koertkuipers · 2017-01-23T03:13:04Z

i will just copy the conversion code over for now thx

## What changes were proposed in this pull request? `OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs: 1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it. 2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16479 from cloud-fan/hive-writer.

lokkju · 2017-10-26T15:05:37Z

So it turns out just copying the conversion code doesn't work, as seen in databricks/spark-avro#240 - and now I'm running into the same thing writing my own datasource. As a datasource in the end requires implementing a class that extends OutputWriter, and the OutputWriter interface changed, a datasource plugin doesn't seem to be able to support both pre and post versions of 2.2.x in the same plugin.

Any suggestions on how to handle this, without requiring users to match a the spark version to the new datasource version?

cloud-fan · 2017-10-26T20:21:36Z

This is a common issue of the data source v1, it's not powerful enough and you have to use some Spark internal APIs and hit compatibility problem... AFAIK a workable solution is to create different branches for different Spark versions, or using some dirty reflection workarounds.

lokkju · 2017-10-26T20:29:14Z

I'd be interested in the "dirty reflection workarounds", if you have examples. Not sure how I'd use reflection to handle conflicting interface definitions, but I'd love to learn.

cloud-fan · 2017-10-26T20:35:37Z

Here is a better solution I found: https://github.com/databricks/spark-avro/pull/217/files#diff-3086eddba29f4034c324541695a2357b

implementing different OutputWriterFactory and switch them with build files.

lokkju · 2017-10-26T20:54:42Z

So it essentially compiles each implementation against different spark versions, then both bytecodes are included in the final jar? Then reflection to instantiate it.

That works, without too much pain. Might go that route, thanks.

cloud-fan commented Jan 5, 2017

View reviewed changes

cleanup OutputWriterFactory and OutputWriter

79bb30c

cloud-fan force-pushed the hive-writer branch from 1de8e49 to 79bb30c Compare January 6, 2017 01:04

gatorsmile reviewed Jan 6, 2017

View reviewed changes

address comments

902f17a

cloud-fan force-pushed the hive-writer branch from 110bcdf to 902f17a Compare January 6, 2017 08:19

asfgit closed this in b3d3962 Jan 7, 2017

squito mentioned this pull request Jul 18, 2017

spark-avro 3.2.0 doesn't work with spark 2.2.0 (abstract OutputWriter.write) databricks/spark-avro#240

Closed

lokkju mentioned this pull request Oct 26, 2017

Migrating to scalapb 0.6.6 and sbt-protoc 0.99.12 scalapb/sparksql-scalapb#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter #16479

[SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter #16479

cloud-fan commented Jan 5, 2017 •

edited

Loading

cloud-fan Jan 5, 2017

cloud-fan commented Jan 5, 2017

SparkQA commented Jan 5, 2017

yhuai commented Jan 5, 2017

cloud-fan commented Jan 6, 2017

SparkQA commented Jan 6, 2017

gatorsmile Jan 6, 2017

gatorsmile Jan 6, 2017

gatorsmile Jan 6, 2017

cloud-fan Jan 6, 2017

SparkQA commented Jan 6, 2017

SparkQA commented Jan 6, 2017

gatorsmile commented Jan 6, 2017

cloud-fan commented Jan 7, 2017

koertkuipers commented Jan 22, 2017 •

edited

Loading

cloud-fan commented Jan 23, 2017

koertkuipers commented Jan 23, 2017

lokkju commented Oct 26, 2017 •

edited

Loading

cloud-fan commented Oct 26, 2017

lokkju commented Oct 26, 2017 •

edited

Loading

cloud-fan commented Oct 26, 2017 •

edited

Loading

lokkju commented Oct 26, 2017

[SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter #16479

[SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter #16479

Conversation

cloud-fan commented Jan 5, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan Jan 5, 2017

Choose a reason for hiding this comment

cloud-fan commented Jan 5, 2017

SparkQA commented Jan 5, 2017

yhuai commented Jan 5, 2017

cloud-fan commented Jan 6, 2017

SparkQA commented Jan 6, 2017

gatorsmile Jan 6, 2017

Choose a reason for hiding this comment

gatorsmile Jan 6, 2017

Choose a reason for hiding this comment

gatorsmile Jan 6, 2017

Choose a reason for hiding this comment

cloud-fan Jan 6, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 6, 2017

SparkQA commented Jan 6, 2017

gatorsmile commented Jan 6, 2017

cloud-fan commented Jan 7, 2017

koertkuipers commented Jan 22, 2017 • edited Loading

cloud-fan commented Jan 23, 2017

koertkuipers commented Jan 23, 2017

lokkju commented Oct 26, 2017 • edited Loading

cloud-fan commented Oct 26, 2017

lokkju commented Oct 26, 2017 • edited Loading

cloud-fan commented Oct 26, 2017 • edited Loading

lokkju commented Oct 26, 2017

cloud-fan commented Jan 5, 2017 •

edited

Loading

koertkuipers commented Jan 22, 2017 •

edited

Loading

lokkju commented Oct 26, 2017 •

edited

Loading

lokkju commented Oct 26, 2017 •

edited

Loading

cloud-fan commented Oct 26, 2017 •

edited

Loading