Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter #16479

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 5, 2017

What changes were proposed in this pull request?

OutputWriterFactory/OutputWriter are internal interfaces and we can remove some unnecessary APIs:

  1. OutputWriterFactory.newWriter(path: String): no one calls it and no one implements it.
  2. OutputWriter.write(row: Row): during execution we only call writeInternal, which is weird as OutputWriter is already an internal interface. We should rename writeInternal to write and remove def write(row: Row) and it's related converter code. All implementations should just implement def write(row: InternalRow)

How was this patch tested?

existing tests.

@@ -64,18 +64,18 @@ object FileFormatWriter extends Logging {
val outputWriterFactory: OutputWriterFactory,
val allColumns: Seq[Attribute],
val partitionColumns: Seq[Attribute],
val nonPartitionColumns: Seq[Attribute],
val dataColumns: Seq[Attribute],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename nonPartitionColumns to dataColumns, to be consistent with other places in the codebase.

@cloud-fan
Copy link
Contributor Author

cc @liancheng @gatorsmile @yhuai

@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70932 has finished for PR 16479 at commit 1de8e49.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jan 5, 2017

What is the benefit of making these changes?

@cloud-fan
Copy link
Contributor Author

@yhuai It removes unnecessary code to make the codebase easier to maintain. Besides, the libsvm relation should be a little faster as it doesn't need to go through a converter.

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70947 has finished for PR 16479 at commit 79bb30c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


protected[sql] def writeInternal(row: InternalRow): Unit = {
write(converter(row))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the original PR that introduce these lines: https://github.com/apache/spark/pull/8010/files

*/
def newWriter(path: String): OutputWriter = {
throw new UnsupportedOperationException("newInstance with just path not supported")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of this function was removed in https://github.com/apache/spark/pull/15710/files

I think it is safe to remove it.

override def write(row: Row): Unit = {
val label = row.get(0)
val vector = row.get(1).asInstanceOf[Vector]
// This `asInstanceOf` is safe because it's guaranteed by `LibSVMFileFormat.verifySchema`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LibSVMFileFormat.verifySchema is only called in the buildReader , but this is the write path, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I added the verification.

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70971 has finished for PR 16479 at commit 110bcdf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70973 has finished for PR 16479 at commit 902f17a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in b3d3962 Jan 7, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Jan 9, 2017
## What changes were proposed in this pull request?

`OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs:
1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it.
2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16479 from cloud-fan/hive-writer.
@koertkuipers
Copy link
Contributor

koertkuipers commented Jan 22, 2017

how "internal" are these interfaces really? every time a change like this is made spark-avro breaks

spark-avro uses OutputWriter.write(row: Row)

@cloud-fan
Copy link
Contributor Author

Everything in package org.apache.spark.sql.execution should be internal to Spark SQL. Technically you can still implement OutputWriter outside of Spark, but there is no guarantee about the stability.

Ideally we should not change any interface if unnecessary, but this change is reasonable. As an internal interface, it's more efficient to use InternalRow directly, instead of converting InternalRow to Row and then operate on Row. I'm sorry that this breaks spark-avro, but we can make spark-avro more efficient by switching to the new interface. Or we can just copy the previous conversion code to spark-avro, so that we can still covert InternalRow to Row and operate on Row in spark-avro.

@koertkuipers
Copy link
Contributor

i will just copy the conversion code over for now thx

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

`OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs:
1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it.
2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16479 from cloud-fan/hive-writer.
@lokkju
Copy link

lokkju commented Oct 26, 2017

So it turns out just copying the conversion code doesn't work, as seen in databricks/spark-avro#240 - and now I'm running into the same thing writing my own datasource. As a datasource in the end requires implementing a class that extends OutputWriter, and the OutputWriter interface changed, a datasource plugin doesn't seem to be able to support both pre and post versions of 2.2.x in the same plugin.

Any suggestions on how to handle this, without requiring users to match a the spark version to the new datasource version?

@cloud-fan
Copy link
Contributor Author

This is a common issue of the data source v1, it's not powerful enough and you have to use some Spark internal APIs and hit compatibility problem... AFAIK a workable solution is to create different branches for different Spark versions, or using some dirty reflection workarounds.

@lokkju
Copy link

lokkju commented Oct 26, 2017

I'd be interested in the "dirty reflection workarounds", if you have examples. Not sure how I'd use reflection to handle conflicting interface definitions, but I'd love to learn.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Oct 26, 2017

Here is a better solution I found: https://github.com/databricks/spark-avro/pull/217/files#diff-3086eddba29f4034c324541695a2357b

implementing different OutputWriterFactory and switch them with build files.

@lokkju
Copy link

lokkju commented Oct 26, 2017

So it essentially compiles each implementation against different spark versions, then both bytecodes are included in the final jar? Then reflection to instantiate it.

That works, without too much pain. Might go that route, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants