[SPARK-28698][SQL] Support user-specified output schema in `to_avro` #25419

gengliangwang · 2019-08-12T12:38:34Z

What changes were proposed in this pull request?

The mapping of Spark schema to Avro schema is many-to-many. (See https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion)
The default schema mapping might not be exactly what users want. For example, by default, a "string" column is always written as "string" Avro type, but users might want to output the column as "enum" Avro type.
With PR #21847, Spark supports user-specified schema in the batch writer.
For the function to_avro, we should support user-specified output schema as well.

How was this patch tested?

Unit test.

gengliangwang · 2019-08-12T12:38:53Z

@cloud-fan @HyukjinKwon

SparkQA · 2019-08-12T12:45:56Z

Test build #108970 has finished for PR 25419 at commit cdd2add.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CatalystDataToAvro(

cloud-fan · 2019-08-12T13:31:11Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala

+        jsonFormatSchema,
+        options = Map.empty),
+      data.eval())
+    intercept[SparkException] {


what's the error message?

cloud-fan · 2019-08-12T13:33:01Z

makes sense to me, let's fix the build first.

SparkQA · 2019-08-12T14:18:46Z

Test build #108974 has finished for PR 25419 at commit b64f9c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-12T14:43:57Z

Test build #108977 has finished for PR 25419 at commit 621ee0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-08-12T19:12:45Z

cc @dbtsai

dongjoon-hyun · 2019-08-12T19:21:47Z

external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala

-case class CatalystDataToAvro(child: Expression) extends UnaryExpression {
+case class CatalystDataToAvro(
+    child: Expression,
+    jsonFormatSchema: Option[String]) extends UnaryExpression {


Can we have a default value None?

- jsonFormatSchema: Option[String]) extends UnaryExpression { + jsonFormatSchema: Option[String] = None) extends UnaryExpression {

Here I am trying to avoid parameter with a default value. The result is quite different with/without a specified schema.
Also, this is consistent with CatalystDataToAvro.

unless the default value is used a lot in tests, I don't think we should add default value in internal classes. We should force the caller side to specify the parameter when they instantiate the internal class.

Got it, @gengliangwang and @cloud-fan .

dongjoon-hyun · 2019-08-12T19:22:40Z

external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala

@@ -72,6 +72,19 @@ object functions {
   */
  @Experimental
  def to_avro(data: Column): Column = {
-    new Column(CatalystDataToAvro(data.expr))
+    new Column(CatalystDataToAvro(data.expr, None))
+  }


If we have the default value None, we don't need to touch this function.

See my comment in #25419 (comment)

dongjoon-hyun · 2019-08-12T19:23:41Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala

      prepareExpectedResult(expected))
  }

  protected def checkUnsupportedRead(data: Literal, schema: String): Unit = {
-    val binary = CatalystDataToAvro(data)
+    val binary = CatalystDataToAvro(data, None)


Also, if we have the default value, we don't need to change line 41 and 46.

See my comment in #25419 (comment)

dongjoon-hyun · 2019-08-12T19:31:18Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala

+        jsonFormatSchema,
+        options = Map.empty).eval()
+    }.getMessage
+    assert(message.contains("Malformed records are detected in record parsing."))


In this PR, CatalystDataToAvro ignores the given scheme in case of None, doesn't it? For me, this error seems to come from AvroDataToCatalyst instead of CatalystDataToAvro.

If this error comes from AvroDataToCatalyst, this test coverage is misleading. For example, we had better have a test coverage for

a test whether CatalystDataToAvro(data, None) successfully ignores None without any exception.

a test whether CatalystDataToAvro(data, "") fails with that error message (?)

How do you think about that, @gengliangwang ?

Here AvroDataToCatalyst is just to check the Avro schema of CatalystDataToAvro.

When jsonFormatSchema is provided in CatalystDataToAvro, the output Avro schema is enum type, and we validate it with AvroDataToCatalyst. This proves that the provided schema works.

When the jsonFormatSchema is None, the output Avro schema is string type, and it can't be parsed as enum type.

I will change the order of the two checks in the case and add a new test case for invalid user-specified schema

+1. Thanks, @gengliangwang .

SparkQA · 2019-08-13T06:07:24Z

Test build #109016 has finished for PR 25419 at commit 6d7520b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-13T12:52:38Z

thanks, merging to master!

dirrao · 2020-08-27T05:57:28Z

is it possible to get this enhancement back-ported to 2.4.4?

cloud-fan · 2020-08-27T06:23:20Z

@dirrao in general only bug fixes can go to earlier branches.

The mapping of Spark schema to Avro schema is many-to-many. (See https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion) The default schema mapping might not be exactly what users want. For example, by default, a "string" column is always written as "string" Avro type, but users might want to output the column as "enum" Avro type. With PR apache#21847, Spark supports user-specified schema in the batch writer. For the function `to_avro`, we should support user-specified output schema as well. Unit test. Closes apache#25419 from gengliangwang/to_avro. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 48adc91)

add new to_avro

cdd2add

revise

ffb76e1

cloud-fan reviewed Aug 12, 2019

View reviewed changes

gengliangwang added 2 commits August 12, 2019 21:37

fix

b64f9c1

check error message

621ee0e

dongjoon-hyun changed the title ~~[SPARK-28698][SQL] Allow user-specified output schema in function to_avro~~ [SPARK-28698][SQL] Support user-specified output schema in to_avro Aug 12, 2019

dongjoon-hyun added the SQL label Aug 12, 2019

dongjoon-hyun reviewed Aug 12, 2019

View reviewed changes

update test

6d7520b

cloud-fan closed this in 48adc91 Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28698][SQL] Support user-specified output schema in `to_avro` #25419

[SPARK-28698][SQL] Support user-specified output schema in `to_avro` #25419

gengliangwang commented Aug 12, 2019

gengliangwang commented Aug 12, 2019

SparkQA commented Aug 12, 2019

cloud-fan Aug 12, 2019

cloud-fan commented Aug 12, 2019

SparkQA commented Aug 12, 2019

SparkQA commented Aug 12, 2019

dongjoon-hyun commented Aug 12, 2019

dongjoon-hyun Aug 12, 2019

gengliangwang Aug 13, 2019

cloud-fan Aug 13, 2019

dongjoon-hyun Aug 13, 2019

dongjoon-hyun Aug 12, 2019

gengliangwang Aug 13, 2019

dongjoon-hyun Aug 12, 2019 •

edited

gengliangwang Aug 13, 2019

dongjoon-hyun Aug 12, 2019

dongjoon-hyun Aug 12, 2019

gengliangwang Aug 13, 2019 •

edited

dongjoon-hyun Aug 13, 2019

SparkQA commented Aug 13, 2019

cloud-fan commented Aug 13, 2019

dirrao commented Aug 27, 2020 •

edited

cloud-fan commented Aug 27, 2020

[SPARK-28698][SQL] Support user-specified output schema in to_avro #25419

[SPARK-28698][SQL] Support user-specified output schema in to_avro #25419

Conversation

gengliangwang commented Aug 12, 2019

What changes were proposed in this pull request?

How was this patch tested?

gengliangwang commented Aug 12, 2019

SparkQA commented Aug 12, 2019

Choose a reason for hiding this comment

cloud-fan commented Aug 12, 2019

SparkQA commented Aug 12, 2019

SparkQA commented Aug 12, 2019

dongjoon-hyun commented Aug 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Aug 12, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang Aug 13, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 13, 2019

cloud-fan commented Aug 13, 2019

dirrao commented Aug 27, 2020 • edited

cloud-fan commented Aug 27, 2020

[SPARK-28698][SQL] Support user-specified output schema in `to_avro` #25419

[SPARK-28698][SQL] Support user-specified output schema in `to_avro` #25419

dongjoon-hyun Aug 12, 2019 •

edited

gengliangwang Aug 13, 2019 •

edited

dirrao commented Aug 27, 2020 •

edited