[SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas #24405

giamo · 2019-04-18T13:27:18Z

What changes were proposed in this pull request?

The current implementation of from_avro and AvroDataToCatalyst doesn't allow doing schema evolution since it requires the deserialization of an Avro record with the exact same schema with which it was serialized.

The proposed change is to add a new option to allow passing the schema used to serialize the records. This allows using a different compatible schema for reading by passing both schemas to GenericDatumReader. If no writer's schema is provided, nothing changes from before.

Why are the changes needed?

Consider the following example.

// schema ID: 1
val schema1 = """
{
    "type": "record",
    "name": "MySchema",
    "fields": [
        {"name": "col1", "type": "int"},
        {"name": "col2", "type": "string"}
     ]
}
"""

// schema ID: 2
val schema2 = """
{
    "type": "record",
    "name": "MySchema",
    "fields": [
        {"name": "col1", "type": "int"},
        {"name": "col2", "type": "string"},
        {"name": "col3", "type": "string", "default": ""}
     ]
}
"""

The two schemas are compatible - i.e. you can use schema2 to deserialize events serialized with schema1, in which case there will be the field col3 with the default value.

Now imagine that you have two dataframes (read from batch or streaming), one with Avro events from schema1 and the other with events from schema2. We want to combine them into one dataframe for storing or further processing.

With the current from_avro function we can only decode each of them with the corresponding schema:

scala> val df1 = ... // Avro events created with schema1
df1: org.apache.spark.sql.DataFrame = [eventBytes: binary]
scala> val decodedDf1 = df1.select(from_avro('eventBytes, schema1) as "decoded")
decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string>]

scala> val df2= ... // Avro events created with schema2
df2: org.apache.spark.sql.DataFrame = [eventBytes: binary]
scala> val decodedDf2 = df2.select(from_avro('eventBytes, schema2) as "decoded")
decodedDf2: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>]

but then decodedDf1 and decodedDf2 have different Spark schemas and we can't union them. Instead, with the proposed change we can decode df1 in the following way:

scala> import scala.collection.JavaConverters._
scala> val decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("writerSchema" -> schema1).asJava) as "decoded")
decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>]

so that both dataframes have the same schemas and can be merged.

Does this PR introduce any user-facing change?

This PR allows users to pass a new configuration but it doesn't affect current code.

How was this patch tested?

A new unit test was added.

giamo · 2019-04-18T13:30:57Z

cc @gengliangwang

HyukjinKwon · 2019-04-19T11:31:30Z

@giamo, what use case does this PR target? Most of operations can be already done by Spark APIs.

giamo · 2019-04-19T15:54:14Z

@HyukjinKwon Imagine that every hour we run a Spark job that loads some data containing Avro records from an external source into a Spark dataframe, uses the from_avro function to deserialize the records and does some processing with them. We use a schema registry to store the schema of the records and sometimes we need to publish a new version by adding or removing fields in a backward compatible way. Consider that it's important for our business logic that a whole batch of records is deserialized using the latest available schema version.

Without doing schema evolution, the job works fine. Now imagine that at some point the business requires we evolve the schema by adding a new field, so the data loaded in the next run of the Spark job now contains some events serialized with the old version of the schema and some others serialized with the new one. Our job now breaks because it tries to read the older data with the new schema and this is not currently supported by from_avro. This is a real use case that we are facing, by the way :)

A workaround using the Spark API would require detecting the differences between the schema versions and patching the older records to make them have the same structure as newer records. This would be non-trivial to implement, less efficient and actually unnecessary since Avro already offers the functionality by passing both the reader and an optional writer schema to a GenericDatumReader (which is used by Spark underneath). This PR simply allows to do the same in the Spark interface.

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala

giamo · 2019-05-08T13:58:33Z

@HyukjinKwon @mgaido91 any thoughts on this?

mgaido91

I think the use case makes sense, especially in streaming apps. WDYT @HyukjinKwon ?

external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala

external/avro/src/test/java/org/apache/spark/sql/avro/JavaAvroFunctionsSuite.java

external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala

giamo · 2019-05-13T12:12:58Z

@mgaido91 @HyukjinKwon changes pushed

giamo · 2019-07-17T09:48:27Z

@mgaido91 @HyukjinKwon do you think we can move this forward?

mgaido91 · 2019-07-17T11:45:23Z

I think this change makes sense, @HyukjinKwon WDYT?

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala

gengliangwang · 2019-07-18T06:09:52Z

Overall LGTM.
CC @cloud-fan

giamo · 2019-07-18T07:32:31Z

Thanks! I think I need an "ok to test" from a committer to start the Jenkins build, right?

mgaido91

LGTM, too apart from minor comments

external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala

HyukjinKwon · 2019-07-25T13:20:06Z

ok to test

HyukjinKwon · 2019-07-25T13:20:36Z

retest this please

SparkQA · 2019-07-25T13:50:11Z

Test build #108171 has finished for PR 24405 at commit a1bef67.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-25T19:55:30Z

Test build #108183 has finished for PR 24405 at commit 5e2f37d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

external/avro/src/main/scala/org/apache/spark/sql/avro/functions.scala

python/pyspark/sql/avro/functions.py

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala

HyukjinKwon · 2019-08-13T16:04:46Z

@giamo, can you clarify the usecase with codes in PR description? We talk by codes.
The API looks a bit awkward because it takes two Avro schemas.

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala

cloud-fan · 2019-08-15T09:01:17Z

cc @dbtsai @gengliangwang do we have this feature in the avro data source? e.g. can we read avro files with a different but compatible schema?

giamo · 2019-08-15T09:05:58Z

@cloud-fan yes, that's what the GenericDatumReader constructor that I linked above allows you to do. There are plenty of examples online.

cloud-fan · 2019-08-15T09:22:57Z

I mean the Spark avro data source.

mgaido91 · 2019-09-09T14:35:09Z

+1 for having it in the options too, thanks @HyukjinKwon !

SparkQA · 2019-09-14T15:40:00Z

Test build #110593 has finished for PR 24405 at commit 3128047.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

viirya · 2019-09-16T20:38:57Z

A bit confused by the example provided:

scala> val decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("writerSchema" -> schema1).asJava) as "decoded")

So them schema1 is provided as writerSchema, we use schema1 to read avro bytes back? Isn't schema1 only 2 fields? Are you meant schema2 here?

HyukjinKwon · 2019-10-11T03:08:17Z

ok to test

HyukjinKwon · 2019-10-11T03:09:37Z

@giamo mind updating PR? Sorry for my late response. Looks like we're getting closer to merge.

SparkQA · 2019-10-11T22:23:59Z

Test build #111933 has finished for PR 24405 at commit 3128047.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

giamo · 2019-10-16T06:44:57Z

A bit confused by the example provided:
scala> val decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("writerSchema" -> schema1).asJava) as "decoded")
So them schema1 is provided as writerSchema, we use schema1 to read avro bytes back? Isn't schema1 only 2 fields? Are you meant schema2 here?

The example is correct, jsonFormatSchema is the one used for deserialization, just as before.

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

SparkQA · 2019-10-16T07:05:02Z

Test build #112144 has finished for PR 24405 at commit 54cb6f1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Fokko

LGTM, two small suggestions

Fokko · 2019-10-16T08:37:46Z

docs/sql-data-sources-avro.md

+    <td>None</td>
+    <td>Optional Avro schema (in JSON format) that was used to serialize the data. This should be set if the schema provided
+      for deserialization is compatible with - but not the same as - the one used to originally convert the data to Avro.
+    </td>


Would it be possible to link to the Confluent documentation? They have an excellent document on schema compatibility and evolution: https://docs.confluent.io/current/schema-registry/avro.html

Fokko · 2019-10-16T08:38:13Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala

@@ -153,4 +153,45 @@ class AvroFunctionsSuite extends QueryTest with SharedSparkSession {
      assert(df.collect().map(_.get(0)) === Seq(Row("one"), Row("two"), Row("three"), Row("four")))
    }
  }
+
+  test("SPARK-27506: roundtrip in to_avro and from_avro with different compatible schemas") {


I would also add a test with an incompatible schema, for example, changing a string to an int.

HyukjinKwon · 2019-12-04T01:18:52Z

retest this please

gengliangwang · 2019-12-04T01:44:09Z

docs/sql-data-sources-avro.md

@@ -240,6 +240,14 @@ Data source options of Avro can be set via:
    </td>
    <td>function <code>from_avro</code></td>
  </tr>
+  <tr>
+    <td><code>writerSchema</code></td>


How about actualSchema? I think it is more straightforward.

I would stick to writerSchema, mostly because this is also the term used in Avro itself: https://avro.apache.org/docs/1.9.1/api/java/org/apache/avro/hadoop/io/AvroValueDeserializer.html

I come up with this because in the implementation it also uses the name actual: https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java#L67

But I am OK with writerSchema as well since it is the name in the constructor method.

Yeah, actually writerSchema name is super confusing to me. Can we use actualSchema? I would prefer this one more.

SparkQA · 2019-12-04T01:57:30Z

Test build #114811 has finished for PR 24405 at commit 54cb6f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-05T09:13:19Z

@giamo, sorry it took long. very close to go. Can you address the comments?

HyukjinKwon · 2019-12-06T10:47:22Z

Can any of you take over this (by picking the commits here so that I make this guy a co-author) if @giamo is being inactive? There are only few rather minor comments left to address ...

Fokko · 2019-12-06T12:10:26Z

I'm happy to cherry-pick @giamo's work and fix the last few comments

HyukjinKwon · 2019-12-06T12:51:11Z

Please go ahead. Let me credit to both as co-authors.

Fokko · 2019-12-06T13:15:01Z

I've opened up a follow up under #26780

…e schemas Follow up of #24405 ### What changes were proposed in this pull request? The current implementation of _from_avro_ and _AvroDataToCatalyst_ doesn't allow doing schema evolution since it requires the deserialization of an Avro record with the exact same schema with which it was serialized. The proposed change is to add a new option `actualSchema` to allow passing the schema used to serialize the records. This allows using a different compatible schema for reading by passing both schemas to _GenericDatumReader_. If no writer's schema is provided, nothing changes from before. ### Why are the changes needed? Consider the following example. ``` // schema ID: 1 val schema1 = """ { "type": "record", "name": "MySchema", "fields": [ {"name": "col1", "type": "int"}, {"name": "col2", "type": "string"} ] } """ // schema ID: 2 val schema2 = """ { "type": "record", "name": "MySchema", "fields": [ {"name": "col1", "type": "int"}, {"name": "col2", "type": "string"}, {"name": "col3", "type": "string", "default": ""} ] } """ ``` The two schemas are compatible - i.e. you can use `schema2` to deserialize events serialized with `schema1`, in which case there will be the field `col3` with the default value. Now imagine that you have two dataframes (read from batch or streaming), one with Avro events from schema1 and the other with events from schema2. **We want to combine them into one dataframe** for storing or further processing. With the current `from_avro` function we can only decode each of them with the corresponding schema: ``` scalaval df1 = ... // Avro events created with schema1 df1: org.apache.spark.sql.DataFrame = [eventBytes: binary] scalaval decodedDf1 = df1.select(from_avro('eventBytes, schema1) as "decoded") decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string>] scalaval df2= ... // Avro events created with schema2 df2: org.apache.spark.sql.DataFrame = [eventBytes: binary] scalaval decodedDf2 = df2.select(from_avro('eventBytes, schema2) as "decoded") decodedDf2: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>] ``` but then `decodedDf1` and `decodedDf2` have different Spark schemas and we can't union them. Instead, with the proposed change we can decode `df1` in the following way: ``` scalaimport scala.collection.JavaConverters._ scalaval decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("actualSchema" -> schema1).asJava) as "decoded") decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>] ``` so that both dataframes have the same schemas and can be merged. ### Does this PR introduce any user-facing change? This PR allows users to pass a new configuration but it doesn't affect current code. ### How was this patch tested? A new unit test was added. Closes #26780 from Fokko/SPARK-27506. Lead-authored-by: Fokko Driesprong <fokko@apache.org> Co-authored-by: Gianluca Amori <gianluca.amori@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>

Fokko · 2019-12-11T09:39:20Z

@gengliangwang can you close this one as well?

gengliangwang · 2019-12-11T18:12:09Z

Close this one since #26780 is closed.
@giamo Thanks for work!

mgaido91 reviewed Apr 23, 2019

View reviewed changes

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala Outdated Show resolved Hide resolved

giamo force-pushed the SPARK-27506 branch from 0f44d32 to 0352d8a Compare April 30, 2019 14:13

mgaido91 reviewed May 8, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

giamo force-pushed the SPARK-27506 branch from 7539624 to 251d9a7 Compare July 17, 2019 09:45

gengliangwang reviewed Jul 17, 2019

View reviewed changes

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala Outdated Show resolved Hide resolved

mgaido91 reviewed Jul 18, 2019

View reviewed changes