Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas #24405

Closed
wants to merge 2 commits into from

Conversation

giamo
Copy link
Contributor

@giamo giamo commented Apr 18, 2019

What changes were proposed in this pull request?

The current implementation of from_avro and AvroDataToCatalyst doesn't allow doing schema evolution since it requires the deserialization of an Avro record with the exact same schema with which it was serialized.

The proposed change is to add a new option to allow passing the schema used to serialize the records. This allows using a different compatible schema for reading by passing both schemas to GenericDatumReader. If no writer's schema is provided, nothing changes from before.

Why are the changes needed?

Consider the following example.

// schema ID: 1
val schema1 = """
{
    "type": "record",
    "name": "MySchema",
    "fields": [
        {"name": "col1", "type": "int"},
        {"name": "col2", "type": "string"}
     ]
}
"""

// schema ID: 2
val schema2 = """
{
    "type": "record",
    "name": "MySchema",
    "fields": [
        {"name": "col1", "type": "int"},
        {"name": "col2", "type": "string"},
        {"name": "col3", "type": "string", "default": ""}
     ]
}
"""

The two schemas are compatible - i.e. you can use schema2 to deserialize events serialized with schema1, in which case there will be the field col3 with the default value.

Now imagine that you have two dataframes (read from batch or streaming), one with Avro events from schema1 and the other with events from schema2. We want to combine them into one dataframe for storing or further processing.

With the current from_avro function we can only decode each of them with the corresponding schema:

scala> val df1 = ... // Avro events created with schema1
df1: org.apache.spark.sql.DataFrame = [eventBytes: binary]
scala> val decodedDf1 = df1.select(from_avro('eventBytes, schema1) as "decoded")
decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string>]

scala> val df2= ... // Avro events created with schema2
df2: org.apache.spark.sql.DataFrame = [eventBytes: binary]
scala> val decodedDf2 = df2.select(from_avro('eventBytes, schema2) as "decoded")
decodedDf2: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>]

but then decodedDf1 and decodedDf2 have different Spark schemas and we can't union them. Instead, with the proposed change we can decode df1 in the following way:

scala> import scala.collection.JavaConverters._
scala> val decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("writerSchema" -> schema1).asJava) as "decoded")
decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>]

so that both dataframes have the same schemas and can be merged.

Does this PR introduce any user-facing change?

This PR allows users to pass a new configuration but it doesn't affect current code.

How was this patch tested?

A new unit test was added.

@giamo
Copy link
Contributor Author

giamo commented Apr 18, 2019

cc @gengliangwang

@HyukjinKwon
Copy link
Member

@giamo, what use case does this PR target? Most of operations can be already done by Spark APIs.

@giamo
Copy link
Contributor Author

giamo commented Apr 19, 2019

@HyukjinKwon Imagine that every hour we run a Spark job that loads some data containing Avro records from an external source into a Spark dataframe, uses the from_avro function to deserialize the records and does some processing with them. We use a schema registry to store the schema of the records and sometimes we need to publish a new version by adding or removing fields in a backward compatible way. Consider that it's important for our business logic that a whole batch of records is deserialized using the latest available schema version.

Without doing schema evolution, the job works fine. Now imagine that at some point the business requires we evolve the schema by adding a new field, so the data loaded in the next run of the Spark job now contains some events serialized with the old version of the schema and some others serialized with the new one. Our job now breaks because it tries to read the older data with the new schema and this is not currently supported by from_avro. This is a real use case that we are facing, by the way :)

A workaround using the Spark API would require detecting the differences between the schema versions and patching the older records to make them have the same structure as newer records. This would be non-trivial to implement, less efficient and actually unnecessary since Avro already offers the functionality by passing both the reader and an optional writer schema to a GenericDatumReader (which is used by Spark underneath). This PR simply allows to do the same in the Spark interface.

@giamo
Copy link
Contributor Author

giamo commented May 8, 2019

@HyukjinKwon @mgaido91 any thoughts on this?

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use case makes sense, especially in streaming apps. WDYT @HyukjinKwon ?

@giamo
Copy link
Contributor Author

giamo commented May 13, 2019

@mgaido91 @HyukjinKwon changes pushed

@giamo
Copy link
Contributor Author

giamo commented Jul 17, 2019

@mgaido91 @HyukjinKwon do you think we can move this forward?

@mgaido91
Copy link
Contributor

I think this change makes sense, @HyukjinKwon WDYT?

@gengliangwang
Copy link
Member

Overall LGTM.
CC @cloud-fan

@giamo
Copy link
Contributor Author

giamo commented Jul 18, 2019

Thanks! I think I need an "ok to test" from a committer to start the Jenkins build, right?

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, too apart from minor comments

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 25, 2019

Test build #108171 has finished for PR 24405 at commit a1bef67.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 25, 2019

Test build #108183 has finished for PR 24405 at commit 5e2f37d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@giamo, can you clarify the usecase with codes in PR description? We talk by codes.
The API looks a bit awkward because it takes two Avro schemas.

@cloud-fan
Copy link
Contributor

cc @dbtsai @gengliangwang do we have this feature in the avro data source? e.g. can we read avro files with a different but compatible schema?

@giamo
Copy link
Contributor Author

giamo commented Aug 15, 2019

@cloud-fan yes, that's what the GenericDatumReader constructor that I linked above allows you to do. There are plenty of examples online.

@cloud-fan
Copy link
Contributor

I mean the Spark avro data source.

@mgaido91
Copy link
Contributor

mgaido91 commented Sep 9, 2019

+1 for having it in the options too, thanks @HyukjinKwon !

@SparkQA
Copy link

SparkQA commented Sep 14, 2019

Test build #110593 has finished for PR 24405 at commit 3128047.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Sep 16, 2019

A bit confused by the example provided:

scala> val decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("writerSchema" -> schema1).asJava) as "decoded")

So them schema1 is provided as writerSchema, we use schema1 to read avro bytes back? Isn't schema1 only 2 fields? Are you meant schema2 here?

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 11, 2019

@giamo mind updating PR? Sorry for my late response. Looks like we're getting closer to merge.

@SparkQA
Copy link

SparkQA commented Oct 11, 2019

Test build #111933 has finished for PR 24405 at commit 3128047.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@giamo
Copy link
Contributor Author

giamo commented Oct 16, 2019

A bit confused by the example provided:

scala> val decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("writerSchema" -> schema1).asJava) as "decoded")

So them schema1 is provided as writerSchema, we use schema1 to read avro bytes back? Isn't schema1 only 2 fields? Are you meant schema2 here?

The example is correct, jsonFormatSchema is the one used for deserialization, just as before.

@SparkQA
Copy link

SparkQA commented Oct 16, 2019

Test build #112144 has finished for PR 24405 at commit 54cb6f1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, two small suggestions

<td>None</td>
<td>Optional Avro schema (in JSON format) that was used to serialize the data. This should be set if the schema provided
for deserialization is compatible with - but not the same as - the one used to originally convert the data to Avro.
</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to link to the Confluent documentation? They have an excellent document on schema compatibility and evolution: https://docs.confluent.io/current/schema-registry/avro.html

@@ -153,4 +153,45 @@ class AvroFunctionsSuite extends QueryTest with SharedSparkSession {
assert(df.collect().map(_.get(0)) === Seq(Row("one"), Row("two"), Row("three"), Row("four")))
}
}

test("SPARK-27506: roundtrip in to_avro and from_avro with different compatible schemas") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a test with an incompatible schema, for example, changing a string to an int.

@HyukjinKwon
Copy link
Member

retest this please

@@ -240,6 +240,14 @@ Data source options of Avro can be set via:
</td>
<td>function <code>from_avro</code></td>
</tr>
<tr>
<td><code>writerSchema</code></td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about actualSchema? I think it is more straightforward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would stick to writerSchema, mostly because this is also the term used in Avro itself: https://avro.apache.org/docs/1.9.1/api/java/org/apache/avro/hadoop/io/AvroValueDeserializer.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I come up with this because in the implementation it also uses the name actual: https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java#L67

But I am OK with writerSchema as well since it is the name in the constructor method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, actually writerSchema name is super confusing to me. Can we use actualSchema? I would prefer this one more.

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114811 has finished for PR 24405 at commit 54cb6f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@giamo, sorry it took long. very close to go. Can you address the comments?

@HyukjinKwon
Copy link
Member

Can any of you take over this (by picking the commits here so that I make this guy a co-author) if @giamo is being inactive? There are only few rather minor comments left to address ...

@Fokko
Copy link
Contributor

Fokko commented Dec 6, 2019

I'm happy to cherry-pick @giamo's work and fix the last few comments

@HyukjinKwon
Copy link
Member

Please go ahead. Let me credit to both as co-authors.

@Fokko
Copy link
Contributor

Fokko commented Dec 6, 2019

I've opened up a follow up under #26780

gengliangwang pushed a commit that referenced this pull request Dec 11, 2019
…e schemas

Follow up of #24405

### What changes were proposed in this pull request?
The current implementation of _from_avro_ and _AvroDataToCatalyst_ doesn't allow doing schema evolution since it requires the deserialization of an Avro record with the exact same schema with which it was serialized.

The proposed change is to add a new option `actualSchema` to allow passing the schema used to serialize the records. This allows using a different compatible schema for reading by passing both schemas to _GenericDatumReader_. If no writer's schema is provided, nothing changes from before.

### Why are the changes needed?
Consider the following example.

```
// schema ID: 1
val schema1 = """
{
    "type": "record",
    "name": "MySchema",
    "fields": [
        {"name": "col1", "type": "int"},
        {"name": "col2", "type": "string"}
     ]
}
"""

// schema ID: 2
val schema2 = """
{
    "type": "record",
    "name": "MySchema",
    "fields": [
        {"name": "col1", "type": "int"},
        {"name": "col2", "type": "string"},
        {"name": "col3", "type": "string", "default": ""}
     ]
}
"""
```

The two schemas are compatible - i.e. you can use `schema2` to deserialize events serialized with `schema1`, in which case there will be the field `col3` with the default value.

Now imagine that you have two dataframes (read from batch or streaming), one with Avro events from schema1 and the other with events from schema2. **We want to combine them into one dataframe** for storing or further processing.

With the current `from_avro` function we can only decode each of them with the corresponding schema:

```
scalaval df1 = ... // Avro events created with schema1
df1: org.apache.spark.sql.DataFrame = [eventBytes: binary]
scalaval decodedDf1 = df1.select(from_avro('eventBytes, schema1) as "decoded")
decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string>]

scalaval df2= ... // Avro events created with schema2
df2: org.apache.spark.sql.DataFrame = [eventBytes: binary]
scalaval decodedDf2 = df2.select(from_avro('eventBytes, schema2) as "decoded")
decodedDf2: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>]
```

but then `decodedDf1` and `decodedDf2` have different Spark schemas and we can't union them. Instead, with the proposed change we can decode `df1` in the following way:

```
scalaimport scala.collection.JavaConverters._
scalaval decodedDf1 = df1.select(from_avro(data = 'eventBytes, jsonFormatSchema = schema2, options = Map("actualSchema" -> schema1).asJava) as "decoded")
decodedDf1: org.apache.spark.sql.DataFrame = [decoded: struct<col1: int, col2: string, col3: string>]
```

so that both dataframes have the same schemas and can be merged.

### Does this PR introduce any user-facing change?
This PR allows users to pass a new configuration but it doesn't affect current code.

### How was this patch tested?
A new unit test was added.

Closes #26780 from Fokko/SPARK-27506.

Lead-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Gianluca Amori <gianluca.amori@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
@Fokko
Copy link
Contributor

Fokko commented Dec 11, 2019

@gengliangwang can you close this one as well?

@gengliangwang
Copy link
Member

Close this one since #26780 is closed.
@giamo Thanks for work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
10 participants