Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48545][SQL] Create to_avro and from_avro SQL functions to match DataFrame equivalents #46977

Closed
wants to merge 13 commits into from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Jun 13, 2024

What changes were proposed in this pull request?

This PR creates two new SQL functions "to_avro" and "from_avro" to match existing DataFrame equivalents.

For example:

sql(
  """
    |create table t as
    |  select named_struct('u', named_struct('member0', member0, 'member1', member1)) as s
    |  from values (1, null), (null,  'a') tab(member0, member1)
    |""".stripMargin)

val jsonFormatSchema =
  """
    |{
    |  "type": "record",
    |  "name": "struct",
    |  "fields": [{
    |    "name": "u",
    |    "type": ["int","string"]
    |  }]
    |}
    |""".stripMargin

spark.sql(
  s"""
    |select from_avro(result, '$jsonFormatSchema', map()).u from (
    |  select to_avro(s, '$jsonFormatSchema') as result from t
    |)")
  .collect()

> {1, NULL}
  {NULL, "a"}

Why are the changes needed?

This brings parity between SQL and DataFrame APIs in Apache Spark.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds extra unit tests, and I also checked that the functions work with spark-shell.

Was this patch authored or co-authored using generative AI tooling?

No GitHub copilot usage this time

commit

commit

commit

commit

commit

commit
@dtenedor
Copy link
Contributor Author

Thanks @allisonwang-db for your review, followed through on your comments.

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! cc @cloud-fan

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI passes

@dtenedor
Copy link
Contributor Author

cc @cloud-fan the CI is passing now :)

@gengliangwang
Copy link
Member

Thanks, merging to master

HyukjinKwon pushed a commit that referenced this pull request Jun 23, 2024
…nd from_avro functions but Avro is not loaded by default

### What changes were proposed in this pull request?

This PR updates the new `to_avro` and `from_avro` SQL functions added in #46977 to return reasonable errors when Avro is not loaded by default.

### Why are the changes needed?

According to the [Apache Spark Avro Data Source Guide](https://spark.apache.org/docs/latest/sql-data-sources-avro.html), Avro is not loaded into Spark by default. With this change, users get reasonable error messages if they try to call the `to_avro` or `from_avro` SQL functions in this case with instructions telling them what to do, rather than obscure Java `ClassNotFoundException`s.

### Does this PR introduce _any_ user-facing change?

Yes, see above.

### How was this patch tested?

This PR adds golden file based test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No GitHub copilot this time.

Closes #47063 from dtenedor/to-from-avro-error-not-loaded.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants