Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43380][SQL] Fix slowdown in Avro read #43530

Closed

Conversation

zeruibao
Copy link
Contributor

@zeruibao zeruibao commented Oct 25, 2023

What changes were proposed in this pull request?

Fix slowdown in Avro read. There is a #42503 causes the performance regression. It seems that creating an AvroOptions inside toSqlType is very expensive. Try to pass this in the callstack.

After regression
image-20231024-193909

Before regression
image-20231024-193650

Why are the changes needed?

Need to fix the performance regression of Avro read.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UT test

Was this patch authored or co-authored using generative AI tooling?

No

@RunyaoChen
Copy link
Contributor

The PR description mentions #42503 causes the performance regression, while the 42503 itself is a fix to another perf regression.

I think we need some careful review here.

@RunyaoChen
Copy link
Contributor

@cloud-fan Could you help review given that you reviewed the #42503 that this PR is trying to fix

@github-actions github-actions bot added the BUILD label Oct 26, 2023
…Converters.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
@zeruibao
Copy link
Contributor Author

All tests passed. Should be good to merge @cloud-fan.

@cloud-fan
Copy link
Contributor

cloud-fan commented Oct 27, 2023

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in 7d94c57 Oct 27, 2023
cloud-fan pushed a commit that referenced this pull request Oct 27, 2023
### What changes were proposed in this pull request?
Fix slowdown in Avro read. There is a #42503 causes the performance regression. It seems that creating an `AvroOptions` inside `toSqlType` is very expensive. Try to pass this in the callstack.

After regression
![image-20231024-193909](https://github.com/apache/spark/assets/125398515/c6af9376-e058-4da9-8f63-d9e8663b36ef)

Before regression
![image-20231024-193650](https://github.com/apache/spark/assets/125398515/fd609c05-accb-4ce8-8020-2866328a52f7)

### Why are the changes needed?
Need to fix the performance regression of Avro read.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UT test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43530 from zeruibao/SPARK-4380-real-fix-regression-2.

Lead-authored-by: zeruibao <zerui.bao@databricks.com>
Co-authored-by: Zerui Bao <125398515+zeruibao@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 7d94c57)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
beliefer added a commit that referenced this pull request Oct 30, 2023
……useStableIdForUnionType: Boolean): SchemaType

### What changes were proposed in this pull request?
#43530 provides a new method:
```
  /**
   * Converts an Avro schema to a corresponding Spark SQL schema.
   *
   * since 4.0.0
   */
  def toSqlType(avroSchema: Schema, useStableIdForUnionType: Boolean): SchemaType = {
    toSqlTypeHelper(avroSchema, Set.empty, useStableIdForUnionType)
  }
```
Because take `AvroOptions` as parameter causes the performance regression, the old `toSqlType` looks very useless.

This PR also improve some caller of `toSqlType` by pass `useStableIdForUnionType` directly.

### Why are the changes needed?
Deprecate toSqlType(avroSchema: Schema, …useStableIdForUnionType: Boolean): SchemaType

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Exists test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #43557 from beliefer/SPARK-43380_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Jiaan Geng <beliefer@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants