[SPARK-48833][SQL][VARIANT] Support variant in `InMemoryTableScan` by richardc-db · Pull Request #47252 · apache/spark

richardc-db · 2024-07-08T06:44:02Z

What changes were proposed in this pull request?

adds support for variant type in InMemoryTableScan, or df.cache() by supporting writing variant values to an inmemory buffer.

Why are the changes needed?

prior to this PR, calling df.cache() on a df that has a variant would fail because InMemoryTableScan does not support reading variant types.

Does this PR introduce any user-facing change?

no

How was this patch tested?

added UTs

Was this patch authored or co-authored using generative AI tooling?

no

cloud-fan · 2024-07-23T05:52:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnType.scala

+  override def append(v: VariantVal, buffer: ByteBuffer): Unit = {
+    val varLenSize: Int = 4 + v.getValue().length + v.getMetadata().length
+    ByteBufferHelper.putInt(buffer, varLenSize)
+    ByteBufferHelper.putInt(buffer, v.getValue().length)


why not simply one int for value size and one int for metadata size?

This was done initially to mimic the VariantVal unsafe row representation here. I can switch it to write one int for value and one int for metadata if you'd prefer

cloud-fan · 2024-07-24T10:01:52Z

the protobuf failure is unrelated, thanks, merging to master!

### What changes were proposed in this pull request? adds support for variant type in `InMemoryTableScan`, or `df.cache()` by supporting writing variant values to an inmemory buffer. ### Why are the changes needed? prior to this PR, calling `df.cache()` on a df that has a variant would fail because `InMemoryTableScan` does not support reading variant types. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added UTs ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47252 from richardc-db/variant_dfcache_support. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

init

5388af7

github-actions bot added the SQL label Jul 8, 2024

cloud-fan reviewed Jul 23, 2024

View reviewed changes

richardc-db added 2 commits July 23, 2024 13:42

change format

a772605

trigger

4975c74

cloud-fan approved these changes Jul 24, 2024

View reviewed changes

cloud-fan closed this in 0c9b072 Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48833][SQL][VARIANT] Support variant in `InMemoryTableScan`#47252

[SPARK-48833][SQL][VARIANT] Support variant in `InMemoryTableScan`#47252
richardc-db wants to merge 3 commits intoapache:masterfrom
richardc-db:variant_dfcache_support

richardc-db commented Jul 8, 2024 •

edited

Loading

Uh oh!

cloud-fan Jul 23, 2024

Uh oh!

richardc-db Jul 23, 2024 •

edited

Loading

Uh oh!

cloud-fan commented Jul 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

richardc-db commented Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan Jul 23, 2024

Choose a reason for hiding this comment

Uh oh!

richardc-db Jul 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

richardc-db commented Jul 8, 2024 •

edited

Loading

richardc-db Jul 23, 2024 •

edited

Loading