[SPARK-45827][SQL] Add Variant data type in Spark. #43707

chenhao-db · 2023-11-07T22:47:55Z

What changes were proposed in this pull request?

This PR adds Variant data type in Spark. It doesn't actually introduce any binary encoding, but just has the value and metadata binaries.

This PR includes:

The in-memory Variant representation in different types of Spark rows. All rows except UnsafeRow use the VariantVal object to store an Variant value. In the UnsafeRow, the two binaries are stored contiguously.
Spark parquet writer and reader support for the Variant type. This is agnostic to the detailed binary encoding but just transparently reads the two binaries.
A dummy Spark parse_json implementation so that I can manually test the writer and reader. It currently returns an VariantVal with value being the raw bytes of the input string and empty metadata. This is not a valid Variant value in the final binary encoding.

How was this patch tested?

Manual testing. Some supported usages:

> sql("create table T using parquet as select parse_json('1') as o")
> sql("select * from T").show
+---+
|  o|
+---+
|  1|
+---+
> sql("insert into T select parse_json('[2]') as o")
> sql("select * from T").show
+---+
|  o|
+---+
|[2]|
|  1|
+---+

dtenedor · 2023-11-08T18:20:04Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala

+  protected override def nullSafeEval(input: Any): Any = {
+    // A dummy implementation: the value is the raw bytes of the input string. This is not the final
+    // implementation, but only intended for debugging.
+    new VariantVal(input.asInstanceOf[UTF8String].toString.getBytes, Array())


we should probably implement the checkInputDataTypes method to enforce that the input is actually a UTF8String before we reach this point.

We don't need it because the class extends ExpectsInputTypes: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpectsInputTypes.scala

You're right, my mistake on this.

dtenedor · 2023-11-08T18:20:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala

@@ -326,6 +327,17 @@ case class PhysicalStructType(fields: Array[StructField]) extends PhysicalDataTy
  }
 }

+class PhysicalVariantType extends PhysicalDataType {


can we add a comment for this?

I don't feel there is anything special and worth a comment in this class. All Physical*Type classes in this file don't have a comment.

dtenedor · 2023-11-08T18:48:08Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala

+  protected override def nullSafeEval(input: Any): Any = {
+    // A dummy implementation: the value is the raw bytes of the input string. This is not the final
+    // implementation, but only intended for debugging.
+    new VariantVal(input.asInstanceOf[UTF8String].toString.getBytes, Array())


You're right, my mistake on this.

cloud-fan · 2023-11-09T06:53:53Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java

+   * This function writes the binary content into {@code buffer} starting from {@code cursor}, as
+   * described in the class comment. The caller should guarantee there is enough space in `buffer`.
+   */
+  public void writeIntoUnsafeRow(byte[] buffer, long cursor) {


Suggested change

public void writeIntoUnsafeRow(byte[] buffer, long cursor) {

public void writeIntoUnsafeRow(Object baseObject, long baseOffset, long cursor) {

there is no byte[] if Spark is using offheap mode.

I'm not quite sure what you mean. This function is called by UnsafeWriter, which always uses byte[] to build an UnsafeRow. There is no benefit of changing this function to use a generic base object.

interesting, UnsafeWriter always use byte[], then it's fine.

I'm a bit confused. Why not add writeIntoUnsafeRow into UnsafeWriter?

Actually this makes sense to me. The job of writing to unsafe row belongs to UnsafeWriter and the code should be put there as well. What do you think? @chenhao-db

But it will introduce extra stack length.

It makes sense. The original reason I put writeIntoUnsafeRow in this class is to avoid code duplication (e.g., readFromUnsafeRow is called by two classes, such a duplication is quite common for other physical value types). But since writeIntoUnsafeRow only has one caller UnsafeWriter, it is okay to just put it in UnsafeWriter.

cloud-fan · 2023-11-09T06:58:23Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java

+    int metadataSize = totalSize - 4 - valueSize;
+    byte[] value = new byte[valueSize];
+    byte[] metadata = new byte[metadataSize];
+    Platform.copyMemory(


We can avoid copy if VariantVal can follow UTF8String and also represent its data as baseObject + baseOffset. Does Java have nicer way to do it now? cc @rednaxelafx

I know how UTF8String works, but I feel it is simpler to have byte[] in the VariantVal object instead of baseObject + baseOffset. I prefer to have this version first and it is not something unchangeable in the future.

cloud-fan · 2023-11-09T07:04:43Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala

+    """
+    Examples:
+  """,
+  since = "3.4.0",


cloud-fan · 2023-11-09T07:13:50Z

Can we check FileFormat#supportDataType and make sure only Parquet supports it?

cloud-fan · 2023-11-09T07:17:00Z

We should also check all the call sites of DataSource#disallowWritingIntervals and also disallow writing variants to DS v1.

beliefer

Is there any test?

cloud-fan · 2023-11-10T03:11:44Z

sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala

+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.unsafe.types.VariantVal
+
+class VariantSuite extends QueryTest with SharedSparkSession {


@beliefer this is the test.

cloud-fan · 2023-11-10T03:12:53Z

sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala

+
+    // At this point, JSON parsing logic is not really implemented. We just construct some number
+    // inputs that are also valid JSON. This exercises passing VariantVal throughout the system.
+    val query = spark.sql("select parse_json(repeat('1', id)) as v from range(1, 10)")


We don't need a fake function to get variant values. Since we have also updated the encoder code, I think Seq(VariantVal(...)).toDF("col") should work. If not, we need to check the encode code.

This is not really "fake": when the JSON parsing is implemented, this test will still be valid.

I would like to keep it, and add some round-trip tests with spark.createDataFrame(spark.sparkContext.parallelize(rows), schema).

HyukjinKwon · 2023-11-10T08:32:46Z

sql/api/src/main/scala/org/apache/spark/sql/types/VariantType.scala

+
+/**
+ * The data type representing semi-structured values with arbitrary hierarchical data structures. At
+ * this moment, it is intended to store parsed JSON values and almost any other data types in the


This doesn't sound like @Stable API at all. Let's replace it to at least @Unstable.

HyukjinKwon · 2023-11-10T08:34:33Z

sql/api/src/main/scala/org/apache/spark/sql/types/VariantType.scala

+ * The data type representing semi-structured values with arbitrary hierarchical data structures. At
+ * this moment, it is intended to store parsed JSON values and almost any other data types in the
+ * system (e.g., we don't plan to let it store a map with a non-string key type). In the future, we
+ * may also extend it to store other semi-structured data representation like XML.


I think we should rewrite this statement here. The API documentation shouldn't really mention something like "at this moment" or "in the future". Should just say which version supports what. Therefore, should mention the version @since too.

HyukjinKwon · 2023-11-10T08:36:14Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala

+  usage = "_FUNC_(jsonStr) - Parse a JSON string as an Variant value. Throw an exception when the string is not valid JSON value.",
+  examples =
+    """
+    Examples:


Should add examples.

HyukjinKwon · 2023-11-10T08:36:57Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala

+
+  override def dataType: DataType = VariantType
+
+  override def nullable: Boolean = false


Hm, it implements nullSafeEval so I guess it's nullable?

it at lease should be child.nullable

it at lease should be child.nullable

This is correct, and I don't need to override nullable in this case.

nullSafeEval is ignoring null inputs, and the expression itself is not nullable for non-null inputs.

cloud-fan · 2023-11-10T08:41:46Z

.../src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriterSuite.scala

+    assert(rowWriter.getRow.getVariant(0) === null)
+    val variant = new VariantVal(Array[Byte](1, 2, 3), Array[Byte](-1, -2, -3, -4))
+    rowWriter.write(1, variant)
+    assert(rowWriter.getRow.getVariant(1).debugString() == variant.debugString())


shall we override equals and hashCode for VariantVal?

I don't think we should have them at this moment: the equivalence of VariantVal is much more complicated than byte-to-byte comparison (i.e., different binary values can represent the same variant). It will be pretty complex and will depend on the detailed encoding. It will be confusing if we have equals and hashCode as byte-to-byte comparisons.

In my mind, when we are testing the result of VariantVal in the future, we should use semantic equivalence rather than byte-to-byte comparison. This test is really a special case because we want to verify we exactly pass the VariantVal throughout the system. We can add something like equalByBytes, but it doesn't make things any simpler, so I choose to compare the debugString.

HyukjinKwon · 2023-11-10T08:42:27Z

Please create an umbrella JIRA, and add this (SPARK-45827) as a sub-task, and add some more sub-tasks if possible (e.g., Python, R support, documentation). Adding a new type needs a huge change to provide a proper support (e.g., see SPARK-27790)

sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala

common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java

beliefer · 2023-11-10T08:56:31Z

sql/api/src/main/scala/org/apache/spark/sql/types/VariantType.scala

+class VariantType private () extends AtomicType {
+  // The default size is used in query planning to drive optimization decisions. 2048 is arbitrarily
+  // picked and we currently don't have any data to support it. This may need revisiting later.
+  override def defaultSize: Int = 2048


How do we get the actual length cheaply?

This is a default size for the variant type, not for a certain variant value. StringType has the same thing.

beliefer · 2023-11-10T08:57:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

@@ -808,6 +809,9 @@ object FunctionRegistry {
    expression[LengthOfJsonArray]("json_array_length"),
    expression[JsonObjectKeys]("json_object_keys"),

+    // Variant
+    expression[ParseJson]("parse_json"),


Could we implement parse_json with another PR?

I will have an actual implementation for parse_json in the future. At this point, I think it is beneficial to include a "fake" implementation to help testing and experimenting.

beliefer · 2023-11-11T05:06:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala

@@ -1857,6 +1857,12 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase with Compilat
      messageParameters = Map("dataType" -> field.dataType.catalogString))
  }

+  def cannotSaveVariantIntoExternalStorageError(): Throwable = {
+    new AnalysisException(
+      errorClass = "_LEGACY_ERROR_TEMP_1176",


Shall we give an useful name?

I think it is a common pattern: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala#L1561

That's a legacy from history, this is a new one.

Makes sense, I updated the error class.

beliefer

LGTM except some comments.

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala

beliefer · 2023-11-13T02:26:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala

+
+  override private[sql] def ordering =
+    throw QueryExecutionErrors.orderedOperationUnsupportedByDataTypeError(
+      "PhysicalVariantType")


Shall we supports sort in future?

I think so, but probably not soon. I also added a todo.

beliefer · 2023-11-13T02:28:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala

@@ -1564,6 +1564,12 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase with Compilat
      messageParameters = Map.empty)
  }

+  def cannotSaveVariantIntoExternalStorageError(): Throwable = {
+    new AnalysisException(
+      errorClass = "CANNOT_SAVE_VARIANT",


cc @MaxGekk

sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala

beliefer · 2023-11-13T12:12:24Z

sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala

+        .map(_.get(0).asInstanceOf[VariantVal].toString)
+        .sorted
+        .toSeq
+      val expected = (1 until 10).map(id => "1" * id)


I mean is put val expected = (1 until 10).map(id => "1" * id) out verifyResult.

I don't think it matters much.

The current code creates expected twice.

I'm aware of that, but the cost is really ignorable, and I actually like the current code more because the creation is near to the use.

HyukjinKwon · 2023-11-14T04:42:07Z

Merged to master.

## What changes were proposed in this pull request? This is a follow-up of #43707. The previous PR missed a piece in the variant parquet reader: we are treating the variant type as `struct<value binary, metadata binary>`, so it also needs a similar `assembleStruct` process in the Parquet reader to correctly set the nullness of variant values from def/rep levels. ## How was this patch tested? Extend the existing unit test. It would fail without the change. Closes #43825 from chenhao-db/fix_variant_parquet_reader. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Add a small fix for #43707. Since Variant is represented in columnar form as a struct, it must use `StructNullableTypeConverter` so that nulls are set properly in child column vectors. ### Why are the changes needed? Fixes a potential when setting nulls in Variant columns. ### Does this PR introduce _any_ user-facing change? No, Variant is not released yet. ### How was this patch tested? Updated existing unit test to test Variant. It fails without the fix. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43911 from cashmand/SPARK-45827-fixnulls. Authored-by: cashmand <david.cashman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? The Variant datatype was added in #43707 but the equivalent PySpark type was not added. In this PR we add Variant to PySpark which allows us to create PySpark dataframes containing the Variant type. ### Why are the changes needed? Without this PR, trying to create a dataframe containing a variant type results in `AssertionError: Undefined error message parameter for error class: CANNOT_PARSE_DATATYPE. Parameters: {'error': "Undefined error message parameter for error class: CANNOT_PARSE_DATATYPE. Parameters: {'error': 'variant'}"}`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new PySpark type tests involving Variant. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45131 from desmondcheongzx/variant-pyspark-type-info. Authored-by: Desmond Cheong <desmond.cheong@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

### What changes were proposed in this pull request? The Variant datatype was added in apache#43707 but the equivalent PySpark type was not added. In this PR we add Variant to PySpark which allows us to create PySpark dataframes containing the Variant type. ### Why are the changes needed? Without this PR, trying to create a dataframe containing a variant type results in `AssertionError: Undefined error message parameter for error class: CANNOT_PARSE_DATATYPE. Parameters: {'error': "Undefined error message parameter for error class: CANNOT_PARSE_DATATYPE. Parameters: {'error': 'variant'}"}`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new PySpark type tests involving Variant. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45131 from desmondcheongzx/variant-pyspark-type-info. Authored-by: Desmond Cheong <desmond.cheong@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

github-actions bot added the SQL label Nov 7, 2023

initial

2558cf8

chenhao-db force-pushed the variant-type branch from e9c901a to 2558cf8 Compare November 7, 2023 23:12

github-actions bot added the DOCS label Nov 7, 2023

chenhao-db changed the title ~~Add Variant data type in Spark.~~ [SPARK-45827] Add Variant data type in Spark. Nov 7, 2023

fix test

467ee34

dtenedor approved these changes Nov 8, 2023

View reviewed changes

chenhao-db added 3 commits November 8, 2023 13:13

fix java lint

d3c36ca

add missing code

b4debc6

trigger test

9930a65

cloud-fan reviewed Nov 9, 2023

View reviewed changes

beliefer reviewed Nov 9, 2023

View reviewed changes

disallow writes and add tests

32027c2

github-actions bot added the AVRO label Nov 10, 2023

chenhao-db requested a review from cloud-fan November 10, 2023 03:05

cloud-fan reviewed Nov 10, 2023

View reviewed changes

chenhao-db added 2 commits November 9, 2023 20:20

more tests

bf24f5d

fix test

cce571b

chenhao-db requested a review from cloud-fan November 10, 2023 07:33

HyukjinKwon reviewed Nov 10, 2023

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-45827] Add Variant data type in Spark.~~ [SPARK-45827][SQL] Add Variant data type in Spark. Nov 10, 2023

cloud-fan reviewed Nov 10, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala Show resolved Hide resolved

beliefer reviewed Nov 10, 2023

View reviewed changes

resolve comments

7481ba8

chenhao-db requested review from cloud-fan, HyukjinKwon and beliefer November 11, 2023 00:42

beliefer reviewed Nov 11, 2023

View reviewed changes

chenhao-db added 3 commits November 11, 2023 00:11

Merge branch 'master' into variant-type

a203af1

update error message

09daaa7

fix test

f09e720

cloud-fan approved these changes Nov 12, 2023

View reviewed changes

HyukjinKwon approved these changes Nov 12, 2023

View reviewed changes

beliefer approved these changes Nov 13, 2023

View reviewed changes

add comment

b696661

beliefer reviewed Nov 13, 2023

View reviewed changes

HyukjinKwon closed this in aa10ac7 Nov 14, 2023

chenhao-db mentioned this pull request Nov 15, 2023

[SPARK-45827][SQL][FOLLOWUP] Fix variant parquet reader. #43825

Closed

cashmand mentioned this pull request Nov 20, 2023

[SPARK-45827][SQL][FOLLOWUP] Use StructNullableTypeConverter for Variant #43911

Closed

desmondcheongzx mentioned this pull request Feb 17, 2024

[SPARK-47079][PYTHON][SQL][CONNECT] Add Variant type info to PySpark #45131

Closed

	public void writeIntoUnsafeRow(byte[] buffer, long cursor) {
	public void writeIntoUnsafeRow(Object baseObject, long baseOffset, long cursor) {


		override def dataType: DataType = VariantType

		override def nullable: Boolean = false

[SPARK-45827][SQL] Add Variant data type in Spark. #43707

[SPARK-45827][SQL] Add Variant data type in Spark. #43707

Conversation

chenhao-db commented Nov 7, 2023 • edited

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Nov 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 9, 2023

cloud-fan commented Nov 9, 2023

beliefer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenhao-db Nov 11, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenhao-db Nov 10, 2023 • edited

Choose a reason for hiding this comment

HyukjinKwon commented Nov 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer Nov 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Nov 14, 2023

chenhao-db commented Nov 7, 2023 •

edited

cloud-fan Nov 10, 2023 •

edited

chenhao-db Nov 11, 2023 •

edited

chenhao-db Nov 10, 2023 •

edited

HyukjinKwon commented Nov 10, 2023 •

edited

beliefer Nov 10, 2023 •

edited