Skip to content

[core] support vector on spark#8019

Merged
JingsongLi merged 1 commit into
apache:masterfrom
Stefanietry:support_vector_on_spark
Jun 1, 2026
Merged

[core] support vector on spark#8019
JingsongLi merged 1 commit into
apache:masterfrom
Stefanietry:support_vector_on_spark

Conversation

@Stefanietry
Copy link
Copy Markdown
Contributor

@Stefanietry Stefanietry commented May 28, 2026

Purpose

Linked issue: #8018

Tests

org.apache.paimon.spark.SparkMultimodalITCase#testVector

@Stefanietry Stefanietry force-pushed the support_vector_on_spark branch 5 times, most recently from 1b6cad7 to ea64eb7 Compare May 29, 2026 15:28
@leaves12138
Copy link
Copy Markdown
Contributor

Thanks for the contribution. I am holding off on approval because the current CI status has multiple failing jobs. Please fix or rerun the failures, then I can take another pass.

@Stefanietry Stefanietry force-pushed the support_vector_on_spark branch 2 times, most recently from b810ce0 to 52db328 Compare June 1, 2026 03:30
@@ -242,6 +251,11 @@ public DataType visit(ArrayType arrayType) {
return DataTypes.createArrayType(elementType.accept(this), elementType.isNullable());
}

@Override
public DataType visit(VectorType vectorType) {
return DataTypes.createArrayType(vectorType.getElementType().accept(this), vectorType.isNullable());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Please add tests, visit(VectorType) passes vectorType.isNullable() (column nullability) as containsNull, but containsNull controls whether array elements can be null. The existing visit(ArrayType) correctly passes elementType.isNullable().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly indicate that the array does not contain null:

public DataType visit(VectorType vectorType) {
return DataTypes.createArrayType(vectorType.getElementType().accept(this), false);
}

Test: SparkTypeTest.testVectorType

ArrayType arrayType = (ArrayType) field.dataType();
String dimKey = String.format("field.%s.vector-dim", field.name());
type =
new VectorType(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorType is constructed with arrayType.containsNull() (element nullability) instead of field.nullable() (column nullability). The general-case code path on line 613 correctly uses field.nullable() — this vector branch is inconsistent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use general api:

DataTypes.VECTOR(
Integer.parseInt(properties.get(dimKey)),
toPaimonType(arrayType.elementType()));

@@ -609,6 +607,17 @@ private Schema toInitialSchema(
field.dataType() instanceof org.apache.spark.sql.types.BinaryType,
"The type of blob field must be binary");
type = new BlobType();
} else if (vectorFields.contains(field.name())) {
Preconditions.checkArgument(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer.parseInt(properties.get(dimKey)) throws NumberFormatException: null when field..vector-dim is missing. Add Preconditions.checkArgument with a descriptive message.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add check:

String dimKey = String.format("field.%s.vector-dim", field.name());
Preconditions.checkArgument(
properties.containsKey(dimKey),
"When setting '"
+ CoreOptions.VECTOR_FIELD.key()
+ "', you must also set 'field.%s.vector-dim',"
+ " where %s is the name of the vector field.");

@Stefanietry Stefanietry force-pushed the support_vector_on_spark branch from 52db328 to 8c2b5e9 Compare June 1, 2026 07:52
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 6bb161b into apache:master Jun 1, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants