[core] support vector on spark#8019
Conversation
1b6cad7 to
ea64eb7
Compare
|
Thanks for the contribution. I am holding off on approval because the current CI status has multiple failing jobs. Please fix or rerun the failures, then I can take another pass. |
b810ce0 to
52db328
Compare
| @@ -242,6 +251,11 @@ public DataType visit(ArrayType arrayType) { | |||
| return DataTypes.createArrayType(elementType.accept(this), elementType.isNullable()); | |||
| } | |||
|
|
|||
| @Override | |||
| public DataType visit(VectorType vectorType) { | |||
| return DataTypes.createArrayType(vectorType.getElementType().accept(this), vectorType.isNullable()); | |||
There was a problem hiding this comment.
Bug: Please add tests, visit(VectorType) passes vectorType.isNullable() (column nullability) as containsNull, but containsNull controls whether array elements can be null. The existing visit(ArrayType) correctly passes elementType.isNullable().
There was a problem hiding this comment.
Explicitly indicate that the array does not contain null:
public DataType visit(VectorType vectorType) {
return DataTypes.createArrayType(vectorType.getElementType().accept(this), false);
}
Test: SparkTypeTest.testVectorType
| ArrayType arrayType = (ArrayType) field.dataType(); | ||
| String dimKey = String.format("field.%s.vector-dim", field.name()); | ||
| type = | ||
| new VectorType( |
There was a problem hiding this comment.
VectorType is constructed with arrayType.containsNull() (element nullability) instead of field.nullable() (column nullability). The general-case code path on line 613 correctly uses field.nullable() — this vector branch is inconsistent.
There was a problem hiding this comment.
Use general api:
DataTypes.VECTOR(
Integer.parseInt(properties.get(dimKey)),
toPaimonType(arrayType.elementType()));
| @@ -609,6 +607,17 @@ private Schema toInitialSchema( | |||
| field.dataType() instanceof org.apache.spark.sql.types.BinaryType, | |||
| "The type of blob field must be binary"); | |||
| type = new BlobType(); | |||
| } else if (vectorFields.contains(field.name())) { | |||
| Preconditions.checkArgument( | |||
There was a problem hiding this comment.
Integer.parseInt(properties.get(dimKey)) throws NumberFormatException: null when field..vector-dim is missing. Add Preconditions.checkArgument with a descriptive message.
There was a problem hiding this comment.
Add check:
String dimKey = String.format("field.%s.vector-dim", field.name());
Preconditions.checkArgument(
properties.containsKey(dimKey),
"When setting '"
+ CoreOptions.VECTOR_FIELD.key()
+ "', you must also set 'field.%s.vector-dim',"
+ " where %s is the name of the vector field.");
52db328 to
8c2b5e9
Compare
Purpose
Linked issue: #8018
Tests
org.apache.paimon.spark.SparkMultimodalITCase#testVector