Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search#2707
Conversation
…ble native vector search
|
@XuQianJin-Stars Appreciate if you can review this PR 🙏 |
There was a problem hiding this comment.
Pull request overview
This PR fixes the Lance writer to emit Arrow FixedSizeList types for array columns, enabling native vector search in PyLance. Previously, array columns were always written as variable-length Arrow Lists, preventing Lance's vector search functionality from working. The fix introduces a configuration property pattern <column>.arrow.fixed-size-list.size that allows users to specify when array columns should be written as FixedSizeList.
Changes:
- Added support for configuring FixedSizeList via table properties in
LanceArrowUtils - Implemented conversion logic from ListVector to FixedSizeListVector in
ArrowDataConverter - Updated table creation in
LanceLakeCatalogand tests to pass custom properties through the schema conversion pipeline
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| LanceArrowUtils.java | Adds configuration property support and logic to create FixedSizeList ArrowType when the property is set |
| ArrowDataConverter.java | Implements new method to copy data from shaded ListVector to non-shaded FixedSizeListVector |
| LanceLakeCatalog.java | Passes custom properties to toArrowSchema to enable FixedSizeList configuration |
| LanceTieringTest.java | Adds comprehensive test coverage with embedding column using FixedSizeList type |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...s-lake/fluss-lake-lance/src/main/java/org/apache/fluss/lake/lance/utils/LanceArrowUtils.java
Outdated
Show resolved
Hide resolved
...s-lake/fluss-lake-lance/src/main/java/org/apache/fluss/lake/lance/utils/LanceArrowUtils.java
Outdated
Show resolved
Hide resolved
...ake/fluss-lake-lance/src/main/java/org/apache/fluss/lake/lance/utils/ArrowDataConverter.java
Show resolved
Hide resolved
|
Addressed Co-pilot comments, also fixed a memory leak that could occur if exception is thrown during conversion from shaded to non-shaded objects. |
| LanceArrowUtils.toArrowSchema(tableDescriptor.getSchema().getRowType()) | ||
| LanceArrowUtils.toArrowSchema( | ||
| tableDescriptor.getSchema().getRowType(), | ||
| tableDescriptor.getCustomProperties()) |
There was a problem hiding this comment.
Need to confirm the return type of getCustomProperties(). If it returns an immutable Map or a Properties type rather than Map<String, String>, will tableProperties.get(fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX) in toArrowField work correctly? Suggest using Map<String, String> in the new method signature and ensuring compatibility at the call site.
There was a problem hiding this comment.
getCustomProperties() returns Map<String, String>
public Map<String, String> getCustomProperties() {
return customProperties;
}| public class LanceArrowUtils { | ||
|
|
||
| /** Property suffix for configuring a fixed-size list Arrow type on array columns. */ | ||
| public static final String FIXED_SIZE_LIST_SIZE_SUFFIX = ".arrow.fixed-size-list.size"; |
There was a problem hiding this comment.
The property key format is <column_name>.arrow.fixed-size-list.size. If the column name itself contains . (e.g., a.b), it leads to parsing ambiguity. For example, a.b.arrow.fixed-size-list.size — it's unclear whether this refers to column a.b or column a with some unrelated suffix.
Suggestion:
- Explicitly document that column names must not contain
.; or - Validate in toArrowField that the column name does not contain ., and throw a meaningful error message if it does.
There was a problem hiding this comment.
Good comment, I actually look up Lance doc further and found that they do not support dots in columns. Ref: https://docs.lancedb.com/search/filtering#advanced-sql-filters
Updated to throw exception
| try { | ||
| listSize = Integer.parseInt(sizeStr); | ||
| } catch (NumberFormatException ignored) { | ||
| // Not really ignored, IllegalArgumentEx still thrown below. |
There was a problem hiding this comment.
Validating for a positive integer is correct. However, if the property value is an empty string "" or whitespace " ", Integer.parseInt will throw a NumberFormatException without a user-friendly error message. Suggest catching NumberFormatException and converting it to an IllegalArgumentException that includes the property key name, making it easier for users to locate the configuration issue.
int listSize;
try {
listSize = Integer.parseInt(sizeStr);
} catch (NumberFormatException e) {
throw new IllegalArgumentException(
String.format("Invalid value '%s' for property '%s', expected a positive integer.",
sizeStr, fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX), e);
}
checkArgument(listSize > 0, "...");
| assertThat(embeddingValues).hasSize(EMBEDDING_LIST_SIZE); | ||
| org.apache.fluss.row.InternalArray expectedArray = expectRecord.getRow().getArray(3); | ||
| for (int j = 0; j < EMBEDDING_LIST_SIZE; j++) { | ||
| assertThat((Float) embeddingValues.get(j)).isEqualTo(expectedArray.getFloat(j)); |
There was a problem hiding this comment.
Missing null value scenario in test data?
There was a problem hiding this comment.
Great comment. Testing for null actually surfaced a bug.
The fix did take me some time to get to..
|
@XuQianJin-Stars Appreciate if you can review again |
well |
|
LGTM |
Purpose
Linked issue: close #2706
Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search.
Querying with pylance native vector search fails because pylance expects embedding column to be of fixedSizeList type instead of variable size list. This PR fixes the conversion from Fluss' arrow data to Lance's arrow data by using FixedSizeList if
<column>.arrow.fixed-size-list.sizeis defined, similar to how Spark SQL does it ref: https://lance.org/integrations/spark/operations/ddl/create-table/#creating-large-string-columnsTests
Added unit test