Skip to content

Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search#2707

Merged
luoyuxia merged 5 commits intoapache:mainfrom
leekeiabstraction:lance-writer-emit-arrow-fixedSizeList
Feb 28, 2026
Merged

Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search#2707
luoyuxia merged 5 commits intoapache:mainfrom
leekeiabstraction:lance-writer-emit-arrow-fixedSizeList

Conversation

@leekeiabstraction
Copy link
Contributor

Purpose

Linked issue: close #2706

Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search.

Querying with pylance native vector search fails because pylance expects embedding column to be of fixedSizeList type instead of variable size list. This PR fixes the conversion from Fluss' arrow data to Lance's arrow data by using FixedSizeList if <column>.arrow.fixed-size-list.size is defined, similar to how Spark SQL does it ref: https://lance.org/integrations/spark/operations/ddl/create-table/#creating-large-string-columns

Tests

Added unit test

  • Tested manually that pylance successfully performed a vector search on the tiered data

@leekeiabstraction
Copy link
Contributor Author

@XuQianJin-Stars Appreciate if you can review this PR 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the Lance writer to emit Arrow FixedSizeList types for array columns, enabling native vector search in PyLance. Previously, array columns were always written as variable-length Arrow Lists, preventing Lance's vector search functionality from working. The fix introduces a configuration property pattern <column>.arrow.fixed-size-list.size that allows users to specify when array columns should be written as FixedSizeList.

Changes:

  • Added support for configuring FixedSizeList via table properties in LanceArrowUtils
  • Implemented conversion logic from ListVector to FixedSizeListVector in ArrowDataConverter
  • Updated table creation in LanceLakeCatalog and tests to pass custom properties through the schema conversion pipeline

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
LanceArrowUtils.java Adds configuration property support and logic to create FixedSizeList ArrowType when the property is set
ArrowDataConverter.java Implements new method to copy data from shaded ListVector to non-shaded FixedSizeListVector
LanceLakeCatalog.java Passes custom properties to toArrowSchema to enable FixedSizeList configuration
LanceTieringTest.java Adds comprehensive test coverage with embedding column using FixedSizeList type

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@leekeiabstraction
Copy link
Contributor Author

Addressed Co-pilot comments, also fixed a memory leak that could occur if exception is thrown during conversion from shaded to non-shaded objects.

LanceArrowUtils.toArrowSchema(tableDescriptor.getSchema().getRowType())
LanceArrowUtils.toArrowSchema(
tableDescriptor.getSchema().getRowType(),
tableDescriptor.getCustomProperties())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to confirm the return type of getCustomProperties(). If it returns an immutable Map or a Properties type rather than Map<String, String>, will tableProperties.get(fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX) in toArrowField work correctly? Suggest using Map<String, String> in the new method signature and ensuring compatibility at the call site.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCustomProperties() returns Map<String, String>

    public Map<String, String> getCustomProperties() {
        return customProperties;
    }

public class LanceArrowUtils {

/** Property suffix for configuring a fixed-size list Arrow type on array columns. */
public static final String FIXED_SIZE_LIST_SIZE_SUFFIX = ".arrow.fixed-size-list.size";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property key format is <column_name>.arrow.fixed-size-list.size. If the column name itself contains . (e.g., a.b), it leads to parsing ambiguity. For example, a.b.arrow.fixed-size-list.size — it's unclear whether this refers to column a.b or column a with some unrelated suffix.

Suggestion:

  1. Explicitly document that column names must not contain .; or
  2. Validate in toArrowField that the column name does not contain ., and throw a meaningful error message if it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comment, I actually look up Lance doc further and found that they do not support dots in columns. Ref: https://docs.lancedb.com/search/filtering#advanced-sql-filters

Updated to throw exception

try {
listSize = Integer.parseInt(sizeStr);
} catch (NumberFormatException ignored) {
// Not really ignored, IllegalArgumentEx still thrown below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validating for a positive integer is correct. However, if the property value is an empty string "" or whitespace " ", Integer.parseInt will throw a NumberFormatException without a user-friendly error message. Suggest catching NumberFormatException and converting it to an IllegalArgumentException that includes the property key name, making it easier for users to locate the configuration issue.

int listSize;
try {
    listSize = Integer.parseInt(sizeStr);
} catch (NumberFormatException e) {
    throw new IllegalArgumentException(
        String.format("Invalid value '%s' for property '%s', expected a positive integer.",
            sizeStr, fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX), e);
}
checkArgument(listSize > 0, "...");

assertThat(embeddingValues).hasSize(EMBEDDING_LIST_SIZE);
org.apache.fluss.row.InternalArray expectedArray = expectRecord.getRow().getArray(3);
for (int j = 0; j < EMBEDDING_LIST_SIZE; j++) {
assertThat((Float) embeddingValues.get(j)).isEqualTo(expectedArray.getFloat(j));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing null value scenario in test data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment. Testing for null actually surfaced a bug.

The fix did take me some time to get to..

@leekeiabstraction
Copy link
Contributor Author

@XuQianJin-Stars Appreciate if you can review again

@XuQianJin-Stars
Copy link
Contributor

@XuQianJin-Stars Appreciate if you can review again

well

@XuQianJin-Stars
Copy link
Contributor

LGTM

@luoyuxia luoyuxia merged commit ad037cf into apache:main Feb 28, 2026
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[lake/lance] Lance writer should emit Arrow FixedSizeList for array columns to enable native vector search

4 participants