Skip to content

feat: Add iceberg-lance module with Lance columnar format support#15580

Open
fightBoxing wants to merge 2 commits intoapache:mainfrom
fightBoxing:feature/lance-format-clean
Open

feat: Add iceberg-lance module with Lance columnar format support#15580
fightBoxing wants to merge 2 commits intoapache:mainfrom
fightBoxing:feature/lance-format-clean

Conversation

@fightBoxing
Copy link

Summary

Add a new iceberg-lance module to support Lance columnar data format in Apache Iceberg. Lance is a modern columnar format optimized for ML/AI workloads with native vector search, O(1) random access, and zero-copy Arrow integration.

Changes

Modified Files

  • api/.../FileFormat.java — Added LANCE("lance", true) enum value
  • settings.gradle — Registered lance module
  • build.gradle — Added iceberg-lance project configuration with Arrow dependencies

New Module: iceberg-lance (12 core files + 8 test files)

Core Classes

File Description
Lance.java Main entry class with WriteBuilder, ReadBuilder, DataWriteBuilder (follows Parquet/ORC pattern)
LanceSchemaUtil.java Bidirectional Iceberg Schema ↔ Arrow Schema conversion
LanceValueWriters.java Type-specific value writers (Boolean/Int/Long/Float/Double/String/Date/Time/Timestamp/Decimal/UUID/Binary)
LanceValueReaders.java Type-specific value readers for all supported types
LanceFileAppender.java FileAppender implementation with Metrics collection
LanceIterable.java CloseableIterable implementation with column projection support
LanceMetrics.java Metrics builder and MetricsCollector for rowCount/columnSizes/valueCounts/nullCounts/bounds
LanceUtil.java Configuration constants and utility methods

Data Layer Integration

File Description
GenericLanceReader.java Generic Record reader adapter
GenericLanceWriter.java Generic Record writer adapter

Tests (60 test cases, all passing)

Test Class Cases Coverage
TestFileFormatLance 7 Enum, splittable, extension, fromString
TestLanceSchemaUtil 7 Primitive/temporal/decimal/nested/map types, round-trip, null validation
TestLanceValueReadersWriters 17 Full type round-trip + forType factory
TestLanceMetrics 5 Simple/full metrics, collector, bounds, null bounds
TestLanceDataWriter 8 Write/builder/metrics/null/empty/schema/length/close-after-write
TestLanceDataReader 6 Read/builder/round-trip/null/empty/large dataset
TestLanceReadProjection 4 Column pruning, single/full/builder projection
TestLanceUtil 6 Extension, properties, fragment size, compression, constants

Architecture Design

Why Lance in Iceberg?

Dimension Parquet/ORC Lance Value
Random Access Full RowGroup/Stripe scan O(1) row-level 10-100x for AI inference
Vector Search Not native Built-in ANN index No external vector DB needed
Update Efficiency Copy-on-Write full rewrite Native row-level update Frequent update scenarios
Arrow Integration Serialization required Zero-copy mapping Memory efficiency

Extension Architecture

iceberg-lance/
├── src/main/java/org/apache/iceberg/lance/    (10 core classes)
│   ├── Lance.java              — Entry: WriteBuilder / ReadBuilder / DataWriteBuilder
│   ├── LanceSchemaUtil.java    — Iceberg Schema ↔ Arrow Schema conversion
│   ├── LanceValueWriters.java  — Write Iceberg values to Arrow vectors
│   ├── LanceValueReaders.java  — Read Arrow vectors to Iceberg types
│   ├── LanceFileAppender.java  — FileAppender with Metrics
│   ├── LanceIterable.java      — CloseableIterable with projection
│   ├── LanceMetrics.java       — Metrics collection
│   └── LanceUtil.java          — Config constants and utilities
├── src/main/java/org/apache/iceberg/data/lance/  (2 adapters)
│   ├── GenericLanceReader.java
│   └── GenericLanceWriter.java
└── src/test/java/org/apache/iceberg/lance/       (8 test classes)

Type Mapping (Iceberg ↔ Arrow ↔ Lance)

Iceberg Type Arrow Type Lance Type
BooleanType Bool Bool
IntegerType Int32 Int32
LongType Int64 Int64
FloatType Float32 Float32
DoubleType Float64 Float64
DateType Date32 Date32
TimeType Time64(µs) Time64(µs)
TimestampType Timestamp(µs) Timestamp(µs)
StringType Utf8 Utf8
BinaryType Binary Binary
DecimalType(p,s) Decimal128(p,s) Decimal128(p,s)
UUIDType FixedSizeBinary(16) FixedSizeBinary(16)
ListType List List
MapType Map Map
StructType Struct Struct

Design Principles

  1. Follow existing patterns — Fully mirrors iceberg-parquet and iceberg-orc module structure
  2. Optional dependency — Registered via reflection in InternalData, no impact on existing functionality
  3. Arrow-native — Uses Arrow as intermediate representation for zero-copy integration
  4. Complete Metrics — Full support for rowCount, columnSizes, valueCounts, nullCounts, lowerBounds, upperBounds
  5. Column projectionLanceIterable supports projection reads with fieldId matching

Implementation Roadmap

  • Phase 1 ✅ Core format module (this PR)
  • Phase 2 🔜 Data layer integration (BaseFileWriterFactory, GenericAppenderFactory)
  • Phase 3 🔜 Engine integration (Spark/Flink readers and writers)
  • Phase 4 🔜 Advanced features (ANN vector search, row-level Merge-on-Read)

CI Checks

All checks pass locally:

  • ✅ Spotless (Google Java Format)
  • ✅ Checkstyle
  • ✅ Error-Prone static analysis
  • ✅ Javadoc compilation
  • ✅ Apache License headers
  • ✅ Unit tests (60/60 passed)
  • ✅ Full build

📄 Full architecture design document: see iceberg-lance-format-design.md in this PR.

rockyyin added 2 commits March 10, 2026 19:12
- Add LANCE enum to FileFormat
- Add lance module configuration in settings.gradle and build.gradle
- Implement core lance module:
  - Lance.java: main entry with ReadBuilder/WriteBuilder/DataWriteBuilder
  - LanceSchemaUtil: Iceberg Schema <-> Arrow Schema conversion
  - LanceValueWriters/Readers: type-specific value writers and readers
  - LanceFileAppender: FileAppender implementation with metrics collection
  - LanceIterable: CloseableIterable with projection support
  - LanceMetrics: metrics builder and collector
  - LanceUtil: configuration constants and utilities
- Implement data layer adapters:
  - GenericLanceReader/Writer for GenericRecord support
- Add comprehensive test suite (60 test cases, 8 test classes)
- Add Lance format design document
- Apply Google Java Format via spotlessApply
- Replace new ArrayList<>() with Lists.newArrayList() (checkstyle)
- Add .hasMessage() to assertThatThrownBy assertions (checkstyle)
- Fix Javadoc HTML entity escaping for <-> symbols
- Remove unused imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant