Skip to content

Fix TableScan to generate correct partition bucket paths #128

@luoyuxia

Description

@luoyuxia

Parent Issue

Part of #124 (support partitioned table)
Depends on #126 (BinaryRow deserialization), #127 (partition path generation)

Background

TableScan::plan_snapshot() currently discards partition information when building DataSplits:

// table_scan.rs:154
for ((_partition, bucket), group_entries) in groups {
    // ...
    // table_scan.rs:171-173
    // todo: consider partitioned table
    let bucket_path = format!("{base_path}/bucket-{bucket}");
    let partition = BinaryRow::new(0);  // Always empty!
}

For partitioned tables, the correct path should be {table_path}/{partition_path}/bucket-{bucket}, e.g., {table_path}/dt=2024-01-01/bucket-0/.

What needs to be done

  1. Pass partition type info to plan_snapshot()

    • Add partition keys (names) and partition field types (from TableSchema) as parameters, or pass the TableSchema itself
    • Alternatively, change plan_snapshot() from a static method to an instance method that can access self.table.schema
  2. Decode partition bytes into BinaryRow

    • For each group key (partition_bytes, bucket), construct a BinaryRow from the raw bytes using BinaryRow::from_bytes(arity, data)
    • The arity is the number of partition keys
  3. Generate partition path using PartitionPathUtils

  4. Store actual partition data in DataSplit

    • Pass the decoded BinaryRow (with real data) to DataSplitBuilder.with_partition() instead of the empty BinaryRow::new(0)

Affected files

  • crates/paimon/src/table/table_scan.rsplan_snapshot() method

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions