-
Notifications
You must be signed in to change notification settings - Fork 272
Open
Description
Summary
PR #3224 implements field-major processing for struct fields, which moves type dispatch from O(rows × fields) to O(fields). However, for complex nested types (Struct, List, Map inside a struct), it falls back to row-major processing via append_field.
This issue tracks extending the field-major optimization to nested Struct fields specifically.
Current Behavior
In append_struct_fields_field_major() (row.rs), complex types fall back to per-row processing:
// For complex types (struct, list, map), fall back to append_field
// since they have their own nested processing logic
dt @ (DataType::Struct(_) | DataType::List(_) | DataType::Map(_, _)) => {
for (row_idx, i) in (row_start..row_end).enumerate() {
let nested_row = if struct_is_null[row_idx] {
SparkUnsafeRow::default()
} else {
// ... extract nested row
};
append_field(dt, struct_builder, &nested_row, field_idx)?;
}
}This means for deeply nested structs, we lose the benefit of field-major processing at each nesting level.
Proposed Optimization
For nested Struct fields:
- Get the nested
StructBuilderonce per field - Build nested struct validity in one pass
- Recursively apply field-major processing to nested struct fields
This would require refactoring to separate validity handling from field value processing.
Expected Impact
- 1.2-1.5x speedup for workloads with deeply nested struct types
- Benefit multiplies with nesting depth
Notes
- List and Map fields are harder to optimize due to variable-length elements per row
- This is a follow-up to PR perf: optimize native shuffle struct field processing with field-major order #3224 which implemented the initial field-major optimization
Metadata
Metadata
Assignees
Labels
No labels