Skip to content

[VL] Distinct aggregation OOM when getOutput #8025

@ccat3z

Description

@ccat3z

Backend

VL (Velox)

Bug description

Distinct aggregation will merge all sorted spill file in getOutput() (SpillPartition::createOrderedReader). If there are too many spill files, reading the first batch of each file into memory will consume a significant amount of memory. In one of our internal cases, one task generated 300 spill files, which requires close to 3G of memory.

image

Possible workarounds:

  1. Increase kMaxSpillRunRows, 1M will generate too many spill files for hundreds million rows of input. [GLUTEN-7249][VL] Lower default overhead memory ratio and spill run size #7531
  2. Reduce kSpillWriteBufferSize to 1M or lower. Why it is set to 4M by default? Is there any experience in performance tuning?

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions