Skip to content

Get rid of instance-level ThreadLocal in ConcurrentGrouper #8884

@leventov

Description

@leventov

Motivation: instance-level ThreadLocals are hard to understand. They are supposed to mask an issue of overly complex logic organization, but actually only add more complexity.

Instance-level ThreadLocals also create extra work for GC (weak references, large ThreadLocalMaps to scan) that can lead to longer pauses, but this is probably not a big issue in this specific case.

In the case of ConcurrentGrouper, threadNumber assignment could be pulled up to the beginning of the task in GroupByMergingQueryRunnerV2, then a special accumulator is created which knows of the number (instead of using a pre-created accumulator from RowBasedGrouperHelper.createGrouperAccumulatorPair()), and that accumulator calls to a specially created ConcurrentGrouper's method aggregate(int threadNumber, key).

This means that creating concurrent and non-concurrent groupers should be un-generalized (currently generalized in RowBasedGrouperHelper.createGrouperAccumulatorPair() method), but this seems to be beneficial because this looks like a false abstraction which only adds coupling and confusion of execution paths (harder to see which grouper type is created and used when) rather than lifts any logic. For that matter, ConcurrentGrouper doesn't need to implement Grouper interface, too, providing only aggregate(int threadNumber, key) method, but not standard Grouper's methods aggregate(key) and aggregate(key, keyHash). I think it would also make things clearer because it would become more evident for people who try to unwind the logic of the subsystem in their heads where ConcurrentGrouper could and could not appear.

It would also be nice to provide evidence in comments that the number of queryables in GroupByMergingQueryRunnerV2 and therefore the number of created tasks doesn' exceed (or exactly match?) the concurrencyHint because currently, this looks to be the prerequisite for ConcurrentGrouper to work without ArrayIndexOutOfBoundsException in this line, but it doesn't seems to follow from anywhere. Also, it means that concurrencyHint is not actually a "hint", that is, something completely heuristical/performance-related which could be any number, but actually a parameter which should have a very specific meaning of the actual number of concurrent threads calling aggregate() on the instance of ConcurrentGrouper.

FYI @jihoonson

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions