Skip to content

How to prioritize executors running on different agents when allocating jobs? #147

@svmhdvn

Description

@svmhdvn

I am wondering if there is a way in gleam to better distribute jobs across available agents rather than just executors. Consider the following example:

  1. read a text file containing (key, value) pairs, with many keys being repeated
  2. partition by keys, process the value and transform it to something
  3. write values to disk distributed evenly by keys across agents

Let's say there are 6 possible keys, ["key1", "key2", ... , "key6"], and I am running 3 agents on different machines. Ideally, I would like to have the following distribution after my flow:
agent1: only has values with two different keys
agent2: only has values with two different keys
agent3: only has values with two different keys

For the example, consider the following flow:

const NUM_AGENTS = 3
...
    f := flow.New("example").
        Read(file.Txt("data/*", 4)).
        Map("process the values", ProcessValues).
        PartitionByKey("shard by key", NUM_AGENTS).
        Map("write to local file on agent's filesystem", WriteToFile)
...

Using this flow, I don't get the results I'm looking for. Instead, gleam simply finds available executors, regardless of the agent they're running on, so I might get something that looks like this:

agent1: has values with three different keys
agent2: has values with 1 key
agent3: has values with two different keys

Is there a way to prioritize available agents without running jobs on them? In particular, I would like the partition to shard all the 6 keys into 2, 2, and 2 running on separate agents and only use the executors available on those agents.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions