Default demand should be 1 #72

benwilson512 · 2016-08-15T14:12:28Z

I think the default demand should be one, or there should be no default at all.

Premises

When people play around with Flow, they're expecting basically a configurable concurrent Enum. Obviously there's a lot more there, but this comparison is evident both in Flow's API as well as the examples given in Flow / GenStage's docs which include explicit comparisons to Enum base pipelines.

Issues with current defaults:

Much too high for many uses. For anything IO bound or where the time taken to perform an operation dominates the runtime of the overall flow, any default other than 1 is entirely too high. The current defaults specifically are orders of magnitude too high.
Counterintuitive. Given premise 1, people are used to thinking about consuming enumerables one thing at a time. Batching an enumerable takes an explicit call to do so. As both my own experience and the experience of others will testify, there have been many cases where we used flow to build something and saw everything happen sequentially because we didn't know there was batching happening under the covers. While such batching may be useful when trying to maximize the throughput of certain use cases, I think it is more natural to consider batching as something you opt into rather than need to opt out of.

Issues with the proposal of 1, and responses.

Much to low for many uses. While true, it seems more natural to start with too little grouping and add batching on rather than start with an arbitrary amount of batching and have to adjust up or down.

Issues with any default other than 1:

There are definitely cases where 1 is the ideal default. This is less true for every other number.
No matter what number you choose, it's going to be wrong for a lot of cases. You can try to pick numbers you think suite the majority of cases, but this is hard to determine a priori. 1 actually does relatively well here however. The other factor in choosing a value is intuitiveness, and I think 1 works in that respect as well due to premise 1.

benwilson512 · 2016-08-15T14:12:36Z

ugh, hit enter too early, please hold....

benwilson512 · 2016-08-15T14:34:54Z

ok we're good now.

josevalim · 2016-08-15T14:41:03Z

Honestly I would say it is best to break expectations upfront. If people are expecting it to flow one item at a time then I would like to break this expectation early on instead of having it apparently work well but with subpar performance/understanding of the tools. Any expectation of ordering, unity and sequentiality should be addressed.

For example, I have also seen folks using flow to map over a list of 10 elements with very basic computations expecting to see improvements. All of those "intuitions" are wrong and they need to be broken. Better docs on those cases may help too.

josevalim · 2016-08-15T14:50:06Z

4c74352 talks about batch size right on the second paragraph now.

benwilson512 · 2016-08-15T14:53:56Z

I agree on the docs, and I'm glad to see the defaults mentioned further up.

The broader point is that the distribution of usefulness over the set of feasible numbers here has a distinct mode at 1, which isn't true of any other number. 500 is not marginally much more useful than 499 or 501. There are plenty of cases however where 1 is more useful than 2, and wildly more useful than 500.

I'm happy to recognize that there are fundamental differences WRT how Flow and GenStage manage data, and one such difference is that they operate on batches. The problem is that the currently chosen batch sizes amount to a built in optimization for only a certain set of problems, and I'm not sure why we should optimize those problems over others.

That last argument could be levied against choosing 1 and whatever problems it turns out to be optimal for, but 1 at least is the only definitive bound to the possible demand options.

josevalim · 2016-08-15T15:12:27Z

The problem is that the currently chosen batch sizes amount to a built in optimization for only a certain set of problems, and I'm not sure why we should optimize those problems over others.

The main purpose of having the default of 1000 is not to be some silver bullet. I am fine with changing the value to 10 or 100. However, I agree 10 or 100 won't be much better or worse than 1000. But that's my point: the value of "more than 1" is there to show there is batching. Choosing a default of 1 would completely undermine it.

If developers are asking questions, that's a good thing, as long as they are being answered. The only other option I could think is to have no default but I am not sure if it would solve anything.

paulbalomiri · 2016-09-18T03:52:09Z

1000 is not to be some silver bullet

When I first started to use GenStagea week ago I had the problem that I did not know where the demand=1000 came from, and somehow suspected it was set to max_demand, which (as i know now) is [partially] wrong.

Before demand and max_demand settled into clarity for me, I was stumbling upon lots and lots of Process.sleep(1000) in the Doc-Examples, and in Jose`s speech in London.

Perhaps I was just tired or slow to digest the GenStage workings, but Process.sleep(1042) and a default demand which is different from the defaults for both max_demand and min_demand would have helped me.

josevalim · 2016-09-19T14:04:58Z

Process.sleep(1000) in the Doc-Examples, and in Jose`s speech in London.

Oh, that's great feedback. I will take note! :D

josevalim · 2016-09-23T17:14:21Z

@benwilson512 so should we stick with 1000 after all?

seivan · 2017-04-20T13:34:40Z

Sorry for waking up an old ticket. Feel free to close it if it's not the right venue or if I should start a new one.

We had the same issue with the demand (though not with Flow). Our current design is 1 chain of P, 3xPC and C handle one demand at a time. If we want more, we start new chains.

Why would one want a single PC process an array of demands instead of 1 at a time?
If a PC operates of an array of "jobs" and one of them crashes, then the whole thing just goes down when only that particular job should go down.

From my point of view, it makes it easier to reason and handle errors when they happen.
In our case, it's popping work from Redis, processing them in several PCs and eventually write to the DB in the C
Errors are treated based on which PC where PCs that do network request fail, will put the original job back into the queue for later resuming while other types of errors will drop the job in an error bucket.

Now granted, I am pretty sure I am "thinking" wrong here or missed something vital. Just want to make sure of that before we proceed with this.

benwilson512 closed this as completed Oct 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default demand should be 1 #72

Default demand should be 1 #72

benwilson512 commented Aug 15, 2016 •

edited

benwilson512 commented Aug 15, 2016

benwilson512 commented Aug 15, 2016

josevalim commented Aug 15, 2016 •

edited

josevalim commented Aug 15, 2016

benwilson512 commented Aug 15, 2016

josevalim commented Aug 15, 2016

paulbalomiri commented Sep 18, 2016 •

edited

josevalim commented Sep 19, 2016

josevalim commented Sep 23, 2016

seivan commented Apr 20, 2017 •

edited

Default demand should be 1 #72

Default demand should be 1 #72

Comments

benwilson512 commented Aug 15, 2016 • edited

Premises

Issues with current defaults:

Issues with the proposal of 1, and responses.

Issues with any default other than 1:

benwilson512 commented Aug 15, 2016

benwilson512 commented Aug 15, 2016

josevalim commented Aug 15, 2016 • edited

josevalim commented Aug 15, 2016

benwilson512 commented Aug 15, 2016

josevalim commented Aug 15, 2016

paulbalomiri commented Sep 18, 2016 • edited

josevalim commented Sep 19, 2016

josevalim commented Sep 23, 2016

seivan commented Apr 20, 2017 • edited

benwilson512 commented Aug 15, 2016 •

edited

josevalim commented Aug 15, 2016 •

edited

paulbalomiri commented Sep 18, 2016 •

edited

seivan commented Apr 20, 2017 •

edited