perf: enabled streams standby replicas by default #3641

vinothchandar · 2019-10-22T16:42:20Z

Description

BREAKING CHANGE: existing multi instance ksql deployments could see increased disk usage, due to additional standby state replication, with the benefit of improving state availability for pull queries. By default, we tolerate 1 instance failure and there is no effect on single instance deployment

Testing done

Unit test added. Tested on my box bringing up ksql server

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

vinothchandar · 2019-10-22T16:54:20Z

@derekjn @mjsax we can continue discussion here on choosing from the approaches below

a) Landing this change and enabling it by default. Existing deployments will see additional resource consumption. but we can message this with other changes we are making
b) We just message this along with other streams configuration recommendations. If users just upgrade ksql and try the pull queries, it may have bad experience.

None of this matters for single node quickstart/docker playground use cases..

BREAKING CHANGE: existing multi instance ksql deployments could see increased disk usage, due to additional standby state replication, with the benefit of improving state availability for pull queries. By default, we tolerate 1 instance failure and there is no effect on single instance deployment

mjsax

LGTM.

agavra

LGTM - please get a +1 from one of the cloud folk (@vcrfxia @stevenpyzhang @spena)

vinothchandar · 2019-10-22T23:06:08Z

Capturing the conversation with @derekjn here. IMO we should have ideally enabled the standby replication in KStreams by default, since KStreams design does not rely on checkpointing to recover state upon instance failures.

Given the changes we are making to KSQL anyway, it seems prudent to do this now, so even pull/streaming queries will see meaningful speedups in the recovery times.

vcrfxia

LGTM. cc @rodesai in case we want to override this default in cloud later. Using the default is definitely fine for now.

blueedgenick · 2019-10-23T02:26:18Z

this one makes me wonder - we have a long-standing debate about whether the default ootb settings should be suitable for dev or test or prod or some other flavor of prod :) e.g. for some large-scale app usecases, it makes very little sense to double the resource usage, whereas for others - typically with smaller data volumes - it's a great default. My suggestion would be to stick with what we have (where the default is for no standby replicas) and we revisit #817 - basically we should provide an entire properties file which is tuned for "resilient, production deployments". Whether this is the default file or a secondary one clearly marked as being "use this one for prod!!!" is an open question but we should be consistent. The worst of all worlds would be to have some max-resilience settings in a dev-oriented properties file and some others in a separate properties file or only mentioned in the docs.

…

On Tue, Oct 22, 2019 at 7:15 PM Victoria Xia ***@***.***> wrote: ***@***.**** approved this pull request. LGTM. cc @rodesai <https://github.com/rodesai> in case we want to override this default in cloud later. Using the default is definitely fine for now. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3641?email_source=notifications&email_token=ABCXJIA4HOWE3MNLCVZE4SLQP6XU3A5CNFSM4JDSMXK2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCI3RMWA#pullrequestreview-305600088>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCXJIDI6P5LQCCYMOS7WPLQP6XU3ANCNFSM4JDSMXKQ> .

derekjn · 2019-10-23T02:50:50Z

would be to stick with what we have (where the default is for no standby
replicas) and we revisit #817 -
basically we should provide an entire properties file which is tuned for
"resilient, production deployments".

@blueedgenick this makes a lot of sense to me. I think it's very reasonable to expect users to explicitly create a production configuration file (perhaps using our suggested configuration options as guidance), especially considering that KSQL runs as an actual server. This is fairly standard practice IMO, and I don't believe that users expect the OOTB configuration to be suitable for production.

vinothchandar · 2019-10-23T03:07:07Z

Thanks for all the feedback. Answer really depends on how you look at ksql - streaming ETL tool or end-end database with pull queries being the main stay.

hould be suitable for dev or test or prod or some other flavor of prod :) e.g. for some large-scale > app usecases, it makes very little sense to double the resource usage
Lets try to break this down :) .

For dev/test where you have a single instance, this is a no-op. So we are good.
For multi node small scale state, overhead is small anyway.
For large scale (defined as something that is a few GBs in size), I have to disagree with you. It's where the standbys really matter. There are numerous streams tickets on this very issue where the topology is in recovering state for even hours. For point/pull queries, I think it will be a very frustrating experience to wait 10s of seconds even when nodes go down. Other systems who have checkpointing of state by other means such as cloud/hdfs, do pay the additional cost for drastically better availability.

The prod config files are a great idea, had them on my previous project(s) too. Did not realize ksql does not have one yet. I can pick up #817 if there are no takers yet.

On this PR, if the argument was this will be disruptive to existing KSQL deployments and thats not okay, happy to close this and do (some form of) #817 instead, as it relates to pull queries.

vinothchandar · 2019-10-23T11:58:24Z

@apurvam there are trade-offs here. WDYT?

apurvam · 2019-10-23T21:59:53Z

I’d be in favor of having a set of recommended production configs in a clearly named properties file. This file can include this PR plus everything from https://docs.confluent.io/current/ksql/docs/installation/server-config/config-reference.html#ksql-production-settings.

The default can remain optimized for development purposes.

That said, if you will never have a standby replica on the same instance, so num.standby <= num instances, then this default is fairly harmless. Philosophically, num.standby =3 belongs in a production file and num.standby = 1 can be explicitly put in the development file.

vinothchandar · 2019-10-23T22:35:19Z

I still think this decision should be based on whether its okay to increase footprint for existing deployments. Nonetheless, I concede we don't have this data and I think most of the confusion here around what use-case to optimize for.

So I ll close this and make a PR for #817 instead. Thanks everyone!

mjsax · 2019-10-23T22:41:04Z

Philosophically, num.standby =3 belongs in a production file and num.standby = 1 can be explicitly put in the development file.

@apurvam -- for dev the config should be num.standby = 0 and for prod maybe num.standby = 2 (note, that the config is different than topic replication factor as it excludes the active task)

apurvam · 2019-10-23T23:55:39Z

thanks @mjsax , yea I came back to make that correction before seeing your message. I always get thrown off by the semantics of num.standby.replicas. we should be shore that num.standby.replicas = 0 is a valid value, though.

apurvam · 2019-10-23T23:56:36Z

I still think this decision should be based on whether its okay to increase footprint for existing deployments. Nonetheless, I concede we don't have this data and I think most of the confusion here around what use-case to optimize for.

I think for prod deployments, it should be OK to default to increasing footprint for better availability. It is a config, and we have an opinion for the default. IF people want to trade off availability for cost, they have the lever to pull.

mjsax · 2019-10-24T00:03:40Z

we should be shore that num.standby.replicas = 0 is a valid value, though.

Zero is the default and thus valid: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/StreamsConfig.java#L583-L587

In fact, for a dev properties file, the parameter could be omitted entirely.

vinothchandar requested a review from a team as a code owner October 22, 2019 16:42

vinothchandar force-pushed the ksql-3635 branch from 74b037a to 47a053d Compare October 22, 2019 17:30

mjsax approved these changes Oct 22, 2019

View reviewed changes

agavra approved these changes Oct 22, 2019

View reviewed changes

agavra requested review from spena, vcrfxia and stevenpyzhang October 22, 2019 20:54

vcrfxia approved these changes Oct 23, 2019

View reviewed changes

vinothchandar closed this Oct 23, 2019

vinothchandar mentioned this pull request Oct 25, 2019

Enable standby replicas by default for stateful aggregations #3635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: enabled streams standby replicas by default #3641

perf: enabled streams standby replicas by default #3641

vinothchandar commented Oct 22, 2019 •

edited

Loading

vinothchandar commented Oct 22, 2019 •

edited

Loading

mjsax left a comment

agavra left a comment

vinothchandar commented Oct 22, 2019

vcrfxia left a comment

blueedgenick commented Oct 23, 2019 via email

derekjn commented Oct 23, 2019

vinothchandar commented Oct 23, 2019

vinothchandar commented Oct 23, 2019

apurvam commented Oct 23, 2019

vinothchandar commented Oct 23, 2019

mjsax commented Oct 23, 2019

apurvam commented Oct 23, 2019

apurvam commented Oct 23, 2019

mjsax commented Oct 24, 2019

perf: enabled streams standby replicas by default #3641

perf: enabled streams standby replicas by default #3641

Conversation

vinothchandar commented Oct 22, 2019 • edited Loading

Description

Testing done

Reviewer checklist

vinothchandar commented Oct 22, 2019 • edited Loading

mjsax left a comment

Choose a reason for hiding this comment

agavra left a comment

Choose a reason for hiding this comment

vinothchandar commented Oct 22, 2019

vcrfxia left a comment

Choose a reason for hiding this comment

blueedgenick commented Oct 23, 2019 via email

derekjn commented Oct 23, 2019

vinothchandar commented Oct 23, 2019

vinothchandar commented Oct 23, 2019

apurvam commented Oct 23, 2019

vinothchandar commented Oct 23, 2019

mjsax commented Oct 23, 2019

apurvam commented Oct 23, 2019

apurvam commented Oct 23, 2019

mjsax commented Oct 24, 2019

vinothchandar commented Oct 22, 2019 •

edited

Loading

vinothchandar commented Oct 22, 2019 •

edited

Loading