Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: enabled streams standby replicas by default #3641

Closed
wants to merge 1 commit into from

Conversation

vinothchandar
Copy link
Contributor

@vinothchandar vinothchandar commented Oct 22, 2019

Description

BREAKING CHANGE: existing multi instance ksql deployments could see increased disk usage, due to additional standby state replication, with the benefit of improving state availability for pull queries. By default, we tolerate 1 instance failure and there is no effect on single instance deployment

Testing done

Unit test added. Tested on my box bringing up ksql server

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@vinothchandar vinothchandar requested a review from a team as a code owner October 22, 2019 16:42
@vinothchandar
Copy link
Contributor Author

vinothchandar commented Oct 22, 2019

@derekjn @mjsax we can continue discussion here on choosing from the approaches below

a) Landing this change and enabling it by default. Existing deployments will see additional resource consumption. but we can message this with other changes we are making
b) We just message this along with other streams configuration recommendations. If users just upgrade ksql and try the pull queries, it may have bad experience.

None of this matters for single node quickstart/docker playground use cases..

BREAKING CHANGE: existing multi instance ksql deployments could see increased disk usage, due to additional standby state replication, with the benefit of improving state availability for pull queries. By default, we tolerate 1 instance failure and there is no effect on single instance deployment
Copy link
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - please get a +1 from one of the cloud folk (@vcrfxia @stevenpyzhang @spena)

@vinothchandar
Copy link
Contributor Author

Capturing the conversation with @derekjn here. IMO we should have ideally enabled the standby replication in KStreams by default, since KStreams design does not rely on checkpointing to recover state upon instance failures.

Given the changes we are making to KSQL anyway, it seems prudent to do this now, so even pull/streaming queries will see meaningful speedups in the recovery times.

Copy link
Contributor

@vcrfxia vcrfxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc @rodesai in case we want to override this default in cloud later. Using the default is definitely fine for now.

@blueedgenick
Copy link
Contributor

blueedgenick commented Oct 23, 2019 via email

@derekjn
Copy link
Contributor

derekjn commented Oct 23, 2019

would be to stick with what we have (where the default is for no standby
replicas) and we revisit #817 -
basically we should provide an entire properties file which is tuned for
"resilient, production deployments".

@blueedgenick this makes a lot of sense to me. I think it's very reasonable to expect users to explicitly create a production configuration file (perhaps using our suggested configuration options as guidance), especially considering that KSQL runs as an actual server. This is fairly standard practice IMO, and I don't believe that users expect the OOTB configuration to be suitable for production.

@vinothchandar
Copy link
Contributor Author

Thanks for all the feedback. Answer really depends on how you look at ksql - streaming ETL tool or end-end database with pull queries being the main stay.

hould be suitable for dev or test or prod or some other flavor of prod :) e.g. for some large-scale > app usecases, it makes very little sense to double the resource usage
Lets try to break this down :) .

  • For dev/test where you have a single instance, this is a no-op. So we are good.
  • For multi node small scale state, overhead is small anyway.
  • For large scale (defined as something that is a few GBs in size), I have to disagree with you. It's where the standbys really matter. There are numerous streams tickets on this very issue where the topology is in recovering state for even hours. For point/pull queries, I think it will be a very frustrating experience to wait 10s of seconds even when nodes go down. Other systems who have checkpointing of state by other means such as cloud/hdfs, do pay the additional cost for drastically better availability.

The prod config files are a great idea, had them on my previous project(s) too. Did not realize ksql does not have one yet. I can pick up #817 if there are no takers yet.

On this PR, if the argument was this will be disruptive to existing KSQL deployments and thats not okay, happy to close this and do (some form of) #817 instead, as it relates to pull queries.

@vinothchandar
Copy link
Contributor Author

@apurvam there are trade-offs here. WDYT?

@apurvam
Copy link
Contributor

apurvam commented Oct 23, 2019

I’d be in favor of having a set of recommended production configs in a clearly named properties file. This file can include this PR plus everything from https://docs.confluent.io/current/ksql/docs/installation/server-config/config-reference.html#ksql-production-settings.

The default can remain optimized for development purposes.

That said, if you will never have a standby replica on the same instance, so num.standby <= num instances, then this default is fairly harmless. Philosophically, num.standby =3 belongs in a production file and num.standby = 1 can be explicitly put in the development file.

@vinothchandar
Copy link
Contributor Author

I still think this decision should be based on whether its okay to increase footprint for existing deployments. Nonetheless, I concede we don't have this data and I think most of the confusion here around what use-case to optimize for.

So I ll close this and make a PR for #817 instead. Thanks everyone!

@mjsax
Copy link
Member

mjsax commented Oct 23, 2019

Philosophically, num.standby =3 belongs in a production file and num.standby = 1 can be explicitly put in the development file.

@apurvam -- for dev the config should be num.standby = 0 and for prod maybe num.standby = 2 (note, that the config is different than topic replication factor as it excludes the active task)

@apurvam
Copy link
Contributor

apurvam commented Oct 23, 2019

thanks @mjsax , yea I came back to make that correction before seeing your message. I always get thrown off by the semantics of num.standby.replicas. we should be shore that num.standby.replicas = 0 is a valid value, though.

@apurvam
Copy link
Contributor

apurvam commented Oct 23, 2019

I still think this decision should be based on whether its okay to increase footprint for existing deployments. Nonetheless, I concede we don't have this data and I think most of the confusion here around what use-case to optimize for.

I think for prod deployments, it should be OK to default to increasing footprint for better availability. It is a config, and we have an opinion for the default. IF people want to trade off availability for cost, they have the lever to pull.

@mjsax
Copy link
Member

mjsax commented Oct 24, 2019

we should be shore that num.standby.replicas = 0 is a valid value, though.

Zero is the default and thus valid: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/StreamsConfig.java#L583-L587

In fact, for a dev properties file, the parameter could be omitted entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants