Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of pzoo and zoo as persistent/ephemeral storage nodes #123

Closed
emedina opened this issue Jan 10, 2018 · 10 comments
Closed

Use of pzoo and zoo as persistent/ephemeral storage nodes #123

emedina opened this issue Jan 10, 2018 · 10 comments
Milestone

Comments

@emedina
Copy link

emedina commented Jan 10, 2018

Hi,

In Zookeeper we have the notion of persistent/ephemeral nodes, but I'm struggling to understand why these concepts have been used here in terms of persistent volumes in K8s.

Can someone elaborate a bit further on what the objectives are for this intentional configuration?

Thanks.

@solsson
Copy link
Contributor

solsson commented Jan 10, 2018

It was introduced in #34 and discussed in #26 (comment).

The case for this has weakened though, with increased support for dynamic volume provisioning across different kubernetes setups, and with this setup being used for heavier workloads. I'd perfer if the two statefulsets could simply be scaled up and down individually. For example if you're on a single zone you don't have the volume portability issue. In a setup like #118 with local volumes howerver it's quite difficult to ensure quorum capabilities on single node failure.

Unfortunately Zookeeper's configuration is static, prior to 3.5 wich is in development. Adapting to initial scale would be doable, I think. For example the init script could use the Kubernetes API to read the desired number of replicas for the both StatefulSets and generate the server.X strings accordingly.

@thoslin
Copy link

thoslin commented Jan 11, 2018

Hi @solsson . I have the same confusion. I've read your comment on the other issue, but it didn't help me understand why two node are using empty_dir but not persistent volumn. Could you elaborate a little more as to under what scenario they will be useful? How does it compare to use persistent volume for all 5 nodes? I'm running my Kubernetes cluster on AWS, with 6 worker nodes spreading across 3 availability zones. Thanks.

@solsson
Copy link
Contributor

solsson commented Jan 11, 2018

Good that you question this. The complexity should be removed if it can't be motivated. I'm certainly prepared to switch to all-persistent Zookeeper.

The design goal was to make the persistent layer as robust as the services layer. Probably not as robust as bucket stores or 3rd party hosted databases, but same uptime as your frontend is good enough.

Thus workloads will have to migrate in the face of lost availability zones, like non-stateful apps will certainly do with Kubernetes. I recall https://medium.com/spire-labs/mitigating-an-aws-instance-failure-with-the-magic-of-kubernetes-128a44d44c14 "a sense of awe watching the automatic mitigation".

Unless you have a volume type that can migrate, the problem is that stateful pods will only start in the zone where the volume was provisioned. With both 5 and 7 node zk across 3 zones, if a zone with 2 or 3 zk pods repsectively goes out, you're -1 pod away from losing a majority of your zk. My assumption is that lost majority means your service goes down. Zone outage can be extensive, as in the AWS case above, and due to zk's static configuration you can't reconfigure to adapt to the situation as it would cause the -1.

With kafka brokers you can throw money at the problem: increase your replication factor. With zk you can't. Or maybe you can, with scale=9?

@shrinandj
Copy link

@solsson I've tried to rephrase the reason for having pzoo and zoo below. Let me know what you think:

AFAICT, there are at least two types of failures for which there should be some protection.

  • Software errors: This is where something goes wrong with a Zookeeper pod that results in it going down. There is nothing wrong with the underlying infrastructure.

  • Infra errors: Underlying AWS/cloud infrastructure went down.

If there are 3 AZs, the 5 ZK pods are spread across these 3 AZs. If an AZ goes down, there is little benefit to be had of having 5 ZK pods since the AZ that went down could result in 2 ZK pods being lost. The ZK cluster is 1 more failure away from being unavailable. The situation would be the same if there were only 3 ZK pods and 1 AZ went down.

However, for software errors, each pod could go down by itself and having 5 ZK nodes helps because it can tolerate 2 individual pod failures (instead of 1 in the 3ZK case).

While having only 3 EBS volumes instead of 5 does keep costs low, to avoid confusion, it would be better to have a single statefulset of pzoo with 5 nodes.

@solsson
Copy link
Contributor

solsson commented Feb 8, 2018

While having only 3 EBS volumes instead of 5 does keep costs low, to avoid confusion, it would be better to have a single statefulset of pzoo with 5 nodes.

@shrinandj I think I agree at this stage. What would be even better, in particular now (unlike in the k8s 1.2 days) that support for automatic volume provisioning can be expected, would be to support scaling of the zookeeper statefulset(s). That way everyone can descide for themselves, and we can default to 5 persistent pods. Should be quite doable in the initscript, by retrieving desired number of replicas with kubect. I'd be happy to accept PRs for such things.

@shrinandj
Copy link

Can you elaborate a bit on that?

  • The default will be a statefulset with 5 pods.
  • Users can scale this up if needed by simply increasing the number from 5 to whatever using kubectl scale statefulsets pzoo --replicas=<new-replicas>. This should create the new PVCs and then run the pods.

What changes are required in the init script?

@solsson
Copy link
Contributor

solsson commented Feb 8, 2018

Sounds like a good summary, and my ideas for how are sketchy at best. Sadly(?) this repo has come of age already and needs to consider backwards compatibility. Hence we might want a multi-step solution:

  1. Add volume claims to the zoo statefulset, keep the init script as is.
  2. Add an ezoo (ephemeral) statefulset as a copy of the "old" zoo, for the multi-zone frugal use case, but with replicas=0.
  3. Include the above kubernetes-kafka release.
  4. Add a branch (for evaluation by those who dare) that generates the server entries based on kubectl -n kafka get statefulset zoo -o=jsonpath='{.status.replicas}' (and equivalent for pzoo - deprecated - and ezoo).
  5. If this is looking good, change defaults to replicas=5 for zoo and replicas=0 for pzoo+ezoo, with a documented migration procedure in release notes.

@AndresPineros
Copy link

@solsson I understand that the steps mentioned above are needed due to backwards compatibility, but in case I want 5 pzoos I just need to change the replication to 5 and remove the zoo statefulset, right?

@solsson
Copy link
Contributor

solsson commented Sep 5, 2018

@AndresPineros You'll also need to change the server.4 and server.5 lines in 10zookeeper-config.yml and prepend the p.

@solsson
Copy link
Contributor

solsson commented Nov 28, 2018

See #191 (comment) for the suggested way forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants