Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chart refactor #3

Merged
merged 4 commits into from
Mar 5, 2023
Merged

Chart refactor #3

merged 4 commits into from
Mar 5, 2023

Conversation

FedeBev
Copy link
Contributor

@FedeBev FedeBev commented Jun 14, 2021

Since there's still an open issue in the main repo and I've the feeling this chart has been dropped, I've decided to rework it completely.

The main differences are:

  • different values for server and agent
  • HPA
  • it works (seriously, the original didn't work for me because of this issue)
  • server is deployed using a StatefulSet, allowing the usage of PV (through volume claim template), resolving support K8s pvc #2
  • a custom image can be specified, useful when an image with a custom plugin should be used
  • I've no longer seen Auto-Join issues with Kubernetes #1, maybe just because of dkron upgrade (3.1.6)

Known issues:

Let me know what are your thoughts, I'm open to feedback available to make any changes

@vcastellm
Copy link
Member

Sorry for the mega-delay here @FedeBev, thank you very much for the work! I had not prioritized the K8s work until now, going to test this and merge if it works.

Copy link
Member

@vcastellm vcastellm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the nitpicks, it LGTM but I have been unable to do a rolling restart without breaking the cluster, what's your strategy?

dkron/templates/envsConfigmap.yaml Outdated Show resolved Hide resolved
dkron/templates/serviceaccount.yaml Outdated Show resolved Hide resolved
@FedeBev
Copy link
Contributor Author

FedeBev commented Nov 9, 2021

What kind of problem are you facing @Victorcoder? I just tried a rolling update from version v3.1.8 to v3.1.10 and everything went good except for a slight leader election delay.

Do you mind sharing your values and some details about your cluster?

I've just fixed the missing newlines and cleaned up the sample values file

@vcastellm
Copy link
Member

I just do a helm install helm upgrade and the rolling is too fast for the leader election to happen correctly, when done rolling there's no leader in the cluster, all nodes ends up as followers.

Are you tweeking some other options?

@FedeBev
Copy link
Contributor Author

FedeBev commented Nov 24, 2021

No additional tweek on my configuration.

This also happens to me, but after a few seconds after the end of the rollout a new leader is elected again and everything goes fine.

We can try to tune the rollout strategy, but taking into account the issue about the kubernetes discovery, I don't think we can achieve something better with helm. An operator would solve the issue, but it's a huge effort.

@vcastellm
Copy link
Member

@FedeBev what about a liveness, readiness check?

@FedeBev
Copy link
Contributor Author

FedeBev commented Nov 25, 2021

@vcastellm in the issue about the kubernetes discovery I've mentioned before, you can find why I wasn't able to use those.

I don't see any other possible way right now. I guess we should change something about how dkron starts or request a PR to the kubernetes discovery library.
In my opinion, the point is that dkron should be able to discover itself when the pod is running (/health returns 200), NOT when is ready. This way the k8s discovery works and we can implement a useful /health and ready probe.

@vcastellm
Copy link
Member

Understood, thanks, I will take a look on how we can improve the /health endpoint

@FedeBev
Copy link
Contributor Author

FedeBev commented Nov 25, 2021

Keep me posted, I'd be glad to help.
Dkron could become a core component for a project I'm working on however our platform relies entirely on K8s and this is a big issue

@Espina2
Copy link

Espina2 commented Dec 5, 2021

I'm also very interested in this fix. @vcastellm @FedeBev anything that I can do to make this happen? Also, is this not better than the currently available chart?

@FedeBev
Copy link
Contributor Author

FedeBev commented Dec 7, 2021

@Espina2 the chart is working on my cluster in a dev environment, but it's still very young and there's still a long way to go before being production ready.
There's an issue with the k8s discovery that makes the dkron cluster unavailable due to master election during the rolling upgrade.

* Add Server Pod Disruption Budget

* Add missing newline at EOF

* Add preStop command to mitigate raft peers total caos in Server StatefulSet

Co-authored-by: Alberto Pinardi <alberto.pinardi@criptalia.it>
@omarzouk
Copy link

omarzouk commented Mar 1, 2023

hello! also highly interested in this. Has anyone come up with a way to work around the discovery issue? any thoughts on how to proceed?

@vcastellm vcastellm merged commit 873d8b9 into distribworks:master Mar 5, 2023
@vcastellm
Copy link
Member

Merged this PR and added some changes, I'll take over from here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants