Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring ML Workflows using Kubeflow on Amazon EKS #202

Merged
merged 21 commits into from
May 18, 2023
Merged

Monitoring ML Workflows using Kubeflow on Amazon EKS #202

merged 21 commits into from
May 18, 2023

Conversation

elamaran11
Copy link
Contributor

What does this PR do?

馃洃 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

As part of day 2 operations, customers want to monitor their Infrastructure, Amazon EKS clusters and application components. AWS customers use Amazon EKS to run machine learning workloads. Containerization allows machine learning engineers to package and distribute models easily, while Kubernetes helps in deploying, scaling, and improving. In addition to monitoring the behavior of the Amazon EKS clusters, it鈥檚 essential to monitor the behavior of machine learning workflows as well to ensure the operational resilience of workloads and platforms run by an organization.

Kubeflow is the open-source machine learning (ML) platform dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Kubeflow provides many components, including a central dashboard, multi-user Jupyter notebooks, Kubeflow Pipelines, KFServing, and Katib, as well as distributed training operators for TensorFlow, PyTorch, MXNet, and XGBoost. Kubeflow components export metrics which provides insights into the health and function of Kubeflow on Amazon Elastic Kubernetes Service (EKS).

Motivation

Idea on Monitoring ML Workflows using Kubeflow on ELS

More

  • [Y] Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • [NA] Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • [Y] Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

@elamaran11 elamaran11 requested a review from a team as a code owner May 16, 2023 01:26
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 01:26 — with GitHub Actions Inactive
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 01:27 — with GitHub Actions Inactive
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 01:28 — with GitHub Actions Inactive
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 01:29 — with GitHub Actions Inactive
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 01:31 — with GitHub Actions Inactive
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 01:33 — with GitHub Actions Inactive
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 01:35 — with GitHub Actions Inactive
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 01:39 — with GitHub Actions Inactive
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 01:46 — with GitHub Actions Inactive
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 14:34 — with GitHub Actions Inactive
@elamaran11 elamaran11 temporarily deployed to DoEKS Test May 16, 2023 15:05 — with GitHub Actions Inactive
Copy link
Contributor

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 馃憤馃徏 Thanks @elamaran11

@elamaran11
Copy link
Contributor Author

LGTM 馃憤馃徏 Thanks @elamaran11

Thankyou @vara-bonthu

@vara-bonthu vara-bonthu merged commit 1511228 into awslabs:main May 18, 2023
25 checks passed
@elamaran11 elamaran11 added the enhancement New feature or request label May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants