[0.1.6] Deploying argoflow-aws #227

jai · 2021-09-14T07:19:33Z

We're setting up Kubeflow (argoflow-aws) from scratch, including the infrastructure and hit some stumbling blocks along the way. Wanted to document them all here (for now) and address as needed with PRs etc.

I realize that #84 exists, happy to merge into there but I'm not sure that issue deals with the specific 0.1.6 tag. That might be part of my issue as well since some things are more up-to-date on the master branch.

Current issues (can be triaged and split into separate issues or merged into existing issues)

❌ OPEN ISSUES

These are mainly based off of broken functionality or application statuses in ArgoCD

`knative`

Impact: Low
ArgoCD resources out of sync (MutatingWebhookConfiguration and ValidatingWebhookConfiguration)
Auto Sync currently turned off to debug
Details: [0.1.6] Deploying argoflow-aws #227 (comment)

`mpi-operator` (https://github.com/kubeflow/mpi-operator)

Impact: Low (not used by our org)
Related to
- Update helm chart to use the v2 controller kubeflow/mpi-operator#412
- Fix MPI operator error by syncing image/manifest versions to v0.3.0 #226
What mpi-operator does:

The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes.

Crashes

Logs

flag provided but not defined: -kubectl-delivery-image
Usage of /opt/mpi-operator:
  -add_dir_header
    	If true, adds the file directory to the header
...

`aws-eks-resources`

Impact: Low
ArgoCD resources out of sync (probably needs ignoreDifferences)
Auto Sync currently turned off to debug

✅ SOLVED ISSUES

[✅ SOLVED] `oauth2-proxy`

Impact: Unknown
Problem - CreateContainerConfigError: secret "oauth2-proxy" not found
- Solution - Secrets need to be manually updated in AWS Secrets Manager for oauth2-proxy:
  - client-id and client-secret (GCP link)
  - cookie-secret (generated by Terraform - see the kubeflow_oidc_cookie_secret output variable)
Problem - cannot contact the Redis cluster
- Solution - Redis cluster needs to be in the same VPC Security Group as the EKS cluster (see https://github.com/honestbank/argoflow-aws-infrastructure/issues/2)

[✅ SOLVED] `pipelines`

Impact: High
Crash Loop
Logs:

F0914 02:03:01.977497       7 main.go:240] Fatal error config file: While parsing config: invalid character 't' after object key:value pair

Solution - values in setup.conf must NOT be quoted

[✅ SOLVED] `aws-load-balancer-controller`

Impact: High
Blocks accessing UI/dashboard
Load Balancer isn't being created, logs:

2021/09/14 09:46:15 http: TLS handshake error from 172.31.39.152:54030: remote error: tls: bad certificate

{"level":"error","ts":1631613104.4709718,"logger":"controller","msg":"Reconciler error","controller":"service","name":"istio-ingressgateway","namespace":"istio-system","error":"Internal error occurred: failed calling webhook \"mtargetgroupbinding.elbv2.k8s.aws\": Post \"https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"aws-load-balancer-controller-ca\")"}

Solution:
- The EKS subnets need to be tagged correctly (see https://github.com/honestbank/argoflow-aws-infrastructure/issues/3)
- Deleted and re-sync'ed the ArgoCD Application. There was most likely some kind of race condition/invalid value from the first installation of argoflow-aws.

[✅ SOLVED] Central Dashboard

Impact: High
Can't access any applications - dashboard 404s for all sub-apps
Related:
- Error: "Sorry, /jupyter/ is not a valid page" kubeflow/kubeflow#3615
- Nested menu - "Sorry, /<endpoint>/ is not a valid page" kubeflow/kubeflow#6080
Logs:

> kubeflow-centraldashboard@0.0.2 start /app
> npm run serve


> kubeflow-centraldashboard@0.0.2 serve /app
> node dist/server.js

Initializing Kubernetes configuration
Unable to fetch Application information: 404 page not found

"aws" is not a supported platform for Metrics
Using Profiles service at http://profiles-kfam.kubeflow:8081/kfam
Server listening on port http://localhost:8082 (in production mode)
Unable to fetch Application information: 404 page not found
2021-09-14T02:39:12.655692792Z

Update - seems we shouldn't port-forward into the dashboard. However aws-load-balancer-controller has an issue (see below)
Solution: the dashboard cannot be accessed using kubectl port-forward but rather needs to be accessed through the proper URL of <<__subdomain_dashboard__>.<<__domain__>>

[✅ SOLVED] `kube-prometheus-stack`

Impact: Low
kube-prometheus-stack-grafana ConfigMap and Secret are going out of sync (in ArgoCD), which causes checksums in the Deployment to go out of sync as well
Was an issue on v0.1.6, resolved by deploying master (b90cb8a)

The text was updated successfully, but these errors were encountered:

EKami · 2021-09-19T13:05:42Z

Wow, thanks a lot for this! Very helpful!

EKami · 2021-09-19T13:14:21Z

I would also add:
Ext-dns record not created in route53:

Delete the istio app in the argocd dashboard, it will recreate the resource and update the DNS entries.
The deployment of ext-dns must happen in this order:
istio-operator-> external-dns -> istio-resources -> istio
to properly update the DNS entries.

Might be possible to fix with:

  annotations:
    argocd.argoproj.io/sync-wave: "2"

EKami · 2021-09-19T21:48:37Z

Any idea why knative is not synchronizing properly?

jai · 2021-09-21T10:20:56Z

I would also add:
Ext-dns record not created in route53:

Delete the istio app in the argocd dashboard, it will recreate the resource and update the DNS entries.
The deployment of ext-dns must happen in this order:
istio-operator-> external-dns -> istio-resources -> istio
to properly update the DNS entries.

Might be possible to fix with:
  annotations:
    argocd.argoproj.io/sync-wave: "2"

yes there are a few applications/scenarios that need to happen in the correct order it seems. One is definitely the external-dns - I am trying to be diligent and write down the others as I see them but it's super hard to make sure of one order of events given the sheer number of ArgoCD Applications!

jai · 2021-09-21T10:29:23Z

Any idea why knative is not synchronizing properly?

There are a few apps that are fighting with K8s - fields going out of sync - I had this with the Knative install in our regular compute cluster too. Below is the ignoreDifferences we use for our Knative install:

spec:
  ignoreDifferences:
  - group: rbac.authorization.k8s.io
    kind: ClusterRole
    jsonPointers:
    - /rules
  - group: admissionregistration.k8s.io
    kind: ValidatingWebhookConfiguration
    jsonPointers:
    - /webhooks/0/rules
  - group: admissionregistration.k8s.io
    kind: MutatingWebhookConfiguration
    jsonPointers:
    - /webhooks/0/rules

The argoflow-aws Knative ArgoCD Application is going out of sync on the following objects:

MutatingWebhookConfiguration
- webhook.domainmapping.serving.knative.dev - webhooks.0.rules.0.resources.1 (domainmappings/status)

webhook.serving.knative.dev

ValidatingWebhookConfiguration
- validation.webhook.domainmapping.serving.knative.dev

validation.webhook.serving.knative.dev

ClusterRole
- knative-serving-admin

knative-serving-aggregated-addressable-resolver

Am I the only one having these go out of sync? This isn't the only app - have a few of them, will post the list.

davidspek · 2021-09-21T10:35:46Z

@jai Thanks for the very detailed issue thread you've started here. Sadly I haven't had much time to dedicate to the ArgoFlow repositories since starting my new job. However, there are a lot of very big Kubeflow improvements I'm working on. Basically it's a completely redesigned architecture that simplifies Kubeflow and adds better security and more advanced features around User/Group/Project management.

Regarding the KNative manifests, they are quite a pain, especially with Kustomize. I've got a Helm chart that should be usable instead, that would should get rid of this continuous syncing issue. Would you like to help move the KNative deployment over to Helm? If so, I can clean up the chart a little bit and add it to a registry for you to depend on.

jai · 2021-09-21T10:36:14Z

ArgoCD Applications that are flip-flopping - not sure what the technical term is. Basically ArgoCD installs one manifest the the cluster seems to override some values, causing an update tug-of-war kind of thing. I will post details of which resources are causing this:

aws-eks-resources
istio-resources
kfserving
knative
notebook-controller
pipelines
pod-defaults
roles

jai · 2021-09-21T10:39:37Z

@jai Thanks for the very detailed issue thread you've started here. Sadly I haven't had much time to dedicate to the ArgoFlow repositories since starting my new job. However, there are a lot of very big Kubeflow improvements I'm working on. Basically it's a completely redesigned architecture that simplifies Kubeflow and adds better security and more advanced features around User/Group/Project management.

Regarding the KNative manifests, they are quite a pain, especially with Kustomize. I've got a Helm chart that should be usable instead, that would should get rid of this continuous syncing issue. Would you like to help move the KNative deployment over to Helm? If so, I can clean up the chart a little bit and add it to a registry for you to depend on.

Does argoflow/argoflow-aws use vanilla Knative? If I understand what you're saying, we would have to maintain a Helm repo with the Knative manifests, which sounds like one more thing to maintain. Is there a way we can point it at the Knative Operator and then just install a CRD? I might be way off base since I've only been working with Argoflow/Kubeflow for a couple of weeks 😂

davidspek · 2021-09-21T10:43:57Z

What you're saying is completely correct. The Knative Operator is probably a good fit to reduce the maintenance overhead. However, I haven't yet had time to look into it. The Istio <-> Knative <-> KFServing interplay is very fragile and took a couple weeks to get working properly (which also hasn't been upstreamed yet), so implementing the Knative Operator would need some special attention and testing.

jai · 2021-09-21T12:03:36Z

What you're saying is completely correct. The Knative Operator is probably a good fit to reduce the maintenance overhead. However, I haven't yet had time to look into it. The Istio <-> Knative <-> KFServing interplay is very fragile and took a couple weeks to get working properly (which also hasn't been upstreamed yet), so implementing the Knative Operator would need some special attention and testing.

I'm at an early-stage startup so my availability is super patchy - I wouldn't want to start something and leave it hanging halfway. I will poke around at the KFServing/Knative parts and see what's going on - no promises I can take this on but I will always do what I can!

jai · 2021-09-28T04:20:17Z

Update - also running into this issue: kserve/kserve#848

jai · 2021-10-25T04:49:01Z

Update - I think I've whittled it down to stuff that I think can be addressed by ignoreDifferences in the ArgoCD Application CRD. I'll open a draft PR to see if that's the best way to address these issues or if there's a better way to fix them upstream/in other areas.

jai · 2022-01-19T07:21:46Z

Update - ignoreDifferences is done, I'm currently validating and will submit PRs. Sorry for the long lead time!

jai changed the title ~~[0.1.6] deploying argoflow-aws~~ [0.1.6] Deploying argoflow-aws Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.1.6] Deploying argoflow-aws #227

[0.1.6] Deploying argoflow-aws #227

jai commented Sep 14, 2021 •

edited

Loading

EKami commented Sep 19, 2021

EKami commented Sep 19, 2021

EKami commented Sep 19, 2021

jai commented Sep 21, 2021

jai commented Sep 21, 2021 •

edited

Loading

davidspek commented Sep 21, 2021

jai commented Sep 21, 2021 •

edited

Loading

jai commented Sep 21, 2021

davidspek commented Sep 21, 2021

jai commented Sep 21, 2021

jai commented Sep 28, 2021

jai commented Oct 25, 2021

jai commented Jan 19, 2022

[0.1.6] Deploying argoflow-aws #227

[0.1.6] Deploying argoflow-aws #227

Comments

jai commented Sep 14, 2021 • edited Loading

Current issues (can be triaged and split into separate issues or merged into existing issues)

❌ OPEN ISSUES

knative

mpi-operator (https://github.com/kubeflow/mpi-operator)

aws-eks-resources

✅ SOLVED ISSUES

[✅ SOLVED] oauth2-proxy

[✅ SOLVED] pipelines

[✅ SOLVED] aws-load-balancer-controller

[✅ SOLVED] Central Dashboard

[✅ SOLVED] kube-prometheus-stack

EKami commented Sep 19, 2021

EKami commented Sep 19, 2021

EKami commented Sep 19, 2021

jai commented Sep 21, 2021

jai commented Sep 21, 2021 • edited Loading

davidspek commented Sep 21, 2021

jai commented Sep 21, 2021 • edited Loading

jai commented Sep 21, 2021

davidspek commented Sep 21, 2021

jai commented Sep 21, 2021

jai commented Sep 28, 2021

jai commented Oct 25, 2021

jai commented Jan 19, 2022

jai commented Sep 14, 2021 •

edited

Loading

`knative`

`mpi-operator` (https://github.com/kubeflow/mpi-operator)

`aws-eks-resources`

[✅ SOLVED] `oauth2-proxy`

[✅ SOLVED] `pipelines`

[✅ SOLVED] `aws-load-balancer-controller`

[✅ SOLVED] `kube-prometheus-stack`

jai commented Sep 21, 2021 •

edited

Loading

jai commented Sep 21, 2021 •

edited

Loading