Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller clearing status from Mig objects (possibly RBAC related) #275

Closed
eriknelson opened this issue Aug 16, 2019 · 4 comments
Closed

Comments

@eriknelson
Copy link
Contributor

A bunch of us have hit this issue over the last couple days (@pranavgaikwad, myself, and just now @jwmatthews). It will look like the UI has locked up during PV discovery, or validation, or check connection on clusters or storage. All of these operations rely on the status being updated by the controller or timing out. Upon digging into it more, the Mig objects either 1) never get an initial status, or 2) have their existing status wiped so there is no status object on the mig object any longer. This just depends on when the issue strikes. After seeing the absence of the status, logging the controller pod reveals the following RBAC error:

E0816 19:01:01.845757       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:196: Failed to list *v1.Deployment: deployments.apps is forbidden: User "system:serviceaccount:mig:default" cannot list resource "deployments" in API group "apps" at the cluster scope

I just helped John debug his cluster, which had been up and working about 20 minutes before this appeared. Here are his controller logs: https://gist.github.com/fefcf7f47d8c9b7e1571b3165cbfa9dd

When I helped Pranav, I just gave his SA cluster-admin and obviously it was able to do everything it needed to do, so the problem went away.

The odd thing about this is it seems to appear sometime after everything has been functioning fine? If this were simply a misconfigured role, I would expect nothing to be working from initial deployment.

@jwmatthews
Copy link
Contributor

Adding some background from my usage.
I deployed a cluster on Thursday evening 8/15 ~6pm.
I tested multiple migrations of same mssql-persistent namespace.
After a migration I would:

  • Delete the migmigrations
  • Delete the migplan
  • Delete the mssql-persistent namespace from destination
  • Scale the app back up on source cluster

I did not do much of anything else, left the velero Backup, Restore CRs present.
I also was not closing the plans.

I used the cluster to perform migrations several times ~5 migrations in evening, worked fine.
Friday morning 8/16, I saw some odd issues with a migration failing if I reused the same name of the MigPlan. Even though I deleted the MigPlan/MigMigrations, I could sometimes reuse same name sometimes i couldn't. I did see the registry pods were still present as I hadn't finalized the plans.

I manually deleted a few of the registry pods.

I still had a working setup for migrations at this point, I was changing the name of MigPlan and migrations were happening.

Around 1:40pm was last successful migration on 8/16.
I attempted to demo functionality at ~2:30pm 8/16.

First thing I saw was that the UI was having issues with check connection, like it couldn't talk to backend.

I attempted to walk through wizard and got stuck at PV Discovery as Erik mentioned.

Grabbing some more info below:
https://gist.github.com/jwmatthews/2531d26af8475617c377fa71cf0d569f

$ oc describe clusterrole.rbac &> clusterrole.rbac.logs
https://gist.github.com/jwmatthews/0e2ab32682dfda0d8e4e48c1d4dc1d20

@jwmatthews
Copy link
Contributor

Below is from a cluster I recently provisioned, installed mig-operator ~10 minutes ago.
About to do some migrations, grabbing a view of the clusterrole.rbac info incase anything we want to compare from initial state to later state.

$ oc describe clusterrole.rbac &> clusterrole.rbac_0817cluster.logs

$ gist clusterrole.rbac_0817cluster.logs
https://gist.github.com/b96602cc1679838cd53e9a4277881ed9

@jortel
Copy link
Contributor

jortel commented Aug 23, 2019

This has been fixed by: migtools/mig-operator#40, right?

@jwmatthews
Copy link
Contributor

I'm OK closing for now, if we see it again re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants