Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes Plugin 500 errors #9942

Closed
asaini11 opened this issue Mar 3, 2022 · 16 comments
Closed

Kubernetes Plugin 500 errors #9942

asaini11 opened this issue Mar 3, 2022 · 16 comments
Labels
bug Something isn't working help wanted Help/Contributions wanted from community members

Comments

@asaini11
Copy link
Contributor

asaini11 commented Mar 3, 2022

Expected Behavior

K8 plugin shouldn’t show 500 errors in logs and UI as our config is correct. Or the error message should go away in the UI once we receive a 200 or not show in the UI at all. We know our config is correct as we can see the pod information in the UI.

Actual Behavior

In the UI we can see the K8 plugin load successfully. After approx 3 mins we start seeing 500 timeout errors. These are being error handled using code & code

We would like to understand why this error handling was put in place and why the 500 errors are returned.

When we use developer tools in the browser (F12) we can see 10 second refresh requests as defined by code. As we can see 200 status codes we know our config is correct and we can successfully see pod data in the UI. However, after 3 mins we get a 500 error in the logs and the UI shows the error. Usually the k8 plugin refresh that is done after 10 seconds is a 200 http code but the error remains in the UI (only the cluster IP updates following the next 500 error). The 500 error happens approx every minute but in the UI the error doesn’t go away until the page is refreshed. Despite the error being there in the UI and the fact we do get 200 status codes too the pod information is still able to update. However, this does not allow a good user experience for our devs. For example, in the screenshots attached you can see we get a 500 and then a 200 status code. We get a few 200 codes until we hit a 500 again. Ideally the UI should remove the error message when the 200 http code has been hit. As the error remains in the UI, we have to refresh the UI for the errors to go away every 3 mins.

requests
500
200

Errors in logs:
2022-03-02T09:47:25.428Z kubernetes error action=retrieveObjectsByServiceId service=core, error=Error: connect ETIMEDOUT ourClusterIP:443 type=plugin where core is one of our services.

Following this error we see this in the UI: There was a problem retrieving some Kubernetes resources for the entity: core. This could mean that the Error Reporting card is not completely accurate.

Note. after some time we see the following in the logs 2022-03-02T09:48:00.730Z kubernetes error action=retrieveObjectsByServiceId service=core, error=FetchError: request to https://www.googleapis.com/oauth2/v4/token failed, reason: socket hang up type=plugin although this shouldn't happen as we are using a Google service account as per google’s docs

From this error in the logs, we get the following message in the UI: Errors: Request failed with 503 , upstream connect error or disconnect/reset before headers. reset reason: connection termination

Again a refresh in the browser fixes this temporarily.

We have also seen the following error in the logs: kubernetes error action=retrieveObjectsByServiceId service=core, error=HttpError: HTTP request failed type=plugin

Steps to Reproduce

We have configured the K8 plugin using docs where we have the below config:

kubernetes: 

  serviceLocatorMethod: 

    type: 'multiTenant' 

  clusterLocatorMethods: 

    - type: 'config' 

      clusters: 

        - url: https://ourClusterIP 

          name: dev 

          authProvider: 'googleServiceAccount' 

          skipTLSVerify: true 

          skipMetricsLookup: false 

          dashboardApp: gke 

          dashboardParameters: 

            projectId: ourProjectId 

            region: ourZone 

            clusterName: ourClusterName 

          caData: ${DEV_K8S_CONFIG_CA_DATA}

where we have mounted DEV_K8S_CONFIG_CA_DATA as a K8 secret and have also defined and mounted GOOGLE_APPLICATION_CREDENTIALS which is our gcp's service account json file. We have 7 clusters (one of which is dev). The other clusters follow a similar config.

Context

We would like a smooth developer user experience i.e the devs shouldn’t need to keep refreshing the UI to remove the error message.

Your Environment

We are hosting backstage in Kubernetes.

Backstage version 0.4.14
Related bug on github

  • Browser Information:

Chrome browser

  • Output of yarn backstage-cli info:
    Run command locally.
yarn run v1.22.17 

OS:   Linux 5.13.0-30-generic - linux/x64 

node: v14.18.3 

yarn: 1.22.17 

cli:  0.12.0 (installed)   

Dependencies: 

  @backstage/app-defaults                                  0.1.5 

  @backstage/backend-common                                0.10.4, 0.10.6 

  @backstage/backend-tasks                                 0.1.4 

  @backstage/catalog-client                                0.5.5 

  @backstage/catalog-model                                 0.9.10 

  @backstage/cli-common                                    0.1.6 

  @backstage/cli                                           0.12.0 

  @backstage/config-loader                                 0.9.3 

  @backstage/config                                        0.1.13 

  @backstage/core-app-api                                  0.5.0 

  @backstage/core-components                               0.8.5, 0.8.8, 0.8.7 

  @backstage/core-plugin-api                               0.5.0, 0.6.0 

  @backstage/errors                                        0.2.0 

  @backstage/integration-react                             0.1.19 

  @backstage/integration                                   0.7.2 

  @backstage/plugin-api-docs                               0.7.0 

  @backstage/plugin-app-backend                            0.3.22 

  @backstage/plugin-auth-backend                           0.7.0 

  @backstage/plugin-catalog-backend                        0.21.0 

  @backstage/plugin-catalog-common                         0.1.1 

  @backstage/plugin-catalog-import                         0.7.10 

  @backstage/plugin-catalog-react                          0.6.12, 0.6.13 

  @backstage/plugin-catalog                                0.7.9 

  @backstage/plugin-github-actions                         0.4.32 

  @backstage/plugin-kubernetes-backend                     0.4.6 

  @backstage/plugin-kubernetes-common                      0.2.2 

  @backstage/plugin-kubernetes                             0.5.6 

  @backstage/plugin-org                                    0.4.0 

  @backstage/plugin-pagerduty                              0.3.23 

  @backstage/plugin-permission-common                      0.4.0 

  @backstage/plugin-permission-node                        0.4.0 

  @backstage/plugin-permission-react                       0.3.0 

  @backstage/plugin-proxy-backend                          0.2.16 

  @backstage/plugin-scaffolder-backend-module-cookiecutter 0.1.9 

  @backstage/plugin-scaffolder-backend                     0.15.21 

  @backstage/plugin-scaffolder-common                      0.1.3 

  @backstage/plugin-scaffolder                             0.12.0 

  @backstage/plugin-search-backend-node                    0.4.4 

  @backstage/plugin-search-backend                         0.3.1 

  @backstage/plugin-search                                 0.5.6 

  @backstage/plugin-sonarqube                              0.2.13 

  @backstage/plugin-tech-radar                             0.5.3 

  @backstage/plugin-techdocs-backend                       0.13.0 

  @backstage/plugin-techdocs                               0.13.0 

  @backstage/plugin-user-settings                          0.3.17 

  @backstage/search-common                                 0.2.1 

  @backstage/techdocs-common                               0.11.4 

  @backstage/test-utils                                    0.2.3 

  @backstage/theme                                         0.2.14 

  @backstage/types                                         0.1.1 

  @backstage/version-bridge                                0.1.1 

Done in 1.03s. 

@asaini11 asaini11 added the bug Something isn't working label Mar 3, 2022
@asaini11
Copy link
Contributor Author

asaini11 commented Mar 3, 2022

@mclarke47 As mentioned, I have raised this as a github issue. Thank you

@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label May 2, 2022
@freben
Copy link
Member

freben commented May 5, 2022

@asaini11 I am going through old issues and just wanted to check in, what's your status for this one?

@github-actions github-actions bot removed the stale label May 5, 2022
@asaini11
Copy link
Contributor Author

asaini11 commented May 5, 2022

Hi @freben this is still an issue but we are on Backstage version 0.4.14. I am going to try and update to the latest and see if this still persists. Thank you.

@chriscarpenter12
Copy link

chriscarpenter12 commented May 10, 2022

I was getting this same error until I adjusted the ClusterRole attached to the ServiceAccount used for Backstage. It's not exactly clear what the required permissions are...

@asaini11
Copy link
Contributor Author

asaini11 commented May 16, 2022

@freben we have updated our backstage version to 1.1.0 and we are still getting the same issue so would like to keep this open till we have a fix.
@chriscarpenter12 Our permissions on the service account are the below:
container.configMaps.list
container.cronJobs.list
container.deployments.list
container.horizontalPodAutoscalers.list
container.ingresses.list
container.jobs.list
container.pods.list
container.replicaSets.list
container.services.list
container.clusters.get
container.clusters.getCredentials
container.clusters.list

I think the last 3 may be excessive/unnecessary permissions as per docs
Can you please let me know what permissions you have attached to your Service Account so that I can compare?

@chriscarpenter12
Copy link

chriscarpenter12 commented May 16, 2022

After updating to the latest version it was failing because jobs and cronjobs weren't batch readable. I manually patched our ClusterRole to include specifically, but it was again still failing this time with a generic error not indicating why. I then used the ClusterRole aggregationRule to bind the default cluster role view to our backstage-view role. After that all worked. I haven't had a chance yet to start removing pieces to see what's the cause, because this cluster role is way overkill for what I think Backstage needs.
backstage-view.cluster-role.yml.txt

@asaini11
Copy link
Contributor Author

Thank you @chriscarpenter12 I'll give this a go as we're currently using a GCP IAM policy which has the permissions role defined there so will try this way and let you know how I get on.

@asaini11
Copy link
Contributor Author

Have tagged @mclarke47 on discord to see if we can get a permanent fix on this/further input.

@freben freben added the help wanted Help/Contributions wanted from community members label May 18, 2022
@mclarke47
Copy link
Collaborator

I wonder does using the configuration option objectTypes to override the default objects to fetch help. Also what version of K8S are you using?

@asaini11
Copy link
Contributor Author

K8s version is: 1.22.7-gke.900
Sorry if I've misunderstood here. What would I override the objectTypes to as we are using a Google Service Account IAM policy which has the permissions defined there in a GCP role which works mostly until we hit the first 500. When we hit this 500 error we get this error message in the UI There was a problem retrieving some Kubernetes resources for the entity: core. This could mean that the Error Reporting card is not completely accurate. Then when we get a 200 the error message stays in the UI. I'd like to understand what causes this random 500 error and if the error can be removed from the UI after a successful 200 has been hit?

@goenning
Copy link
Contributor

goenning commented May 18, 2022

I think this is a bug, we see this sometimes as well and we're on Azure, not Google. As we have 6 clusters, sometimes we get throttled by Azure and one or more requests may fail, which causes this banner to show and never disappear.

We're doing a couple of things to mitigate these errors:

@asaini11
Copy link
Contributor Author

Hi @goenning that's good to know that others are having the same issue using different config! I think I will try these things to see if it reduces it. However, it sounds like you have tried a few things but it's still not fixed. I'm thinking, as we do get a 200 after the 500 error it would be good if the message in the UI disappears so hopefully the users don't see it as a workaround for a smoother developer experience. @mclarke47 would also be good to get your thoughts on this.

@asaini11
Copy link
Contributor Author

Hello @chriscarpenter12 I tried the K8 service account but this didn't work for us as we have 7 clusters so the Google Service Account works better for us (despite the error).
@goenning I'm going to keep an eye out for your pull request #11603 to be merged so I can try this.

@asaini11
Copy link
Contributor Author

asaini11 commented May 26, 2022

Update: To limit the refresh rate temporarily I have removed 4/7 cluster configuration in the app-config file to only capture the important clusters. We are still seeing the issue but it doesn't occur as quickly/often.

@asaini11
Copy link
Contributor Author

I'd like to close this bug as we haven't seen this since applying this change thanks to @goenning. We set it to 30 seconds (using 3 clusters) and not seen the error since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Help/Contributions wanted from community members
Projects
None yet
Development

No branches or pull requests

5 participants