Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: DB Console Metrics takes a long time to render graphs and sometimes fails to load at all for 22.1.5 clusters #85636

Closed
thtruo opened this issue Aug 4, 2022 · 12 comments
Assignees
Labels
A-kv-observability A-observability-inf A-webui Triage label for DB Console (fka admin UI) issues. Add this if nothing else is clear. A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@thtruo
Copy link
Contributor

thtruo commented Aug 4, 2022

Describe the problem

Customers and the field team have been reporting major UI issues affecting CRDB 22.1.4 and 22.1.5 - where metrics/graphs either take a long time to load in DB Console, or at times even fail to load at all with a "Something went wrong" state. Both are regressions

image (24)
image (26)

To Reproduce

Links to Loom recordings will be shared on the equivalent Jira ticket.

Environment:

  • CockroachDB version 22.1.5 (it's also been reported to happen for 22.1.4)

Additional context

  • Affected clusters tend to be running with 35+ or 45+ nodes

Jira issue: CRDB-18353

gz#13551

gz#13591

@thtruo thtruo added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. A-webui Triage label for DB Console (fka admin UI) issues. Add this if nothing else is clear. T-crux labels Aug 4, 2022
@daniel-crlabs
Copy link
Contributor

This impacts both cockroach cloud and cockroach on prem customers, as long as they are running 22.1.5. I attempted to obtain screenshots from sql, hardware and overview pages; only the top graph would load, independently of the time frame I chose. Affected nodes are not only large nodes, I tested 5 different cloud customers, all with 9 nodes or less and they all showed this same behavior.
This not only impacts our customers, but also impacts our ability to troubleshoot issues for our customers.

@andreimatei
Copy link
Contributor

@daniel-crlabs it'd be useful if you attached something private pointing to one or two CC clusters where this happens

@daniel-crlabs
Copy link
Contributor

@daniel-crlabs it'd be useful if you attached something private pointing to one or two CC clusters where this happens

@andreimatei there are were quite a few. I'm not sure how to attach a private comment to gh, but we have a view which you can choose and pick the larger clusters (ones with 5 + nodes) that are running 22.1.5 and you'll be able to see this behavior.

I'll add the link to the jira ticket, I think that's internal.

@ddtort
Copy link

ddtort commented Aug 8, 2022

Some context that may help here... here's a look at a v21.2.x cluster that loads well on the left and a v22.1.5 cluster that takes a long time to load on the right here. The initial request for data goes through for both but all subsequent requests to stream more recent data gets queued and hangs for the 22.1.5 cluster

Screenshot 2022-08-08 at 11 28 04 AM

@daniel-crlabs
Copy link
Contributor

Passing the question along, please see below from one of our CEAs:

What release will this be in: #85636

We can't close the ticket until this has been backported and tested, thanks.

@thtruo
Copy link
Contributor Author

thtruo commented Aug 8, 2022

Thanks @daniel-crlabs - the team is currently on it. If we can merge the fix by ~Aug 16, that will fall in the 22.1.6 release. If we miss that train, then it'll be 22.1.7
We won't have a confirmation yet until a PR is up hopefully soon

@thtruo
Copy link
Contributor Author

thtruo commented Aug 8, 2022

Updated assignees and T- labels since @koorosh is taking the baton from @dhartunian

@daniel-crlabs
Copy link
Contributor

Thank you my friends, I'll pass that info along to the customer, appreciate the update.

@thtruo
Copy link
Contributor Author

thtruo commented Aug 16, 2022

@ccampbell-crl
Copy link
Collaborator

ccampbell-crl commented Aug 16, 2022

@thtruo Is there currently a workaround for this issue or are customers expected to deal with this issue until they upgrade? Another customer is also currently having this issue on 22.1.5

@thtruo
Copy link
Contributor Author

thtruo commented Aug 16, 2022

@ccampbell-crl the patches to those issues were recently merged, so they are only available in 22.1.6

Unfortunately they will still encounter those regressions if they're on 22.1.5 - we have not found workarounds for that release

@thtruo
Copy link
Contributor Author

thtruo commented Aug 23, 2022

Closing this issue as patches have rolled out into the 22.1.6 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-observability-inf A-webui Triage label for DB Console (fka admin UI) issues. Add this if nothing else is clear. A-webui-general Issues on the DB Console that span multiple areas or don't have another clear category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

7 participants