Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Historical data #3

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open

Historical data #3

wants to merge 33 commits into from

Conversation

bglusman
Copy link
Owner

@bglusman bglusman commented Jun 2, 2020

Rebased, replacement PR for #2 which was in an odd state

@bglusman
Copy link
Owner Author

bglusman commented Jun 10, 2020

This branch has been live on elixir-companies.com for the last week or so and is working well there with very small memory impact on several different chart types. The only known issue is a small period of incorrectly formatted data display in the uPlot charts after initial render until the first live telemetry event occurs after client load; I suspect something in the JS code needs a tweak since it's now getting a lot of data right away, but I couldn't, so if anyone with knowledge of that could help me identify that, it would be appreciated. I did some digging/playing around some half-baked theories and none seemed to change behavior let alone fix, so, I think I'm not understanding the code in metrics_live/index.js very well, which is where I assume some small change would be needed.

The changes in dev.exs are my best guess at an enhanced dev experience to support this feature, so that it's disabled by default but can be enabled by setting an env var. Perhaps there's a simpler/preferable alternative or perhaps that should be reverted if we merge in to default to enabled.

I understand this is a big change, and I hope whether it is accepted or not that its a useful step in the right direction.

(DRAFT local PR before real PR to upstream github.com/phoenixframework/phoenix_live_dashboard )

@leeoniya
Copy link

leeoniya commented Jun 10, 2020

the only known issue is a small period of incorrectly formatted data display in the uPlot charts after initial render until the first live telemetry event occurs after client load;

might be leeoniya/uPlot#257. can you see if uPlot master does better?

@bglusman
Copy link
Owner Author

@leeoniya I think that does fix it! Just did a quick check locally anyway and looks fixed at a glance, I'll try and confirm in morning, but thank you!

@bglusman
Copy link
Owner Author

@leeoniya actually I'm wrong 😢 It's not fixed by that, though if you want to confirm/play with repro locally, the easiest place is in elixir-companies repo where I have page view metrics that will load history but look wrong until you hit one of the pages again while chart is open (and wait a few seconds because they're aggregated via telemetry_poller). until then, it looks to me like y-axis has wrong min/max.

I made a new branch pointing to uPlot master here https://github.com/bglusman/phoenix_live_dashboard/tree/historical_data_uplot_master
and a branch of elixir-companies using that branch here
https://github.com/bglusman/elixir-companies/tree/history_new_api

You can also play around on either this or uplot_master version of live_dashboard directly of course, there's a new mix command to run with history in the README as mix dev_history that should work without any other dependencies. Thank you for chiming in though and the glimmer of hope, please let me know if I can do anything to help debug!

@leeoniya
Copy link

please let me know if I can do anything to help debug!

i'll need a repro that's runnable on jsfiddle or similar.

alternatively, a live setup of history_new_api that i can pop open in a browser to see the symptoms, or at least the full config options and a problematic dataset that i can run without any extra baggage.

@bglusman
Copy link
Owner Author

bglusman commented Jun 11, 2020

Hey Leon!  Thanks again so much!  I'm not at all sure how/if I can repro in jsfiddle, maybe I'll give it a shot since that's your preference, but the history_new_api is more or less ready to go, locally all I change is removing an auth check from dashboard, so dashboard immediately works out of the box with no login. I'll also lower the period for periodic_measurements to 1 second from 10. Those are both pushed up in history_new_api branch as last commit, so all you should need to do is mix do deps.get, ecto.setup, phx.server and then visit http://localhost:4000/dashboard/nonode%40nohost/metrics?group=page_views

You can see a short video of the issue being reproduced a couple times here
https://streamable.com/ldcibr
and you can reproduce the events that are triggering updates/adding data by running in your terminal, while the server is up (before and/or after you visit the metrics path) with this bash snippet
for i in seq 1 3; do curl http://localhost:4000/en/companies; done or just curl http://localhost:4000/en/companies for a single data point which is all it takes to "fix" the display once it's open. You can also just visit the companies page manually in browser, its a simple count of page hits on that graph.

Let me know if that helps, and if I have time before you get around to this method maybe I'll try seeing if I can repro in JSFiddle, but I think that could be more work.

dev.exs Outdated Show resolved Hide resolved
@bglusman
Copy link
Owner Author

bglusman commented Jun 12, 2020

@leeoniya suspect maybe the issue is related to the debounce 100 in uPlot, but on the other hand that looks like it’s about window resizing not about x/y scaling? Is there a similar throttle or debounce thing I missed elsewhere or am I misunderstanding that one?

Oh or maybe is there a way to explicitly send a signal/event after last data point is sent to recalculate x/y scale now?

@leeoniya
Copy link

the debounce is only for syncing the cursor coords during browser window scroll or resize, so that cannot cause path drawing issues.

Oh or maybe is there a way to explicitly send a signal/event after last data point is sent to recalculate x/y scale now?

unfortunately, without having the problematic combo of uPlot options and dataset you're feeding it, i can't advise any kind of voodoo to try to remedy the situation. i'll take a look if you can repro this in an accessible-via-my-browser live_dashboard instance or in a jsfiddle, but i don't have any spare cycles beyond that in the next few weeks :(

@bglusman
Copy link
Owner Author

bglusman commented Jun 12, 2020 via email

@leeoniya
Copy link

leeoniya commented Jun 14, 2020

i have pulled out and adjusted the live_dashboard chart code [1] here to get it running in the default config:

https://codesandbox.io/s/withered-silence-wgc8f

you'll need to give me the problematic dataset(s) that trigger the issue.

cc @mcrumm

[1] https://github.com/phoenixframework/phoenix_live_dashboard/blob/master/assets/js/metrics_live/index.js

@bglusman
Copy link
Owner Author

Oh, I just noticed you pinged Michael Crumm, appreciate the thought, and perhaps he may have an idea, though I was trying to avoid bothering the core team until this was resolved, but I don't know, since I'm pretty sure this is just exposing an underlying bug/constraint elsewhere in uPlot and/or metrics js code, perhaps its appropriate.

I spent a little time in the codesandbox trying to figure out how to translate phx-diff websocket contens into valid chart.pushData events, and so far didn't manage to, but my theory is just performing multiple pushData or something like that in rapid succession should trigger the bug, and if it doesn't, well, I don't really have any other ideas for triggering it in pure JS land.

The charts it can happen in are dead simple though, here are the guts of a couple of sample phx-diffs corresponding to data points, just a few intergers at some times not that far apart, but they all come together immediately, not spread out over time as LiveDashboard normally would send them. Apologies though, I know I'm not the ideal person to be investigating this and don't want to waste either of your time. You can also see the issue live on elixir-companies.com/dashboard by the way as this code is running there. PageViews tab is the easiest to see it in because there aren't many page view events so it's likely that, for example, page_views.companies_web.company_show will sit there quite a while with the y-axis not changing from a max of 1, until you visit, say, https://elixir-companies.com/en/companies/268 or some other company id, and then (once the telemetry poller reaches 10 seconds and sends the aggregate) you'll see all the same data but with a re-calibrated Y axis.

In any case, here are a few phx-diff data point, though I don't think they'll help a lot:

{"c":{"0":{"2":{"d":[["Company index","","1592171420761605"]]}}}}
{"c":{"0":{"2":{"d":[["Company index","","1592171650790140"]]}}}}
{"c":{"0":{"2":{"d":[["Company index","","1592171680792930"]]}}}}
{"c":{"0":{"2":{"d":[["Company index","","1592171760802638"]]}}}}
{"c":{"0":{"2":{"d":[["Company index","1","1592172050854267"]]}}}}
{"c":{"0":{"2":{"d":[["Company index","","1592172290883619"]]}}}}

{"c":{"1":{"2":{"d":[["Company show","10","1592191067657135"]]}}}}
{"c":{"1":{"2":{"d":[["Company show","6","1592191077658130"]]}}}}
{"c":{"1":{"2":{"d":[["Company show","5","1592191087659041"]]}}}}

@leeoniya
Copy link

You can also see the issue live on elixir-companies.com/dashboard by the way as this code is running there

i cannot:

image

@bglusman
Copy link
Owner Author

bglusman commented Jun 15, 2020

Apologies for confusion, you must auth via github SSO at elixir-companies.com and then afterwards can visit /dashboard or click on link once logged in for dashboard

@leeoniya
Copy link

leeoniya commented Jun 15, 2020

ok, this is fixed in leeoniya/uPlot@a48050c

a self-contained repro is here (with latest uPlot build, so no issues): empty-data-stale-scales.zip

i am rather curious about the dataset, however.

const data0 = [
  [
    1592215265.345238,
    1592215265.352769,
    1592215293.755221,
    1592215293.761757,
    1592215293.77905,
    1592215293.78007,
    1592215293.782685,
    1592215315.296678,
    1592215315.297912,
    1592215315.300079,
    1592215315.302002,
    1592215315.302094,

a single page hit seems to cause multiple events, sometimes within the same millisecond. i don't expect this to happen and it creates a ton more data for the chart and really awkward-looking trends (below). i'm assuming this is an issue specific to elixir-companies.com and not LiveDashboard generally?

image

@bglusman
Copy link
Owner Author

@leeoniya well, at the moment it's specific to elixir-companies and to this PR/fork of LiveDashboard, but I wanted to understand and resolve the issue before opening PR to upstream LiveDashboard. A page hit is exactly causing multiple events (sorry if I failed to explain that well above), and that's still my proposed implementation for how historical charts will work, though it's possible José or someone else on team will have an alternative idea that sends it all down on first render instead of as seperate events for each data point. I'll try your fix now, but thanks so much Leon!

@leeoniya
Copy link

A page hit is exactly causing multiple events (sorry if I failed to explain that well above), and that's still my proposed implementation for how historical charts will work

what's the meaning of 4 separate query decode times for a single page hit? is the query being decoded multiple times?

@bglusman
Copy link
Owner Author

Do the charts look different after your fix out of curiosity vs, um before your fix/on elixir-companies but with a manual new event? for me the charts correct themselves as soon as a new event comes over the wire, and the idea is it should look the same as if the chart had been open the whole time, i.e. where in LiveDashboard normally when you do a hard client refresh, you lose all data in the metrics, here we're saving (some of) the history on the server and sending that on every new client connect. maybe there's a better way to pass it down though, the current approach re-uses the existing send_update mechanism that LiveDashboard uses for all telemetry dispatches to the client, but I haven't seen decode_time look like your above snapshot... currently it looks like this for me, once an additional decode_time event gets sent immediately after page load:
Screen Shot 2020-06-15 at 3 07 01 PM
If your fix leaves it looking like that, we may need to figure out a different change/fix...

@leeoniya
Copy link

leeoniya commented Jun 15, 2020

the uPlot issue was improper retention of the initial/data-less 0..1 y-scale even after the first setData event.

there are no timing issues since everything is synchronous and setData always takes the full dataset; there is no such thing as incremental or addData in uPlot.

the chart renders the provided dataset correctly as far as i can tell. what are you expecting to see?

@bglusman
Copy link
Owner Author

Just tested locally and it looks fine to me, I pushed up revision of historical_data_uplot_master branch using your commit and ran locally and it looked great to me... those extra data points in the spikes in your image above look weird to me also, but I can't recreate that when live_dashbaord is using the fix locally... can we PR that change into uPlot master, and I can have my PR include the update and explanation when I open it?

@bglusman
Copy link
Owner Author

To be clear, nothing in this fork or elixir-companies is or should be producing different datasets than upstream master, it's just the as you saw it sends all the stored datasets very quickly, often within the same millisecond, or thats what it should do and all I can see it doing...

@bglusman
Copy link
Owner Author

I did download and run your local repro and I do see similar weirdness there when I zoom in... I'll leave my local dev version running your patch open a while and see if that reproduces it when there's more data or not, but either way thank you so much for the help and the fix!

@leeoniya
Copy link

but I can't recreate that when live_dashbaord is using the fix locally

thats really odd, because the dataset i pulled is the one being generated and given to uPlot by the link you gave me:

http://elixir-companies.com/dashboard/server%409465247c-3338-4bf1-ba51-93ef7b7dc988/metrics?group=companies

uPlot does not generate this data, it is coming from your server. the uPlot fix has nothing to do with the spikes, those spikes exist in the dataset. which is why i asked why there are multiple query_decode datapoints for a single page hit (sometimes in the same millisecond).

i would expect the server to only create and store a single query decode datapoint per page hit, not 3-5 of them.

@bglusman
Copy link
Owner Author

Sorry for confusion, I know uPlot isn't generating the data, and I'm sure if this issue goes any further it's almost certain to be some kind of subtle issue server-side, not in your library, you've clearly found that issue, I'm just unsure whether/where I need to do additional digging before opening my PR upstream! Not your concern of course.

When I said I can't reproduce locally, I didn't mean your local packaged self contained repro above, that reproduces chart like yours if I zoom in (I forgot about the zoom feature on uPlot, zooming or hiding and showing the chart also fixes the existing y axis issue on elixir-companies/elsewhere, but I hadn't been trying that). I jsut meant, so far, running your fixed uPlot commit above locally on the branch of this repo historical_data_uplot_master which I updated to use that commit you referenced (I don't see it on any branch of uPlot), I didn't see any similar issue.

Is there any reason not to merge/PR your above fix into uPlot master and/or make a new npm release that includes it?

@leeoniya
Copy link

Is there any reason not to merge/PR your above fix into uPlot master and/or make a new npm release that includes it?

the commit i linked is already in master. later today i'll be re-checking that nothing broke since 1.0.10 and will publish 1.0.11.

@bglusman
Copy link
Owner Author

As a thought, I wonder if the issue has to do with floating point math and the recent change to calculate times at microsecond precision in LiveDashboard, but then divide them down and send to client at coarser granularity, using the microsecond small digits to ensure the events have unique IDs... this commit phoenixframework@dc72173 and/or perhaps this one phoenixframework@24b5bac but that's just a wild guess, I haven't fully digested these, but I saw in your dataset s fair number of data points with lots of extra precision but all .99998 etc...

Awesome, I forgot to look for it in master! Thanks so much Leon!

@leeoniya
Copy link

leeoniya commented Jun 15, 2020

As a thought, I wonder if the issue has to do with floating point math and the recent change to calculate times at microsecond precision in LiveDashboard

right, that was done because uPlot expects time axes to be unix timestamps (seconds), and also cannot have duplicate values, which would occur if you have 2 events in the same second or millisecond.

that still doesnt explain why there are multiple query decode events for a single page hit (can you verify if this is actually happening and it's not just a websocket data -> uPlot data adaptation mistake?). a precision change or division op should not have that effect, right?

@leeoniya
Copy link

leeoniya commented Jun 15, 2020

below is what i see come over the websocket for a single page hit to https://elixir-companies.com/en/companies/268. why there are 4 sets of events and 2 Query time points in each set, i have no idea, but that's where the spikes come from. garbage in; garbage out 🤷

[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"1":{"2":{"d":[["Decode time","0.009307","1592251652289036"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"2":{"2":{"d":[["Query time","4.213578","1592251652289036"]]}}}}]	
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"3":{"2":{"d":[["Queue time","0.218026","1592251652289036"]]}}}}]	
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"4":{"2":{"d":[["Idle time","1834.576134","1592251652289036"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"0":{"2":{"d":[["Total time","13.286985","1592251652302712"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"1":{"2":{"d":[["Decode time","0.012582","1592251652302712"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"2":{"2":{"d":[["Query time","12.685079","1592251652302712"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"3":{"2":{"d":[["Queue time","0.589324","1592251652302712"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"4":{"2":{"d":[["Idle time","1839.3218749999999","1592251652302712"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"0":{"2":{"d":[["Total time","1.6687889999999999","1592251652304909"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"1":{"2":{"d":[["Decode time","0.0048319999999999995","1592251652304909"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"2":{"2":{"d":[["Query time","1.374785","1592251652304909"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"3":{"2":{"d":[["Queue time","0.289172","1592251652304909"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"4":{"2":{"d":[["Idle time","1853.068186","1592251652304909"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"0":{"2":{"d":[["Total time","2.5699289999999997","1592251652305923"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"1":{"2":{"d":[["Decode time","0.005889999999999999","1592251652305923"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"2":{"2":{"d":[["Query time","2.310116","1592251652305923"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"3":{"2":{"d":[["Queue time","0.253923","1592251652305923"]]}}}}]
[null,null,"lv:phx-FhjQbZI9uivVKwUB","diff",{"c":{"4":{"2":{"d":[["Idle time","1853.182272","1592251652305923"]]}}}}]

@bglusman
Copy link
Owner Author

bglusman commented Jun 15, 2020

I'll see if I have any idea how to verify, this could be some kind of bug in my code I suppose or a difference in how I'm handling each event storing it vs how live_dashboard is handling it when broadcasting it... but so far I can't reproduce any difference between the live data and the playback data. I can reference this discussion in the final pull request as an open question I guess, if I can't figure out anything definitive before then. Maybe I need to group the historical events by exact time, so that it sends anything that's the same timestamp in the same event? or deduplicate/take first event with a given timestamp? Not sure, but appreciate the input and heads up, along with the bug fix!

deprecated time unit: :microseconds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants