-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming of stdout for metric emission in external probe #691
Comments
See #689 (comment) for possible implementation of this feature. |
@BrennaEpp, IIRC, you were also looking for something similar a few months back. Can you please confirm if this will help your project as well? Thanks! |
- Generate metrics from external probe's stdout as soon as stdout becomes available. This helps in situations where external probe process runs for a bit but it keeps outputting metrics much more frequently, say 1 min interval with output every 10s (see #691 for one such request but it has been asked earlier as well, at least once more). - With this change stderr will also be read and logged as soon as it's available. Add and improve testing: - Don't rely on timeout for testing. That makes it unreliable on CI. - Reduce external probe process runs for the TestProbeOnceMode test. - Remove wait for the command exit. - Change in #547 appears wrong as it was causing wait to be called on only for non-windows platforms, while issue was on windows. It seems that the issue was temporary and fixed by itself.
Fix for this has been submitted (#708). |
Thank you @manugarg ! I'm pretty sure this will resolve a big issue that we've encountered (data loss). I will test it out at some point and get back to you if it doesn't completely resolve. |
@manugarg, A big thanks to you for implementing this solution. It seems working fine for the most of the cases but I got error while emitting distribution metric, I emit latency_ms_dist metric like the following:
I think the continuous streaming of stdout is breaking the line. I am getting this error: But once I disabled this new feature I am getting no errors and everything seems fine. Also in the test file external_test.go, you have used all the environment variables which has "GO_TEST" as prefix, which led to fail the tests in our environment, In our test environment there is one extra GO_TEST_CHATTY_OUTPUT variable and it will always be there. For now, I have tweaked my code but later someone can face the same issue. |
This is likely a result of default buffer limits. It appears that your distribution metrics lines are too long -- default limit is 64kB. Does that sound right? Approximately how many numbers do you expect on these lines? Update: |
I've just now fixed an issue (#712), but symptoms of that issue should not be what you're noticing. It may be worth trying though. As I said earlier, looking at your external program's output will help debug this issue further. |
Hi @manugarg, I have checked with the output closely it is actually parsing the line partially : In actual production the metric line would have more than 50000+ values, previously it was working fine. |
Cool. I'd definitely recommend not making these lines too long (more than 64kB) though. Instead of that, run multiple probes if you need to increase the frequency. |
To give you enough room, I am increasing the max token size to 256kB (4 times the default): #722. |
Describe the feature you'd like and the problem it will solve
Currently, the cloudprober runs the external probe and extracts all the metrics from the stdout all at once. This means that if our probe runs for 60 seconds, emitting one metric ("request_count 1") every 10 seconds, cloudprober will not extract these metrics every 10 seconds. Instead, it will extract them at the end of 60 seconds and emit "request_count 1" six times, all together.
I'm using the stackdriver surfacer. With this configuration, if all six metrics are emitted at the end, the surfacer will override all six metrics into one, leading to data loss.
To resolve this issue, we can extract the metrics in a streaming way. This means extracting data from stdout at a fixed interval, which can eliminate the data loss problem.
Related bug for this feature request : #689
The text was updated successfully, but these errors were encountered: