Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming of stdout for metric emission in external probe #691

Closed
v-pratap opened this issue Mar 7, 2024 · 11 comments
Closed

Streaming of stdout for metric emission in external probe #691

v-pratap opened this issue Mar 7, 2024 · 11 comments
Labels
enhancement New feature or request
Milestone

Comments

@v-pratap
Copy link
Contributor

v-pratap commented Mar 7, 2024

Describe the feature you'd like and the problem it will solve
Currently, the cloudprober runs the external probe and extracts all the metrics from the stdout all at once. This means that if our probe runs for 60 seconds, emitting one metric ("request_count 1") every 10 seconds, cloudprober will not extract these metrics every 10 seconds. Instead, it will extract them at the end of 60 seconds and emit "request_count 1" six times, all together.

I'm using the stackdriver surfacer. With this configuration, if all six metrics are emitted at the end, the surfacer will override all six metrics into one, leading to data loss.

To resolve this issue, we can extract the metrics in a streaming way. This means extracting data from stdout at a fixed interval, which can eliminate the data loss problem.

Related bug for this feature request : #689

@manugarg
Copy link
Contributor

manugarg commented Mar 9, 2024

See #689 (comment) for possible implementation of this feature.

@manugarg
Copy link
Contributor

@BrennaEpp, IIRC, you were also looking for something similar a few months back. Can you please confirm if this will help your project as well? Thanks!

@manugarg manugarg added this to the v0.13.4 milestone Mar 29, 2024
manugarg added a commit that referenced this issue Apr 2, 2024
- Generate metrics from external probe's stdout as soon as stdout becomes available. This helps in situations where external probe process runs for a bit but it keeps outputting metrics much more frequently, say 1 min interval with output every 10s (see #691 for one such request but it has been asked earlier as well, at least once more).

- With this change stderr will also be read and logged as soon as it's available.

Add and improve testing:
- Don't rely on timeout for testing. That makes it unreliable on CI.
- Reduce external probe process runs for the TestProbeOnceMode test.
- Remove wait for the command exit.
   - Change in #547 appears wrong as it was causing wait to be called on
      only for non-windows platforms, while issue was on windows. It seems
      that the issue was temporary and fixed by itself.
@manugarg
Copy link
Contributor

manugarg commented Apr 2, 2024

Fix for this has been submitted (#708).

@manugarg manugarg closed this as completed Apr 2, 2024
@BrennaEpp
Copy link

Thank you @manugarg ! I'm pretty sure this will resolve a big issue that we've encountered (data loss). I will test it out at some point and get back to you if it doesn't completely resolve.

@v-pratap
Copy link
Contributor Author

v-pratap commented Apr 9, 2024

@manugarg, A big thanks to you for implementing this solution. It seems working fine for the most of the cases but I got error while emitting distribution metric, I emit latency_ms_dist metric like the following:
Screenshot 2024-04-09 at 9 34 46 PM

latency_ms_dist{op=READ} 1,3,4,5,5,6,63,3,5,3,56,5,675,6,45,4

I think the continuous streaming of stdout is breaking the line. I am getting this error:
Screenshot 2024-04-09 at 9 31 14 PM
Screenshot 2024-04-09 at 9 31 14 PM

But once I disabled this new feature I am getting no errors and everything seems fine.
Can you provide some insights here?

Also in the test file external_test.go, you have used all the environment variables which has "GO_TEST" as prefix, which led to fail the tests in our environment, In our test environment there is one extra GO_TEST_CHATTY_OUTPUT variable and it will always be there. For now, I have tweaked my code but later someone can face the same issue.

@manugarg
Copy link
Contributor

manugarg commented Apr 9, 2024

This is likely a result of default buffer limits. It appears that your distribution metrics lines are too long -- default limit is 64kB. Does that sound right? Approximately how many numbers do you expect on these lines?

Update:
I did some testing and it seems that scanner will not break the line if buffer is too small for it, so it must be something else. In any case, before we do further debugging, it will be good to know your program's output -- how does that line look like.

@manugarg
Copy link
Contributor

manugarg commented Apr 9, 2024

I've just now fixed an issue (#712), but symptoms of that issue should not be what you're noticing. It may be worth trying though.

As I said earlier, looking at your external program's output will help debug this issue further.

@manugarg manugarg reopened this Apr 9, 2024
@v-pratap
Copy link
Contributor Author

Hi @manugarg,

I have checked with the output closely it is actually parsing the line partially :
Screenshot 2024-04-10 at 10 42 54 AM

In actual production the metric line would have more than 50000+ values, previously it was working fine.

@v-pratap
Copy link
Contributor Author

I have patched the fix of #712 and then it seems working fine. Thanks @manugarg for all of these.

@manugarg
Copy link
Contributor

Cool. I'd definitely recommend not making these lines too long (more than 64kB) though. Instead of that, run multiple probes if you need to increase the frequency.

@manugarg
Copy link
Contributor

To give you enough room, I am increasing the max token size to 256kB (4 times the default): #722.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants