Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error handling for errors in Datadog API response #103

Merged
merged 4 commits into from
Apr 26, 2023

Conversation

joycse06
Copy link
Collaborator

Datadog can return a 202 Accepted response code but errors in response payload. This PR adds a check for that.

From what it looks like, the endpoint probably also partially accept MetricData which are valid and return errors for invalid ones.

With this, we will at least see in logs what's going on and will be able to fix it quickly. Without this, it was failing silently.

Copy link
Collaborator

@Areson Areson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the failure mode for this? From what it looks like it appears that it didn't return an error at all since err == nil. If some of the metrics are being accepted returning an error here for 202 responses may cause some metric values not to be submitted if there were a large number of points being submitted. Are we fine with stopping the submission of any remaining metrics when we get a partial success?

sink/datadog.go Outdated
@@ -393,6 +394,11 @@ func (s *Datadog) sendApi(ddCtx context.Context, dp []datadogV2.MetricSeries) er
return err
}

if len(apiResponse.Errors) > 0 {
// datadog can return a 202 Accepted response code but errors in response payload
return fmt.Errorf("error response from Datadog: %s", strings.Join(apiResponse.Errors, ","))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re @Areson comment: if this is partial success, then we should not abort completely.

@joycse06
Copy link
Collaborator Author

The series that was being sent all had the same issue, so datadog returned error for the whole batch, but I think it can have partial success. I will verify.

But I agree, if it returns partial success, we should log the error and continue. Though I prefer to fail fast, but in this case as blip submits a lot of metrics we shouldn't stop submission for a few failures.

It might be a good idea to send a meta metric for partial errors in sinks so we can build alert on them. I don't want to depend on a human being to check for partial errors in logs all the time.

@joycse06
Copy link
Collaborator Author

What was the failure mode for this?

The metrics were not being sent to Datadog and there were no errors in logs. So it was failing silently.

@daniel-nichter
Copy link
Member

It might be a good idea to send a meta metric for partial errors in sinks so we can build alert on them. I don't want to depend on a human being to check for partial errors in logs all the time.

I thought about meta metrics, but it's one of those areas where probably nothing we do will be right for a general audience. So instead, partial failure should emit a Blip event because (my thinking was) those can be picked up by a custom sink and sent wherever, to do whatever.

@joycse06
Copy link
Collaborator Author

Verified, this can be partial success. Updated the code to log the errors and continue.

@joycse06
Copy link
Collaborator Author

joycse06 commented Apr 24, 2023

Updated PR to Raise an event as well.

@joycse06 joycse06 merged commit 885bd85 into main Apr 26, 2023
@daniel-nichter daniel-nichter deleted the jnag/2023-04/more-error-checking-in-datadog branch July 3, 2023 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants