Add error handling for errors in Datadog API response #103

joycse06 · 2023-04-21T04:18:43Z

Datadog can return a 202 Accepted response code but errors in response payload. This PR adds a check for that.

From what it looks like, the endpoint probably also partially accept MetricData which are valid and return errors for invalid ones.

With this, we will at least see in logs what's going on and will be able to fix it quickly. Without this, it was failing silently.

Areson

What was the failure mode for this? From what it looks like it appears that it didn't return an error at all since err == nil. If some of the metrics are being accepted returning an error here for 202 responses may cause some metric values not to be submitted if there were a large number of points being submitted. Are we fine with stopping the submission of any remaining metrics when we get a partial success?

daniel-nichter · 2023-04-23T15:24:46Z

sink/datadog.go

@@ -393,6 +394,11 @@ func (s *Datadog) sendApi(ddCtx context.Context, dp []datadogV2.MetricSeries) er
 			return err
 		}

+		if len(apiResponse.Errors) > 0 {
+			// datadog can return a 202 Accepted response code but errors in response payload
+			return fmt.Errorf("error response from Datadog: %s", strings.Join(apiResponse.Errors, ","))


Re @Areson comment: if this is partial success, then we should not abort completely.

joycse06 · 2023-04-23T20:53:45Z

The series that was being sent all had the same issue, so datadog returned error for the whole batch, but I think it can have partial success. I will verify.

But I agree, if it returns partial success, we should log the error and continue. Though I prefer to fail fast, but in this case as blip submits a lot of metrics we shouldn't stop submission for a few failures.

It might be a good idea to send a meta metric for partial errors in sinks so we can build alert on them. I don't want to depend on a human being to check for partial errors in logs all the time.

joycse06 · 2023-04-23T21:26:06Z

What was the failure mode for this?

The metrics were not being sent to Datadog and there were no errors in logs. So it was failing silently.

daniel-nichter · 2023-04-23T21:41:52Z

It might be a good idea to send a meta metric for partial errors in sinks so we can build alert on them. I don't want to depend on a human being to check for partial errors in logs all the time.

I thought about meta metrics, but it's one of those areas where probably nothing we do will be right for a general audience. So instead, partial failure should emit a Blip event because (my thinking was) those can be picked up by a custom sink and sent wherever, to do whatever.

joycse06 · 2023-04-24T22:22:40Z

Verified, this can be partial success. Updated the code to log the errors and continue.

joycse06 · 2023-04-24T22:37:24Z

Updated PR to Raise an event as well.

joycse06 added 2 commits April 21, 2023 14:13

Add error handling for datadog error response

deeca2b

errorsJson -> respJSON

ac30e97

daniel-nichter approved these changes Apr 21, 2023

View reviewed changes

daniel-nichter requested a review from Areson April 21, 2023 12:55

Areson reviewed Apr 21, 2023

View reviewed changes

daniel-nichter reviewed Apr 23, 2023

View reviewed changes

Log validation errors and continue

9ade8dc

Raise an event for partial sink failure as well

1dbb852

Areson approved these changes Apr 25, 2023

View reviewed changes

joycse06 merged commit 885bd85 into main Apr 26, 2023

daniel-nichter deleted the jnag/2023-04/more-error-checking-in-datadog branch July 3, 2023 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add error handling for errors in Datadog API response #103

Add error handling for errors in Datadog API response #103

joycse06 commented Apr 21, 2023

Areson left a comment •

edited

Loading

daniel-nichter Apr 23, 2023

joycse06 commented Apr 23, 2023

joycse06 commented Apr 23, 2023

daniel-nichter commented Apr 23, 2023

joycse06 commented Apr 24, 2023

joycse06 commented Apr 24, 2023 •

edited

Loading

Add error handling for errors in Datadog API response #103

Add error handling for errors in Datadog API response #103

Conversation

joycse06 commented Apr 21, 2023

Areson left a comment • edited Loading

Choose a reason for hiding this comment

daniel-nichter Apr 23, 2023

Choose a reason for hiding this comment

joycse06 commented Apr 23, 2023

joycse06 commented Apr 23, 2023

daniel-nichter commented Apr 23, 2023

joycse06 commented Apr 24, 2023

joycse06 commented Apr 24, 2023 • edited Loading

Areson left a comment •

edited

Loading

joycse06 commented Apr 24, 2023 •

edited

Loading