log to cloudwatch instead of HB on aiproxy request retry #58354

davidsbailey · 2024-05-01T19:50:00Z

Finishes https://codedotorg.atlassian.net/browse/AITT-597. Some relevant background here: https://github.com/code-dot-org/code-dot-org/pull/54951/files#r1394758841

Testing story

new unit test coverage

Follow-up steps

resolve https://app.honeybadger.io/projects/3240/faults/106895906/01HVM50BD81SEVA9545BV6XN4Q after this PR ships (currently it is paused)

cat5inthecradle · 2024-05-03T19:30:00Z

I think this is a great candidate for structured cloudwatch logs, from which we could infer some metrics. We don't have this implemented yet, but it has popped up in a few conversations, including this week's issues with the Test server.

This PR looks good to me, just noting is the loss of context that you were previously able to send to Honeybadger.

Are there any additional dimensions worth adding? I'm thinking maybe the name/url of the particular resource that timed out.

davidsbailey · 2024-05-03T20:24:24Z

@cat5inthecradle please tell me more about structured cloudwatch logs! I would love to log user id, script level id, etc so that we can see if a certain project is failing repeatedly, though I guess we'll find that out another way if the project fails more times than the number of retries. another wish is to send along error details from aiproxy and have those show up in these logs, so that we don't have to do forensics to match up this occurrence with the raw logs from aiproxy.

I'm open to adding more dimensions, but at the moment all of our traffic from this job is going to the assessment endpoint on aiproxy so I'm not sure if that particular piece of data adds anything?

davidsbailey · 2024-05-03T20:27:33Z

as an aside, new relic has "custom metrics" and "custom events". metrics are like any other metrics, and events could be searched with a SQL-like interface. I thought this worked really well, and wish we had something like this within AWS

davidsbailey · 2024-05-03T20:28:10Z

lastly... I'm actually tempted to log the error details to firehose, since that actually implements exactly what I need

cat5inthecradle · 2024-05-03T20:52:37Z

Firehose gets the logs into S3 which makes them queryable via AWS Athena. You can do the same thing with Cloudwatch Logs. Cloudwatch Logs also integrate more closely with metrics.

You use MetricFilters for that, something like this in Cloudformation:

  TimeoutMetricFilter:
    Type: "AWS::Logs::MetricFilter"
    Properties:
      LogGroupName: 
        Ref: TimeoutLogGroup
      FilterPattern: '{ $.event = "timeout" }'
      MetricTransformations:
        - MetricName: "TimeoutEventCount"
          MetricNamespace: "YourApplication"
          MetricValue: "1"
          DefaultValue: 0

Firehose might be a better intermediate step than Metrics because it allows for wider events. I guess I wasn't aware we were doing that on the backend though.

davidsbailey · 2024-05-06T16:43:29Z

This PR looks good to me, just noting is the loss of context that you were previously able to send to Honeybadger.

Firehose might be a better intermediate step than Metrics because it allows for wider events.

Adding firehose logging in #58411, then will switch to writing to cloudwatch logs once that is working.

I guess I wasn't aware we were doing that on the backend though.

Can you clarify what you mean by this last part?

davidsbailey added 4 commits April 30, 2024 22:23

log to cloudwatch instead of HB on aiproxy request retry

6a3e365

add ScriptLevelId to cloudwatch log event

71e1777

trim down dimensions

d25c33c

add unit test coverage

084feaf

davidsbailey marked this pull request as ready for review May 3, 2024 00:43

davidsbailey requested review from cat5inthecradle and a team May 3, 2024 00:44

davidsbailey mentioned this pull request May 3, 2024

retry aiproxy requests on 503 and 504 #58411

Merged

davidsbailey merged commit 7b14e41 into staging May 6, 2024
2 checks passed

davidsbailey deleted the ai-rubrics-retry-cloudwatch branch May 6, 2024 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log to cloudwatch instead of HB on aiproxy request retry #58354

log to cloudwatch instead of HB on aiproxy request retry #58354

davidsbailey commented May 1, 2024 •

edited

cat5inthecradle commented May 3, 2024

davidsbailey commented May 3, 2024

davidsbailey commented May 3, 2024

davidsbailey commented May 3, 2024

cat5inthecradle commented May 3, 2024

davidsbailey commented May 6, 2024

log to cloudwatch instead of HB on aiproxy request retry #58354

log to cloudwatch instead of HB on aiproxy request retry #58354

Conversation

davidsbailey commented May 1, 2024 • edited

Testing story

Follow-up steps

cat5inthecradle commented May 3, 2024

davidsbailey commented May 3, 2024

davidsbailey commented May 3, 2024

davidsbailey commented May 3, 2024

cat5inthecradle commented May 3, 2024

davidsbailey commented May 6, 2024

davidsbailey commented May 1, 2024 •

edited