Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log to cloudwatch instead of HB on aiproxy request retry #58354

Merged
merged 4 commits into from
May 6, 2024

Conversation

davidsbailey
Copy link
Member

@davidsbailey davidsbailey commented May 1, 2024

Finishes https://codedotorg.atlassian.net/browse/AITT-597. Some relevant background here: https://github.com/code-dot-org/code-dot-org/pull/54951/files#r1394758841

Testing story

new unit test coverage

Follow-up steps

resolve https://app.honeybadger.io/projects/3240/faults/106895906/01HVM50BD81SEVA9545BV6XN4Q after this PR ships (currently it is paused)

@davidsbailey davidsbailey marked this pull request as ready for review May 3, 2024 00:43
@davidsbailey davidsbailey requested review from cat5inthecradle and a team May 3, 2024 00:44
@cat5inthecradle
Copy link
Contributor

I think this is a great candidate for structured cloudwatch logs, from which we could infer some metrics. We don't have this implemented yet, but it has popped up in a few conversations, including this week's issues with the Test server.

This PR looks good to me, just noting is the loss of context that you were previously able to send to Honeybadger.

Are there any additional dimensions worth adding? I'm thinking maybe the name/url of the particular resource that timed out.

@davidsbailey
Copy link
Member Author

@cat5inthecradle please tell me more about structured cloudwatch logs! I would love to log user id, script level id, etc so that we can see if a certain project is failing repeatedly, though I guess we'll find that out another way if the project fails more times than the number of retries. another wish is to send along error details from aiproxy and have those show up in these logs, so that we don't have to do forensics to match up this occurrence with the raw logs from aiproxy.

I'm open to adding more dimensions, but at the moment all of our traffic from this job is going to the assessment endpoint on aiproxy so I'm not sure if that particular piece of data adds anything?

@davidsbailey
Copy link
Member Author

as an aside, new relic has "custom metrics" and "custom events". metrics are like any other metrics, and events could be searched with a SQL-like interface. I thought this worked really well, and wish we had something like this within AWS

@davidsbailey
Copy link
Member Author

lastly... I'm actually tempted to log the error details to firehose, since that actually implements exactly what I need

@cat5inthecradle
Copy link
Contributor

Firehose gets the logs into S3 which makes them queryable via AWS Athena. You can do the same thing with Cloudwatch Logs. Cloudwatch Logs also integrate more closely with metrics.

You use MetricFilters for that, something like this in Cloudformation:

  TimeoutMetricFilter:
    Type: "AWS::Logs::MetricFilter"
    Properties:
      LogGroupName: 
        Ref: TimeoutLogGroup
      FilterPattern: '{ $.event = "timeout" }'
      MetricTransformations:
        - MetricName: "TimeoutEventCount"
          MetricNamespace: "YourApplication"
          MetricValue: "1"
          DefaultValue: 0

Firehose might be a better intermediate step than Metrics because it allows for wider events. I guess I wasn't aware we were doing that on the backend though.

@davidsbailey
Copy link
Member Author

This PR looks good to me, just noting is the loss of context that you were previously able to send to Honeybadger.

Firehose might be a better intermediate step than Metrics because it allows for wider events.

Adding firehose logging in #58411, then will switch to writing to cloudwatch logs once that is working.

I guess I wasn't aware we were doing that on the backend though.

Can you clarify what you mean by this last part?

@davidsbailey davidsbailey merged commit 7b14e41 into staging May 6, 2024
2 checks passed
@davidsbailey davidsbailey deleted the ai-rubrics-retry-cloudwatch branch May 6, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants