-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log to cloudwatch instead of HB on aiproxy request retry #58354
Conversation
I think this is a great candidate for structured cloudwatch logs, from which we could infer some metrics. We don't have this implemented yet, but it has popped up in a few conversations, including this week's issues with the Test server. This PR looks good to me, just noting is the loss of context that you were previously able to send to Honeybadger. Are there any additional dimensions worth adding? I'm thinking maybe the name/url of the particular resource that timed out. |
@cat5inthecradle please tell me more about structured cloudwatch logs! I would love to log user id, script level id, etc so that we can see if a certain project is failing repeatedly, though I guess we'll find that out another way if the project fails more times than the number of retries. another wish is to send along error details from aiproxy and have those show up in these logs, so that we don't have to do forensics to match up this occurrence with the raw logs from aiproxy. I'm open to adding more dimensions, but at the moment all of our traffic from this job is going to the assessment endpoint on aiproxy so I'm not sure if that particular piece of data adds anything? |
as an aside, new relic has "custom metrics" and "custom events". metrics are like any other metrics, and events could be searched with a SQL-like interface. I thought this worked really well, and wish we had something like this within AWS |
lastly... I'm actually tempted to log the error details to firehose, since that actually implements exactly what I need |
Firehose gets the logs into S3 which makes them queryable via AWS Athena. You can do the same thing with Cloudwatch Logs. Cloudwatch Logs also integrate more closely with metrics. You use MetricFilters for that, something like this in Cloudformation: TimeoutMetricFilter:
Type: "AWS::Logs::MetricFilter"
Properties:
LogGroupName:
Ref: TimeoutLogGroup
FilterPattern: '{ $.event = "timeout" }'
MetricTransformations:
- MetricName: "TimeoutEventCount"
MetricNamespace: "YourApplication"
MetricValue: "1"
DefaultValue: 0 Firehose might be a better intermediate step than Metrics because it allows for wider events. I guess I wasn't aware we were doing that on the backend though. |
Adding firehose logging in #58411, then will switch to writing to cloudwatch logs once that is working.
Can you clarify what you mean by this last part? |
Finishes https://codedotorg.atlassian.net/browse/AITT-597. Some relevant background here: https://github.com/code-dot-org/code-dot-org/pull/54951/files#r1394758841
Testing story
new unit test coverage
Follow-up steps
resolve https://app.honeybadger.io/projects/3240/faults/106895906/01HVM50BD81SEVA9545BV6XN4Q after this PR ships (currently it is paused)