-
Notifications
You must be signed in to change notification settings - Fork 12
feat: add "flow logger" + instrument upload flow with it #47
Conversation
giovanni-guidini
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, so I'll leave my ✅ , but interested on other people's opinions too.
I don't know of any ready made tools to do this, but I think it's clean implementation.
tasks/notify.py
Outdated
| commitid: str, | ||
| current_yaml=None, | ||
| empty_upload=None, | ||
| checkpoints=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be an instance of CheckpointLogger? I believe all these task arguments need to be serializable in some way (not sure if JSON is used or something else) - so just wanted to make sure this was tested in some sort of E2E fashion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah great callout, i think the tests don't actually exercise passing this between tasks
i was at least planning to create a sentry project and try to get a dashboard populated end-to-end before merging, but i'll also see if i can write a proper integration test that passes this between tasks in CI
Codecov Report
Changes have been made to critical files, which contain lines commonly executed in production. Learn more @@ Coverage Diff @@
## main #47 +/- ##
==========================================
- Coverage 98.59% 98.59% -0.01%
==========================================
Files 357 359 +2
Lines 26004 26275 +271
==========================================
+ Hits 25639 25905 +266
- Misses 365 370 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
|
made significant changes since review
|
@scott-codecov was right, the also got a dashboard created in a spare sentry org, proving the basic idea? |
scott-codecov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great @matt-codecov - excited to see it in action!
tl;dr:
CheckpointLoggerand its testsUploadFlowand the instrumentation ofUploadTask,UploadFinisherTask, andNotifyTask(and tests)tested end-to-end, see this dashboard in sentry with widgets for each of the subflows created:

will see about creating a dashboard in the real Codecov org when this is deployed to staging
related issue: codecov/engineering-team#84
CheckpointLogger+UploadFlowthere are currently two things i want to measure that don't really fit in sentry transaction traces or statsd:
UploadProcessorTasks takes to run, median + p75 + p95you can figure these out for specific samples in sentry, but it didn't seem like you can aggregate/dashboard them effectively. and statsd timers are very code-local and can't capture these high-level event intervals. so i'm playing with a new thing
the above snippet will record timestamps for A and B and then submit the interval to sentry as a custom performance metric on the current transaction with the name "a_to_b". in worker, the current transaction is (i think) the task you're in when you submit the subflow. create a
CheckpointLoggerat the start of your flow and pass it as an argument to functions/tasks until you reach the end of your flow.i've instrumented the upload flow and am hoping to make a dashboard in sentry with these metrics:
UploadTask:time_before_processing,initial_processing_durationUploadFinisherTask:batch_processing_duration,total_processing_durationNotifyTask:notification_latencybatch_processing_durationis the "all theUploadProcessorTasks" metric andnotification_latencyis the end-to-end metric.in the future, i am interested in a version of this that provides reliability metrics for these flows too. basically i want to add funnel analysis: of the N users who enter this flow (upload task), what % end in a success state (notifications sent), and what's the breakdown of where they fall out of the funnel (failed to process uploads, too many retries, no valid credentials)? let me know if you have any ideas :)
OPEN Q: is there anything i should know about clock synchronization between hosts in our deployments? the higher percentiles of these metrics are up there in the minutes so clock drift of a few seconds is probably fine, but i'm kind of hand-waving here
Legal Boilerplate
Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.