-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12432: [Rust] [DataFusion] Add metrics to SortExec #10078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@returnString I'd like to hear more about your idea to use atomics here .. would you be interested in creating a follow-up PR to switch over? |
39106cf to
0e375f2
Compare
Codecov Report
@@ Coverage Diff @@
## master #10078 +/- ##
==========================================
+ Coverage 78.92% 78.93% +0.01%
==========================================
Files 286 286
Lines 64728 64758 +30
==========================================
+ Hits 51088 51119 +31
+ Misses 13640 13639 -1
Continue to review full report at Codecov.
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code looks ok except for the flatbuffer pin
As we develop the counter system I think it is worth considering if we can avoid the runtime string lookups, but I don't see that as a deal breaker.
Also, I agree that if someone (like @returnString ) has time to switch the Counters from using Mutex to using AtomicUsize or something the code will likely look much nicer
| pub enum MetricType { | ||
| /// Simple counter | ||
| Counter, | ||
| /// Time in nanoseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably help to explicitly mention if this was wall clock time, or cpu time . It looks like this PR saves wallclock time for sort
Both are probably interesting counters / metrics to eventually have
| let result: Vec<RecordBatch> = collect(sort_exec).await?; | ||
| let result: Vec<RecordBatch> = collect(sort_exec.clone()).await?; | ||
| assert_eq!(sort_exec.metrics().get("outputRows").unwrap().value, 8); | ||
| assert_eq!(result.len(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also assert that the time counter was greater than zero?
| output_rows: SQLMetric::counter("outputRows"), | ||
| sort_time_nanos: SQLMetric::time_nanos("sortTime"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given Rust's focus on compile time type checking, what would you think about using typed counters rather than String keys?
So make the code look something like:
| output_rows: SQLMetric::counter("outputRows"), | |
| sort_time_nanos: SQLMetric::time_nanos("sortTime"), | |
| output_rows: SQLMetric<OutputRows>::new(), | |
| sort_time_nanos: SQLMetric<SortTime>::new(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply adding an enum for the metrics? This might limit extensibility for users that want to add custom metrics.
|
I've got a few ideas for follow-up PRs:
Will try and carve out some time either today or tomorrow to write these up in more depth. All in all, really excited about getting decent observability 😀 |
|
Thank you @returnString ! BTW we (well really @jacobmarble) has spent a non trivial amount of time getting prometheus metrics generated in IOx -- from that experience (and the substantial dependency chain it brings) I would suggest DataFusion focus on a self contained way to get the metrics from planning and executing, and then leave it up to users of |
Dandandan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really cool!
One thing I am wondering for the sake of keeping statistics for things like adaptive/dynamic query optimization is that maybe we should make some metrics / stats more static? The current "flexibility" by having them inside strings is good if for logging / debugging the metrics, but maybe would be nice if we can re-use them if we start having a dynamic way of changing the plan based on runtime statistics.
|
Thanks for the feedback @alamb @Dandandan @returnString. This code was enough to help me track down an issue and demonstrate the metrics capability but it would be good to collaborate on a better design for this. We also need a way to accumulate values for these metrics across a distributed query so that we see totals per operator when looking at query plans in the UI. I'll address the smaller points here and merge this since we're about to move to the new repo and will follow up this week with a new issue and design doc for metrics. |
Add
outputRowsandsortTimemetrics to SortExec.Example output from Ballista: