Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Apr 17, 2021

Add outputRows and sortTime metrics to SortExec.

Example output from Ballista:

SortExec { input: ProjectionExec { expr: [(Column { name: "l_shipmode" }, "l_shipmode"), (Column { name: "SUM(CASE WHEN 
  Metrics: sortTime=44444, outputRows=2

@github-actions
Copy link

@andygrove andygrove requested review from alamb and jorgecarleitao and removed request for jorgecarleitao April 17, 2021 16:07
@andygrove
Copy link
Member Author

@returnString I'd like to hear more about your idea to use atomics here .. would you be interested in creating a follow-up PR to switch over?

@codecov-commenter
Copy link

codecov-commenter commented Apr 17, 2021

Codecov Report

Merging #10078 (bf49dda) into master (7e3deb5) will increase coverage by 0.01%.
The diff coverage is 97.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10078      +/-   ##
==========================================
+ Coverage   78.92%   78.93%   +0.01%     
==========================================
  Files         286      286              
  Lines       64728    64758      +30     
==========================================
+ Hits        51088    51119      +31     
+ Misses      13640    13639       -1     
Impacted Files Coverage Δ
rust/datafusion/src/physical_plan/sort.rs 92.19% <96.96%> (+0.66%) ⬆️
...ust/datafusion/src/physical_plan/hash_aggregate.rs 84.62% <100.00%> (+0.31%) ⬆️
rust/datafusion/src/physical_plan/mod.rs 87.09% <100.00%> (+0.88%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e3deb5...bf49dda. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks ok except for the flatbuffer pin

As we develop the counter system I think it is worth considering if we can avoid the runtime string lookups, but I don't see that as a deal breaker.

Also, I agree that if someone (like @returnString ) has time to switch the Counters from using Mutex to using AtomicUsize or something the code will likely look much nicer

pub enum MetricType {
/// Simple counter
Counter,
/// Time in nanoseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably help to explicitly mention if this was wall clock time, or cpu time . It looks like this PR saves wallclock time for sort

Both are probably interesting counters / metrics to eventually have

let result: Vec<RecordBatch> = collect(sort_exec).await?;
let result: Vec<RecordBatch> = collect(sort_exec.clone()).await?;
assert_eq!(sort_exec.metrics().get("outputRows").unwrap().value, 8);
assert_eq!(result.len(), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also assert that the time counter was greater than zero?

Comment on lines +69 to +70
output_rows: SQLMetric::counter("outputRows"),
sort_time_nanos: SQLMetric::time_nanos("sortTime"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given Rust's focus on compile time type checking, what would you think about using typed counters rather than String keys?

So make the code look something like:

Suggested change
output_rows: SQLMetric::counter("outputRows"),
sort_time_nanos: SQLMetric::time_nanos("sortTime"),
output_rows: SQLMetric<OutputRows>::new(),
sort_time_nanos: SQLMetric<SortTime>::new(),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply adding an enum for the metrics? This might limit extensibility for users that want to add custom metrics.

@returnString
Copy link
Contributor

I've got a few ideas for follow-up PRs:

  • mutex => atomics as mentioned
  • adding higher-level wrappers like an RAII method for recording time spent on actions
  • seeing how we might integrate these metrics with something like Prometheus

Will try and carve out some time either today or tomorrow to write these up in more depth.

All in all, really excited about getting decent observability 😀

@alamb
Copy link
Contributor

alamb commented Apr 18, 2021

Thank you @returnString !

BTW we (well really @jacobmarble) has spent a non trivial amount of time getting prometheus metrics generated in IOx -- from that experience (and the substantial dependency chain it brings) I would suggest DataFusion focus on a self contained way to get the metrics from planning and executing, and then leave it up to users of DataFusion to connect that to Promethus (or whatever other metric provider they want)

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really cool!

One thing I am wondering for the sake of keeping statistics for things like adaptive/dynamic query optimization is that maybe we should make some metrics / stats more static? The current "flexibility" by having them inside strings is good if for logging / debugging the metrics, but maybe would be nice if we can re-use them if we start having a dynamic way of changing the plan based on runtime statistics.

@andygrove
Copy link
Member Author

Thanks for the feedback @alamb @Dandandan @returnString. This code was enough to help me track down an issue and demonstrate the metrics capability but it would be good to collaborate on a better design for this. We also need a way to accumulate values for these metrics across a distributed query so that we see totals per operator when looking at query plans in the UI. I'll address the smaller points here and merge this since we're about to move to the new repo and will follow up this week with a new issue and design doc for metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants