Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20136][SQL] Add num files and metadata operation timing to scan operator metrics #17465

Closed
wants to merge 1 commit into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Mar 29, 2017

What changes were proposed in this pull request?

This patch adds explicit metadata operation timing and number of files in data source metrics. Those would be useful to include for performance profiling.

Screenshot of a UI with this change (num files and metadata time are new metrics):

screen shot 2017-03-29 at 12 29 28 am

How was this patch tested?

N/A

@rxin
Copy link
Contributor Author

rxin commented Mar 29, 2017

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a unit test for this?

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't take into account the time for the initial partition pruning, which is done earlier in logical planning. That's in the PruneFileSourcePartitions rule, so not sure how to connect it with this physical operator though.

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75349 has finished for PR 17465 at commit 9efc548.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-ionescu
Copy link
Contributor

LGTM

I think it's ok that it doesn't take into account partition pruning, since that's not really part of this physical operator. Another thing to note here is that listFiles() may often return from cache. We could consider exposing file cache hits & misses as another metric.

@rxin
Copy link
Contributor Author

rxin commented Mar 30, 2017

Let me merge this now. I will send a follow-up PR to take the logical planning time into account (otherwise in the vast majority of cases, i.e. pruned partitions, the metadata operation time will be approximately 0).

@rxin
Copy link
Contributor Author

rxin commented Mar 30, 2017

Merging in master.

@asfgit asfgit closed this in 6097788 Mar 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants