Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38999][SQL] Refactor FileSourceScanExec: file scan physical node #36327

Closed
wants to merge 3 commits into from

Conversation

utkarsh39
Copy link
Contributor

What changes were proposed in this pull request?

The PR refactors FileSourceScanExec case class into a base trait FileSourceScanLike which is then subclassed by FileSourceScanExec. FileSourceScanLike contains basic functionality like metrics and file listing while the FileSourceScanExec contains execution specific code.

Why are the changes needed?

Currently the code for FileSourceScanExec class, the physical node for the file scans is quite complex and lengthy making it slightly difficult to reason about.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Code refactor, existing tests should suffice.

@github-actions github-actions bot added the SQL label Apr 22, 2022
@utkarsh39
Copy link
Contributor Author

@cloud-fan @gengliangwang Can you please help review this?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

} ++ staticMetrics
}

lazy val inputRDD: RDD[InternalRow] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea of separating planning/metrics stuff vs RDD/execution.

// Number of coalesced buckets.
def optionalNumCoalescedBuckets: Option[Int]
// Output attributes of the scan, including data attributes and partition attributes.
def output: Seq[Attribute]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataSourceScanExec extends LeafExecNode, and def output: Seq[Attribute] is already declared there.

// Output attributes of the scan, including data attributes and partition attributes.
def output: Seq[Attribute]
// Predicates to use for partition pruning.
def partitionFilters: Seq[Expression]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we put it near dataFilters?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 8c80016 Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants