Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41226][SQL] Refactor Spark types by introducing physical types #38750

Conversation

desmondcheongzx
Copy link
Contributor

@desmondcheongzx desmondcheongzx commented Nov 22, 2022

What changes were proposed in this pull request?

Refactor Spark types by introducing physical types. Multiple logical types match to the same physical type, for example DateType and YearMonthIntervalType are both implemented using IntegerType. Since this is the case, we can simplify case matching logic on Spark types by matching their physical types rather than listing all possible logical types.

Why are the changes needed?

These changes simplify the Spark type system.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Since this code is a refactor of existing code, we rely on existing tests.

@github-actions github-actions bot added the SQL label Nov 22, 2022
@desmondcheongzx desmondcheongzx changed the title Refactor by introducing physical types Refactor Spark types by introducing physical types Nov 22, 2022
@desmondcheongzx desmondcheongzx changed the title Refactor Spark types by introducing physical types [SPARK-41226][SQL] Refactor Spark types by introducing physical types Nov 22, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

case class PhysicalArrayType(elementType: DataType, containsNull: Boolean)
extends PhysicalDataType {}

case class PhysicalBinaryType() extends PhysicalDataType {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should they be scala object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding how java and scala classes work, but these were left as case class instead of object because of the instanceof matching in the SpecializedGettersReader.java, ColumnarBatchRow.java, and ColumnarRow.java files, which need these types to be classes instead of objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, scala object is not very java friendly.

@cloud-fan
Copy link
Contributor

nice refactor! We should have done this earlier, before adding ansi interval types and timestamp ntz. Now we should have more confidence of these new data types.

@cloud-fan
Copy link
Contributor

cc @MaxGekk @gengliangwang

@desmondcheongzx
Copy link
Contributor Author

Thanks for the suggestions @cloud-fan! I removed PhysicalObjectType and added comments to the PR


case class PhysicalArrayType(elementType: DataType, containsNull: Boolean) extends PhysicalDataType

case class PhysicalBinaryType() extends PhysicalDataType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for physical types without parameters, shall we follow logical type and have both class and object?

class LongType private() ...
case object LongType extends LongType

The benefit is: it's a singleton and we can save memory usage. The matching code in Scala can be

if (dt == PhysicalBinaryType)..
pdt match {
  case PhysicalBinaryType => ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion! Just made the changes @cloud-fan

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3d59859 Dec 1, 2022
@LuciferYang
Copy link
Contributor

@cloud-fan @desmondcheongzx whether the following scenarios are currently unsuitable for use physical types:

} else if (t instanceof DateType) {
dst.appendInt(DateTimeUtils.fromJavaDate((Date)o));
} else {

@cloud-fan
Copy link
Contributor

good point, I think we should use physical type there. We should probably find all the usages of DateType in the codebase and see if they need to use physical type or not. @LuciferYang it will be great if you have time to help with this. Thanks in advance!

@LuciferYang
Copy link
Contributor

OK, let me find them as comprehensively as possible

beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
### What changes were proposed in this pull request?

Refactor Spark types by introducing physical types. Multiple logical types match to the same physical type, for example `DateType` and `YearMonthIntervalType` are both implemented using `IntegerType`. Since this is the case, we can simplify case matching logic on Spark types by matching their physical types rather than listing all possible logical types.

### Why are the changes needed?

These changes simplify the Spark type system.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Since this code is a refactor of existing code, we rely on existing tests.

Closes apache#38750 from desmondcheongzx/refactor-using-physical-types.

Authored-by: Desmond Cheong <desmond.cheong@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants