-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41226][SQL] Refactor Spark types by introducing physical types #38750
[SPARK-41226][SQL] Refactor Spark types by introducing physical types #38750
Conversation
e5ff425
to
b08c9c1
Compare
b08c9c1
to
a2634ef
Compare
Can one of the admins verify this patch? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala
Show resolved
Hide resolved
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala
Outdated
Show resolved
Hide resolved
case class PhysicalArrayType(elementType: DataType, containsNull: Boolean) | ||
extends PhysicalDataType {} | ||
|
||
case class PhysicalBinaryType() extends PhysicalDataType {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should they be scala object
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding how java and scala classes work, but these were left as case class
instead of object
because of the instanceof
matching in the SpecializedGettersReader.java
, ColumnarBatchRow.java
, and ColumnarRow.java
files, which need these types to be classes instead of objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see, scala object is not very java friendly.
nice refactor! We should have done this earlier, before adding ansi interval types and timestamp ntz. Now we should have more confidence of these new data types. |
Thanks for the suggestions @cloud-fan! I removed |
|
||
case class PhysicalArrayType(elementType: DataType, containsNull: Boolean) extends PhysicalDataType | ||
|
||
case class PhysicalBinaryType() extends PhysicalDataType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for physical types without parameters, shall we follow logical type and have both class and object?
class LongType private() ...
case object LongType extends LongType
The benefit is: it's a singleton and we can save memory usage. The matching code in Scala can be
if (dt == PhysicalBinaryType)..
pdt match {
case PhysicalBinaryType => ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion! Just made the changes @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one comment.
thanks, merging to master! |
@cloud-fan @desmondcheongzx whether the following scenarios are currently unsuitable for use physical types: spark/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java Lines 170 to 172 in 3fc8a90
|
good point, I think we should use physical type there. We should probably find all the usages of |
OK, let me find them as comprehensively as possible |
### What changes were proposed in this pull request? Refactor Spark types by introducing physical types. Multiple logical types match to the same physical type, for example `DateType` and `YearMonthIntervalType` are both implemented using `IntegerType`. Since this is the case, we can simplify case matching logic on Spark types by matching their physical types rather than listing all possible logical types. ### Why are the changes needed? These changes simplify the Spark type system. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Since this code is a refactor of existing code, we rely on existing tests. Closes apache#38750 from desmondcheongzx/refactor-using-physical-types. Authored-by: Desmond Cheong <desmond.cheong@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Refactor Spark types by introducing physical types. Multiple logical types match to the same physical type, for example
DateType
andYearMonthIntervalType
are both implemented usingIntegerType
. Since this is the case, we can simplify case matching logic on Spark types by matching their physical types rather than listing all possible logical types.Why are the changes needed?
These changes simplify the Spark type system.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Since this code is a refactor of existing code, we rely on existing tests.