New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41358][SQL] Refactor ColumnVectorUtils#populate
method to use PhysicalDataType
instead of DataType
#38873
Conversation
@cloud-fan @desmondcheongzx I create an Umbrella SPARK-41356 and add 2 possible initial tasks and will add subTasks gradually in the future, If you have time, please help to review the first one. Thanks ~ |
* | ||
* @since 3.4.0 | ||
*/ | ||
public class PhysicalDataTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really needed? PhysicalDataType
is an internal API, and we don't need to make it java friendly. We can write df isinstanceof PhysicalStringType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, in the following scenario:
spark/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
Lines 79 to 95 in 3fc8a90
static { | |
mutableFieldTypes = Collections.unmodifiableSet( | |
new HashSet<>( | |
Arrays.asList( | |
NullType, | |
BooleanType, | |
ByteType, | |
ShortType, | |
IntegerType, | |
LongType, | |
FloatType, | |
DoubleType, | |
DateType, | |
TimestampType, | |
TimestampNTZType | |
))); | |
} |
if we can replace PhysicalDataType
with DataType
:
mutableFieldTypes = Collections.unmodifiableSet(
new HashSet<>(
Arrays.asList(
PhysicalNullType$.MODULE$,
PhysicalBooleanType$.MODULE$,
PhysicalByteType$.MODULE$,
PhysicalShortType$.MODULE$,
PhysicalIntegerType$.MODULE$,
PhysicalLongType$.MODULE$,
PhysicalFloatType$.MODULE$,
PhysicalDoubleType$.MODULE$
)));
Does it look ugly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not ugly, I think PhysicalDataTypes
can be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK. This is internal code anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me remove PhysicalDataTypes and change to use isinstanceof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} else if (t == DataTypes.LongType) { | ||
} else if (pdt instanceof PhysicalIntegerType) { | ||
if (o instanceof Date) { | ||
dst.appendInt(DateTimeUtils.fromJavaDate((Date) o)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit weird. Why the caller side passes Date
for date type but Long
for timestamp type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
Lines 170 to 174 in 5d90f98
* DateType -> java.sql.Date if spark.sql.datetime.java8API.enabled is false | |
* DateType -> java.time.LocalDate if spark.sql.datetime.java8API.enabled is true | |
* | |
* TimestampType -> java.sql.Timestamp if spark.sql.datetime.java8API.enabled is false | |
* TimestampType -> java.time.Instant if spark.sql.datetime.java8API.enabled is true |
Except for java.sql.Timestamp
, I think we should also handle o
is java.time.LocalDate
and java.time.Instant
ref to the document.?
There should be no tests to cover these branches. I will add some test later.
@@ -125,32 +125,45 @@ public static Map<Integer, Integer> toJavaIntMap(ColumnarMap map) { | |||
} | |||
|
|||
private static void appendValue(WritableColumnVector dst, DataType t, Object o) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function is to handle external types (Row), I think matching logical data type makes more sense, as external types are bound to logical types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will revert the change of this method, but there maybe 2 problem:
- TimestampType is not handled, will throw
UnsupportedOperationException
directly - appendValue method not handle MapType yet
seems that another pr is needed to solve these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is an existing bug and should be fixed separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is an existing bug and should be fixed separately.
thanks @cloud-fan, I will follow up on the remaining bug fix later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnVectorUtils#toBatch
and related appendValue
methods will only be used by test cases
ColumnVectorUtils
to use PhysicalDataType
instead of DataType
ColumnVectorUtils#populate
method to use PhysicalDataType
instead of DataType
GA passed |
thanks, merging to master! |
…e `PhysicalDataType` instead of `DataType` ### What changes were proposed in this pull request? The main change of this pr is refactor `ColumnVectorUtils#populate` method to use `PhysicalDataType` instead of `DataType`. ### Why are the changes needed? Simplify type match. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing ColumnVectorUtilsSuite Closes apache#38873 from LuciferYang/SPARK-41358. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
The main change of this pr is refactor
ColumnVectorUtils#populate
method to usePhysicalDataType
instead ofDataType
.Why are the changes needed?
Simplify type match.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing ColumnVectorUtilsSuite