Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41358][SQL] Refactor ColumnVectorUtils#populate method to use PhysicalDataType instead of DataType #38873

Closed
wants to merge 5 commits into from

Conversation

LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Dec 2, 2022

What changes were proposed in this pull request?

The main change of this pr is refactor ColumnVectorUtils#populate method to use PhysicalDataType instead of DataType.

Why are the changes needed?

Simplify type match.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing ColumnVectorUtilsSuite

@github-actions github-actions bot added the SQL label Dec 2, 2022
@LuciferYang
Copy link
Contributor Author

@cloud-fan @desmondcheongzx I create an Umbrella SPARK-41356 and add 2 possible initial tasks and will add subTasks gradually in the future, If you have time, please help to review the first one. Thanks ~

*
* @since 3.4.0
*/
public class PhysicalDataTypes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really needed? PhysicalDataType is an internal API, and we don't need to make it java friendly. We can write df isinstanceof PhysicalStringType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in the following scenario:

static {
mutableFieldTypes = Collections.unmodifiableSet(
new HashSet<>(
Arrays.asList(
NullType,
BooleanType,
ByteType,
ShortType,
IntegerType,
LongType,
FloatType,
DoubleType,
DateType,
TimestampType,
TimestampNTZType
)));
}

if we can replace PhysicalDataType with DataType:

 mutableFieldTypes = Collections.unmodifiableSet(
      new HashSet<>(
        Arrays.asList(
          PhysicalNullType$.MODULE$,
          PhysicalBooleanType$.MODULE$,
          PhysicalByteType$.MODULE$,
          PhysicalShortType$.MODULE$,
          PhysicalIntegerType$.MODULE$,
          PhysicalLongType$.MODULE$,
          PhysicalFloatType$.MODULE$,
          PhysicalDoubleType$.MODULE$
        )));

Does it look ugly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not ugly, I think PhysicalDataTypes can be deleted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK. This is internal code anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me remove PhysicalDataTypes and change to use isinstanceof

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} else if (t == DataTypes.LongType) {
} else if (pdt instanceof PhysicalIntegerType) {
if (o instanceof Date) {
dst.appendInt(DateTimeUtils.fromJavaDate((Date) o));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit weird. Why the caller side passes Date for date type but Long for timestamp type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* DateType -> java.sql.Date if spark.sql.datetime.java8API.enabled is false
* DateType -> java.time.LocalDate if spark.sql.datetime.java8API.enabled is true
*
* TimestampType -> java.sql.Timestamp if spark.sql.datetime.java8API.enabled is false
* TimestampType -> java.time.Instant if spark.sql.datetime.java8API.enabled is true

Except for java.sql.Timestamp, I think we should also handle o is java.time.LocalDate and java.time.Instant ref to the document.?

There should be no tests to cover these branches. I will add some test later.

@@ -125,32 +125,45 @@ public static Map<Integer, Integer> toJavaIntMap(ColumnarMap map) {
}

private static void appendValue(WritableColumnVector dst, DataType t, Object o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function is to handle external types (Row), I think matching logical data type makes more sense, as external types are bound to logical types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will revert the change of this method, but there maybe 2 problem:

  • TimestampType is not handled, will throw UnsupportedOperationException directly
  • appendValue method not handle MapType yet

seems that another pr is needed to solve these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is an existing bug and should be fixed separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is an existing bug and should be fixed separately.

thanks @cloud-fan, I will follow up on the remaining bug fix later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnVectorUtils#toBatch and related appendValue methods will only be used by test cases

@LuciferYang LuciferYang changed the title [SPARK-41358][SQL] Refactor ColumnVectorUtils to use PhysicalDataType instead of DataType [SPARK-41358][SQL] Refactor ColumnVectorUtils#populate method to use PhysicalDataType instead of DataType Dec 2, 2022
@LuciferYang
Copy link
Contributor Author

GA passed

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 32554a0 Dec 5, 2022
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…e `PhysicalDataType` instead of `DataType`

### What changes were proposed in this pull request?
The main change of this pr is refactor `ColumnVectorUtils#populate` method to use `PhysicalDataType` instead of `DataType`.

### Why are the changes needed?
Simplify type match.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing ColumnVectorUtilsSuite

Closes apache#38873 from LuciferYang/SPARK-41358.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants