[FLINK-16996][table] Refactor planner and connectors to use new data structures by wuchong · Pull Request #11925 · apache/flink

wuchong · 2020-04-27T16:20:06Z

What is the purpose of the change

Refactors existing code to use the new data structures interfaces.

Brief change log

The commints in the order:

[table-common] Add necessary methods to internal data structures
[table-common] Add binary implementations of internal data structures
[table-runtime-blink] Implement all the data structures and serializers around RowData
[table-runtime-blink] Remove legacy data formats (BaseRow)
[table-blink] Refactor planner and runtime to use new data structures
[python] Refactor pyflink to use new data structures
[parquet] Refactor parquet connector to use new data structures
[orc] Refactor ORC connector to use new data structures
[hive] Refactor Hive connector to use new data structures

Some notable changes:

In code generation, we hard cast StringData to BinaryStringData. This makes the code generator easily to generate opeartions based on string. The same to RawValueData.
Most methods of Decimal have been moved to DecimalDateUtil. So I also updated the code generation logic.
Remove RecordEqualiser#equalsWithoutHeader interface. This method is used less and can be replaced by RecordEqualiser#equals. This can also avoid add equalsWithoutHeader method to the public API GenericRowData.

Verifying this change

This change is covered by existing tests.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): yes
The serializers: yes
The runtime per-record code paths (performance sensitive): yes
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

flinkbot · 2020-04-27T16:23:50Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit fd62e8e (Mon Apr 27 16:23:50 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-04-27T16:30:05Z

CI report:

80e1215 UNKNOWN
65b255c Travis: CANCELED Azure: PENDING
3cbfdd0 UNKNOWN
0a1e660 UNKNOWN

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

wuchong · 2020-04-28T01:44:16Z

Hi @dianfu , could you help to review the python part? RowData and ArrayDat don't implement TypeGetterSetter, so I have to refactor the ArrowFieldWriter implementations.

KurtYoung

I've checked runtime part and LGTM

KurtYoung · 2020-04-28T01:59:25Z

...-table-runtime-blink/src/main/java/org/apache/flink/table/data/binary/BinaryRowDataUtil.java

@@ -66,4 +69,20 @@ public static boolean byteArrayEquals(
 		return true;
 	}

+	public static String toString(RowData row, LogicalType[] types) {


This is only used in tests, move it to tests?

KurtYoung · 2020-04-28T05:16:47Z

...ble-runtime-blink/src/main/java/org/apache/flink/table/data/binary/BinaryStringDataUtil.java

 		str.ensureMaterialized();

-		if (precision > Decimal.MAX_LONG_DIGITS || str.getSizeInBytes() > Decimal.MAX_LONG_DIGITS) {
+		if (DecimalDataUtils.isByteArrayDecimal(precision) || DecimalDataUtils.isByteArrayDecimal(str.getSizeInBytes())) {


Is this right? why we check both precision and sizeInBytes with the same method?

cc @JingsongLi , do you know why we check both of them?

It is right.
The case is, precision is 10 (less than MAX_LONG_DIGITS), but the string data is more than 10 precision maybe, we can use big decimal to convert, and then remove the extra decimals according to the precision.

KurtYoung · 2020-04-28T05:22:08Z

...le/flink-table-runtime-blink/src/main/java/org/apache/flink/table/data/util/MapDataUtil.java

+	 * Converts a {@link MapData} into Java {@link Map}, the keys and values of the Java map
+	 * still holds objects of internal data structures.
+	 */
+	public static Map<Object, Object> convertToJavaMap(


only used by tests

This is also used in code generation:
https://github.com/wuchong/flink/blob/744983929e464758baad98bf73cba9601eac48c5/flink-table/flink-table-planner-blink/src/main/scala/org/apache/flink/table/planner/codegen/calls/ScalarOperatorGens.scala#L2179

dianfu · 2020-04-28T06:19:55Z

@wuchong The python part LGTM.

twalthr

Feedback for 38946ea.

twalthr · 2020-04-28T06:42:56Z

flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/GenericRowData.java

+			if (i != 0) {
+				sb.append(",");
+			}
+			sb.append(StringUtils.arrayAwareToString(fields[i]));


Use org.apache.flink.table.utils.EncodingUtils#objectToString

twalthr · 2020-04-28T06:45:03Z

...le/flink-table-common/src/main/java/org/apache/flink/table/functions/AsyncTableFunction.java

 *
 * <p>By default the result type of an evaluation method is determined by Flink's type extraction
- * facilities. Currently, only support {@link org.apache.flink.types.Row} and {@code BaseRow} as
+ * facilities. Currently, only support {@link org.apache.flink.types.Row} and {@code RowData} as


use: {@link RowData}

twalthr · 2020-04-28T06:47:47Z

...-table-common/src/main/java/org/apache/flink/table/types/logical/utils/LogicalTypeUtils.java

+			case RAW:
+				return RawValueData.class;
+			default:
+				throw new UnsupportedOperationException("Not support type: " + type);


nit: Unsupported type:

twalthr · 2020-04-28T06:49:20Z

...-table-common/src/main/java/org/apache/flink/table/types/logical/utils/LogicalTypeUtils.java

 	}

+	/**
+	 * Get internal(sql engine execution data formats) conversion class for {@link LogicalType}.


nit: Returns the conversion class for the given {@link LogicalType} that is used by the table runtime.

twalthr · 2020-04-28T06:49:51Z

...-table-common/src/main/java/org/apache/flink/table/types/logical/utils/LogicalTypeUtils.java

+	/**
+	 * Get internal(sql engine execution data formats) conversion class for {@link LogicalType}.
+	 */
+	public static Class<?> internalConversionClass(LogicalType type) {


nit: toInternalConversionClass

twalthr

Some feedback to 9517bc9.

What I don't like is that we have a lot of runtime code in flink-table-common now which means it is also available in the API. I'm wondering if we could at least hide some util classes by a default scope visibility. At least we should move all of those utilities to the binary package.

twalthr · 2020-04-28T06:55:32Z

flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/TypedSetters.java

+	 * Precision is not compact: can not call setNullAt when decimal is null, must call
+	 * setDecimal(i, null, precision) because we need update var-length-part.
+	 */
+	void setDecimal(int i, DecimalData value, int precision);


twalthr · 2020-04-28T06:56:22Z

flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/TypedSetters.java

+	void setNullAt(int pos);
+
+	/**
+	 * Set boolean value.


Remove those JavaDocs. They are not useful and just make the code more complicated.

twalthr · 2020-04-28T06:57:18Z

flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/TypedSetters.java

+	 * <p>Note:
+	 * Precision is compact: can call setNullAt when decimal is null.
+	 * Precision is not compact: can not call setNullAt when decimal is null, must call
+	 * setDecimal(i, null, precision) because we need update var-length-part.


nit: use {@code } to format the JavaDoc

twalthr · 2020-04-28T07:00:25Z

...-table/flink-table-common/src/main/java/org/apache/flink/table/data/binary/BinaryFormat.java

+/**
+ * Binary format spanning {@link MemorySegment}s.
+ */
+public interface BinaryFormat {


Missing @Internal

twalthr · 2020-04-28T07:03:16Z

...le/flink-table-common/src/main/java/org/apache/flink/table/data/binary/LazyBinaryFormat.java

+ * <p>It can lazy the conversions as much as possible. It will be converted into required form
+ * only when it is needed.
+ */
+public abstract class LazyBinaryFormat<T> implements BinaryFormat {


Missing @Internal

twalthr · 2020-04-28T07:04:54Z

...-table/flink-table-common/src/main/java/org/apache/flink/table/utils/BinarySegmentUtils.java

+ * Utilities for binary data segments which heavily uses {@link MemorySegment}.
+ */
+@Internal
+public class BinarySegmentUtils {


Put this under org.apache.flink.table.data.binary. Make the class final with a private default constructor.

twalthr · 2020-04-28T07:08:04Z

flink-table/flink-table-common/src/main/java/org/apache/flink/table/utils/MurmurHashUtils.java

+ * Murmur Hash. This is inspired by Guava's Murmur3_32HashFunction.
+ */
+@Internal
+public final class MurmurHashUtils {


put under org.apache.flink.table.data.binary add private default constructor

twalthr · 2020-04-28T07:08:51Z

flink-table/flink-table-common/src/main/java/org/apache/flink/table/utils/StringUtf8Utils.java

+ * Utilities for String UTF-8.
+ */
+@Internal
+public class StringUtf8Utils {


put under org.apache.flink.table.data.binary make final with private default constructor

twalthr · 2020-04-28T07:13:26Z

flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/TypedSetters.java

+ * used on the binary format such as {@link BinaryRowData}.
+ */
+@Internal
+public interface TypedSetters {


If this is mainly used for binary formats, put it in the binary package.

twalthr · 2020-04-28T07:24:09Z

Thanks @wuchong for this massive PR. I took a look at the classes in table-common. Apart from my comments, they look good to me.

wuchong · 2020-04-28T08:24:27Z

Thanks @KurtYoung @dianfu @twalthr for the quickly reviewing. I have addressed all the comments.

Hi Timo, currently, it's hard to make all classes under binary. to be package-visible, because they will be used in serializers and code generation. But I think MurmurHashUtils and StringUtf8Utils can be, so I updated them.

JingsongLi · 2020-04-28T06:29:15Z

...table/flink-table-common/src/main/java/org/apache/flink/table/data/binary/NestedRowData.java

+@Internal
+public final class NestedRowData extends BinarySection implements RowData, TypedSetters {
+
+	private static final long serialVersionUID = 1L;


JingsongLi · 2020-04-28T06:47:28Z

...le-runtime-blink/src/main/java/org/apache/flink/table/runtime/generated/RecordEqualiser.java


 import java.io.Serializable;

 /**
- * Record equaliser for BaseRow which can compare two BaseRows and returns whether they are equal.
+ * Record equaliser for RowData which can compare two RowDatas and returns whether they are equal.


RowData data no plurality "s"

JingsongLi · 2020-04-28T08:11:50Z

...table-planner-blink/src/main/scala/org/apache/flink/table/planner/codegen/CodeGenUtils.scala

   * Returns a term for representing the given class in Java code.
   */
  def typeTerm(clazz: Class[_]): String = {
+    if (clazz == classOf[StringData]) {


Could this put CodeGen at risk of missing cast?
If possible, I prefer cast in accessing its methods.

Consider RowData and others, using StringData more reasonable to me.

There are too many calls on the BinaryStringData in code generation now, if we use StringData in code generation, we have to refactor a lot of codes. And we don't benefit much from this effort. I created FLINK-17437 to track this, we can refactor this in the future.

JingsongLi · 2020-04-28T08:14:57Z

...anner-blink/src/main/scala/org/apache/flink/table/planner/codegen/CodeGeneratorContext.scala

      case None =>
        val term = newName("typeSerializer")
-        val ser = InternalSerializers.create(t, new ExecutionConfig)
+        val ser = InternalSerializers.createInternalSerializer(t, new ExecutionConfig)


createInternalSerializer a little redundant

JingsongLi · 2020-04-28T08:31:07Z

There are still 100+ BaseRow in codes (variable name or something else), you can modify them all.

wuchong · 2020-04-28T11:37:51Z

Thanks for the reviewing @JingsongLi . I have addressed the comments and renamed lagecy field names and method which use BaseRow, SqlTimestamp and BinaryGeneric.

JingsongLi

Thanks @wuchong , looks good to me from my side.

…ructures - Add to primitive array methods to ArrayData - Add get(ArrayData, int, LogicalType) utility to ArrayData - Add get(Object key) to GenericMapData - Add get(RowData, int, LogicalType) utility to RowData - Add toString() to RowData

…ta structures

…and serializers around RowData

…uctures

wuchong · 2020-04-29T03:29:28Z

Thanks all for the reviewing. I have rebased the commits. Will merge this once builds passed.

JingsongLi · 2020-04-29T04:14:20Z

Hi @wuchong , do you want to squash or not? Here is my question:
Does it need to be able to compile for each commit to the master?

wuchong · 2020-04-29T04:25:39Z

Hi @JingsongLi , I want to keep the splitted commits. From my point of view, squashing such a large PR into one commit is not good.

JingsongLi · 2020-04-29T04:29:22Z

Hi @JingsongLi , I want to keep the splitted commits. From my point of view, squashing such a large PR into one commit is not good.

So the answer is we can push commit with broken compilation.

wuchong · 2020-04-29T04:35:34Z

@JingsongLi , as far as I know, the community doesn't have a rule to make sure independent commits can pass build, but there is a rule to separate commits if it is a big commit:

https://flink.apache.org/contributing/code-style-and-quality-pull-requests.html#separate-refactoring-cleanup-and-independent-changes

wuchong · 2020-04-29T06:13:06Z

Travis is passed: https://travis-ci.org/github/wuchong/flink/builds/680843224

JingsongLi · 2020-04-29T06:15:20Z

@wuchong go you.

…ta structures This closes #11925

…and serializers around RowData This closes #11925

This closes #11925

…uctures This closes #11925

This closes #11925

…uctures This closes #11925

This closes #11925

wuchong requested a review from JingsongLi April 27, 2020 16:20

rmetzger added the review=description? label Apr 27, 2020

rmetzger added component=TableSQL/Planner component=TableSQL/Ecosystem labels Apr 27, 2020

KurtYoung reviewed Apr 28, 2020

View reviewed changes

twalthr reviewed Apr 28, 2020

View reviewed changes

JingsongLi requested changes Apr 28, 2020

View reviewed changes

JingsongLi approved these changes Apr 29, 2020

View reviewed changes

wuchong added 9 commits April 29, 2020 11:17

[FLINK-16996][table-common] Add binary implementations of internal da…

64acb5f

…ta structures

[FLINK-16996][table-runtime-blink] Implement all the data structures …

4c53494

…and serializers around RowData

[FLINK-16996][table-runtime-blink] Remove legacy data formats (BaseRow)

3b7d60a

[FLINK-16996][table] Refactor planner and runtime to use new data str…

b22a489

…uctures

[FLINK-16996][python] Refactor pyflink to use new data structures

01d07ac

[FLINK-16996][parquet] Refactor parquet connector to use new data str…

9ca0511

…uctures

[FLINK-16996][orc] Refactor ORC connector to use new data structures

c7e466d

[FLINK-16996][hive] Refactor Hive connector to use new data structures

0a1e660

wuchong force-pushed the rowdata-planner branch from 3cbfdd0 to 0a1e660 Compare April 29, 2020 03:28

wuchong closed this in 2296487 Apr 29, 2020

wuchong added a commit that referenced this pull request Apr 29, 2020

[FLINK-16996][table-common] Add binary implementations of internal da…

130967a

…ta structures This closes #11925

wuchong added a commit that referenced this pull request Apr 29, 2020

[FLINK-16996][table-runtime-blink] Implement all the data structures …

c69d67c

…and serializers around RowData This closes #11925

wuchong added a commit that referenced this pull request Apr 29, 2020

[FLINK-16996][table-runtime-blink] Remove legacy data formats (BaseRow)

264d685

This closes #11925

wuchong added a commit that referenced this pull request Apr 29, 2020

[FLINK-16996][table] Refactor planner and runtime to use new data str…

4ff59a7

…uctures This closes #11925

wuchong added a commit that referenced this pull request Apr 29, 2020

[FLINK-16996][python] Refactor pyflink to use new data structures

1481f3d

This closes #11925

wuchong added a commit that referenced this pull request Apr 29, 2020

[FLINK-16996][parquet] Refactor parquet connector to use new data str…

b2c4092

…uctures This closes #11925

wuchong added a commit that referenced this pull request Apr 29, 2020

[FLINK-16996][orc] Refactor ORC connector to use new data structures

b8fa6f6

This closes #11925

wuchong added a commit that referenced this pull request Apr 29, 2020

[FLINK-16996][hive] Refactor Hive connector to use new data structures

f622530

This closes #11925

wuchong deleted the rowdata-planner branch April 29, 2020 06:16

Conversation

wuchong commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Apr 27, 2020

Automated Checks

Review Progress

Uh oh!

flinkbot commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

wuchong commented Apr 28, 2020

Uh oh!

KurtYoung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dianfu commented Apr 28, 2020

Uh oh!

twalthr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

twalthr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

twalthr commented Apr 28, 2020

Uh oh!

wuchong commented Apr 28, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuchong commented Apr 27, 2020 •

edited

Loading

flinkbot commented Apr 27, 2020 •

edited

Loading

JingsongLi Apr 28, 2020 •

edited

Loading

wuchong commented Apr 28, 2020 •

edited

Loading