[FLINK-12588][python] Add TableSchema for Python Table API. #8561

WeiZhong94 · 2019-05-29T07:25:55Z

What is the purpose of the change

This pull request is intended to add TableSchema for Python Table API. For this goal, a _to_python_type function is introduced in this pull request. This function is for converting Java's DataType and TypeInformation objects into Python's DataType objects. For ensuring that _to_python_type and the existing _to_java_type are mutually inverse functions, this PR makes some changes on the flink python type system.

Brief change log

Add the TableSchema class.
Add get_schema method in Table class.
Add schema method in OldCsv and Schema class.
Add _to_python_type function.
Changed many default value and default behavior of current data types.

Verifying this change

Added integration tests in test_table_schema.py, test_schema_operation.py and test_types.py.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (python docs)

flinkbot · 2019-05-29T07:28:17Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

dianfu

@WeiZhong94 Thanks a lot for the PR. Have left a few comments.

dianfu · 2019-05-29T10:10:36Z

flink-python/pyflink/table/table_schema.py

+
+    def get_field_types(self):
+        """
+        Returns all field data types as an array.


this method is deprecated in Java and there is no need to add this method in Python any more.

dianfu · 2019-05-29T10:12:03Z

flink-python/pyflink/table/table_schema.py

+        else:
+            return None
+
+    def get_field_type(self, field):


Deprecated in Java and we can remove it in Python

dianfu · 2019-05-29T10:13:34Z

flink-python/pyflink/table/table_schema.py

+    A table schema that represents a table's structure with field names and data types.
+    """
+
+    def __init__(self, field_names=None, data_types=None, java_object=None):


java_object -> j_table_schema

dianfu · 2019-05-29T10:17:28Z

flink-python/pyflink/table/table_schema.py

+
+    def get_field_data_types(self):
+        """
+        Returns all field data types as an array.


dianfu · 2019-05-29T10:18:06Z

flink-python/pyflink/table/table_schema.py

+        Returns the specified data type for the given field index or field name.
+
+        :param field: The index of the field or the name of the field.
+        :return: The specified data type.


The data type of the specified field

dianfu · 2019-05-29T11:37:44Z

flink-python/pyflink/table/types.py

+        logical_type = java_data_type.getLogicalType()
+        conversion_clz = java_data_type.getConversionClass()
+        if is_instance_of(logical_type, gateway.jvm.CharType):
+            python_type = DataTypes.CHAR(logical_type.getLength(), logical_type.isNullable())


python_type -> data_type

dianfu · 2019-05-29T11:40:01Z

flink-python/pyflink/table/types.py

+                          "currently." % java_data_type_input)
+        elif is_instance_of(logical_type, gateway.jvm.DayTimeIntervalType):
+            raise \
+                TypeError("Not supported type: %s, DayTimeIntervalType is not supported currently."


currently -> yet

dianfu · 2019-05-29T11:41:44Z

flink-python/pyflink/table/types.py

+            python_type = DataTypes.TIME(logical_type.isNullable())
+        elif is_instance_of(logical_type, gateway.jvm.ZonedTimestampType):
+            raise \
+                TypeError("Not supported type: %s, ZonedTimestampType is not supported currently."


TypeError("ZonedTimestampType is not supported yet").

dianfu · 2019-05-29T11:51:48Z

flink-python/pyflink/table/types.py

+            python_type = DataTypes.MULTISET(_to_python_type(element_type),
+                                             logical_type.isNullable())
+        else:
+            raise TypeError("Not supported colletion data type: %s" % java_data_type_input)


colletion -> collection

dianfu · 2019-05-29T11:54:24Z

flink-python/pyflink/table/types.py

+
+    # Unrecognized type.
+    else:
+        TypeError("Unsupported data type: %s" % java_data_type_input)


What about changing all the error message to "Unsupported data type: %s"?

WeiZhong94 · 2019-05-29T12:53:16Z

@dianfu Thanks for your review! I have updated the PR according to your comments.

sunjincheng121

Thanks for the PR! @WeiZhong94!
I only left 3 suggestions about Python Doc. And one reminder: before opening the PR, we should run the flink-python/dev/lint-python.sh to run the test case and check the code format.
Best,
Jincheng

sunjincheng121 · 2019-05-29T23:28:49Z

flink-python/pyflink/table/table.py

@@ -532,6 +533,9 @@ def insert_into(self, table_path, *table_path_continued):
        j_table_path = to_jarray(gateway.jvm.String, table_path_continued)
        self._j_table.insertInto(table_path, j_table_path)

+    def get_schema(self):
+        return TableSchema(java_object=self._j_table.getSchema())


Add Python Doc with Returns the schema of this table.?

sunjincheng121 · 2019-05-29T23:33:48Z

flink-python/pyflink/table/table_descriptor.py

@@ -177,6 +177,10 @@ def __init__(self):
        self._j_schema = gateway.jvm.Schema()
        super(Schema, self).__init__(self._j_schema)

+    def schema(self, table_schema):
+        self._j_schema = self._j_schema.schema(table_schema._j_table_schema)


Add the Python Doc Align with JAVA? such as:

Sets the schema with field names and the types. Required. This method overwrites existing fields added with ...

sunjincheng121 · 2019-05-29T23:37:33Z

flink-python/pyflink/table/table_descriptor.py

@@ -285,6 +289,10 @@ def line_delimiter(self, delimiter):
        self._j_csv = self._j_csv.lineDelimiter(delimiter)
        return self

+    def schema(self, schema):
+        self._j_csv = self._j_csv.schema(schema._j_table_schema)


I find that the JAVA have the follows JAVA DOC:

/** * Sets the format schema with field names and the types. Required. * The table schema must not contain nested fields. * * This method overwrites existing fields added with [[field()]]. * * @param schema the table schema */

It's better to align the DOC, What to you think?

sunjincheng121 · 2019-05-29T23:53:57Z

flink-python/pyflink/table/types.py

+            data_type = DataTypes.VARBINARY(logical_type.getLength(), logical_type.isNullable())
+        elif _is_instance_of(logical_type, gateway.jvm.DecimalType):
+            data_type = DataTypes.DECIMAL(logical_type.getPrecision(),
+                                            logical_type.getScale(),


continuation line over-indented for visual indent, please correct the format.

sunjincheng121 · 2019-05-29T23:54:29Z

flink-python/pyflink/table/types.py

+            if kind is None:
+                raise Exception("Unsupported java timestamp kind %s" % j_kind)
+            data_type = DataTypes.TIMESTAMP(kind,
+                                              logical_type.getPrecision(),


Same as above.

sunjincheng121 · 2019-05-29T23:54:47Z

flink-python/pyflink/table/types.py

+            data_type = DataTypes.ARRAY(_from_java_type(element_type), logical_type.isNullable())
+        elif _is_instance_of(logical_type, gateway.jvm.MultisetType):
+            data_type = DataTypes.MULTISET(_from_java_type(element_type),
+                                             logical_type.isNullable())


sunjincheng121 · 2019-05-29T23:59:20Z

flink-dist/pom.xml

@@ -573,7 +573,7 @@ under the License.
 									<pattern>py4j</pattern>
 									<shadedPattern>org.apache.flink.api.python.py4j</shadedPattern>
 									<includes>
-										<include>py4j.*</include>
+										<include>py4j.*.*</include>


Why do we need to add this change?

yes, without this change py4j is not shaded completely.

May be net.sf.py4j:* is correct, and FLINK-12409 will correct this change.

Yes, the solution in FLINK-12409 makes more sense. I have removed this change in the new commit which will cause the CI test failure for the moment. I will rebase this PR after FLINK-12409 merged and this problem would be solved.

sunjincheng121 · 2019-06-03T06:30:25Z

#8474 has merged, please rebase the PR! thanks! :)

… method.

dianfu

@WeiZhong94 Thanks a lot for the update. I have left a few comments.

dianfu · 2019-06-04T09:04:35Z

flink-python/pyflink/table/table_descriptor.py

+        This method overwrites existing fields added with
+        :func:`~pyflink.table.table_descriptor.Schema.field`.
+
+        :param schema: The :class:`TableSchema` object.


schema -> table_schema

dianfu · 2019-06-04T09:07:47Z

flink-python/pyflink/table/table_descriptor.py

@@ -287,6 +300,19 @@ def line_delimiter(self, delimiter):
        self._j_csv = self._j_csv.lineDelimiter(delimiter)
        return self

+    def schema(self, schema):


What about changing the argument name to table_schema to be consistent with the method Schema.schema?

dianfu · 2019-06-04T12:24:15Z

flink-python/pyflink/table/tests/test_calc.py

@@ -113,7 +113,7 @@ def test_from_element(self):
                       DataTypes.STRING(), DataTypes.DATE(),
                       DataTypes.TIME(),
                       DataTypes.TIMESTAMP(),
-                       DataTypes.ARRAY(DataTypes.DOUBLE()),
+                       DataTypes.ARRAY(DataTypes.DOUBLE().not_null()),


Could you add a test case for input element [1.0, None]?

dianfu · 2019-06-04T12:26:51Z

flink-python/pyflink/table/tests/test_table_schema.py

+class TableSchemaTests(PyFlinkTestCase):
+
+    def test_init(self):
+        schema = \


There is no need to add a new line here.

dianfu · 2019-06-04T12:45:48Z

flink-python/pyflink/table/types.py

@@ -389,9 +405,10 @@ def __init__(self, precision=0, nullable=True):
        super(TimeType, self).__init__(nullable)
        assert 0 <= precision <= 9
        self.precision = precision
+        self.bridged_to("java.time.LocalTime")


What about revert this kind of changes?

dianfu · 2019-06-04T12:47:46Z

flink-python/pyflink/table/types.py


    @classmethod
-    def TIMESTAMP(cls, kind=TimestampKind.REGULAR, precision=6, nullable=True):
-        return TimestampType(kind, precision, nullable)
+    def TIMESTAMP(cls, kind=TimestampKind.REGULAR, precision=3, nullable=True):


Revert this change?

… some parameter name, fix boxed basic type array support.

WeiZhong94 · 2019-06-06T03:07:32Z

@dianfu Thanks for your review! I have addressed your comment.

sunjincheng121 · 2019-06-09T19:55:35Z

+1 to merged.

This closes apache#8561

rmetzger added review=description? component=API/Python labels May 29, 2019

dianfu reviewed May 29, 2019

View reviewed changes

sunjincheng121 reviewed May 29, 2019

View reviewed changes

WeiZhong94 added 5 commits June 3, 2019 14:39

[FLINK-12588][python] Add TableSchema for Python Table API.

475d0b4

remove unnecessary changes.

f81a89d

adjust error messages and naming, fix shade bug and remove deprecated…

0b6253e

… method.

fix check style, add docs and remove unnecessary tests.

3242ad5

add TableSchema in __all__

d79a3c1

WeiZhong94 force-pushed the FLINK-12588 branch from acd810e to d79a3c1 Compare June 3, 2019 06:42

fix test failures caused by rebase.

3a40126

dianfu reviewed Jun 4, 2019

View reviewed changes

dianfu mentioned this pull request Jun 5, 2019

[FLINK-12719][python] Add the Python catalog API #8623

Closed

WeiZhong94 added 2 commits June 6, 2019 10:42

disable convention class and timestamp precision currently and modify…

d3af69f

… some parameter name, fix boxed basic type array support.

boxed basic type array support null.

8d993b0

sunjincheng121 pushed a commit to sunjincheng121/flink that referenced this pull request Jun 9, 2019

[FLINK-12588][python] Add TableSchema for Python Table API.

d7dda2d

This closes apache#8561

asfgit closed this in 8eaa2d0 Jun 9, 2019

sjwiesman pushed a commit to sjwiesman/flink that referenced this pull request Jun 26, 2019

[FLINK-12588][python] Add TableSchema for Python Table API.

fc2ff3d

This closes apache#8561

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-12588][python] Add TableSchema for Python Table API. #8561

[FLINK-12588][python] Add TableSchema for Python Table API. #8561

WeiZhong94 commented May 29, 2019

flinkbot commented May 29, 2019

dianfu left a comment

dianfu May 29, 2019

dianfu May 29, 2019

dianfu May 29, 2019

dianfu May 29, 2019

dianfu May 29, 2019

dianfu May 29, 2019

dianfu May 29, 2019

dianfu May 29, 2019

dianfu May 29, 2019

dianfu May 29, 2019

WeiZhong94 commented May 29, 2019

sunjincheng121 left a comment •

edited

sunjincheng121 May 29, 2019

sunjincheng121 May 29, 2019

sunjincheng121 May 29, 2019

sunjincheng121 May 29, 2019

sunjincheng121 May 29, 2019

sunjincheng121 May 29, 2019

sunjincheng121 May 29, 2019

WeiZhong94 May 30, 2019

sunjincheng121 May 30, 2019

WeiZhong94 May 30, 2019

sunjincheng121 commented Jun 3, 2019

dianfu left a comment

dianfu Jun 4, 2019

dianfu Jun 4, 2019

dianfu Jun 4, 2019

dianfu Jun 4, 2019

dianfu Jun 4, 2019

dianfu Jun 4, 2019

WeiZhong94 commented Jun 6, 2019

sunjincheng121 commented Jun 9, 2019

[FLINK-12588][python] Add TableSchema for Python Table API. #8561

[FLINK-12588][python] Add TableSchema for Python Table API. #8561

Conversation

WeiZhong94 commented May 29, 2019

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented May 29, 2019

Review Progress

dianfu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeiZhong94 commented May 29, 2019

sunjincheng121 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunjincheng121 commented Jun 3, 2019

dianfu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeiZhong94 commented Jun 6, 2019

sunjincheng121 commented Jun 9, 2019

sunjincheng121 left a comment •

edited