New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-12588][python] Add TableSchema for Python Table API. #8561
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeiZhong94 Thanks a lot for the PR. Have left a few comments.
|
||
def get_field_types(self): | ||
""" | ||
Returns all field data types as an array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is deprecated in Java and there is no need to add this method in Python any more.
else: | ||
return None | ||
|
||
def get_field_type(self, field): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecated in Java and we can remove it in Python
A table schema that represents a table's structure with field names and data types. | ||
""" | ||
|
||
def __init__(self, field_names=None, data_types=None, java_object=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java_object -> j_table_schema
|
||
def get_field_data_types(self): | ||
""" | ||
Returns all field data types as an array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a list
Returns the specified data type for the given field index or field name. | ||
|
||
:param field: The index of the field or the name of the field. | ||
:return: The specified data type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data type of the specified field
flink-python/pyflink/table/types.py
Outdated
logical_type = java_data_type.getLogicalType() | ||
conversion_clz = java_data_type.getConversionClass() | ||
if is_instance_of(logical_type, gateway.jvm.CharType): | ||
python_type = DataTypes.CHAR(logical_type.getLength(), logical_type.isNullable()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python_type -> data_type
flink-python/pyflink/table/types.py
Outdated
"currently." % java_data_type_input) | ||
elif is_instance_of(logical_type, gateway.jvm.DayTimeIntervalType): | ||
raise \ | ||
TypeError("Not supported type: %s, DayTimeIntervalType is not supported currently." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently -> yet
flink-python/pyflink/table/types.py
Outdated
python_type = DataTypes.TIME(logical_type.isNullable()) | ||
elif is_instance_of(logical_type, gateway.jvm.ZonedTimestampType): | ||
raise \ | ||
TypeError("Not supported type: %s, ZonedTimestampType is not supported currently." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeError("ZonedTimestampType is not supported yet").
flink-python/pyflink/table/types.py
Outdated
python_type = DataTypes.MULTISET(_to_python_type(element_type), | ||
logical_type.isNullable()) | ||
else: | ||
raise TypeError("Not supported colletion data type: %s" % java_data_type_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
colletion -> collection
flink-python/pyflink/table/types.py
Outdated
|
||
# Unrecognized type. | ||
else: | ||
TypeError("Unsupported data type: %s" % java_data_type_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing all the error message to "Unsupported data type: %s"?
@dianfu Thanks for your review! I have updated the PR according to your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! @WeiZhong94!
I only left 3 suggestions about Python Doc. And one reminder: before opening the PR, we should run the flink-python/dev/lint-python.sh
to run the test case and check the code format.
Best,
Jincheng
flink-python/pyflink/table/table.py
Outdated
@@ -532,6 +533,9 @@ def insert_into(self, table_path, *table_path_continued): | |||
j_table_path = to_jarray(gateway.jvm.String, table_path_continued) | |||
self._j_table.insertInto(table_path, j_table_path) | |||
|
|||
def get_schema(self): | |||
return TableSchema(java_object=self._j_table.getSchema()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Python Doc with Returns the schema of this table.
?
@@ -177,6 +177,10 @@ def __init__(self): | |||
self._j_schema = gateway.jvm.Schema() | |||
super(Schema, self).__init__(self._j_schema) | |||
|
|||
def schema(self, table_schema): | |||
self._j_schema = self._j_schema.schema(table_schema._j_table_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the Python Doc Align with JAVA? such as:
Sets the schema with field names and the types. Required.
This method overwrites existing fields added with ...
@@ -285,6 +289,10 @@ def line_delimiter(self, delimiter): | |||
self._j_csv = self._j_csv.lineDelimiter(delimiter) | |||
return self | |||
|
|||
def schema(self, schema): | |||
self._j_csv = self._j_csv.schema(schema._j_table_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that the JAVA have the follows JAVA DOC:
/**
* Sets the format schema with field names and the types. Required.
* The table schema must not contain nested fields.
*
* This method overwrites existing fields added with [[field()]].
*
* @param schema the table schema
*/
It's better to align the DOC, What to you think?
flink-python/pyflink/table/types.py
Outdated
data_type = DataTypes.VARBINARY(logical_type.getLength(), logical_type.isNullable()) | ||
elif _is_instance_of(logical_type, gateway.jvm.DecimalType): | ||
data_type = DataTypes.DECIMAL(logical_type.getPrecision(), | ||
logical_type.getScale(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continuation line over-indented for visual indent
, please correct the format.
flink-python/pyflink/table/types.py
Outdated
if kind is None: | ||
raise Exception("Unsupported java timestamp kind %s" % j_kind) | ||
data_type = DataTypes.TIMESTAMP(kind, | ||
logical_type.getPrecision(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
flink-python/pyflink/table/types.py
Outdated
data_type = DataTypes.ARRAY(_from_java_type(element_type), logical_type.isNullable()) | ||
elif _is_instance_of(logical_type, gateway.jvm.MultisetType): | ||
data_type = DataTypes.MULTISET(_from_java_type(element_type), | ||
logical_type.isNullable()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same above
flink-dist/pom.xml
Outdated
@@ -573,7 +573,7 @@ under the License. | |||
<pattern>py4j</pattern> | |||
<shadedPattern>org.apache.flink.api.python.py4j</shadedPattern> | |||
<includes> | |||
<include>py4j.*</include> | |||
<include>py4j.*.*</include> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to add this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, without this change py4j is not shaded completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be net.sf.py4j:*
is correct, and FLINK-12409
will correct this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the solution in FLINK-12409
makes more sense. I have removed this change in the new commit which will cause the CI test failure for the moment. I will rebase this PR after FLINK-12409
merged and this problem would be solved.
#8474 has merged, please rebase the PR! thanks! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeiZhong94 Thanks a lot for the update. I have left a few comments.
This method overwrites existing fields added with | ||
:func:`~pyflink.table.table_descriptor.Schema.field`. | ||
|
||
:param schema: The :class:`TableSchema` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema -> table_schema
@@ -287,6 +300,19 @@ def line_delimiter(self, delimiter): | |||
self._j_csv = self._j_csv.lineDelimiter(delimiter) | |||
return self | |||
|
|||
def schema(self, schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing the argument name to table_schema
to be consistent with the method Schema.schema
?
@@ -113,7 +113,7 @@ def test_from_element(self): | |||
DataTypes.STRING(), DataTypes.DATE(), | |||
DataTypes.TIME(), | |||
DataTypes.TIMESTAMP(), | |||
DataTypes.ARRAY(DataTypes.DOUBLE()), | |||
DataTypes.ARRAY(DataTypes.DOUBLE().not_null()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test case for input element [1.0, None]?
class TableSchemaTests(PyFlinkTestCase): | ||
|
||
def test_init(self): | ||
schema = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to add a new line here.
flink-python/pyflink/table/types.py
Outdated
@@ -389,9 +405,10 @@ def __init__(self, precision=0, nullable=True): | |||
super(TimeType, self).__init__(nullable) | |||
assert 0 <= precision <= 9 | |||
self.precision = precision | |||
self.bridged_to("java.time.LocalTime") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about revert this kind of changes?
flink-python/pyflink/table/types.py
Outdated
|
||
@classmethod | ||
def TIMESTAMP(cls, kind=TimestampKind.REGULAR, precision=6, nullable=True): | ||
return TimestampType(kind, precision, nullable) | ||
def TIMESTAMP(cls, kind=TimestampKind.REGULAR, precision=3, nullable=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this change?
… some parameter name, fix boxed basic type array support.
@dianfu Thanks for your review! I have addressed your comment. |
+1 to merged. |
What is the purpose of the change
This pull request is intended to add TableSchema for Python Table API. For this goal, a
_to_python_type
function is introduced in this pull request. This function is for converting Java's DataType and TypeInformation objects into Python's DataType objects. For ensuring that_to_python_type
and the existing_to_java_type
are mutually inverse functions, this PR makes some changes on the flink python type system.Brief change log
TableSchema
class.get_schema
method inTable
class.schema
method inOldCsv
andSchema
class._to_python_type
function.Verifying this change
Added integration tests in
test_table_schema.py
,test_schema_operation.py
andtest_types.py
.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation