Skip to content

Conversation

twalthr
Copy link
Contributor

@twalthr twalthr commented May 7, 2019

What is the purpose of the change

This PR introduces the DataType class and subclasses. Users are able to do a star import of DataTypes and declare types like: MULTISET(MULTISET(INT()). Very close to SQL. As mentioned in FLIP-37, data types allow to specify format hints to the planner using TIMESTAMP(9).bridgedTo(java.sql.Timestamp.class).

Brief change log

  • DataType structure
  • DataTypes enumeration of types

Verifying this change

See org.apache.flink.table.types.DataTypesTest.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? JavaDocs

@flinkbot
Copy link
Collaborator

flinkbot commented May 7, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@twalthr twalthr force-pushed the FLINK-FLINK-12393 branch from a839697 to 746c81c Compare May 8, 2019 10:52
*
* @see DataTypes for a list of supported data types
*/
public static final class ElementDataType extends DataType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectionDataType?

It is wired that the ElementDataType contains an element type named elementDataType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is on purpose as MapType is also a collection data type. The naming should not reflect the SQL type families but just serve as a container name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I agree with @wuchong on this. BTW e.g. java.util.Map is not a java.util.Collection ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern to me is the class name ElementDataType and the member field elementDataType is confusing.

In JDK, Collection is a group of elements, and Map is not a collection. IMO, ARRAY and MULTISET are collections of elements with same type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Java is also a bit confusing in this sense as methods such as java.util.Collections#emptyMap() are also put under collections. I will correct that. Also in org.apache.flink.table.types.logical.LogicalTypeRoot#MAP. Thanks for the feedback.

*
* @return a new, reconfigured data type instance
*/
public abstract DataType andNull();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about rename andNull() --> asNullable() and notNull() --> asNotNull()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The goal was to be close to SQL: INT NOT NULL -> INT().notNull().
  2. notNull and andNull are shorter than asNullable and asNotNullable.

Alternatively, what about withNull() or withoutNull?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the notNull as is. How about andNull -> nullable()? I guess this method will be rarely used anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to be close to SQL.

In SQL, a nullable field is declared like name INT NULL. But it's not good choice in API INT().null().

What about notNull() and nullable() ? It's easy to understand and close to SQL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm also ok with notNull() and nullable(). Any third opinion? @dawidwys

* Resolution in days. The precision is the number of digits of days. It must have a value
* between 1 and 6 (both inclusive). If no precision is specified, it is equal to 2 by default.
*/
public static Resolution DAY(int precision) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about to provide a DAY() method which uses 2 precision as default? The same as YEAR and SECOND.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor

@dawidwys dawidwys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @twalthr
Looks really good. I had few comments, mostly to the tests infrastructure. I think we can significantly improve the readability of the tests.

*
* @return a new, reconfigured data type instance
*/
public abstract DataType andNull();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the notNull as is. How about andNull -> nullable()? I guess this method will be rarely used anyways.

/**
* Adds a hint that data should be represented using the given class when entering or leaving
* the table ecosystem.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a note that not all classes are supported? Sth like:

<p>The supported conversion classes are limited to {@link LogicalType#supportsInputConversion(Class)} or {@link LogicalType#supportsOutputConversion(Class)}.

*
* @see DataTypes for a list of supported data types
*/
public static final class AtomicDataType extends DataType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we have those as top level classes? It's easier to read short files.

*
* @see DataTypes for a list of supported data types
*/
public static final class ElementDataType extends DataType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I agree with @wuchong on this. BTW e.g. java.util.Map is not a java.util.Collection ;)

assertEquals(java.sql.Timestamp.class, dataType.getConversionClass());
testLogicalType(new TimestampType(false, 9), dataType);

testLogicalType(new TimestampType(true, 9), dataType.andNull());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we replace this method with:

@Test
public void testDataType() {
	...
	assertThat(dataType.andNull(), hasLogicalType(new TimestampType(true, 9)));
	...
}

private static Matcher<DataType> hasLogicalType(LogicalType logicalType) {
	return new FeatureMatcher<DataType, LogicalType>(
			CoreMatchers.equalTo(logicalType),
			"logical type of the data type",
			"logical type") {

		@Override
		protected LogicalType featureValueOf(DataType actual) {
			return actual.getLogicalType();
		}
	};
}

This makes it more explicit what is actually tested without looking into the method. It also produces nicer description in case of error.


testLogicalType(new TimestampType(true, 9), dataType.andNull());

try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also write a simple Matcher for this as it will be reused for all DataTypes


@Test
public void testAtomicDataTypes() {
testLogicalType(new CharType(2), DataTypes.CHAR(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rework those testLogicalType assertions into a single Parameterized test? This would give a clear message which type actually failed.

This would also ensure all assertions are always run. Right now the tests will fail on a first error.

/**
* Tests for {@link DataType}, its subclasses and {@link DataTypes}.
*/
public class DataTypesTest {
Copy link
Contributor

@dawidwys dawidwys May 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add tests that check that default conversion classes are used if none provided?

BTW I think if we rework the test infrastructure into parameterized one with sort of TestSpec I think it will be much easier to read and test all cases.

The TestSpec could look sth like this:

test(dataType)
   .hasLogicalType(...)
   .hasConversionClassI(...)

And then the method:

@Test
public testDataType(TestSpec spec) {
    assertThat(spec.getDataType(), hasLogicalType(spec.getDataType));
    spec.getExpectedConversionClass().ifPresent(convClass -> {
        asserThat(spec.getDataType(), hasConversionClass(convClass));
    }
}


testLogicalType(
new DayTimeIntervalType(DayTimeResolution.MINUTE_TO_SECOND, DayTimeIntervalType.DEFAULT_DAY_PRECISION, 2),
DataTypes.INTERVAL(DataTypes.MINUTE(), DataTypes.SECOND(2)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Not sure about that, but maybe we could static import the DataTypes? It makes it easier to read.

*
* @see CharType
*/
public static DataType.AtomicDataType CHAR(int n) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return just the DataType rather than more concrete subclasses? I think we should try to expose as little internal structure as possible. This would give us more flexibility over the API, and it would still be quite easy to expose more if needed in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. First I was a bit skeptical because it would require casting in util methods but we can also provide a visitor.

public FieldsDataType(
LogicalType logicalType,
@Nullable Class<?> conversionClass,
Map<String, DataType> fieldDataTypes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange to use Map<String, DataType> to represent fields, because RowType is sequential, is it better to use array to express fields?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. At least we should use LinkedHashMap to preserve the fields order. And it would be nice to provide getFieldNames() and getFieldTypes() for convenience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field order is defined by the logical type. The map adds just metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would allow a LinkedHashMap we would also need to verify the order with the logical type. The logical type should be the single source of truth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm just not sure what the getFieldDataTypes will be used for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method will only used by us to figure out which format the user requested. It will be used in the next PR around FLINK-12254.

@twalthr twalthr force-pushed the FLINK-FLINK-12393 branch from 746c81c to b470dd7 Compare May 9, 2019 12:00
Copy link
Contributor

@dawidwys dawidwys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update. Tests look beautiful right now!

@twalthr
Copy link
Contributor Author

twalthr commented May 9, 2019

Thanks everyone for the feedback. I will merge this now.

twalthr added a commit to twalthr/flink that referenced this pull request May 9, 2019
…pe system

Introduces the DataType class and subclasses. Users are able to do a star import
of DataTypes and declare types like: `MULTISET(MULTISET(INT())`. Close to SQL. As
mentioned in FLIP-37, data types allow to specify format hints to the planner
using `TIMESTAMP(9).bridgedTo(java.sql.Timestamp)`.

This closes apache#8360.
@asfgit asfgit closed this in 1cb9d68 May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants