[FLINK-8558] [table] Add unified format interfaces and separate formats from connectors #6264

twalthr · 2018-07-05T10:48:50Z

What is the purpose of the change

This PR introduces a format discovery mechanism based on Java Service Providers. The general TableFormatFactory is similar to the existing table source discovery mechanism. However, it allows for arbitrary format interfaces that might be introduced in the future. At the moment, a connector can request configured instances of DeserializationSchema and SerializationSchema. In the future we can add interfaces such as a Writer or KeyedSerializationSchema without breaking backwards compatibility.

This PR deprecates the existing strong coupling of connector and format for the Kafa table sources and table source factories. It introduces descriptor-based alternatives.

Brief change log

Introduction of TableFormatService with TableFormatFactory and specific DeserializationSchemaFactory and SerializationSchemaFactory
Decoupling of existing connectors (i.e. Kafka) from formats (i.e. JSON and Avro)
Exposing the descriptor-based approach, deprecate the old builders and make table source internal

Verifying this change

Existing tests for coupled sources and factories are still working
New tests for format discovery, formats, and decoupled table sources

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): yes
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? more docs for the descriptors will follow in a separate PR

…ts from connectors This PR introduces a format discovery mechanism based on Java Service Providers. The general `TableFormatFactory` is similar to the existing table source discovery mechanism. However, it allows for arbirary format interfaces that might be introduced in the future. At the moment, a connector can request configured instances of `DeserializationSchema` and `SerializationSchema`. In the future we can add interfaces such as a `Writer` or `KeyedSerializationSchema` without breaking backwards compatibility. This PR deprecates the existing strong coupling of connector and format for the Kafa table sources and table source factories. It introduces descriptor-based alternatives.

pnowojski

Sorry, this one is too big to review all at once. I reviewed roughly until BatchTableEnvironment.scala.

pnowojski · 2018-07-06T07:26:27Z

...afka-0.10/src/main/java/org/apache/flink/streaming/connectors/kafka/Kafka010TableSource.java

+			String proctimeAttribute,
+			List<RowtimeAttributeDescriptor> rowtimeAttributeDescriptors,
+			Map<String, String> fieldMapping,
+			String topic, Properties properties,


nit: new line between topic & properties

pnowojski · 2018-07-06T09:20:33Z

...afka-0.10/src/main/java/org/apache/flink/streaming/connectors/kafka/Kafka010TableSource.java


 	/**
 	 * Creates a Kafka 0.10 {@link StreamTableSource}.
 	 *
 	 * @param topic                 Kafka topic to consume.
 	 * @param properties            Properties for the Kafka consumer.
 	 * @param deserializationSchema Deserialization schema to use for Kafka records.
-	 * @param typeInfo              Type information describing the result type. The field names are used
-	 *                              to parse the JSON file and so are the types.
+	 * @param typeInfo              Not relevant anymore.


Why is it not relevant? Doesn't it brake the backward compatibility? If so, it would be safer to drop the constructor altogether.

The only reason for keeping it would be if ALL old invocations will still work the same as they used to, regardless of this value. However in that case, I would be also inclined to drop this constructor, since it's easy change for the users and the class was @PublicEvolving and now it's @Internal

The typeInfo should have been the same as the produced type that has been returned by the deserializationSchema. I'm also fine with dropping the constructor already.

pnowojski · 2018-07-06T09:21:16Z

...-0.10/src/main/java/org/apache/flink/streaming/connectors/kafka/Kafka010JsonTableSource.java

 */
+@Deprecated


I have mixed feelings about Deprecating PublicEvolving classes.

PublicEvolving is not Public so we can modify them if necessary. Additionally, this class should never have gotten this annotation as we do not guarantee API stability for the Table API at the moment.

I agree that having both annotation looks weird. I will remove the PublicEvolving.

As we talked offline. I wasn't sure if we could drop PublicEvolving classes instead of deprecating them. But if you want to deprecate them for one release, it might be better idea to do so.

pnowojski · 2018-07-06T09:23:00Z

...-0.10/src/main/java/org/apache/flink/streaming/connectors/kafka/Kafka010AvroTableSource.java

 	 */
+	@Deprecated


Hmm why do you deprecate individual methods after already deprecating whole class? Does it solve some problem?

If not I would revert those additional @Deprecated notes.

It does not harm to deprecate the methods as well for better visibility:
https://stackoverflow.com/questions/17615019/class-with-deprecated-annotation-means-that-all-methods-and-fields-automaticall

Yes, I assumed that it works that way as you referred and it seemed to me that deprecating the class should suffice. But you are right, it doesn't harm (except of larger commit to review)

pnowojski · 2018-07-06T09:25:59Z

...r-kafka-base/src/main/java/org/apache/flink/streaming/connectors/kafka/KafkaTableSource.java

@@ -55,50 +57,101 @@
 */
 @Internal
 public abstract class KafkaTableSource
-	implements StreamTableSource<Row>, DefinedProctimeAttribute, DefinedRowtimeAttributes {
+	implements StreamTableSource<Row>, DefinedProctimeAttribute, DefinedRowtimeAttributes, DefinedFieldMapping {


nit: split interfaces into new lines?

pnowojski · 2018-07-06T14:22:00Z

...ies/flink-table/src/main/scala/org/apache/flink/table/descriptors/DescriptorProperties.scala

    if (!properties.contains(key)) {
      if (!isOptional) {
        throw new ValidationException(s"Could not find required property '$key'.")
      }
    } else {
-      TypeStringUtils.readTypeInfo(properties(key)) // throws validation exceptions
+      // throws validation exceptions


what's the purpose of this comment?

That we don't validate the string but let the parser do the work for us. I updated the comment.

pnowojski · 2018-07-06T14:25:44Z

...es/flink-table/src/main/scala/org/apache/flink/table/formats/TableFormatFactoryService.scala

+    * @tparam T factory class type
+    * @return configured instance from factory
+    */
+  def find[T](


split this method into smaller ones. Basically everywhere where you typed a comment

//foo bar baz some; code; block;

replace the comment with method call:

def fooBarBaz() { some; code; block; }

pnowojski · 2018-07-06T14:26:57Z

...es/flink-table/src/main/scala/org/apache/flink/table/formats/TableFormatFactoryService.scala

+
+  private def normalizeContext(factory: TableFormatFactory[_]): Map[String, String] = {
+    val requiredContextJava = factory.requiredContext()
+    if (requiredContextJava != null) {


same as below

pnowojski · 2018-07-06T14:28:08Z

...afka-0.10/src/main/java/org/apache/flink/streaming/connectors/kafka/Kafka010TableSource.java

+	 * Creates a Kafka 0.10 {@link StreamTableSource}.
+	 *
+	 * @param schema                      Schema of the produced table.
+	 * @param proctimeAttribute           Field name of the processing time attribute, null if no


Replace null with Optional

https://stackoverflow.com/questions/31922866/why-should-java-8s-optional-not-be-used-in-arguments

This is kind of controversial topic. Generally speaking I suspect that Java discourage to use Optional beside return values because we should use @Nullable or not use any of them. However in projects that ignored @Nullable annotation (such as Flink), it's virtually impossible to start using them and thus using Optional is the only way to have a compiler control over optional/nullable fields.

In this particular use case of "optional" arguments my preference hierarchy is:

provide a builder for this class

provide alternative constructor without this argument

use @Nullable with enabled compile errors on incorrectly handled @Nullable annotations

use Optional
...

use @Nullable WITHOUT compile errors on incorrectly handled @Nullable annotations

use nullable argument without @Nullable annotation

two last options are for me out of the question, since 1337 is evil and 1336 doesn't improve situation. Option 3 is sadly impossible for Flink.

The same logic applies for me to other use cases (like fields, return values etc):

avoid nulls/optionals (for example via builders or named parameters with default values)

use @Nullable with compiler errors

use Optional

pnowojski · 2018-07-06T14:29:06Z

...r-kafka-base/src/main/java/org/apache/flink/streaming/connectors/kafka/KafkaTableSource.java

+	private List<RowtimeAttributeDescriptor> rowtimeAttributeDescriptors;
+
+	/** Mapping for the fields of the table schema to fields of the physical returned type or null. */
+	private Map<String, String> fieldMapping;


Optional<Map<...>> fieldMapping. Same goes for other nullable fields/parameters.

I added a @Nullable annotation.

please check my other comment about that. @Nullable without compiler errors is not in any way better :(

pnowojski

Again only partial review :( I didn't manage to look into Feedback addressed commit. However I managed to more or less fully review first commit.

pnowojski · 2018-07-09T12:26:02Z

...-0.10/src/main/java/org/apache/flink/streaming/connectors/kafka/Kafka010AvroTableSource.java

 	 */
+	@Deprecated


Yes, I assumed that it works that way as you referred and it seemed to me that deprecating the class should suffice. But you are right, it doesn't harm (except of larger commit to review)

pnowojski · 2018-07-09T12:27:54Z

...-0.10/src/main/java/org/apache/flink/streaming/connectors/kafka/Kafka010JsonTableSource.java

 */
+@Deprecated


As we talked offline. I wasn't sure if we could drop PublicEvolving classes instead of deprecating them. But if you want to deprecate them for one release, it might be better idea to do so.

pnowojski · 2018-07-09T12:41:30Z

...afka-0.10/src/main/java/org/apache/flink/streaming/connectors/kafka/Kafka010TableSource.java

+	 * Creates a Kafka 0.10 {@link StreamTableSource}.
+	 *
+	 * @param schema                      Schema of the produced table.
+	 * @param proctimeAttribute           Field name of the processing time attribute, null if no


This is kind of controversial topic. Generally speaking I suspect that Java discourage to use Optional beside return values because we should use @Nullable or not use any of them. However in projects that ignored @Nullable annotation (such as Flink), it's virtually impossible to start using them and thus using Optional is the only way to have a compiler control over optional/nullable fields.

In this particular use case of "optional" arguments my preference hierarchy is:

provide a builder for this class

provide alternative constructor without this argument

use @Nullable with enabled compile errors on incorrectly handled @Nullable annotations

use Optional
...

use @Nullable WITHOUT compile errors on incorrectly handled @Nullable annotations

use nullable argument without @Nullable annotation

two last options are for me out of the question, since 1337 is evil and 1336 doesn't improve situation. Option 3 is sadly impossible for Flink.

The same logic applies for me to other use cases (like fields, return values etc):

avoid nulls/optionals (for example via builders or named parameters with default values)

use @Nullable with compiler errors

use Optional

pnowojski · 2018-07-09T12:46:36Z

...r-kafka-base/src/main/java/org/apache/flink/streaming/connectors/kafka/KafkaTableSource.java

+	private List<RowtimeAttributeDescriptor> rowtimeAttributeDescriptors;
+
+	/** Mapping for the fields of the table schema to fields of the physical returned type or null. */
+	private Map<String, String> fieldMapping;


please check my other comment about that. @Nullable without compiler errors is not in any way better :(

pnowojski · 2018-07-09T12:52:11Z

...c/test/java/org/apache/flink/streaming/connectors/kafka/KafkaTableSourceFactoryTestBase.java

+		// prepare parameters for Kafka table source
+
+		final TableSchema schema = TableSchema.builder()
+			.field("fruit-name", Types.STRING())


Benefit is deduplication of values like "fruit-name" within a single test (easier to change such constants) + self documenting code. Using constants documents all of those occurrences that are indeed the same.

Both help when debugging, refactoring and extending the test in the future.

pnowojski · 2018-07-09T14:42:26Z

...link-table/src/test/scala/org/apache/flink/table/formats/TableFormatFactoryServiceTest.scala

+  }
+
+  @Test(expected = classOf[AmbiguousTableFormatException])
+  def testAmbiguousFactory(): Unit = {


how does this differ from testAmbiguousSchemaBasedSelection?

I added a comment to the other test.

pnowojski · 2018-07-09T14:44:42Z

...libraries/flink-table/src/main/scala/org/apache/flink/table/formats/TableFormatFactory.scala

+    *
+    * An empty context means that the factory matches for all requests.
+    */
+  def requiredContext(): util.Map[String, String]


I don't understand Context in this context. Rename to requiredProperties?

Context defines the "context" in which the factory will be activated.

I still do not understand this name. Could you think about something more descriptive? It seems to me like this method returns the set of properties that are required to match given factory. Thus requiredProperties seems better, but maybe I'm missing something?

Context is more accurate as the map also contains properties that are not required. E.g. a user does not have to specify property-version in YAML but he can. The required context explains the context for which this factory was implemented for. The factory service decides how to handle the context information.

pnowojski · 2018-07-09T14:48:19Z

...flink-table/src/test/scala/org/apache/flink/table/formats/utils/TestTableFormatFactory.scala

+
+  override def requiredContext(): util.Map[String, String] = {
+    val context = new util.HashMap[String, String]()
+    context.put("format.type", "test-format")


I think you should extract or reuse already extracted constants for all of those strings like format.type format.path format.important etc. Generally speaking if a constant occurs more then once in code, it should be extracted.

Generally, you are right. The problem of having a static variable containing a property key is however that you can break backwards compatibility and all tests automatically succeed because they all referenced the common variable.

pnowojski · 2018-07-09T14:49:14Z

...ink-table/src/test/scala/org/apache/flink/table/sources/TestWildcardFormatTableFactory.scala

+/**
+  * Table source factory for testing with a wildcard format ("format.*").
+  */
+class TestWildcardFormatTableFactory extends TableSourceFactory[Row] {


~~Where is this class used?~~

pnowojski · 2018-07-09T14:50:09Z

...link-table/src/test/scala/org/apache/flink/table/sources/TableSourceFactoryServiceTest.scala

+    props.put(FORMAT_TYPE, "test")
+    props.put("format.type", "not-test")
+    props.put("format.not-test-property", "wildcard-property")
+    TableSourceFactoryService.findAndCreateTableSource(props.toMap)


assert that TestWildcardFormatTableFactory was found?

pnowojski · 2018-07-11T07:43:59Z

...libraries/flink-table/src/main/scala/org/apache/flink/table/formats/TableFormatFactory.scala

+    *
+    * An empty context means that the factory matches for all requests.
+    */
+  def requiredContext(): util.Map[String, String]


I still do not understand this name. Could you think about something more descriptive? It seems to me like this method returns the set of properties that are required to match given factory. Thus requiredProperties seems better, but maybe I'm missing something?

pnowojski · 2018-07-11T07:45:47Z

...es/flink-table/src/main/scala/org/apache/flink/table/formats/TableFormatFactoryService.scala

+    if (requiredContextJava != null) {
+      requiredContextJava.asScala.map(e => (e._1.toLowerCase, e._2)).toMap
+    } else {
+      Map[String, String]()


checkNotNull(requiredContextJava)? The interface doesn't seem to allow for nulls.

twalthr · 2018-07-11T09:37:31Z

Thanks for the in-depth review @pnowojski. I hope I could address most of your comments. Since this PR heavily overlaps with #6201 and that PR needs a review and some additional work as well. I will close this PR for now and open a PR with a clean unified table sources/sinks/formats story. We can continue the discussions here and I will make sure that changes will be applied to new PR as well.

pnowojski requested changes Jul 6, 2018

View reviewed changes

Feedback addressed

c186701

pnowojski requested changes Jul 9, 2018

View reviewed changes

twalthr added 5 commits July 10, 2018 12:18

Introduce more constants

f7dd475

Add proper null check for format factory

e117987

Extract methods

9649c38

Remove nulls in format factory service

1931689

Improve format factory tests

3cbaa8d

pnowojski reviewed Jul 11, 2018

View reviewed changes

Remove nullable again

9400511

twalthr closed this Jul 11, 2018

twalthr mentioned this pull request Jul 13, 2018

[FLINK-8558] [FLINK-8866] [table] Finalize unified table source/sink/format interfaces #6323

Closed

rmetzger added the component=Formats label Mar 18, 2019

[FLINK-8558] [table] Add unified format interfaces and separate formats from connectors #6264

[FLINK-8558] [table] Add unified format interfaces and separate formats from connectors #6264

Conversation

twalthr commented Jul 5, 2018

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

pnowojski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnowojski Jul 6, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twalthr Jul 6, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnowojski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnowojski Jul 9, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twalthr Jul 11, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnowojski Jul 9, 2018 • edited

Choose a reason for hiding this comment

pnowojski Jul 9, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twalthr commented Jul 11, 2018

pnowojski Jul 6, 2018 •

edited

twalthr Jul 6, 2018 •

edited

pnowojski Jul 9, 2018 •

edited

twalthr Jul 11, 2018 •

edited

pnowojski Jul 9, 2018 •

edited

pnowojski Jul 9, 2018 •

edited