[FLINK-8866][Table API & SQL] Add support for unified table sink instantiation #6201

suez1224 · 2018-06-22T07:00:53Z

(The sections below can be removed for hotfixes of typos)

What is the purpose of the change

Add interfaces to support unified table sink configuration and instantiation. Consolidate table source and table sink configuration and instantiation.

Brief change log

Consolidate table sink and table source instantiation with TableConnectorFactory{Service}.
Add support to register a Calcite table with both tableSource and tableSink.
Add Insert command support in SQL client.
Add CsvTableSinkFactory.

Verifying this change

This change added tests and can be verified as follows:

*Added integration tests for testing registering table source and sink tables under the same name.
*Added integration tests for testing insert into command in sql client.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (JavaDocs)

twalthr

Thank you for this PR @suez1224. I had a first look at the change and added some feedback. I might have a second look tomorrow. In general, it would be great if we could split SQL Client and flink-table changes and get the core feature in first.

One thing that we need to define more clearly is how to map the time attributes from the query to the sink. Proctime can be ignored but how do we map rowtime? We could in theory use information we get from the rowtime descriptor. If it is from-source we don't have to worry about it, if it is from-field we should put the rowtime exactly into this field. But I don't know how we handle multiple rowtimes in the future.

What do you think?

twalthr · 2018-06-26T13:06:56Z

...ries/flink-table/src/main/scala/org/apache/flink/table/connector/TableConnectorFactory.scala


 import java.util

-/**


Add the updated comment again.

twalthr · 2018-06-26T13:07:45Z

...ries/flink-table/src/main/scala/org/apache/flink/table/connector/TableConnectorFactory.scala

@@ -16,21 +16,18 @@
 * limitations under the License.
 */

-package org.apache.flink.table.sources
+package org.apache.flink.table.connector


Use plural connectors

twalthr · 2018-06-26T13:08:59Z

...ries/flink-table/src/main/scala/org/apache/flink/table/connector/TableConnectorFactory.scala

-  * the current classpath to be found.
-  */
-trait TableSourceFactory[T] {
+trait TableConnectorFactory[T] {


Actually, we could also simplify this and call it TableFactory. What do you think? We also call CREATE TABLE not CREATE TABLE CONNECTOR in SQL.

Sounds good, also, I've updated the DDL design doc to call it TABLE CONNECTOR, which I thin k it is more clear.

@suez1224 Actually, I liked CREATE TABLE because it is closer to SQL. The reason why I proposed TableFactory was because the factory does much more than just constructing a connector. It also performs schema validation, format discovery and so on.

+1 I think the most baffling point I have read until this point was the Table*Connector*Factory part :-)

twalthr · 2018-06-26T13:09:18Z

...ries/flink-table/src/main/scala/org/apache/flink/table/connector/TableConnectorFactory.scala

+    * Specify the type of the table connector, check
+    * [[org.apache.flink.table.descriptors.TableDescriptorValidator]] for all values.
+    *
+    * @return the table connector type,.


remove comma

twalthr · 2018-06-26T13:10:46Z

...ries/flink-table/src/main/scala/org/apache/flink/table/connector/TableConnectorFactory.scala

+    *
+    * @return the table connector type,.
+    */
+  def tableType() : String


Rename to getType()?

sounds good.

twalthr · 2018-06-26T16:11:14Z

...link-table/src/test/scala/org/apache/flink/table/connector/TableSinkFactoryServiceTest.scala

+    val properties = mutable.Map[String, String]()
+    properties.put(TableDescriptorValidator.TABLE_TYPE,
+      TableDescriptorValidator.TABLE_TYPE_VALUE_SINK)
+    properties.put(CONNECTOR_TYPE, "test")


I would use strings here for everything (not the variables). This allows tests to fail if we refactor one of the properties.

good point, done

twalthr · 2018-06-26T16:36:23Z

...braries/flink-table/src/test/scala/org/apache/flink/table/runtime/stream/sql/SqlITCase.scala

+      new MemoryTableSourceSinkUtil.UnsafeMemoryAppendTableSink().configure(fieldNames, fieldTypes))
+
+    tEnv.sqlUpdate("INSERT INTO targetTable SELECT a, b, c, rowtime FROM sourceTable")
+    tEnv.sqlQuery("SELECT a, e, f, t, rowtime from targetTable")


I think we need more test cases about how we handle the time attributes for both table types. Maybe not only ITCases but also unit tests. The configure method is an internal method that should not be called here.

twalthr · 2018-06-26T16:37:19Z

...ries/flink-table/src/test/scala/org/apache/flink/table/utils/MemoryTableSourceSinkUtil.scala

+  }
+
+  class UnsafeMemoryTableSource(tableSchema: TableSchema,
+                                returnType: TypeInformation[Row],


We usually intend differently. Take org.apache.flink.table.codegen.CodeGenerator as an example.

twalthr · 2018-06-26T16:46:19Z

...link-sql-client/src/main/java/org/apache/flink/table/client/gateway/local/LocalExecutor.java

+	private <T> void executeUpdateInternal(ExecutionContext<T> context, String query) {
+		final ExecutionContext.EnvironmentInstance envInst = context.createEnvironmentInstance();
+
+		envInst.getTableEnvironment().sqlUpdate(query);


Wrap it into a try-catch similar to org.apache.flink.table.client.gateway.local.LocalExecutor#createTable.

We also need to ship the query config here.

twalthr · 2018-06-26T17:02:36Z

...link-sql-client/src/main/java/org/apache/flink/table/client/gateway/local/LocalExecutor.java

+		final JobGraph jobGraph = envInst.createJobGraph(jobName);
+
+		// create execution
+		new Thread(new ProgramDeployer<>(context, jobName, jobGraph, null)).start();


I think even a detached job needs to return a result. Otherwise you cannot be sure if the job has been submitted or not. E.g., the cluster might not be reachable. In any case, every created thread should be managed by the result store. So we should have a similar architecture as for queries. Maybe instead of CollectStreamResult a StatusResult. Maybe we should do the SQL Client changes in a separate PR?

Yes, let me put it into a separate PR. StatusResult make sense to me.

suez1224 · 2018-06-29T06:41:07Z

@twalthr , for sink only table, I dont think the user need to define any rowtimes on it, since it will never use as a source. For table as both source and sink, when registering it as sink, I think we only need to take care of the 'from-field' columns, since they map to actual data fields in the table. For proctime and 'from-source' columns, we can just ignore them when building the sink schema. Maybe, we should have some helper method for building the schema for source and sink separately. Please correct me if I missed something here. What do you think?

twalthr · 2018-06-29T10:07:33Z

@suez1224 Yes sounds good to me. Only from-field timestamps matter right now.

We should also think of the opposite of a timestamps extractor (timestamp inserter) for cases where the rowtime needs some preprocessing (like e.g. concatenation of a DATE and TIME column), but we can deal with such cases in a follow up issue.

A helper method would be useful. We already have something similar in SchemaValidator for schema derivation.

fhueske · 2018-06-29T13:52:39Z

Hi, I think timestamp fields of source-sink tables should be handled as follows when emitting the table:

proc-time: ignore
from-field: simply write out the timestamp as part of the row.
from-source: write the timestamp separately to the system and remove it from the row. This only works if we can set the timestamp to the sink system. If the system sets the ingestion timestamp by it own, i.e., not the actual value, rows would contain different timestamps when they are ingested. If the sink system does not support to set a timestamp, we cannot allow such a table definition.

suez1224 · 2018-07-03T07:24:58Z

@fhueske @twalthr thanks for the comments. In from-source, the only system i know of is Kafka10 or Kafka11, which support writing record along with timestamp. To support from-source in table sink, I think we can do the following:

add a connector property, e.g. connector.support-timestamp. Only if connector.support-timestamp is true, we will allow the sink table schema to contain a field with rowtime type from-source. Otherwise, an exception will be thrown.
if the condition in 1) is satisfied, we will create corresponding rowtime field in the sink table schema with type LONG, in TableEnvironment.insertInto(), we will validate the sink schema against the insertion source. Also, in the TableSink.emitDataStream() implementation, we will need to insert an timestamp assigner operator to set StreamRecord.timestamp (should we reuse existing interface, or create a new timestampInserter interface?) and remove the extra rowtime field from StreamRecord.value before we emit the datastream to the sink. (for kafkaTableSink, we will also need to invoke setWriteTimestampToKafka(true))

Please correct me if I missed something here. What do you think?

fhueske · 2018-07-03T09:42:13Z

Hi @suez1224, that sounds good overall. :-)

A few comments:

I would not add a user-facing property connector.support-timestamp because a user chooses that by choosing the connector type. Whether the connector supports writing a system timestamp can be an internal field/annotation/interface of the TableSink that is generated from the properties.
Copying the timestamp to the StreamRecord timestamp field can be done with a process function. Actually, we do that already when converting a Table into a DataStream. Setting the flag in the Kafka TableSink should be easy.
Not sure if from-source needs to be supported by the initial version. We could just implement from-field for now, and handle from-source as a follow up issue. Since we are approaching feature freeze, I think this might be a good idea at this point.

What do you think?
Fabian

twalthr · 2018-07-03T11:44:17Z

I agree with @fhueske. Let's do the from-source in a follow-up issue. I will open a PR soon for FLINK-8558 which separates connector and format. For this I also introduced a method KafkaTableSource#supportsKafkaTimestamps. The KafkaTableFactory can read this property and throw an exception accordingly.

suez1224 · 2018-07-03T21:02:54Z

@twalthr @fhueske sounds good to me. We can do that in a follow-up issue for from-source, and we will not support from-source in this PR.

walterddr

Thanks @suez1224 for the PR. It looks really good!

I think @twalthr also had similar comments: seems like there are some unification that can be made to reduce duplications between source and sink.

I will also try directly patching this on to https://github.com/walterddr/AthenaX/tree/upgrade_1.5 to eliminate the need to create a sink provider.

walterddr · 2018-07-06T00:10:27Z

...braries/flink-sql-client/src/main/java/org/apache/flink/table/client/config/Environment.java

 				throw new SqlClientException(
-						"Invalid table 'type' attribute value, only 'source' is supported");
+						"Invalid table 'type' attribute value, only 'source' or 'sink' is supported");


missing both ?

good catch. thanks

walterddr · 2018-07-06T00:12:06Z

...braries/flink-sql-client/src/main/java/org/apache/flink/table/client/config/Environment.java

+						"Invalid table 'type' attribute value, only 'source' or 'sink' is supported");
+			}
+			if (this.tables.containsKey(tableName)) {
+				throw new SqlClientException("Duplicate table name '" + tableName + "'.");


if only "source" and "sink" is allowed, should we allow the same name but different type. e.g. {"name": "t1", "type": "source"} and {"name": "t1", "type": "sink"} co-exist? this is actually following up with the previous comment. I think we just need one, either should work.

the current implementation allow only source and sink in one table.

walterddr · 2018-07-06T00:22:35Z

...-libraries/flink-table/src/main/scala/org/apache/flink/table/api/BatchTableEnvironment.scala

+    registerTableSinkInternal(name, configuredSink)
+  }
+
+  def registerTableSink(name: String, configuredSink: TableSink[_]): Unit = {


could probably move this to based class TableEnvironment ?

walterddr · 2018-07-06T00:29:04Z

...ries/flink-table/src/main/scala/org/apache/flink/table/connector/TableConnectorFactory.scala

-  * the current classpath to be found.
-  */
-trait TableSourceFactory[T] {
+trait TableConnectorFactory[T] {


+1 I think the most baffling point I have read until this point was the Table*Connector*Factory part :-)

walterddr · 2018-07-06T00:33:57Z

...ies/flink-table/src/main/scala/org/apache/flink/table/plan/schema/TableSourceSinkTable.scala

+import org.apache.calcite.schema.Statistic
+import org.apache.calcite.schema.impl.AbstractTable
+
+class TableSourceSinkTable[T1, T2](val tableSourceTableOpt: Option[TableSourceTable[T1]],


Huge +1. My understanding is this will be the overall class to hold a table source, sink or both. TableSourceSinkTable seems redundant.

walterddr · 2018-07-06T00:36:51Z

...ink-table/src/main/scala/org/apache/flink/table/connector/TableConnectorFactoryService.scala

  */
-object TableSourceFactoryService extends Logging {
+class TableConnectorFactoryService[T] extends Logging {


walterddr · 2018-07-06T00:37:13Z

...ink-table/src/main/scala/org/apache/flink/table/connector/TableConnectorFactoryService.scala

  */
-object TableSourceFactoryService extends Logging {
+class TableConnectorFactoryService[T] extends Logging {


also just TableFactoryService ?

walterddr · 2018-07-06T00:38:08Z

...ries/flink-table/src/main/scala/org/apache/flink/table/descriptors/TableSinkDescriptor.scala

+  * Common class for all descriptors describing a table sink.
+  */
+abstract class TableSinkDescriptor extends TableDescriptor {
+  override private[flink] def addProperties(properties: DescriptorProperties): Unit = {


+1 Should be able to unify

- Consolidate table sink and table source instantiation. - Add support to register a Calcite table with both tableSource and tableSink. - Add Insert command support in SQL client. - Add CsvTableSinkFactory.

1) add TableFactoryDiscoverable trait 2) add util for handling rowtime/proctime for table schema and unittests

…tiate TableSinks This closes apache#6201.

suez1224 force-pushed the FLINK-8866-2 branch 3 times, most recently from a7ce69b to 239aba3 Compare June 22, 2018 19:16

twalthr requested changes Jun 26, 2018

View reviewed changes

walterddr reviewed Jul 6, 2018

View reviewed changes

Shuyi Chen added 2 commits July 6, 2018 16:54

- Add unified table sink instantiation.

d436041

- Consolidate table sink and table source instantiation. - Add support to register a Calcite table with both tableSource and tableSink. - Add Insert command support in SQL client. - Add CsvTableSinkFactory.

comment fixes

d70d033

suez1224 force-pushed the FLINK-8866-2 branch 4 times, most recently from 4324efd to a709787 Compare July 7, 2018 22:36

refactor:

0649359

1) add TableFactoryDiscoverable trait 2) add util for handling rowtime/proctime for table schema and unittests

suez1224 force-pushed the FLINK-8866-2 branch from a709787 to 0649359 Compare July 7, 2018 23:05

suez1224 closed this Jul 9, 2018

suez1224 reopened this Jul 9, 2018

twalthr mentioned this pull request Jul 11, 2018

[FLINK-8558] [table] Add unified format interfaces and separate formats from connectors #6264

Closed

twalthr pushed a commit to twalthr/flink that referenced this pull request Jul 12, 2018

[FLINK-8866] [table] Create unified interfaces to configure and insta…

42a8a15

…tiate TableSinks This closes apache#6201.

twalthr mentioned this pull request Jul 13, 2018

[FLINK-8558] [FLINK-8866] [table] Finalize unified table source/sink/format interfaces #6323

Closed

twalthr pushed a commit to twalthr/flink that referenced this pull request Jul 15, 2018

[FLINK-8866] [table] Create unified interfaces to configure and insta…

d646f08

…tiate TableSinks This closes apache#6201.

twalthr pushed a commit to twalthr/flink that referenced this pull request Jul 15, 2018

[FLINK-8866] [table] Create unified interfaces to configure and insta…

bd6fe68

…tiate TableSinks This closes apache#6201.

twalthr pushed a commit to twalthr/flink that referenced this pull request Jul 15, 2018

[FLINK-8866] [table] Create unified interfaces to configure and insta…

5cfdef9

…tiate TableSinks This closes apache#6201.

twalthr pushed a commit to twalthr/flink that referenced this pull request Jul 15, 2018

[FLINK-8866] [table] Create unified interfaces to configure and insta…

1d4120d

…tiate TableSinks This closes apache#6201.

asfgit closed this in 9597248 Jul 15, 2018

sampathBhat pushed a commit to sampathBhat/flink that referenced this pull request Jul 26, 2018

[FLINK-8866] [table] Create unified interfaces to configure and insta…

2912db4

…tiate TableSinks This closes apache#6201.

rmetzger added the component=API/TableSQL label Mar 18, 2019

flinkbot added component=TableSQL/API and removed component=API/TableSQL labels Mar 17, 2022

[FLINK-8866][Table API & SQL] Add support for unified table sink instantiation #6201

[FLINK-8866][Table API & SQL] Add support for unified table sink instantiation #6201

Conversation

suez1224 commented Jun 22, 2018

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

twalthr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suez1224 commented Jun 29, 2018

twalthr commented Jun 29, 2018

fhueske commented Jun 29, 2018

suez1224 commented Jul 3, 2018

fhueske commented Jul 3, 2018

twalthr commented Jul 3, 2018

suez1224 commented Jul 3, 2018

walterddr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment