Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-8866][Table API & SQL] Add support for unified table sink instantiation #6201

Closed
wants to merge 3 commits into from

Conversation

suez1224
Copy link

(The sections below can be removed for hotfixes of typos)

What is the purpose of the change

Add interfaces to support unified table sink configuration and instantiation. Consolidate table source and table sink configuration and instantiation.

Brief change log

  • Consolidate table sink and table source instantiation with TableConnectorFactory{Service}.
  • Add support to register a Calcite table with both tableSource and tableSink.
  • Add Insert command support in SQL client.
  • Add CsvTableSinkFactory.

Verifying this change

This change added tests and can be verified as follows:

  • *Added integration tests for testing registering table source and sink tables under the same name.
  • *Added integration tests for testing insert into command in sql client.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (JavaDocs)

@suez1224 suez1224 force-pushed the FLINK-8866-2 branch 3 times, most recently from a7ce69b to 239aba3 Compare June 22, 2018 19:16
Copy link
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @suez1224. I had a first look at the change and added some feedback. I might have a second look tomorrow. In general, it would be great if we could split SQL Client and flink-table changes and get the core feature in first.

One thing that we need to define more clearly is how to map the time attributes from the query to the sink. Proctime can be ignored but how do we map rowtime? We could in theory use information we get from the rowtime descriptor. If it is from-source we don't have to worry about it, if it is from-field we should put the rowtime exactly into this field. But I don't know how we handle multiple rowtimes in the future.

What do you think?


import java.util

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the updated comment again.

@@ -16,21 +16,18 @@
* limitations under the License.
*/

package org.apache.flink.table.sources
package org.apache.flink.table.connector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use plural connectors

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* the current classpath to be found.
*/
trait TableSourceFactory[T] {
trait TableConnectorFactory[T] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we could also simplify this and call it TableFactory. What do you think? We also call CREATE TABLE not CREATE TABLE CONNECTOR in SQL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, also, I've updated the DDL design doc to call it TABLE CONNECTOR, which I thin k it is more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suez1224 Actually, I liked CREATE TABLE because it is closer to SQL. The reason why I proposed TableFactory was because the factory does much more than just constructing a connector. It also performs schema validation, format discovery and so on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think the most baffling point I have read until this point was the Table*Connector*Factory part :-)

* Specify the type of the table connector, check
* [[org.apache.flink.table.descriptors.TableDescriptorValidator]] for all values.
*
* @return the table connector type,.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comma

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*
* @return the table connector type,.
*/
def tableType() : String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to getType()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

val properties = mutable.Map[String, String]()
properties.put(TableDescriptorValidator.TABLE_TYPE,
TableDescriptorValidator.TABLE_TYPE_VALUE_SINK)
properties.put(CONNECTOR_TYPE, "test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use strings here for everything (not the variables). This allows tests to fail if we refactor one of the properties.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, done

new MemoryTableSourceSinkUtil.UnsafeMemoryAppendTableSink().configure(fieldNames, fieldTypes))

tEnv.sqlUpdate("INSERT INTO targetTable SELECT a, b, c, rowtime FROM sourceTable")
tEnv.sqlQuery("SELECT a, e, f, t, rowtime from targetTable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need more test cases about how we handle the time attributes for both table types. Maybe not only ITCases but also unit tests. The configure method is an internal method that should not be called here.

}

class UnsafeMemoryTableSource(tableSchema: TableSchema,
returnType: TypeInformation[Row],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually intend differently. Take org.apache.flink.table.codegen.CodeGenerator as an example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private <T> void executeUpdateInternal(ExecutionContext<T> context, String query) {
final ExecutionContext.EnvironmentInstance envInst = context.createEnvironmentInstance();

envInst.getTableEnvironment().sqlUpdate(query);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap it into a try-catch similar to org.apache.flink.table.client.gateway.local.LocalExecutor#createTable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to ship the query config here.

final JobGraph jobGraph = envInst.createJobGraph(jobName);

// create execution
new Thread(new ProgramDeployer<>(context, jobName, jobGraph, null)).start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even a detached job needs to return a result. Otherwise you cannot be sure if the job has been submitted or not. E.g., the cluster might not be reachable. In any case, every created thread should be managed by the result store. So we should have a similar architecture as for queries. Maybe instead of CollectStreamResult a StatusResult. Maybe we should do the SQL Client changes in a separate PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me put it into a separate PR. StatusResult make sense to me.

@suez1224
Copy link
Author

@twalthr , for sink only table, I dont think the user need to define any rowtimes on it, since it will never use as a source. For table as both source and sink, when registering it as sink, I think we only need to take care of the 'from-field' columns, since they map to actual data fields in the table. For proctime and 'from-source' columns, we can just ignore them when building the sink schema. Maybe, we should have some helper method for building the schema for source and sink separately. Please correct me if I missed something here. What do you think?

@twalthr
Copy link
Contributor

twalthr commented Jun 29, 2018

@suez1224 Yes sounds good to me. Only from-field timestamps matter right now.

We should also think of the opposite of a timestamps extractor (timestamp inserter) for cases where the rowtime needs some preprocessing (like e.g. concatenation of a DATE and TIME column), but we can deal with such cases in a follow up issue.

A helper method would be useful. We already have something similar in SchemaValidator for schema derivation.

@fhueske
Copy link
Contributor

fhueske commented Jun 29, 2018

Hi, I think timestamp fields of source-sink tables should be handled as follows when emitting the table:

  • proc-time: ignore
  • from-field: simply write out the timestamp as part of the row.
  • from-source: write the timestamp separately to the system and remove it from the row. This only works if we can set the timestamp to the sink system. If the system sets the ingestion timestamp by it own, i.e., not the actual value, rows would contain different timestamps when they are ingested. If the sink system does not support to set a timestamp, we cannot allow such a table definition.

@suez1224
Copy link
Author

suez1224 commented Jul 3, 2018

@fhueske @twalthr thanks for the comments. In from-source, the only system i know of is Kafka10 or Kafka11, which support writing record along with timestamp. To support from-source in table sink, I think we can do the following:

  1. add a connector property, e.g. connector.support-timestamp. Only if connector.support-timestamp is true, we will allow the sink table schema to contain a field with rowtime type from-source. Otherwise, an exception will be thrown.
  2. if the condition in 1) is satisfied, we will create corresponding rowtime field in the sink table schema with type LONG, in TableEnvironment.insertInto(), we will validate the sink schema against the insertion source. Also, in the TableSink.emitDataStream() implementation, we will need to insert an timestamp assigner operator to set StreamRecord.timestamp (should we reuse existing interface, or create a new timestampInserter interface?) and remove the extra rowtime field from StreamRecord.value before we emit the datastream to the sink. (for kafkaTableSink, we will also need to invoke setWriteTimestampToKafka(true))

Please correct me if I missed something here. What do you think?

@fhueske
Copy link
Contributor

fhueske commented Jul 3, 2018

Hi @suez1224, that sounds good overall. :-)

A few comments:

  • I would not add a user-facing property connector.support-timestamp because a user chooses that by choosing the connector type. Whether the connector supports writing a system timestamp can be an internal field/annotation/interface of the TableSink that is generated from the properties.
  • Copying the timestamp to the StreamRecord timestamp field can be done with a process function. Actually, we do that already when converting a Table into a DataStream. Setting the flag in the Kafka TableSink should be easy.
  • Not sure if from-source needs to be supported by the initial version. We could just implement from-field for now, and handle from-source as a follow up issue. Since we are approaching feature freeze, I think this might be a good idea at this point.

What do you think?
Fabian

@twalthr
Copy link
Contributor

twalthr commented Jul 3, 2018

I agree with @fhueske. Let's do the from-source in a follow-up issue. I will open a PR soon for FLINK-8558 which separates connector and format. For this I also introduced a method KafkaTableSource#supportsKafkaTimestamps. The KafkaTableFactory can read this property and throw an exception accordingly.

@suez1224
Copy link
Author

suez1224 commented Jul 3, 2018

@twalthr @fhueske sounds good to me. We can do that in a follow-up issue for from-source, and we will not support from-source in this PR.

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @suez1224 for the PR. It looks really good!

I think @twalthr also had similar comments: seems like there are some unification that can be made to reduce duplications between source and sink.

I will also try directly patching this on to https://github.com/walterddr/AthenaX/tree/upgrade_1.5 to eliminate the need to create a sink provider.

throw new SqlClientException(
"Invalid table 'type' attribute value, only 'source' is supported");
"Invalid table 'type' attribute value, only 'source' or 'sink' is supported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing both ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. thanks

"Invalid table 'type' attribute value, only 'source' or 'sink' is supported");
}
if (this.tables.containsKey(tableName)) {
throw new SqlClientException("Duplicate table name '" + tableName + "'.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if only "source" and "sink" is allowed, should we allow the same name but different type. e.g. {"name": "t1", "type": "source"} and {"name": "t1", "type": "sink"} co-exist? this is actually following up with the previous comment. I think we just need one, either should work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current implementation allow only source and sink in one table.

registerTableSinkInternal(name, configuredSink)
}

def registerTableSink(name: String, configuredSink: TableSink[_]): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could probably move this to based class TableEnvironment ?

* the current classpath to be found.
*/
trait TableSourceFactory[T] {
trait TableConnectorFactory[T] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think the most baffling point I have read until this point was the Table*Connector*Factory part :-)

import org.apache.calcite.schema.Statistic
import org.apache.calcite.schema.impl.AbstractTable

class TableSourceSinkTable[T1, T2](val tableSourceTableOpt: Option[TableSourceTable[T1]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge +1. My understanding is this will be the overall class to hold a table source, sink or both. TableSourceSinkTable seems redundant.

*/
object TableSourceFactoryService extends Logging {
class TableConnectorFactoryService[T] extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

*/
object TableSourceFactoryService extends Logging {
class TableConnectorFactoryService[T] extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also just TableFactoryService ?

* Common class for all descriptors describing a table sink.
*/
abstract class TableSinkDescriptor extends TableDescriptor {
override private[flink] def addProperties(properties: DescriptorProperties): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Should be able to unify

Shuyi Chen added 2 commits July 6, 2018 16:54
- Consolidate table sink and table source instantiation.
- Add support to register a Calcite table with both tableSource and tableSink.
- Add Insert command support in SQL client.
- Add CsvTableSinkFactory.
@suez1224 suez1224 force-pushed the FLINK-8866-2 branch 4 times, most recently from 4324efd to a709787 Compare July 7, 2018 22:36
1) add TableFactoryDiscoverable trait
2) add util for handling rowtime/proctime for table schema and unittests
@suez1224 suez1224 closed this Jul 9, 2018
@suez1224 suez1224 reopened this Jul 9, 2018
twalthr pushed a commit to twalthr/flink that referenced this pull request Jul 12, 2018
twalthr pushed a commit to twalthr/flink that referenced this pull request Jul 15, 2018
twalthr pushed a commit to twalthr/flink that referenced this pull request Jul 15, 2018
twalthr pushed a commit to twalthr/flink that referenced this pull request Jul 15, 2018
twalthr pushed a commit to twalthr/flink that referenced this pull request Jul 15, 2018
@asfgit asfgit closed this in 9597248 Jul 15, 2018
sampathBhat pushed a commit to sampathBhat/flink that referenced this pull request Jul 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants