Skip to content

Conversation

JingsongLi
Copy link
Contributor

What is the purpose of the change

Now PartitionableTableSource and PartitionableTableSink have "getPartitionFieldNames" method, this should be removed, and planner rules should get it from CatalogManager.

The partition field names are the information of Table, source/sink should only be fed with such information but not get them out of it.

Brief change log

See commits.

Verifying this change

This change is already covered by existing tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

@flinkbot
Copy link
Collaborator

flinkbot commented Oct 16, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 6a82d9f (Wed Dec 04 14:47:23 UTC 2019)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Oct 16, 2019

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build

Copy link
Contributor

@docete docete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. I leave some comments.
btw: the last 3 commit's have wrong commit msg (table-planner-planner -> table-planner-blink).

@JingsongLi JingsongLi force-pushed the removeGetPartitionSourceSink branch from f20e67a to 3c1480f Compare October 24, 2019 03:35
@JingsongLi
Copy link
Contributor Author

Thanks for your PR. I leave some comments.
btw: the last 3 commit's have wrong commit msg (table-planner-planner -> table-planner-blink).

Thanks for your review, updated.

@JingsongLi JingsongLi force-pushed the removeGetPartitionSourceSink branch from 3c1480f to e3ba6b1 Compare October 24, 2019 07:00
@docete
Copy link
Contributor

docete commented Oct 24, 2019

LGTM now!

@JingsongLi
Copy link
Contributor Author

@wuchong @KurtYoung Can you take a look?

@KurtYoung
Copy link
Contributor

For LogicalSink, my feeling is we should provide enough information such as partition keys at first step instead of looking to catalog manager during conversion. Have you considered about this approach?

@KurtYoung
Copy link
Contributor

In PlannerBase, we actually already know whether it's a PartitionableTableSink

@JingsongLi
Copy link
Contributor Author

For LogicalSink, my feeling is we should provide enough information such as partition keys at first step instead of looking to catalog manager during conversion. Have you considered about this approach?

But consider the future needs, we need get some other informations from catalog table or catalog, we don't need to pass all informations to LogicalSink, for example, we need do the partition pruning from catalog, it is better to just pass an identifier.

@KurtYoung
Copy link
Contributor

Consider a temporal partition table, I'm not sure such table would stored also in CatalogManager. So you have to write the logic like:
tryGetTableFromTemporalStorage(); if not temporal, get it from CatalogManger every time you want to access some information of the table.

Another solution is we can pass in the CatalogTable? It serves like a meta about the table you trying to read and write.

Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is we should pass in all the required information on CatalogTable to LogicalSink,
other information we can access them via Catalog. I think this is also the way for source, i.e. watermark, computed column, partition.

@JingsongLi
Copy link
Contributor Author

Consider a temporal partition table, I'm not sure such table would stored also in CatalogManager. So you have to write the logic like:
tryGetTableFromTemporalStorage(); if not temporal, get it from CatalogManger every time you want to access some information of the table.

Another solution is we can pass in the CatalogTable? It serves like a meta about the table you trying to read and write.

According to #9971 , temporal table is stored in CatalogManger, so we can get the informations from CatalogManger.

CatalogTable not contains identifier information. Catalog contains more information than it does, such as stats.

@KurtYoung
Copy link
Contributor

If you want, you can pass both table identifier as well as CatalogTable. The key point is we want to reduce unnecessary access to catalog manager.

@JingsongLi
Copy link
Contributor Author

Hi @KurtYoung @wuchong , can you explain more? Why we want to reduce unnecessary access to catalog manager?

@KurtYoung
Copy link
Contributor

3 minor reasons, none of them are critical but kind of bothers me:

  1. Accessing catalog manager may introduce external system access and could cost some time which will increase the optimization duration.
  2. Access catalog manager multiple times could cause data inconsistency. We already get such information before entering to optimization phase, and information might changed when you look up the catalog manager again.
  3. As pointed out in ForwardFields Optimizer integration #2, we already get all the information you needed before optimization, why bother to get it again?

@JingsongLi
Copy link
Contributor Author

3 minor reasons, none of them are critical but kind of bothers me:

  1. Accessing catalog manager may introduce external system access and could cost some time which will increase the optimization duration.
  2. Access catalog manager multiple times could cause data inconsistency. We already get such information before entering to optimization phase, and information might changed when you look up the catalog manager again.
  3. As pointed out in ForwardFields Optimizer integration #2, we already get all the information you needed before optimization, why bother to get it again?

Thanks @KurtYoung to explain it. none of them are critical but can convince me, I'll update it. But I think I can keep the pass of identifier and catalog manager, what do you think?

@KurtYoung
Copy link
Contributor

You mean catalog table, not catalog manager, right?

3 minor reasons, none of them are critical but kind of bothers me:

  1. Accessing catalog manager may introduce external system access and could cost some time which will increase the optimization duration.
  2. Access catalog manager multiple times could cause data inconsistency. We already get such information before entering to optimization phase, and information might changed when you look up the catalog manager again.
  3. As pointed out in ForwardFields Optimizer integration #2, we already get all the information you needed before optimization, why bother to get it again?

Thanks @KurtYoung to explain it. none of them are critical but can convince me, I'll update it. But I think I can keep the pass of identifier and catalog manager, what do you think?

You mean catalog table, not catalog manager, right?

@JingsongLi
Copy link
Contributor Author

You mean catalog table, not catalog manager, right?

I mean pass identifier and catalog table to logical sink and logical source, and keep catalog manager in FlinkContext.

@KurtYoung
Copy link
Contributor

I'm ok with table identifier and catalog table, but not introduce catalog manager to FlinkContext yet. You don't need it in this PR, right?

@JingsongLi
Copy link
Contributor Author

I'm ok with table identifier and catalog table, but not introduce catalog manager to FlinkContext yet. You don't need it in this PR, right?

OK, I'll remove catalog manager and identifier.

@JingsongLi JingsongLi force-pushed the removeGetPartitionSourceSink branch from e3ba6b1 to e5c0586 Compare October 24, 2019 10:22
Copy link
Contributor

@KurtYoung KurtYoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i might have found some logic flaws with the old design. Let me know whether it make sense to you.

isStreamingMode,
FlinkStatistic.builder().tableStats(tableStats).build());
FlinkStatistic.builder().tableStats(tableStats).build(),
null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we also have CatalogTable for such TableSource wrapped table in the future? If not, I would suggest to change CatalogTable to Opion[CatalogTable] in TableSourceTable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have some concern to put the whole CatalogTable in TableSourceTable. Because in CatalogTable, it may contains all the computed columns defined in DDL. But not all of them are retained in source, some of them may be applied to the following LogicalProject. Could we only put the information we need in TabeSourceTable and TableSinkTable? e.g. List<String> partitionKeys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say let's use the information from CatalogTable more carefully then... I can imagine partition keys are just a starter, we might need more and more information from CatalogTable in the future. CatalogTable is some kind of meta information about the table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @KurtYoung , After temporary table support, every table should have CatalogTable. I think we can keep not Option.
Hi @wuchong , I agree with kurt, Actually, source may also need to know compute columns. It needs to know which fields it does not read. (except compute columns expressions)

Copy link
Member

@wuchong wuchong Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source may also need to know compute columns. It needs to know which fields it does not read. (except compute columns expressions)

Actually, we will retain some expression in source (for rowtime generation), then we have to have another field in TableSourceTable to keep this information. say generatedExpressions, this might be confused with the expressions in CatalogTable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jark, I don't quite understand the expression in this source, but I think we need a good name to clarify it. It's may not just the TableSourceTable that saves them together? I think they are all attributes of table.

getTableSink(identifier).map(sink => {
TableSinkUtils.validateSink(catalogSink, identifier, sink)
val partKeys =
catalogManager.getTable(identifier).get().asInstanceOf[CatalogTable].getPartitionKeys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the catalog type before cast to CatalogTable? For example, it would be a CatalogView when get table from catalog manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTableSink(identifier) already has this check, I will change the return type of getTableSink and get CatalogTable directly.

// translate the Table into a DataSet and provide the type that the TableSink expects.
val result: DataSet[T] = translate(table)(outputType)
// Give the DataSet to the TableSink to emit it.
batchSink.emitDataSet(shuffleByPartitionFieldsIfNeeded(batchSink, result))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why deleting these logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first, The former logic is to add shuffle no matter whether it is a static partition or not.
second, after we remove getPartitionFields from PartitionableTableSink, here, we can not get the partitions fields. At present, I don't want to optimize dynamic partition shuffle on legacy planner again.

case partitionableSink: PartitionableTableSink
if partitionableSink.getPartitionFieldNames != null
&& partitionableSink.getPartitionFieldNames.nonEmpty =>
case partitionableSink: PartitionableTableSink =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment on BatchExecSink, should we just put the validation here? (For partitioned table, we need to make sure the sink is PartitionableTableSink)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have validateSink above, no need to validate here.

@JingsongLi
Copy link
Contributor Author

Rebased resolve conflict and fix comments.

@JingsongLi JingsongLi force-pushed the removeGetPartitionSourceSink branch from e5c0586 to 4c59ed7 Compare October 25, 2019 02:42
Copy link
Contributor

@KurtYoung KurtYoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two commit messages should be adjusted. Not get partition keys from catalog manager, but from catalog table.


private def getTableSink(objectIdentifier: ObjectIdentifier): Option[TableSink[_]] = {
JavaScalaConversionUtil.toScala(catalogManager.getTable(objectIdentifier)) match {
private def getTableSink(identifier: ObjectIdentifier): Option[(CatalogTable, TableSink[_])] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identifier -> tableIdentifier

val tableSourceTable = table.unwrap(classOf[TableSourceTable[_]])

if (!tableSourceTable.tableSource.isInstanceOf[PartitionableTableSource]) {
throw new TableException(s"Table(${table.getQualifiedName}) with partition keys" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It the table is partitioned, but we have a non PartitionableTableSource, couldn't we just skip partition prune and read the whole data instead?
Throwing an exception doesn't seem to be right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don't think it's possible to use it like this, it can.

}
}
case _ => throw new TableException(
s"Table(${sinkNode.sinkName}) with partition keys should be a PartitionableTableSink.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need PartitionableTableSink to write data to partitioned table: $tableName

sinkNode.sink match {
case partitionSink: PartitionableTableSink =>
val partKeys = sinkNode.catalogTable.getPartitionKeys
if (!partKeys.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can assert part keys are non empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isPartitioned already judge it, I should remove this judge.

if (!partKeys.isEmpty) {
val partitionIndices =
partKeys.map(partitionSink.getTableSchema.getFieldNames.indexOf(_))
// validate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't have to validate again? we can move all validation logic to TableSinkUtils:validate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partKeys and TableSchema are come from catalog manager, they already validate by catalog manager, I think I can remove it. (Not come from sink or source)

requiredTraitSet = requiredTraitSet.plus(
FlinkRelDistribution.hash(partitionIndices
.map(Integer.valueOf), requireStrict = false))
requiredTraitSet = requiredTraitSet.plus(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this into if (partitionSink.configurePartitionGrouping(true)) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, hash shuffle will add anyway. configurePartitionGrouping only control sort.
In streaming mode, will add hash shuffle, but not sort for configurePartitionGrouping.

s"${partitionFields.get(idx)} must be in the schema.")
}
}
if (sinkNode.catalogTable != null && sinkNode.catalogTable.isPartitioned) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as BatchExecSinkRule

val sinkProperties = catalogTable.toProperties
Option(TableFactoryService.find(classOf[TableSinkFactory[_]], sinkProperties)
val sinkProperties = table.toProperties
Option(table, TableFactoryService.find(classOf[TableSinkFactory[_]], sinkProperties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what's the difference between TableFactoryUtil.createTableSinkForCatalogTable and TableFactoryService.find(classOf[TableSinkFactory[_]], sinkProperties)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Catalog.getTableFactory.
Option 1 use Catalog.getTableFactory to get TableFactory and create sink.
Option 2 use TableFactoryService to create TableFactory and create sink.
The reason why we need option 1 is that hive table factory can not find by TableFactoryService, but I think this can be improved in future.

@JingsongLi JingsongLi force-pushed the removeGetPartitionSourceSink branch from 4c59ed7 to ca20ef1 Compare October 25, 2019 08:24
@JingsongLi JingsongLi closed this Oct 25, 2019
@JingsongLi JingsongLi reopened this Oct 25, 2019
@JingsongLi
Copy link
Contributor Author

Two commit messages should be adjusted. Not get partition keys from catalog manager, but from catalog table.

Thanks for your review, updated.

Copy link
Contributor

@KurtYoung KurtYoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some final comments, the old logic are really a mess.


val partitionFieldNames = tableSourceTable.catalogTable.getPartitionKeys.toSeq.toArray[String]

if (!partitionFieldNames.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this, and it's a really big if statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this one.

if partitionableSink.getPartitionFieldNames != null
&& partitionableSink.getPartitionFieldNames.nonEmpty =>
case partitionableSink: PartitionableTableSink =>
partitionableSink.setStaticPartition(catalogSink.getStaticPartitions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why we set the static partition information here but not during StreamExecSinkRule or BatchExecSinkRule.

Copy link
Contributor Author

@JingsongLi JingsongLi Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just for the test in PartitionableSinkITCase.testInsertWithStaticPartitions, it get staticPartitions from the origin Sink, if in ExecSinkRule, it will be another copied sink.
I think we can modify this in #9796

@JingsongLi JingsongLi force-pushed the removeGetPartitionSourceSink branch from ca20ef1 to 6a82d9f Compare October 25, 2019 09:17
Copy link
Contributor

@KurtYoung KurtYoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, +1

@KurtYoung KurtYoung closed this in 335c4f7 Oct 25, 2019
@JingsongLi JingsongLi deleted the removeGetPartitionSourceSink branch October 29, 2019 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants