Skip to content

Conversation

@JingsongLi
Copy link
Contributor

What is the purpose of the change

Now after partition pruning, we will lost stats.
Actually, we have partition stats in catalog.
We need update statistics after partition pruning in PushPartitionIntoTableSourceScanRule.

Brief change log

  • Add merge method to TableStats.
  • Add table identifier to TableSourceTable
  • Update stats in PushPartitionIntoTableSourceScanRule.

Verifying this change

CatalogStatisticsTest.testGetPartitionStatsFromCatalog

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (JavaDocs

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 05b3061 (Mon Nov 25 15:39:32 UTC 2019)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 25, 2019

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build

Copy link
Contributor

@godfreyhe godfreyhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the pr @JingsongLi , i left some comments

// Remove tableStats after predicates pushed down
FlinkStatistic.builder().statistic(statistic).tableStats(null).build()
val newStatistic = {
val tableStats = catalogOption match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no partitions are pruned and the original TableStats is not UNKNOWN, we could use the original TableStats and avoid fetch all partitions' statistics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • First get stats from regular table stats may be not correct.
  • In the next step, I want to introduce listPartitionsByFilter, so we can not know the msg: if no partitions are pruned.

}

@Test
public void testGetPartitionStatsFromCatalog() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some tests about UNKNOWN stats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add others except unknown fields, because CatalogColumnStatisticsDataBase not support unknown now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this blocker is test blocker, we can wait it until 1.10 release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the PR is: #10394

@JingsongLi
Copy link
Contributor Author

Thanks @godfreyhe for your review, updated.

}

@Test
public void testGetPartitionStatsFromCatalog() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@godfreyhe godfreyhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, +1 to merge

val isStreamingMode: Boolean,
val catalogTable: CatalogTable)
val catalogTable: CatalogTable,
val tableIdentifier: Option[ObjectIdentifier])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think names already played this role

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use names is quite hack:

if (names.size() == 3) {
  // it has ObjectIdentifier
  return ObjectIdentifier.of(names.get(0), names.get(1), names.get(2));
} else {
  // there is no ObjectIdentifier.
}

So I think in future, we should use ObjectIdentifier always instead of names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then i don't think it will be an Option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, only one way should go to here without ObjectIdentifier.
It is temporary table with table api(TableSourceQueryOperation). But I can try to fix it.

val newStatistic = {
val tableStats = catalogOption match {
case Some(catalog) =>
def mergePartitionStats(): TableStats = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why creating an inner function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala don't have code to break the loop.

override def onMatch(call: RelOptRuleCall): Unit = {
val filter: Filter = call.rel(0)
val scan: LogicalTableScan = call.rel(1)
val context = call.getPlanner.getContext.unwrap(classOf[FlinkContext])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need some tests for this rule

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have PushPartitionIntoTableSourceScanRuleTest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it doesn't cover your stats handle logics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CatalogStatisticsTest is more professional to test stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So who will cover the correctness about the stats after partition pruning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatalogStatisticsTest will trigger partition pruning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, got it

@KurtYoung
Copy link
Contributor

Please rebase this

@JingsongLi
Copy link
Contributor Author

@KurtYoung KurtYoung merged commit 0d6b439 into apache:master Dec 6, 2019
@JingsongLi JingsongLi deleted the partstat branch December 7, 2019 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants