Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-12235][hive] Support Hive partition in HiveCatalog #8449

Closed
wants to merge 13 commits into from

Conversation

lirui-apache
Copy link
Contributor

@lirui-apache lirui-apache commented May 15, 2019

What is the purpose of the change

To implement partition related operations in HiveCatalogBase.

Brief change log

  • Introduced HiveCatalogPartition to represent a partition that can be handled by HiveCatalog.
  • Implemented partition-related operations in HiveCatalogBase. Although we intend to let HiveCatalog and GenericHiveMetastoreCatalog share the implementations, this PR only enables/tests these operations for HiveCatalog.
  • Moved partition related tests from GenericInMemoryCatalogTest to CatalogTestBase, so that GenericInMemoryCatalogTest and HiveCatalogTest can share these test cases.

Verifying this change

This PR is tested using partition related test cases in CatalogTestBase.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? JavaDocs

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@lirui-apache
Copy link
Contributor Author

@xuefuz @bowenli86 please take a look. Thanks.

Copy link
Contributor

@zjuwangg zjuwangg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!Thanks your effort on this.

Copy link
Contributor

@xuefuz xuefuz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went over the changes once. Had some comments.

Copy link
Member

@bowenli86 bowenli86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lirui-apache thanks for the PR, I left some comments besides xuefu's

* @throws PartitionSpecInvalidException thrown if partitionSpec and partitionKeys have different sizes,
* or any key in partitionKeys doesn't exist in partitionSpec.
*/
List<String> getFullPartitionValues(ObjectPath tablePath, CatalogPartitionSpec partitionSpec, List<String> partitionKeys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the "full" mean here? probably rename to "getOrdered/ArrangedPartitionValues" to conform to its logic?

nit: can we move tablePath to the end as the last param given it's here to only build a message without any real effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means a full spec, which contains values for all partition keys. And a partial spec can only contain values for a subset of partition keys. Operations like createPartition, dropPartition require a full spec. Operations like listPartitions can accept partial spec. Let me rename to to getOrderedFullPartitionValues, to indicate it orders the values and requires a full spec.

@lirui-apache
Copy link
Contributor Author

@bowenli86 @xuefuz thanks for your comments. Please take a look at the updated PR. Thanks.

Copy link
Contributor

@xuefuz xuefuz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a FYI, PR #8490 will remove some classes that are changed here. Not sure which PR should get in first, but it seems one of them has to rebase.

@bowenli86
Copy link
Member

bowenli86 commented May 20, 2019

I'd like to get #8480 in first and rebase this PR since 8480 is a much larger change set and not easy to rebase. Same as #8477

@lirui-apache
Copy link
Contributor Author

Rebased.
@xuefuz @bowenli86 please take another look. Thanks.

Copy link
Member

@bowenli86 bowenli86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lirui-apache Thank you very much for the update!

I just spotted that, currently how several APIs impl work like this: 1) get a raw hive table 2) parse part of the raw table. The latter step actually duplicate with logic in instantiateHiveCatalogTable(). E.g. ensureTableAndPartitionMatch() parses FLINK_PROPERTY_IS_GENERIC, instantiateHivePartition() parses partition keys, ensurePartitionedTable() parses the raw table's partition key size, all of which we can get by just parsing the raw table to a CatalogTable thru instantiateHiveCatalogTable() in advance. The current duplication also means if we change some general logic in parsing a hive table, we need to change two places. Thus I wonder if it makes sense to just parse the raw table as whole at the beginning rather than having scattered places each parsing only part of it themselves. And we can remove util methods such as getFieldNames() which is only used to get the partition keys which is already available in CatalogTable.

For example, change

public void createPartition(...) {
  Table hiveTable = getHiveTable(tablePath);
  ensureTableAndPartitionMatch(hiveTable, partition);
  ensurePartitionedTable(tablePath, hiveTable);
  try {
    client.add_partition(instantiateHivePartition(hiveTable, partitionSpec, partition));
  } ...
}

to something like:

public void createPartition(...) {
  Table hiveTable = getHiveTable(tablePath);
  CatalogBaseTable catalogTable = instantiateHiveCatalogTable(hiveTable);
  ... check whether catalogTabe and catalogPartition type matches would be much easier here ...
  ... check whether catalogTable is partitioned would be easier here ...
  try {
    client.add_partition(
      instantiateHivePartition(catalogTable, partitionSpec, partition, hiveTable.getSd()));
  } ...
}

@@ -639,8 +640,12 @@ public void createPartition(ObjectPath tablePath, CatalogPartitionSpec partition
checkNotNull(partitionSpec, "CatalogPartitionSpec cannot be null");
checkNotNull(partition, "Partition cannot be null");

checkArgument(partition instanceof HiveCatalogPartition, "Currently only supports HiveCatalogPartition");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently throw CatalogException if the type doesn't match. checkArgument() will throw IllegalArgumentException.

@@ -740,10 +745,13 @@ public void alterPartition(ObjectPath tablePath, CatalogPartitionSpec partitionS
checkNotNull(partitionSpec, "CatalogPartitionSpec cannot be null");
checkNotNull(newPartition, "New partition cannot be null");

checkArgument(newPartition instanceof HiveCatalogPartition, "Currently only supports HiveCatalogPartition");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

boolean isGeneric = Boolean.valueOf(hiveTable.getParameters().get(FLINK_PROPERTY_IS_GENERIC));
if ((isGeneric && catalogPartition instanceof HiveCatalogPartition) ||
(!isGeneric && catalogPartition instanceof GenericCatalogPartition)) {
throw new IllegalArgumentException(String.format("Cannot handle %s partition for %s table",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw CatalogException

* or any key in partitionKeys doesn't exist in partitionSpec.
*/
private List<String> getOrderedFullPartitionValues(CatalogPartitionSpec partitionSpec, List<String> partitionKeys, ObjectPath tablePath)
throws PartitionSpecInvalidException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one more tab

@lirui-apache
Copy link
Contributor Author

@lirui-apache Thank you very much for the update!

I just spotted that, currently how several APIs impl work like this: 1) get a raw hive table 2) parse part of the raw table. The latter step actually duplicate with logic in instantiateHiveCatalogTable(). E.g. ensureTableAndPartitionMatch() parses FLINK_PROPERTY_IS_GENERIC, instantiateHivePartition() parses partition keys, ensurePartitionedTable() parses the raw table's partition key size, all of which we can get by just parsing the raw table to a CatalogTable thru instantiateHiveCatalogTable() in advance. The current duplication also means if we change some general logic in parsing a hive table, we need to change two places. Thus I wonder if it makes sense to just parse the raw table as whole at the beginning rather than having scattered places each parsing only part of it themselves. And we can remove util methods such as getFieldNames() which is only used to get the partition keys which is already available in CatalogTable.

For example, change

public void createPartition(...) {
  Table hiveTable = getHiveTable(tablePath);
  ensureTableAndPartitionMatch(hiveTable, partition);
  ensurePartitionedTable(tablePath, hiveTable);
  try {
    client.add_partition(instantiateHivePartition(hiveTable, partitionSpec, partition));
  } ...
}

to something like:

public void createPartition(...) {
  Table hiveTable = getHiveTable(tablePath);
  CatalogBaseTable catalogTable = instantiateHiveCatalogTable(hiveTable);
  ... check whether catalogTabe and catalogPartition type matches would be much easier here ...
  ... check whether catalogTable is partitioned would be easier here ...
  try {
    client.add_partition(
      instantiateHivePartition(catalogTable, partitionSpec, partition, hiveTable.getSd()));
  } ...
}

I'm not sure how much benefit this can bring us. It might make ensureTableAndPartitionMatch a little easier -- we can check the type of CatalogBaseTable instead of parsing a property. But I don't think we should take the same approach for ensurePartitionedTable. ensurePartitionedTable is already simple enough. And for APIs like listPartitions, creating a CatalogBaseTable just to get num of partition cols seems an overkill to me. And the same applies to getFieldNames. E.g. I think it'll be an overkill to create a CatalogBaseTable to get partition col names in dropPartition.
Maybe an alternative approach to the problem you mentioned is to have more util methods, in order to avoid duplication. For example, we should have a util method to decide whether a Hive table is generic. And all the APIs needing this logic can call the util method. What do you think?

@bowenli86
Copy link
Member

ok, let's leave that part as it for now

@lirui-apache
Copy link
Contributor Author

Rebase and address comments.
@bowenli86 @xuefuz let me know if I missed anything. Thanks.

}
}

private Partition instantiateHivePartition(Table hiveTable, CatalogPartitionSpec partitionSpec, CatalogPartition catalogPartition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static?

return new HiveCatalogPartition(hivePartition.getParameters(), hivePartition.getSd().getLocation());
}

private void ensurePartitionedTable(ObjectPath tablePath, Table hiveTable) throws TableNotPartitionedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static

* @throws PartitionSpecInvalidException thrown if partitionSpec and partitionKeys have different sizes,
* or any key in partitionKeys doesn't exist in partitionSpec.
*/
private List<String> getOrderedFullPartitionValues(CatalogPartitionSpec partitionSpec, List<String> partitionKeys, ObjectPath tablePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static?

Copy link
Contributor

@xuefuz xuefuz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few minor comments for consideration.

@lirui-apache
Copy link
Contributor Author

Made ensureTableAndPartitionMatch static. Several other methods can't be static, unless we pass the needed non-static field as parameter. @xuefuz shall we do that?

@xuefuz
Copy link
Contributor

xuefuz commented May 24, 2019

Made ensureTableAndPartitionMatch static. Several other methods can't be static, unless we pass the needed non-static field as parameter. @xuefuz shall we do that?

No. I think we only need to make what is static in nature static in their current forms.

@bowenli86
Copy link
Member

Thanks @lirui-apache very much for the PR. LGTM, merging

@asfgit asfgit closed this in 95e1686 May 24, 2019
@lirui-apache lirui-apache deleted the FLINK-12235 branch May 24, 2019 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants