[FLINK-12235][hive] Support Hive partition in HiveCatalog #8449

lirui-apache · 2019-05-15T12:17:02Z

What is the purpose of the change

To implement partition related operations in HiveCatalogBase.

Brief change log

Introduced HiveCatalogPartition to represent a partition that can be handled by HiveCatalog.
Implemented partition-related operations in HiveCatalogBase. Although we intend to let HiveCatalog and GenericHiveMetastoreCatalog share the implementations, this PR only enables/tests these operations for HiveCatalog.
Moved partition related tests from GenericInMemoryCatalogTest to CatalogTestBase, so that GenericInMemoryCatalogTest and HiveCatalogTest can share these test cases.

Verifying this change

This PR is tested using partition related test cases in CatalogTestBase.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs

flinkbot · 2019-05-15T12:19:21Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

lirui-apache · 2019-05-16T03:15:22Z

@xuefuz @bowenli86 please take a look. Thanks.

zjuwangg

Looks good to me！Thanks your effort on this.

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

.../flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalogBase.java

.../flink-connector-hive/src/test/java/org/apache/flink/table/catalog/hive/HiveCatalogTest.java

...link-table-api-java/src/main/java/org/apache/flink/table/catalog/GenericInMemoryCatalog.java

xuefuz

Went over the changes once. Had some comments.

bowenli86

@lirui-apache thanks for the PR, I left some comments besides xuefu's

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

.../flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalogBase.java

bowenli86 · 2019-05-16T19:10:49Z

.../flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalogBase.java

+	 * @throws PartitionSpecInvalidException thrown if partitionSpec and partitionKeys have different sizes,
+	 *                                       or any key in partitionKeys doesn't exist in partitionSpec.
+	 */
+	List<String> getFullPartitionValues(ObjectPath tablePath, CatalogPartitionSpec partitionSpec, List<String> partitionKeys)


what does the "full" mean here? probably rename to "getOrdered/ArrangedPartitionValues" to conform to its logic?

nit: can we move tablePath to the end as the last param given it's here to only build a message without any real effect.

It means a full spec, which contains values for all partition keys. And a partial spec can only contain values for a subset of partition keys. Operations like createPartition, dropPartition require a full spec. Operations like listPartitions can accept partial spec. Let me rename to to getOrderedFullPartitionValues, to indicate it orders the values and requires a full spec.

...link-table-api-java/src/main/java/org/apache/flink/table/catalog/GenericInMemoryCatalog.java

...k-table/flink-table-common/src/test/java/org/apache/flink/table/catalog/CatalogTestBase.java

.../flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalogBase.java

lirui-apache · 2019-05-20T03:26:39Z

@bowenli86 @xuefuz thanks for your comments. Please take a look at the updated PR. Thanks.

xuefuz

As a FYI, PR #8490 will remove some classes that are changed here. Not sure which PR should get in first, but it seems one of them has to rebase.

bowenli86 · 2019-05-20T16:25:16Z

I'd like to get #8480 in first and rebase this PR since 8480 is a much larger change set and not easy to rebase. Same as #8477

lirui-apache · 2019-05-21T08:38:56Z

Rebased.
@xuefuz @bowenli86 please take another look. Thanks.

...link-table-api-java/src/main/java/org/apache/flink/table/catalog/GenericInMemoryCatalog.java

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

bowenli86

@lirui-apache Thank you very much for the update!

I just spotted that, currently how several APIs impl work like this: 1) get a raw hive table 2) parse part of the raw table. The latter step actually duplicate with logic in instantiateHiveCatalogTable(). E.g. ensureTableAndPartitionMatch() parses FLINK_PROPERTY_IS_GENERIC, instantiateHivePartition() parses partition keys, ensurePartitionedTable() parses the raw table's partition key size, all of which we can get by just parsing the raw table to a CatalogTable thru instantiateHiveCatalogTable() in advance. The current duplication also means if we change some general logic in parsing a hive table, we need to change two places. Thus I wonder if it makes sense to just parse the raw table as whole at the beginning rather than having scattered places each parsing only part of it themselves. And we can remove util methods such as getFieldNames() which is only used to get the partition keys which is already available in CatalogTable.

For example, change

public void createPartition(...) {
  Table hiveTable = getHiveTable(tablePath);
  ensureTableAndPartitionMatch(hiveTable, partition);
  ensurePartitionedTable(tablePath, hiveTable);
  try {
    client.add_partition(instantiateHivePartition(hiveTable, partitionSpec, partition));
  } ...
}

to something like:

public void createPartition(...) {
  Table hiveTable = getHiveTable(tablePath);
  CatalogBaseTable catalogTable = instantiateHiveCatalogTable(hiveTable);
  ... check whether catalogTabe and catalogPartition type matches would be much easier here ...
  ... check whether catalogTable is partitioned would be easier here ...
  try {
    client.add_partition(
      instantiateHivePartition(catalogTable, partitionSpec, partition, hiveTable.getSd()));
  } ...
}

bowenli86 · 2019-05-22T17:25:49Z

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

@@ -639,8 +640,12 @@ public void createPartition(ObjectPath tablePath, CatalogPartitionSpec partition
 		checkNotNull(partitionSpec, "CatalogPartitionSpec cannot be null");
 		checkNotNull(partition, "Partition cannot be null");

+		checkArgument(partition instanceof HiveCatalogPartition, "Currently only supports HiveCatalogPartition");


We currently throw CatalogException if the type doesn't match. checkArgument() will throw IllegalArgumentException.

bowenli86 · 2019-05-22T17:26:07Z

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

@@ -740,10 +745,13 @@ public void alterPartition(ObjectPath tablePath, CatalogPartitionSpec partitionS
 		checkNotNull(partitionSpec, "CatalogPartitionSpec cannot be null");
 		checkNotNull(newPartition, "New partition cannot be null");

+		checkArgument(newPartition instanceof HiveCatalogPartition, "Currently only supports HiveCatalogPartition");


bowenli86 · 2019-05-22T17:26:37Z

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

+		boolean isGeneric = Boolean.valueOf(hiveTable.getParameters().get(FLINK_PROPERTY_IS_GENERIC));
+		if ((isGeneric && catalogPartition instanceof HiveCatalogPartition) ||
+			(!isGeneric && catalogPartition instanceof GenericCatalogPartition)) {
+			throw new IllegalArgumentException(String.format("Cannot handle %s partition for %s table",


throw CatalogException

bowenli86 · 2019-05-22T17:43:02Z

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

+	 *                                       or any key in partitionKeys doesn't exist in partitionSpec.
+	 */
+	private List<String> getOrderedFullPartitionValues(CatalogPartitionSpec partitionSpec, List<String> partitionKeys, ObjectPath tablePath)
+		throws PartitionSpecInvalidException {


nit: one more tab

lirui-apache · 2019-05-23T03:12:17Z

@lirui-apache Thank you very much for the update!

I just spotted that, currently how several APIs impl work like this: 1) get a raw hive table 2) parse part of the raw table. The latter step actually duplicate with logic in instantiateHiveCatalogTable(). E.g. ensureTableAndPartitionMatch() parses FLINK_PROPERTY_IS_GENERIC, instantiateHivePartition() parses partition keys, ensurePartitionedTable() parses the raw table's partition key size, all of which we can get by just parsing the raw table to a CatalogTable thru instantiateHiveCatalogTable() in advance. The current duplication also means if we change some general logic in parsing a hive table, we need to change two places. Thus I wonder if it makes sense to just parse the raw table as whole at the beginning rather than having scattered places each parsing only part of it themselves. And we can remove util methods such as getFieldNames() which is only used to get the partition keys which is already available in CatalogTable.

For example, change
public void createPartition(...) {
  Table hiveTable = getHiveTable(tablePath);
  ensureTableAndPartitionMatch(hiveTable, partition);
  ensurePartitionedTable(tablePath, hiveTable);
  try {
    client.add_partition(instantiateHivePartition(hiveTable, partitionSpec, partition));
  } ...
}
to something like:
public void createPartition(...) {
  Table hiveTable = getHiveTable(tablePath);
  CatalogBaseTable catalogTable = instantiateHiveCatalogTable(hiveTable);
  ... check whether catalogTabe and catalogPartition type matches would be much easier here ...
  ... check whether catalogTable is partitioned would be easier here ...
  try {
    client.add_partition(
      instantiateHivePartition(catalogTable, partitionSpec, partition, hiveTable.getSd()));
  } ...
}

I'm not sure how much benefit this can bring us. It might make ensureTableAndPartitionMatch a little easier -- we can check the type of CatalogBaseTable instead of parsing a property. But I don't think we should take the same approach for ensurePartitionedTable. ensurePartitionedTable is already simple enough. And for APIs like listPartitions, creating a CatalogBaseTable just to get num of partition cols seems an overkill to me. And the same applies to getFieldNames. E.g. I think it'll be an overkill to create a CatalogBaseTable to get partition col names in dropPartition.
Maybe an alternative approach to the problem you mentioned is to have more util methods, in order to avoid duplication. For example, we should have a util method to decide whether a Hive table is generic. And all the APIs needing this logic can call the util method. What do you think?

bowenli86 · 2019-05-23T04:12:53Z

ok, let's leave that part as it for now

lirui-apache · 2019-05-23T12:14:56Z

Rebase and address comments.
@bowenli86 @xuefuz let me know if I missed anything. Thanks.

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

xuefuz · 2019-05-23T15:35:55Z

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

+		}
+	}
+
+	private Partition instantiateHivePartition(Table hiveTable, CatalogPartitionSpec partitionSpec, CatalogPartition catalogPartition)


xuefuz · 2019-05-23T15:36:16Z

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

+		return new HiveCatalogPartition(hivePartition.getParameters(), hivePartition.getSd().getLocation());
+	}
+
+	private void ensurePartitionedTable(ObjectPath tablePath, Table hiveTable) throws TableNotPartitionedException {


xuefuz · 2019-05-23T15:36:35Z

...tors/flink-connector-hive/src/main/java/org/apache/flink/table/catalog/hive/HiveCatalog.java

+	 * @throws PartitionSpecInvalidException thrown if partitionSpec and partitionKeys have different sizes,
+	 *                                       or any key in partitionKeys doesn't exist in partitionSpec.
+	 */
+	private List<String> getOrderedFullPartitionValues(CatalogPartitionSpec partitionSpec, List<String> partitionKeys, ObjectPath tablePath)


xuefuz

LGTM. A few minor comments for consideration.

lirui-apache · 2019-05-24T02:34:49Z

Made ensureTableAndPartitionMatch static. Several other methods can't be static, unless we pass the needed non-static field as parameter. @xuefuz shall we do that?

xuefuz · 2019-05-24T02:58:26Z

Made ensureTableAndPartitionMatch static. Several other methods can't be static, unless we pass the needed non-static field as parameter. @xuefuz shall we do that?

No. I think we only need to make what is static in nature static in their current forms.

bowenli86 · 2019-05-24T04:59:45Z

Thanks @lirui-apache very much for the PR. LGTM, merging

rmetzger added review=description? component=TableSQL/API labels May 15, 2019

lirui-apache force-pushed the FLINK-12235 branch from 6066388 to 7f7d693 Compare May 16, 2019 03:09

zjuwangg approved these changes May 16, 2019

View reviewed changes