Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-29585][hive] Migrate TableSchema to Schema for Hive connector #21522

Merged
merged 1 commit into from Mar 20, 2023

Conversation

Aitozi
Copy link
Contributor

@Aitozi Aitozi commented Dec 18, 2022

What is the purpose of the change

This PR is meant to migrate the TableSchema to the Schema and ResolvedSchema. Most TableSchema have been moved out of the hive connector module. Only left some in the HiveCatalog related. I filed a discussion about the catalog's APIs regarding this.

Verifying this change

This change is a rework that should be covered by the existing tests.

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 18, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@Aitozi
Copy link
Contributor Author

Aitozi commented Dec 18, 2022

cc @luoyuxia @wuchong please take a look when you are free

@luoyuxia
Copy link
Contributor

@Aitozi Thanks for contribution. Migrate to new scheme is a good improvement and valuable for future development. I'll definitely have a look when I'm free.

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aitozi Thanks for contribution. I left some comments. PTAL.
Also, I find some import of TableSchema import org.apache.flink.table.api.TableSchema
in some test class like:
HiveDialectITCase/TableEnvHiveConnectorITCase/HiveInputFormatPartitionReaderITCase/HiveCatalogGenericMetadataTest/HiveCatalogHiveMetadataTest/HiveCatalogITCase/HiveCatalogTest.
Can they all be removed?

String[] formatNames = new String[formatFieldCount];
LogicalType[] formatTypes = new LogicalType[formatFieldCount];
for (int i = 0; i < formatFieldCount; i++) {
formatNames[i] = tableSchema.getFieldName(i).get();
formatTypes[i] = tableSchema.getFieldDataType(i).get().getLogicalType();
formatNames[i] = resolvedSchema.getColumn(i).get().getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
resolvedSchema.getColumnNames().get(i);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

formatNames[i] = tableSchema.getFieldName(i).get();
formatTypes[i] = tableSchema.getFieldDataType(i).get().getLogicalType();
formatNames[i] = resolvedSchema.getColumn(i).get().getName();
formatTypes[i] = resolvedSchema.getColumn(i).get().getDataType().getLogicalType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

formatConf, typeDescription.toString(), formatTypes));
formatConf,
typeDescription.toString(),
formatType.getFields().stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use formatTypes?

// Partition keys
List<String> partitionKeys = new ArrayList<>();
TableSchema tableSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also remove TableSchema in here?

@@ -97,7 +102,7 @@ public class HiveTableUtil {

private HiveTableUtil() {}

public static TableSchema createTableSchema(
public static ResolvedSchema createResolvedTableSchema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ResolvedSchema? As for as I'm concerned, it should be Schema.

@@ -2106,7 +2111,7 @@ public static CatalogBaseTable getCatalogBaseTable(
public static class TableSpec {
public ObjectIdentifier tableIdentifier;
public String tableName;
public CatalogBaseTable table;
public ResolvedCatalogBaseTable<?> table;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need ResolvedCatalogBaseTable? What I mean is can we avoid to call the method catalogManager.resolveCatalogBaseTable? It's a internal method, we always want to avoid to call it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need the table to be resolved since we want validatePartColumnType on this, it requires the resolved type information.

CatalogTable catalogTable =
getCatalogTable(tableIdentifier.asSummaryString(), qb);
ResolvedCatalogTable catalogTable =
catalogManager.resolveCatalogTable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito. Can we avoid to call catalogManager.resolveCatalogTable ?

}
return builder.build();
return org.apache.flink.table.catalog.UniqueConstraint.primaryKey(
primaryKey.getName(), primaryKey.getColumns());
}

/** Create Hive columns from Flink TableSchema. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this method be removed?

oldTable.getComment(),
oldTable.getPartitionKeys(),
props),
newSchema));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the deprecated TableSchema be removed in method convertAlterTableChangeCol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@luoyuxia
Copy link
Contributor

@Aitozi FYI. About one year ago, I also review a pr of migrate TableSchema to Schema, fapaul@f8af0e9
but the author may not intended to finish it. you can have a look of it just for reference.

@Aitozi
Copy link
Contributor Author

Aitozi commented Jan 30, 2023

@luoyuxia thanks for your review, I will take a look and fix your comments

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 2, 2023

I'm revisiting this pr now.

@Aitozi Aitozi marked this pull request as draft March 2, 2023 13:06
@Aitozi Aitozi force-pushed the hive-schema branch 3 times, most recently from b9ed21a to fc21901 Compare March 9, 2023 09:47
@Aitozi Aitozi marked this pull request as ready for review March 9, 2023 09:48
@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 9, 2023

Hi @luoyuxia , I have addressed your comments, please take a look again when you are free, thanks

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 9, 2023

@flinkbot run azure

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aitozi Thanks for updating. I left some comments again. PTAL. We're getting there.

*/
public HiveSourceBuilder setProjectedFields(int[] projectedFields) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change this method since it's a public interface. Also, I don't think we need to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

if (isHiveTable) {
pkConstraint = table.getSchema().getPrimaryKey().orElse(null);
// TODO replace the deprecated UniqueConstraint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be better that we create a JIra to track the todo task,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -763,7 +783,7 @@ CatalogBaseTable instantiateCatalogTable(Table hiveTable) {
tableSchemaProps.putProperties(properties);
// try to get table schema with both new and old (1.10) key, in order to support tables
// created in old version
tableSchema =
TableSchema tableSchema =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove TableSchema in here? So that TableSchema will be removed from our hive connector totally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its doesn't matter. Since we still use the DescriptorProperties to se/de the schema to store/restore to the external meta store. So TableSchema actually still used in the hive connector system. It can be entirely removed after we can use the new way to se/de the schema. But I think we can improve it as a follow up. WDYT ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatalogPropertiesUtil is the alternative of DescriptorProperties. It should not be a major work to migrate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Now import org.apache.flink.table.api.TableSchema has entirely removed from the hive connector. And the DescriptorProperties is migrate to CatalogPropertiesUtil

@@ -69,67 +69,75 @@ private OperationConverterUtils() {}
public static Operation convertAddReplaceColumns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this method can be removed as well as else if (sqlAlterTable instanceof SqlAddReplaceColumns in SqlToOperationConverter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I get it. SqlAddReplaceColumns is Hive dialect, and is not used now. Will remove it.

@@ -157,12 +165,12 @@ public static Operation convertChangeColumn(
// disallow changing partition columns
throw new ValidationException("CHANGE COLUMN cannot be applied to partition columns");
}
TableSchema oldSchema = catalogTable.getSchema();
ResolvedSchema oldSchema = catalogTable.getResolvedSchema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito:
Seems we can also remove the method convertChangeColumn as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

* ResolvedExpression} back to its Unresolved state. This will enable direct comparison of the
* schema.
*/
public static Schema fromResolvedSchema(ResolvedSchema resolvedSchema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't Schema.newBuilder().fromResolvedSchema(resolvedSchema).build() meet our need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResolvedExpression in resolvedSchema is different from the Expression in the string format. For example, SqlCallExpression's format will have '[]'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored by introducing a TestSchemaResolver to ease this compare.

Copy link
Contributor

@luoyuxia luoyuxia Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean for method CatalogTestUtil#checkEquals(CatalogTable t1, CatalogTable t2), t1/t2 may be ResolvedCatalogTable or DefaultCatalogTable.
If it only happans in Hive, can we use method CatalogManagerMocks.createEmptyCatalogManager() .resolveCatalogTable()? So that, we won't need to add a new class TestSchemaResolver

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it only happans in Hive, can we use method CatalogManagerMocks.createEmptyCatalogManager() .resolveCatalogTable()? So that, we won't need to add a new class TestSchemaResolver.

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 13, 2023

Hi @luoyuxia , Most of your comments have be solved. PTAL again.

@@ -258,8 +256,12 @@ private TableFunction<RowData> getLookupFunction(int[] keys) {
jobConf,
hiveVersion,
tablePath,
getTableSchema().getFieldDataTypes(),
getTableSchema().getFieldNames(),
DataType.getFieldDataTypes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be better using

catalogTable.getResolvedSchema().getColumnDataTypes()
                                .toArray(new DataType[0]),
catalogTable.getResolvedSchema().getColumnNames().toArray(new String[0])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should exclude the computed column here. So I use DataType.getFieldDataTypes( catalogTable.getResolvedSchema().toPhysicalRowDataType()) .toArray(new DataType[0])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

@@ -703,7 +704,7 @@ private CompactReader.Factory<RowData> createCompactReaderFactory(
jobConf,
catalogTable,
hiveVersion,
(RowType) tableSchema.toRowDataType().getLogicalType(),
(RowType) resolvedSchema.toSinkRowDataType().getLogicalType(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be resolvedSchema.toSourceRowDataType(), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ? I think in sink it should deal with the Column::isPersisted data type. So the compact reader factory should use toSinkRowDataType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's for reader, should use resolvedSchema#toPhysicalRowDataType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, updated.

@@ -472,7 +510,8 @@ private static int getCount(Map<String, String> map, String key, String suffix)
final String escapedSeparator = Pattern.quote(SEPARATOR);
final Pattern pattern =
Pattern.compile(
escapedKey
"^"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, for the key generic.schema.1.name and schema.1.name will all pass this pattern, then the column count will misleading the key extractor. It actually expect the ^ match here. So for the generic.schema.1.name it will return 0. Then, we can use the fallback key to get.

* ResolvedExpression} back to its Unresolved state. This will enable direct comparison of the
* schema.
*/
public static Schema fromResolvedSchema(ResolvedSchema resolvedSchema) {
Copy link
Contributor

@luoyuxia luoyuxia Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean for method CatalogTestUtil#checkEquals(CatalogTable t1, CatalogTable t2), t1/t2 may be ResolvedCatalogTable or DefaultCatalogTable.
If it only happans in Hive, can we use method CatalogManagerMocks.createEmptyCatalogManager() .resolveCatalogTable()? So that, we won't need to add a new class TestSchemaResolver

@luoyuxia
Copy link
Contributor

luoyuxia commented Mar 15, 2023

@Aitozi Thanks for updating. I left minor comments. PTAL. Should be ready to be merge in next iteratioin.
BTW, the test fails.

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 15, 2023

@luoyuxia Thanks for your detailed review. I have addressed your comments.

Do you mean for method CatalogTestUtil#checkEquals(CatalogTable t1, CatalogTable t2), t1/t2 may be ResolvedCatalogTable or DefaultCatalogTable.
If it only happans in Hive, can we use method CatalogManagerMocks.createEmptyCatalogManager() .resolveCatalogTable()? So that, we won't need to add a new class TestSchemaResolver

for this question: Yes, It's currently only need to resolve the schema for hive catalog. But in the CatalogTest it can not access to the CatalogManagerMocks. And hive catalog test should also run the CatalogTest's test, So in the CatalogTest module, we'd better have a test tool which can resolve the Schema to ResolvedSchema for compare.

Besides, CatalogManagerMocks can not resolve the expression actually, so the computed column and watermark spec can not be covered. The TestSchemaResolver will solve this and can served as a test harness when need resolve.

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aitozi Thanks for you patient. I left minor comments. PTAL.

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aitozi Thanks for your updating. left minor commets again. PTAL.
please remeber not to call method CatalogManager#resolveCatalogTable

@luoyuxia
Copy link
Contributor

@Aitozi Still notice we are calling catalogManager#resolveCatalogTable in HiveParserSemanticAnalyzer/HiveParserBaseSemanticAnalyzer. Please remove them.

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 17, 2023

@Aitozi Still notice we are calling catalogManager#resolveCatalogTable in HiveParserSemanticAnalyzer/HiveParserBaseSemanticAnalyzer. Please remove them.

All removed, PTAL again

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aitozi Thanks for contribution. LGTM assuming test pass.
Could you please rebase master? I will merge later.

@Aitozi
Copy link
Contributor Author

Aitozi commented Mar 17, 2023

@luoyuxia Done. Very thanks for your patient review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants