[HUDI-2176, 2178, 2179] Adding virtual key support to COW table #3306

nsivabalan · 2021-07-20T04:53:54Z

What is the purpose of the pull request

Adding virtual keys support to COW table for all table operations (insert, upsert, delete, etc)
Fixed clustering
Ensured metadata table supports virtual keys

Changes not covered in this PR:

Flink and java in not in the scope of this PR.
Orc format is not covered.
cli commands.

To discuss:
With virtual keys, we are imposing a constraint that keyGen for a given table cannot change from the time of its inception. So, given this constraint, should we add some validation in HoodieSparkSqlWriter or WriteClient so that keyGen does not change overtime for a given table?
Reason I am asking is, even today we have some loose ends. for eg, if someone switches index type mid way, don't think we validate and throw proper exception that index type can't be changed to incompatible ones. Not meaning to say, we have to follow this. But just a reminder.

Brief change log

Made changes to all write handles to leverage keyGenerator instead of meta fields to compute record keys and partition path.
Added support to SimpleIndex to leverage keyGen instead of meta fields to fetch record key and partition path.
Moved BaseKeyGenerator and other supported interfaces (KeyGenerator, KeyGeneratorInterface) to hudi-common as to use it across all modules and classes.
Fixed most tests in TestHoodieClientCopyOnWriteStorage and TestHoodieBackedMetadata

Verify this pull request

This change added tests and can be verified as follows:

Fixed TestHoodieClientCopyOnWriteStorage and TestHoodieBackedMetadata to test tables w/ virtual keys enabled

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

hudi-bot · 2021-07-20T04:56:41Z

CI report:

77fc978 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

nsivabalan

Left some notes to reviewer

nsivabalan · 2021-07-20T04:54:32Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

@@ -278,7 +291,8 @@ protected boolean writeRecord(HoodieRecord<T> hoodieRecord, Option<IndexedRecord
   * Go through an old record. Here if we detect a newer version shows up, we write the new one to the file.
   */
  public void write(GenericRecord oldRecord) {
-    String key = oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString();
+    String key = populateMetaFields ? oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString() :


not sure if we need to abstract this out and keep it outside of MergeHandle itself. there is only two options. Either use meta cols or use keyGen to compute record keys. So, have decided to manage it here itself.

nsivabalan · 2021-07-20T04:55:19Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieParquetWriter.java

+      prepRecordWithMetadata(avroRecord, record, instantTime,
+          taskContextSupplier.getPartitionIdSupplier().get(), recordIndex, file.getName());
+      super.write(avroRecord);
+      writeSupport.add(record.getRecordKey());


as of this patch, I assume boom goes hand in hand w/ meta cols. If populateMetaCols is false, we are not adding bloom index. We can add follow up patches to de-couple these.

nsivabalan · 2021-07-20T04:56:09Z

...lient/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestHarness.java

@@ -225,6 +229,22 @@ protected void initMetaClient(HoodieTableType tableType) throws IOException {
    metaClient = HoodieTestUtils.init(hadoopConf, basePath, tableType);
  }

+  protected Properties getPropertiesForKeyGen() {


HoodieTestDataGenerator has these fields in the commonly used schema and hence hardcoded it here, so that all tests can call into this to generate props.

nsivabalan · 2021-07-20T04:56:29Z

hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java

+   * @return {@link List} of {@link HoodieKey}s fetched from the parquet file
+   */
+  @Override
+  public List<HoodieKey> fetchRecordKeyPartitionPath(Configuration configuration, Path filePath, BaseKeyGenerator keyGenerator) {


not sure if we can add another argument to existing api and generate/fetch recordKeys and partition path based on that. Felt this is neat.

probably. but this method has a lot of code duplication atm. can we reduce that

nsivabalan · 2021-07-20T04:57:29Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

@@ -1588,6 +1588,11 @@ public Builder withWriteMetaKeyPrefixes(String writeMetaKeyPrefixes) {
      return this;
    }

+    public Builder withPopulateMetaFields(boolean populateMetaFields) {


Can you please clarify me something. if we wish to store the property in hoodie.properties and is part of HoodieTableConfig, then we should set the config via HoodieWriteConfigBuilder.withHoodieTableConfig(new HoodieTableConfigBuilder().withPopulate ... sort of? and exposing setter here in HoodieWriteConfig is not the right way to go about?
It was easier for me in tests to set this param. but wanted to know whats the right way to go about.

yeah. probably composing it like that is the right away. Separate out the table configs from the write configs.

nsivabalan · 2021-07-20T04:57:54Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieFileWriterFactory.java

-    HoodieAvroWriteSupport writeSupport =
-        new HoodieAvroWriteSupport(new AvroSchemaConverter(hoodieTable.getHadoopConf()).convert(schema), schema, filter);
+      TaskContextSupplier taskContextSupplier, boolean populateMetaFields, boolean enableBloomFilter) throws IOException {
+    BloomFilter filter = enableBloomFilter ? createBloomFilter(config) : null;


Already HoodieAvroWriteSupport handles null bloom Filter and hence using null here. If you prefer, I can change that to Option and fix this.

yes, let's fix this to not do nulls, if its not a lot of change

codecov-commenter · 2021-07-20T05:02:14Z

Codecov Report

Merging #3306 (719bb10) into master (a086d25) will decrease coverage by 19.98%.
The diff coverage is 24.65%.

@@              Coverage Diff              @@
##             master    #3306       +/-   ##
=============================================
- Coverage     47.74%   27.76%   -19.99%     
+ Complexity     5591     1330     -4261     
=============================================
  Files           938      386      -552     
  Lines         41823    15582    -26241     
  Branches       4213     1390     -2823     
=============================================
- Hits          19968     4326    -15642     
+ Misses        20070    10932     -9138     
+ Partials       1785      324     -1461

Flag	Coverage Δ
hudicli	`?`
hudiclient	`21.33% <24.65%> (-13.23%)`	⬇️
hudicommon	`?`
hudiflink	`?`
hudihadoopmr	`?`
hudisparkdatasource	`?`
hudisync	`4.88% <ø> (-51.10%)`	⬇️
huditimelineservice	`?`
hudiutilities	`59.87% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...java/org/apache/hudi/config/HoodieWriteConfig.java	`0.00% <0.00%> (-43.38%)`	⬇️
...in/java/org/apache/hudi/io/HoodieAppendHandle.java	`0.00% <0.00%> (ø)`
...g/apache/hudi/io/HoodieKeyLocationFetchHandle.java	`0.00% <0.00%> (ø)`
...ain/java/org/apache/hudi/io/HoodieMergeHandle.java	`0.00% <0.00%> (ø)`
...va/org/apache/hudi/io/HoodieSortedMergeHandle.java	`0.00% <0.00%> (ø)`
...org/apache/hudi/io/storage/HoodieConcatHandle.java	`0.00% <0.00%> (ø)`
...pache/hudi/io/storage/HoodieFileWriterFactory.java	`0.00% <0.00%> (ø)`
...rg/apache/hudi/io/storage/HoodieParquetWriter.java	`0.00% <0.00%> (ø)`
...va/org/apache/hudi/keygen/BuiltinKeyGenerator.java	`62.12% <ø> (ø)`
...apache/hudi/table/HoodieSparkCopyOnWriteTable.java	`70.83% <28.57%> (-7.74%)`	⬇️
... and 616 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a086d25...719bb10. Read the comment docs.

nsivabalan · 2021-07-20T05:11:15Z

@vinothchandar : you are good to review this patch.

nsivabalan · 2021-07-21T23:12:40Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

@@ -101,33 +103,44 @@
  protected long updatedRecordsWritten = 0;
  protected long insertRecordsWritten = 0;
  protected boolean useWriterSchema;
+  protected boolean populateMetaFields;
+  protected Option<BaseKeyGenerator> keyGeneratorOpt;


I initially thought of moving this to HoodieWriteHandle thinking all handles might need this. but looks like neither create nor append handle needs key gen since on the write path, we have HoodieKey handy.

nsivabalan · 2021-07-21T23:15:09Z

hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestDataGenerator.java

@@ -90,6 +90,7 @@
  public static final int DEFAULT_PARTITION_DEPTH = 3;
  public static final String TRIP_SCHEMA_PREFIX = "{\"type\": \"record\"," + "\"name\": \"triprec\"," + "\"fields\": [ "
      + "{\"name\": \"timestamp\",\"type\": \"long\"}," + "{\"name\": \"_row_key\", \"type\": \"string\"},"
+      + "{\"name\": \"partition_path\", \"type\": \"string\"},"


Looks like we didn't have any field in the schema to hold the partition path value only. We always generate and hold it within HoodieKey which eventually goes into meta fields. Hence had to add this field.

vinothchandar · 2021-07-22T22:28:03Z

if someone switches index type mid way, don't think we validate and throw proper exception that index type can't be changed to incompatible ones. Not meaning to say, we have to follow this. But just a reminder.

There was a PR around this before as well. Can we create a blocker JIRA for 0.10.0 that picks all configs like that and ensure the validations are in place

vinothchandar · 2021-07-22T22:29:01Z

Can we file a umbrella JIRA to track all this. move all these issues as well into it.

Flink and java in not in the scope of this PR.
Orc format is not covered.
cli commands.

vinothchandar

LG overall. Lots of code comments.

vinothchandar · 2021-07-22T22:30:24Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

@@ -1588,6 +1588,11 @@ public Builder withWriteMetaKeyPrefixes(String writeMetaKeyPrefixes) {
      return this;
    }

+    public Builder withPopulateMetaFields(boolean populateMetaFields) {


yeah. probably composing it like that is the right away. Separate out the table configs from the write configs.

vinothchandar · 2021-07-22T22:36:46Z

...client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java

-        hoodieTable.getHadoopConf(), new Path(baseFile.getPath())).stream()
-        .map(entry -> Pair.of(entry,
-            new HoodieRecordLocation(baseFile.getCommitTime(), baseFile.getFileId())));
+    if (config.populateMetaFields()) {


you already decide in a upper layer to pass in Option.empty if config.popularMetaFields() == true right? In these cases, it advisable to just use keyGeneratorOpt.map(keyGen -> /* else block call */).orElse(/* if block*/) and not rely on checking the config again and again

vinothchandar · 2021-07-22T22:42:32Z

...client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieKeyLocationFetchHandle.java

+    } else {
+      return BaseFileUtils.getInstance(baseFile.getPath()).fetchRecordKeyPartitionPath(
+          hoodieTable.getHadoopConf(), new Path(baseFile.getPath()), keyGeneratorOpt.get()).stream()
+          .map(entry -> Pair.of(entry,


can we avoid repeating lines 63, 58.

vinothchandar · 2021-07-22T22:57:15Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

    super(config, instantTime, partitionPath, fileId, hoodieTable, taskContextSupplier);
    init(fileId, recordItr);
    init(fileId, partitionPath, baseFile);
+    this.populateMetaFields = config.populateMetaFields();


same question: can we just work off keyGeneratorOpt.isEmpty()

vinothchandar · 2021-07-22T23:02:11Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java

+    setAndValidateKeyGenProps(keyGeneratorOpt);
+  }
+
+  private void setAndValidateKeyGenProps(Option<BaseKeyGenerator> keyGeneratorOpt) {


validate and then set?

vinothchandar · 2021-07-23T00:33:28Z

...-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java

+  }
+
+  private void initKeyGenIfNeeded() {
+    this.populateMetaFields = config.populateMetaFields();


move this to constructor?

vinothchandar · 2021-07-23T00:33:56Z

...-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java

+      try {
+        keyGeneratorOpt = Option.of((BaseKeyGenerator) HoodieSparkKeyGeneratorFactory.createKeyGenerator(new TypedProperties(config.getProps())));
+      } catch (IOException e) {
+        throw new HoodieIOException("Only BaseKeyGenerators are supported when meta columns are disabled ", e);


move this exception handling into the method itself? its an unchecked exception anyway. we can save some lines

not very sure on this. Did you mean to suggest to move this to the caller of this method or within HoodieSparkKeyGeneratorFactory.createKeyGenerator.
Bcoz, we have two constructors and to avoid duplicate code, I am doing this in private method.
Also, we can't move it within HoodieSparkKeyGeneratorFactory.createKeyGenerator, bcoz, here we are casting it to BaseKeyGenerator since, if meta fields are disabled, we have some constraints around keygens.

within HoodieSparkKeyGeneratorFactory.createKeyGenerator.

yes. within. You can still cast outside right?

But I added as a guard for catching casting issues.

vinothchandar · 2021-07-23T00:35:44Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

+  /**
+   * Fetch schema for record key and partition path.
+   */
+  public static Schema getRecordKeyPartitionPathSchema(Schema fileSchema, List<String> recordKeyFields, List<String> partitionPathFields) {


unit test for this?

any reason why we can't just merge the lists outside and keep this method simpler. i.e take a list of fields and get a subschema? in fact, we may have a method like that already, that we can reuse around.

looks like we don't have one already. But will fix this method to be generic.

vinothchandar · 2021-07-23T00:44:37Z

hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java

+  public List<HoodieKey> fetchRecordKeyPartitionPath(Configuration configuration, Path filePath, BaseKeyGenerator keyGenerator) {
+    List<HoodieKey> hoodieKeys = new ArrayList<>();
+    try {
+      if (!filePath.getFileSystem(configuration).exists(filePath)) {


let avoid this call? and have it error out if does not exist.

existing fetchRecordKeyPartitionPath() already does this. I assume you suggested to fix all.

vinothchandar · 2021-07-23T00:45:54Z

hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java

+   * @return {@link List} of {@link HoodieKey}s fetched from the parquet file
+   */
+  @Override
+  public List<HoodieKey> fetchRecordKeyPartitionPath(Configuration configuration, Path filePath, BaseKeyGenerator keyGenerator) {


probably. but this method has a lot of code duplication atm. can we reduce that

nsivabalan

will address the feedback

…he#3306)

nsivabalan commented Jul 20, 2021

View reviewed changes

nsivabalan mentioned this pull request Jul 20, 2021

[HUDI-2176, 2178, 2179] Virtual keys support to COW table #3282

Closed

5 tasks

nsivabalan force-pushed the virtualKeys_COW branch from 901081a to 2c243ca Compare July 20, 2021 05:09

nsivabalan commented Jul 21, 2021

View reviewed changes

nsivabalan force-pushed the virtualKeys_COW branch 3 times, most recently from 719bb10 to 9674330 Compare July 22, 2021 15:40

vinothchandar self-assigned this Jul 22, 2021

nsivabalan force-pushed the virtualKeys_COW branch from 9674330 to 4970e91 Compare July 22, 2021 23:11

vinothchandar reviewed Jul 23, 2021

View reviewed changes

nsivabalan commented Jul 23, 2021

View reviewed changes

nsivabalan force-pushed the virtualKeys_COW branch 2 times, most recently from 7b789d7 to 8a0661b Compare July 26, 2021 03:01

[HUDI-2176, 2178, 2179] Adding virtual key support to COW table

77fc978

nsivabalan force-pushed the virtualKeys_COW branch from 8a0661b to 77fc978 Compare July 26, 2021 14:54

nsivabalan merged commit 61148c1 into apache:master Jul 26, 2021

liujinhui1994 pushed a commit to liujinhui1994/hudi that referenced this pull request Aug 12, 2021

[HUDI-2176, 2178, 2179] Adding virtual key support to COW table (apac…

f1c2738

…he#3306)

[HUDI-2176, 2178, 2179] Adding virtual key support to COW table #3306

[HUDI-2176, 2178, 2179] Adding virtual key support to COW table #3306

Conversation

nsivabalan commented Jul 20, 2021 • edited Loading

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

hudi-bot commented Jul 20, 2021 • edited Loading

CI report:

nsivabalan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 20, 2021 • edited Loading

Codecov Report

nsivabalan commented Jul 20, 2021

Choose a reason for hiding this comment

nsivabalan Jul 21, 2021 • edited Loading

Choose a reason for hiding this comment

vinothchandar commented Jul 22, 2021

vinothchandar commented Jul 22, 2021

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan left a comment

Choose a reason for hiding this comment

nsivabalan commented Jul 20, 2021 •

edited

Loading

hudi-bot commented Jul 20, 2021 •

edited

Loading

codecov-commenter commented Jul 20, 2021 •

edited

Loading

nsivabalan Jul 21, 2021 •

edited

Loading