Skip to content

[HUDI-6594] Support duplicate with HFile Reader/Writer#9292

Closed
codope wants to merge 4 commits intoapache:masterfrom
codope:rli-duplicate
Closed

[HUDI-6594] Support duplicate with HFile Reader/Writer#9292
codope wants to merge 4 commits intoapache:masterfrom
codope:rli-duplicate

Conversation

@codope
Copy link
Member

@codope codope commented Jul 26, 2023

Change Logs

Support duplicate with HFile Reader/Writer and MDT record index.

  • HoodieAvroHFileReader#getRecordByKeyIteratorInternal to return an Iterator<IndexedRecord> for given keys.
  • HoodieTableMetadata API changes - new method that returns a list of HoodieRecords:
/**
   * Fetch records for given keys. A key could have multiple records associated with it. This method returns all the records for given keys.
   *
   * @param keys          list of key for which interested records are looked up for.
   * @param partitionName partition name in metadata table where the records are looked up for.
   * @return Map of key to {@link List} of {@link HoodieRecord}s with records matching the passed in keys.
   */
  Map<String, List<HoodieRecord<HoodieMetadataPayload>>> getAllRecordsByKeys(List<String> keys, String partitionName);
  • Added tests for record index and HFile reader.

Impact

MDT can have duplcates. Useful for record index to work as regular non-global index.

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed


protected abstract Map<String, HoodieRecord<HoodieMetadataPayload>> getRecordsByKeys(List<String> keys, String partitionName);

protected abstract Map<String, List<HoodieRecord<HoodieMetadataPayload>>> getAllRecordsByKeys(List<String> keys, String partitionName);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan I added a separate method (yet to fully implement) for separation as we will have duplicates only in case of RLI for now. But, should we chnage the return type existing getRecordsByKeys?

.map(key -> (HoodieRecord<HoodieMetadataPayload>) allRecords.get(key))
.filter(Objects::nonNull)
.collect(Collectors.toMap(HoodieRecord::getRecordKey, r -> r));
.collect(Collectors.toMap(HoodieRecord::getRecordKey, r -> r, (r1, r2) -> r2));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be reverted after the new BaseTableMetadata#getAllRecordsByKeys is implemented.

@codope codope marked this pull request as ready for review July 27, 2023 09:29
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor

guess we don't need this anymore. feel free to re-open or create a new one if need be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants