Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5138][WIP][DO NOT MERGE] Table Spec FileSliceReader implementation #7199

Closed
wants to merge 10 commits into from

Conversation

codope
Copy link
Member

@codope codope commented Nov 14, 2022

Change Logs

Implements FileSliceReader for Hive as specified in #7080. This is still WIP. Please do not merge.

TODO

  • Models as specified in RFC-64
  • HoodieRecordMerger implementation for ArrayWritable
  • Adapt realtime record reader to new merging logic
  • Schema projection
  • Filter predicate pushdown
  • Refactor presto-hudi connector with this new reader
  • Testing

Impact

High

Risk level (write none, low medium or high below)

High

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@codope codope added pr:wip Work in Progress/PRs reader-core labels Nov 14, 2022
@codope codope changed the title [HUDI-5138][WIP][DO NOT MERGE] Table Spek FileSliceReader implementation [HUDI-5138][WIP][DO NOT MERGE] Table Spec FileSliceReader implementation Nov 15, 2022

public class HoodieTableId {

private final String databaseName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a food for thought: URI scheme in the table naming is usual 3-level "namespace.db.table", we can fold "namespace.db" to just be a qualifier or encode them as standalone concerns depending on how we're envisioning it to be used


@Override
public FileSliceReader project(InternalSchema schema) {
// should we convert to avro Schema?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can just convert it to Avro schema and then use the one you've already impl'd

public FileSliceReader project(Schema requiredSchema) {
String partitionFields = jobConf.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "");
List<String> partitioningFields =
partitionFields.length() > 0 ? Arrays.stream(partitionFields.split("/")).collect(Collectors.toList())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "/" be a partition-path field separator for Presto as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.

if (split instanceof RealtimeSplit) {
HoodieMergedLogRecordScanner logRecordScanner = getMergedLogRecordScanner((RealtimeSplit) split, jobConf, readerSchema);
while (baseRecordIterator.hasNext()) {
logRecordScanner.processNextRecord(baseRecordIterator.next());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'd be able to load whole file in memory. Instead, we should create an "merging iterator" going over the base-file and applying updates on the fly (similar to how Spark's one impl'd)


import static org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.getMergedLogRecordScanner;

public class HoodieRealtimeMergedRecordReader extends HoodieHiveFileSliceReader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do an aggregation instead of extension here? I'd suggest we simply encapsulate FSR as a field inside instead of extension, unless we're planning to extend its functionality

@codope codope changed the base branch from release-feature-rfc46 to master December 19, 2022 06:24
@codope codope changed the base branch from master to release-feature-rfc46 December 19, 2022 06:25
@codope
Copy link
Member Author

codope commented Dec 19, 2022

Closing it in favor of #7508
There were quite a few conflicts. It seemed better to fork off latest master and apply my changes.

@codope codope closed this Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:wip Work in Progress/PRs reader-core
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants