Skip to content

[MINOR] HoodieAvroUtils supports enum => conversion rewrite#8738

Merged
danny0405 merged 3 commits intoapache:masterfrom
envomp:master
May 18, 2023
Merged

[MINOR] HoodieAvroUtils supports enum => conversion rewrite#8738
danny0405 merged 3 commits intoapache:masterfrom
envomp:master

Conversation

@envomp
Copy link
Contributor

@envomp envomp commented May 17, 2023

Our current flows are as follows:

fetch schema:

  • Fetch desired table schema in Avro format from schema registry
  • Get a respective dataset schema given desired table Avro schema
  • Convert respective dataset schema back to Avro schema to get an unified schema

transform input:

  • Consume Kafka via Spark Streaming and receive RDD<GenericRecord>
  • Rewrite the RDD to unified schema to resolve version differences to end up with desired schema where some fields get dropped and some datatypes get changed. Enum => String for example
    • We use a copy of the org.apache.hudi.avro.HoodieAvroUtils class to do the rewrite and require it to support aforementioned rewrite procedure
    • Hopefully we can make the changes in public repo, so we don't need to maintain a custom rewrite class
  • Convert RDD<GenricRecord> to Dataset<Row>

There are situations where we read input from S3 directly resulting in Dataset<Row> which needs to be backfilled to table, so maintaining control over the schema on application side is beneficial for us instead of it being inferred from raw data.

Change Logs

Rewrite to support enum => string conversion

Impact

no impact

Risk level (write none, low medium or high below)

none

Documentation Update

no documentation needed

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@envomp envomp changed the title HoodieAvroUtils supports enum => conversion rewrite [MINOR] HoodieAvroUtils supports enum => conversion rewrite May 17, 2023
@envomp envomp marked this pull request as ready for review May 17, 2023 13:51
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit 423102b into apache:master May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants