Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding in demo of xtable with S3 + HMS converting hudi 0.14.1 to delta lake and iceberg #459

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

alberttwong
Copy link
Contributor

@alberttwong alberttwong commented Jun 4, 2024

Demo for #338

What is the purpose of the pull request

Using the xtable docker demo as the base, modify it so it works with S3. End to End example with readme doc.

Brief change log

  1. added minio container images to provide an object store
  2. changed HMS image to use the Starburst HMS image because Starburst has the S3 libraries already built in to the image.
  3. built a custom spark 3.4 container image based on JDK 11 with hadoop 2.10.2 and hive 2.3.10 (can't use 2.3.1 due to hive 2.3.1 bug) installed. Available at https://hub.docker.com/r/atwong/openjdk-11-spark-3.4-hive-2.3.10-hadoop-2.10.2 if you dont' want to build it.
  4. git clone hudi and compile mvn with JDK 8 so you can get the hudi-hive-sync jars (you can skip this through hudi-hive-sync-bundle on mvnrepository.com)
  5. adding missing libraries to run run_sync_tool.sh. https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3, https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws, https://mvnrepository.com/artifact/com.esotericsoftware/kryo-shaded/4.0.2, https://mvnrepository.com/artifact/org.apache.parquet/parquet-avro, https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client
  6. modifications to iceberg, hudi and delta Trino catalog configurations to support S3 bucket lookups
  7. added core-site.xml to inject parameters to xtable and modified /etc/hadoop/core-site.xml to jnject parameters to hudi-hive-sync tool
  8. Modified pyspark demo script to include S3 configs

Verify this pull request

  • Manually verified the change by running a job locally.

@alberttwong alberttwong changed the title Adding in demo of xtable with S3 Adding in demo of xtable with S3 + HMS converting hudi to delta lake and iceberg Jun 4, 2024
@alberttwong alberttwong changed the title Adding in demo of xtable with S3 + HMS converting hudi to delta lake and iceberg Adding in demo of xtable with S3 + HMS converting hudi 0.14.1 to delta lake and iceberg Jun 4, 2024
@alberttwong alberttwong mentioned this pull request Jun 5, 2024
Signed-off-by: Albert Wong <atwong@alumni.uci.edu>
@daragu
Copy link
Contributor

daragu commented Jun 6, 2024

hi @alberttwong, your contribution is excellent. I have some opinions that can we have a demos as the parent folder.

demos/
  demo-local/
  demo-s3/

add license to pass rat
updated RAT settings in pom to exclude demo-s3
@ghost
Copy link

ghost commented Jun 6, 2024

CI report:

Bot commands @xtable-bot supports the following commands:
  • @xtable-bot run azure re-run the last Azure build

@alberttwong
Copy link
Contributor Author

@daragu I think that's possible. The reason why I didn't change it was that there are links from xtable docs and other places to that demo folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants