Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONeTable with MINIO Buckets #327

Closed
soumilshah1995 opened this issue Feb 8, 2024 · 21 comments
Closed

ONeTable with MINIO Buckets #327

soumilshah1995 opened this issue Feb 8, 2024 · 21 comments

Comments

@soumilshah1995
Copy link

soumilshah1995 commented Feb 8, 2024

Hello there I have MINIO bucket locally huditest
Screenshot 2024-02-08 at 3 49 26 PM

I am trying to use onetable with MINIO

docker-compose file

version: "3"

services:

  metastore_db:
    image: postgres:11
    hostname: metastore_db
    ports:
      - 5432:5432
    environment:
      POSTGRES_USER: hive
      POSTGRES_PASSWORD: hive
      POSTGRES_DB: metastore

  hive-metastore:
    hostname: hive-metastore
    image: 'starburstdata/hive:3.1.2-e.18'
    ports:
      - '9083:9083' # Metastore Thrift
    environment:
      HIVE_METASTORE_DRIVER: org.postgresql.Driver
      HIVE_METASTORE_JDBC_URL: jdbc:postgresql://metastore_db:5432/metastore
      HIVE_METASTORE_USER: hive
      HIVE_METASTORE_PASSWORD: hive
      HIVE_METASTORE_WAREHOUSE_DIR: s3://datalake/
      S3_ENDPOINT: http://minio:9000
      S3_ACCESS_KEY: admin
      S3_SECRET_KEY: password
      S3_PATH_STYLE_ACCESS: "true"
      REGION: ""
      GOOGLE_CLOUD_KEY_FILE_PATH: ""
      AZURE_ADL_CLIENT_ID: ""
      AZURE_ADL_CREDENTIAL: ""
      AZURE_ADL_REFRESH_URL: ""
      AZURE_ABFS_STORAGE_ACCOUNT: ""
      AZURE_ABFS_ACCESS_KEY: ""
      AZURE_WASB_STORAGE_ACCOUNT: ""
      AZURE_ABFS_OAUTH: ""
      AZURE_ABFS_OAUTH_TOKEN_PROVIDER: ""
      AZURE_ABFS_OAUTH_CLIENT_ID: ""
      AZURE_ABFS_OAUTH_SECRET: ""
      AZURE_ABFS_OAUTH_ENDPOINT: ""
      AZURE_WASB_ACCESS_KEY: ""
      HIVE_METASTORE_USERS_IN_ADMIN_ROLE: "admin"
    depends_on:
      - metastore_db
    healthcheck:
      test: bash -c "exec 6<> /dev/tcp/localhost/9083"

  minio:
    image: minio/minio
    environment:
        
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=minio
    networks:
      default:
        aliases:
          - warehouse.minio
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]

  mc:
    depends_on:
      - minio
    image: minio/mc
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc rm -r --force minio/warehouse;
      /usr/bin/mc mb minio/warehouse;
      /usr/bin/mc policy set public minio/warehouse;
      tail -f /dev/null
      "



volumes:
  hive-metastore-postgresql:

networks:
  default:
     name: hudi

hudi_job.py

try:
    import os
    import sys
    import uuid
    import pyspark
    import datetime
    from pyspark.sql import SparkSession
    from pyspark import SparkConf, SparkContext
    from faker import Faker
    import datetime
    from datetime import datetime
    import random
    import pandas as pd  # Import Pandas library for pretty printing

    print("Imports loaded ")

except Exception as e:
    print("error", e)

HUDI_VERSION = '0.14.0'
SPARK_VERSION = '3.4'

SUBMIT_ARGS = f"--packages org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION},org.apache.hadoop:hadoop-aws:3.3.2 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
os.environ['PYSPARK_PYTHON'] = sys.executable

spark = SparkSession.builder \
    .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
    .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
    .config('className', 'org.apache.hudi') \
    .config('spark.sql.hive.convertMetastoreParquet', 'false') \
    .getOrCreate()

print(spark)
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "http://127.0.0.1:9000/")
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "admin")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "password")
spark._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider",
                                     "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")

global faker
faker = Faker()


def get_customer_data(total_customers=2):
    customers_array = []
    for i in range(0, total_customers):
        customer_data = {
            "customer_id": str(uuid.uuid4()),
            "name": faker.name(),
            "state": faker.state(),
            "city": faker.city(),
            "email": faker.email(),
            "created_at": datetime.now().isoformat().__str__(),
            "address": faker.address(),
            "salary": faker.random_int(min=30000, max=100000)
        }
        customers_array.append(customer_data)
    return customers_array


global total_customers, order_data_sample_size
total_customers = 5000
customer_data = get_customer_data(total_customers=total_customers)
spark_df_customers = spark.createDataFrame(data=[tuple(i.values()) for i in customer_data],
                                           schema=list(customer_data[0].keys()))
spark_df_customers.show(3)


def write_to_hudi(spark_df,
                  table_name,
                  db_name,
                  method='upsert',
                  table_type='COPY_ON_WRITE',
                  recordkey='',
                  precombine='',
                  partition_fields=''
                  ):
    path = f"s3a://huditest/hudi/database={db_name}/table_name={table_name}"
    # path = f"file:///Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/hudi/{db_name}/{table_name}"

    hudi_options = {
        'hoodie.table.name': table_name,
        'hoodie.datasource.write.table.type': table_type,
        'hoodie.datasource.write.table.name': table_name,
        'hoodie.datasource.write.operation': method,
        'hoodie.datasource.write.recordkey.field': recordkey,
        'hoodie.datasource.write.precombine.field': precombine,
        "hoodie.datasource.write.partitionpath.field": partition_fields,

        "hoodie.datasource.hive_sync.database": db_name,
        "hoodie.datasource.hive_sync.table": table_name,
        "hoodie.datasource.hive_sync.partition_fields": partition_fields,
        "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.metastore.uris": "thrift://localhost:9083",
        "hoodie.datasource.hive_sync.mode": "hms",
        "hoodie.datasource.hive_sync.enable": "true",
        "hoodie.datasource.write.hive_style_partitioning": "true"
    }

    print("\n")
    print(path)
    print("\n")

    spark_df.write.format("hudi"). \
        options(**hudi_options). \
        mode("append"). \
        save(path)


write_to_hudi(
    spark_df=spark_df_customers,
    db_name="default",
    table_name="customers",
    recordkey="customer_id",
    precombine="created_at",
    partition_fields="state"
)

config.yml

sourceFormat: HUDI
targetFormats:
  - DELTA
datasets:
  -
    tableBasePath: s3a://huditest/hudi/database=default/table_name=customers
    tableName: customers
    partitionSpec: state:VALUE


Error

soumilshah@Soumils-MacBook-Pro StarRocks-Hudi-Minio % java \
-jar  /Users/soumilshah/IdeaProjects/SparkProject/MyGIt/StarRocks-Hudi-Minio/jar/utilities-0.1.0-beta1-bundled.jar \
--datasetConfig ./my_config.yaml
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/Users/soumilshah/IdeaProjects/SparkProject/MyGIt/StarRocks-Hudi-Minio/jar/utilities-0.1.0-beta1-bundled.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-02-08 15:48:18 INFO  io.onetable.utilities.RunSync:141 - Running sync for basePath s3://huditest/hudi/database=default/table_name=customers/ for following table formats [DELTA]
2024-02-08 15:48:18 ERROR io.onetable.utilities.RunSync:164 - Error running sync for s3://huditest/hudi/database=default/table_name=customers/
org.apache.hudi.exception.HoodieIOException: Could not check if s3://huditest/hudi/database=default/table_name=customers is a valid table
	at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:140) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:692) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:85) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:774) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at io.onetable.hudi.HudiSourceClientProvider.getSourceClientInstance(HudiSourceClientProvider.java:42) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at io.onetable.hudi.HudiSourceClientProvider.getSourceClientInstance(HudiSourceClientProvider.java:31) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at io.onetable.client.OneTableClient.sync(OneTableClient.java:97) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at io.onetable.utilities.RunSync.main(RunSync.java:162) ~[utilities-0.1.0-beta1-bundled.jar:?]
Caused by: java.nio.file.AccessDeniedException: s3://huditest/hudi/database=default/table_name=customers/.hoodie: getFileStatus on s3://huditest/hudi/database=default/table_name=customers/.hoodie: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 8K4R3M1SP8D8VJYF; S3 Extended Request ID: t8D2C5IazpRmCbGGH0Hofx1FYf1KqHHZbgw5iSR/xWxw/bOt4f8iSbtszyGFP1PaPYss/Vi1XwdvMZX4wRIUrTk7RIZ5OUS3taz+jCMzLyk=; Proxy: null), S3 Extended Request ID: t8D2C5IazpRmCbGGH0Hofx1FYf1KqHHZbgw5iSR/xWxw/bOt4f8iSbtszyGFP1PaPYss/Vi1XwdvMZX4wRIUrTk7RIZ5OUS3taz+jCMzLyk=:403 Forbidden
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:230) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:151) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2275) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2226) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2160) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$getFileStatus$17(HoodieWrapperFileSystem.java:410) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.getFileStatus(HoodieWrapperFileSystem.java:404) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:51) ~[utilities-0.1.0-beta1-bundled.jar:?]
	... 8 more
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 8K4R3M1SP8D8VJYF; S3 Extended Request ID: t8D2C5IazpRmCbGGH0Hofx1FYf1KqHHZbgw5iSR/xWxw/bOt4f8iSbtszyGFP1PaPYss/Vi1XwdvMZX4wRIUrTk7RIZ5OUS3taz+jCMzLyk=; Proxy: null)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1372) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$4(S3AFileSystem.java:1307) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1304) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2264) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2226) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2160) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$getFileStatus$17(HoodieWrapperFileSystem.java:410) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.getFileStatus(HoodieWrapperFileSystem.java:404) ~[utilities-0.1.0-beta1-bundled.jar:?]
	at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:51) ~[utilities-0.1.0-beta1-bundled.jar:?]
	... 8 more
soumilshah@Soumils-MacBook-Pro StarRocks-Hudi-Minio % 

How to configure one table to work with MINO buckets ?
I tried s3 and s3a both how would I setup to work with minio buckets

@alberttwong
Copy link
Contributor

related: #322

@soumilshah1995
Copy link
Author

@alberttwong did you get it to work with uneatable ?

@soumilshah1995
Copy link
Author

still same issue @alberttwong

export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=password
export S3_ENDPOINT=http://localhost:9000
export AWS_ENDPOINT_URL_S3=http://localhost:9000
export AWS_ENDPOINT=http://localhost:9000


java \
-jar  /Users/soumilshah/IdeaProjects/SparkProject/MyGIt/StarRocks-Hudi-Minio/jar/utilities-0.1.0-beta1-bundled.jar \
--datasetConfig ./config.yml

@alberttwong
Copy link
Contributor

alberttwong commented Feb 22, 2024

@sagarlakshmipathy is there something we're missing? I feel like the java app doesn't pick up ENV.

@alberttwong
Copy link
Contributor

[root@spark-hudi auxjars]# cat ~/.aws/config 
[default]
region = us-west-2
output = json

[services testing-s3]
s3 = 
  endpoint_url = http://minio:9000
[root@spark-hudi auxjars]# cat ~/.aws/credentials 
[default]
aws_access_key_id = admin
aws_secret_access_key = password
[root@spark-hudi auxjars]# aws s3 ls --endpoint http://minio:9000
2024-02-22 21:23:53 huditest
2024-02-22 21:18:19 warehouse
[root@spark-hudi auxjars]# env|grep AWS
AWS_IGNORE_CONFIGURED_ENDPOINT_URLS=true
AWS_REGION=us-east-1
AWS_ENDPOINT_URL_S3=http://minio:9000

@alberttwong
Copy link
Contributor

alberttwong commented Feb 23, 2024

okay. I figured it out. You need to modify utilities/src/main/resources/onetable-hadoop-defaults.xml to include additional configs. #337 We need to clean this up so that onetable scans for conf files.

also we need to modify the trino schema create command.

See StarRocks/demo#54 for all the instructions.

@soumilshah1995
Copy link
Author

soumilshah1995 commented Feb 24, 2024

im lost lol
which variables do we need to set ?

I tried

export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=password
export S3_ENDPOINT=http://localhost:9000
export AWS_ENDPOINT_URL_S3=http://localhost:9000
export AWS_ENDPOINT=http://localhost:9000
image

are this all variables I need to set ?

@alberttwong
Copy link
Contributor

That's the issue... none of them worked so I'm just listing all the variants I tried. what worked was only for my own demo and I modified my hadoop settings. I couldn't get it working for onetable's docker demo.

@soumilshah1995
Copy link
Author

understood let me post the ticket in hudi channel maybe someone can help there

@nfarah86
Copy link

@the-other-tim-brown is this something you can help here?

@sagarlakshmipathy
Copy link
Contributor

@soumilshah1995 @alberttwong I'm on vacation until Thursday, I can help replicate this issue once I'm back. @the-other-tim-brown feel free to chime in when you get a chance.

@soumilshah1995
Copy link
Author

@sagarlakshmipathy Thanks a lot. Please enjoy your vacations this is not urgent or blocking its mostly for POC
no Hurry at all

@the-other-tim-brown
Copy link
Contributor

I have not used MinIO before. I will have to spend some time coming up to speed on it.

@the-other-tim-brown
Copy link
Contributor

@soumilshah1995 have you tried updating the hadoop config like Albert suggested? He put up the configs he used here: https://github.com/apache/incubator-xtable/pull/337/files

@soumilshah1995
Copy link
Author

I haven't opted for using a container for my Hadoop setup. Could you kindly suggest the steps to set it up on a Mac?"

@alberttwong
Copy link
Contributor

alberttwong commented Mar 28, 2024

Soumil... I think if slightly change your java run command to java -jar utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig onetable.yaml -p ../conf/core-site.xml, it should work. The raw for the xml can be found at https://github.com/StarRocks/demo/blob/master/documentation-samples/datalakehouse/conf/core-site.xml

@soumilshah1995
Copy link
Author

I think I will take a different route rather using deltastreamer and xtable with MINIO and StarRocks
I need to try that will try it and keep you all posted here

@soumilshah1995
Copy link
Author

I have decided to go this route instead
Screenshot 2024-03-29 at 9 46 18 AM

@soumilshah1995
Copy link
Author

here is my architecture
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants