Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python parquet query fails against s3 #2828

Closed
K377U opened this issue Dec 21, 2021 · 1 comment
Closed

Python parquet query fails against s3 #2828

K377U opened this issue Dec 21, 2021 · 1 comment

Comments

@K377U
Copy link

K377U commented Dec 21, 2021

What happens?

Python s3 parquet query fails. Against local parquet file same operation works.

Fix #2830

pip install duckdb

$ python parquet_test.py 
Traceback (most recent call last):
  File "parquet_test.py", line 40, in <module>
    connection.execute(f"SELECT * FROM parquet_scan('{parquer_file}') LIMIT 10")
RuntimeError: IO Error: No files found that match the pattern "<hidden>"

pip install duckdb --pre

$ python parquet_test.py 
Traceback (most recent call last):
  File "parquet_test.py", line 36, in <module>
    connection.execute(f"SET s3_access_key_id='{s3_access_key_id}'")
RuntimeError: Catalog Error: unrecognized configuration parameter "s3_access_key_id"

Did you mean: "access_mode"

BUILD_HTTPFS=1 pip install duckdb --no-binary duckdb

$ python parquet_test.py 
Traceback (most recent call last):
  File "parquet_test.py", line 40, in <module>
    connection.execute(f"SELECT * FROM parquet_scan('{parquer_file}') LIMIT 10")
RuntimeError: IO Error: No files found that match the pattern "<hidden>"

BUILD_HTTPFS=1 pip install duckdb --no-binary duckdb --pre

$ python parquet_test.py 
Traceback (most recent call last):
  File "parquet_test.py", line 39, in <module>
    connection.execute(f"SET s3_access_key_id='{s3_access_key_id}'")
RuntimeError: Catalog Error: unrecognized configuration parameter "s3_access_key_id"

Did you mean: "access_mode"

To Reproduce

Configuration:

export S3_ACCESS_KEY_ID=...
export S3_SECRET_ACCESS_KEY=...
export PARQUET_FILE=s3://<bucket>/<path>/<filename>.parquet
# virtualenv env
# pip install boto3
# One of these:
# pip install duckdb
# pip install duckdb --pre
# BUILD_HTTPFS=1 pip install duckdb --no-binary duckdb
# BUILD_HTTPFS=1 pip install duckdb --no-binary duckdb --pre

import os
import json
from urllib.parse import urlparse
import duckdb
import boto3

s3_endpoint = os.environ.get("S3_ENDPOINT", None)
s3_region = os.environ.get("S3_REGION", None)
s3_access_key_id = os.environ.get("S3_ACCESS_KEY_ID", None)
s3_secret_access_key = os.environ.get("S3_SECRET_ACCESS_KEY", None)
parquet_file = os.environ.get("PARQUET_FILE", None)

# Check that the file exists
client = boto3.client(
    "s3",
    endpoint_url=s3_endpoint,
    aws_access_key_id=s3_access_key_id,
    aws_secret_access_key=s3_secret_access_key,
    config=boto3.session.Config(
        signature_version="s3v4",
    ),
)
uri = urlparse(parquet_file)
client.head_object(Bucket=uri.netloc, Key=uri.path.lstrip("/"))


# Connect and query
connection = duckdb.connect(database=":memory:")
if s3_endpoint:
    connection.execute(f"SET s3_endpoint='{s3_endpoint}'")
if s3_access_key_id:
    connection.execute(f"SET s3_access_key_id='{s3_access_key_id}'")
if s3_secret_access_key:
    connection.execute(f"SET s3_secret_access_key='{s3_secret_access_key}'")
if s3_region:
    connection.execute(f"SET s3_region='{s3_region}'")

connection.execute(f"SELECT * FROM parquet_scan('{parquet_file}') LIMIT 10")

header = [col[0] for col in connection.description]
for row in connection.fetchall():
    print(json.dumps(dict(zip(header, row))))

Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • DuckDB Version: 0.3.1 and 0.3.2.dev710
  • DuckDB Client: Python 3.8

Before Submitting

  • [*] Have you tried this on the latest master branch?
  • Python: pip install duckdb --upgrade --pre
  • [*] Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
@K377U
Copy link
Author

K377U commented Dec 21, 2021

I did some digging and moving if 'BUILD_HTTPFS' in os.environ in setup.py before for ext in extensions:

https://github.com/duckdb/duckdb/blob/master/tools/pythonpkg/setup.py#L107

Also I needed to add SET s3_region to above code to make it work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant