Skip to content

Make it easier to query parquet files on remote storage with datafusion-cli #16304

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

There are more and more blogs like this that show examples of running queries against data on remote object store

I would like to compare the performance of DataFusion to these other systems, but I find it really hard to run the examples

For example, in the above blog post,

INSERT INTO tripdata
SELECT * FROM s3('s3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/*.parquet', NOSIGN)
SETTINGS max_threads=32, max_insert_threads=32, input_format_parallel_parsing=0;

When I try to follow the example in https://datafusion.apache.org/user-guide/cli/datasources.html#remote-files-directories to look at this same data in datafusion-cli it doesn't work (and it gives me a confusing message)

$ datafusion-cli
DataFusion CLI v47.0.0
> select count(*) from 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet';
Error during planning: table 'datafusion.public.s3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet' not found
>

I also tried setting AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as suggested and it still fails:

$AWS_ACCESS_KEY_ID=foo AWS_SECRET_ACCESS_KEY=bar datafusion-cli
DataFusion CLI v47.0.0
> select count(*) from 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet';
Error during planning: table 'datafusion.public.s3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet' not found

CREATE EXTERNAL TABLE does appear to work

> CREATE EXTERNAL TABLE hits
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet';
Object Store error: The operation lacked the necessary privileges to complete for path nyc_taxi_rides/data/tripdata_parquet: Error performing HEAD https://s3.us-east-1.amazonaws.com/altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet in 136.439542ms - Server returned non-2xx status code: 403 Forbidden:

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions