New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SELECT on Hive Partitioned PyArrow Dataset fails when File System is ADLFS #5305
Comments
It could be due to the way DuckDB parallelises the read. Can you try with single-threaded mode ( |
Hang still occurs with PRAMA threads=1; I added progress bar and profiling as well. During the hang nothing further populates from the progress bar or profiling. I also added an EXPLAIN on the query - it is able to explain even though it hangs on execution. Source code updated in original post to reflect these changes. |
I am facing similar problems, even on non partitioned data in ADLFS. After some experimentation, I observed that if I limit the arrow dataset to a single parquet file on ADLS, the code completes. This work up to 3 parquet files for the dataset. Any more, then it hangs. This is my environment:
|
After upgrading to duckdb 0.6.1, the program hangs even when using 1 parquet file |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days. |
This issue was closed because it has been stale for 30 days with no activity. |
What's happening: The duckDb select of hive partitioned Arrow dataset hangs when it is read from an adlfs.spec.AzureBlobFileSystem.
What I expect: For the duckDb select of a hive partitioned Arrow dataset to behave essentially the same regardless of whether an adlfs.spec.AzureBlobFileSystem or fsspec.implementations.local.LocalFileSystem is used.
What I'm seeing is that
con.execute("select * from example_ds").arrow()
hangs silently whenexample_ds
is HivePartitioned and the backing file system isadlfs
for Azure Data Lake Gen 2 (specifically yielding adlfs.spec.AzureBlobFileSystem). Notablyexample_ds.to_table()
using Arrow alone passes without complaint.I tested this in the docker image
python:3.10-buster
.I did a fresh install using
pip install duckdb pyarrow adlfs
yielding a pip freeze of:I have a script that tests with the local file system and the remote filesystem, with and without partitioning. It only hangs on the case with the remote and partitioning attempted. To make this example work, you'll need to set an env var for
azure_connection_string_secret
and modify the code for the line that assigns the base directory to write to, viz. the variablebase_dir
. My apologies if the python code is unattractive/non-pythonic... I'm still learning.Note that earlier versions were run by hand with better cleanup between the partitioned and non-partitioned loops and the result was the same. There is a line up there you can uncomment, but I didn't feel comfortable posting code that would issue a delete command on someone else's file system.
cc @jwills
The text was updated successfully, but these errors were encountered: