Virtual columns reading data from parquet using globs? #3269
-
Hi, are there any virtual columns added allowing to identify the exact path to the parquet file when reading from many files using a glob? Use-case: I write parquet files using pyarrow function to_parquet, specifying partition_columns. In this case, the columns themselves will be not stored in the file, but rather their content will be used to create sub-folders:
I now want to perform full scan of the files and group by userid using duckdb:
but I don't have the column userid anymore. If I had something like a virtual column _path, I could do
Regards, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hello! #1905 allows you to query parquet metadata using globs, so as a workaround you could run 1 query to build up a list of parquet files and build up your query in a Python string that looks like: Select 'my_file' as filename, *
From parquet_scan('my_file.parquet')
Union all
Select 'my_file2' as filename, *
From parquet_scan('my_file2.parquet')
|
Beta Was this translation helpful? Give feedback.
Hello!
We have an existing issue for this feature here:
#2303
#1905 allows you to query parquet metadata using globs, so as a workaround you could run 1 query to build up a list of parquet files and build up your query in a Python string that looks like: