Replies: 3 comments 8 replies
-
|
I had a look at doing this, and I think it's doable, but there's some internal design considerations. I like that this would unify how some of the other |
Beta Was this translation helpful? Give feedback.
-
|
@samansmink Do you think it makes sense to add Hive filtering and projections to Branch for reference: https://github.com/duckdb/duckdb/compare/main...xevix:hive-filtering-read-blob-text?expand=1 |
Beta Was this translation helpful? Give feedback.
-
|
Thank you @xevix for adding support for this! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently we can use read_blob to get metadata (and data if desired) for files at a specific location. It would be very useful, for hive partitioned datasets, to add the hive partition columns for queries and to the returned result. This would enable new useful ways of interacting with data through duckdb (examples at the bottom).
The most user friendly way I would imagine this would work is like so (similar to read_parquet):
However, another way I could see this working is with a parse_hive_partitioning function (similar to parse_path) which returns a struct:
It's a bit more verbose, and unideal since you lose the ability to do things like early directory tree pruning (actively being discussed here: #7620) in the subquery, but still useful.
Currently, this could already be done with some degree of success by manually globing and parsing to create the table, but you lose the consistency with the implementation in duckdb functions like read_parquet. For example parsing hive partition types and optimizations (like the early tree pruning mentioned above).
Some example benefits listed below:
Example 1: Image/blob-data lookups on filterable hive partition keys
A common pattern for storing image and other blob data in various systems is to store the data on some blob storage and store metadata/location data about that blob in a proper sql database. This allows you to locate your blob data through sql queries on the metadata. With the functionality I'm proposing you could now just store your image/blob data on blob storage with hive partitioning and use read_blob with hive filters in your where statement to locate the blob files that you want to retrieve.
See this small sample of links describing or suggesting this pattern, which could all be solved with duckdb!
#10761 (comment)
https://www.reddit.com/r/learnprogramming/comments/1849ldt/what_best_practices_for_storing_images_and/
https://www.reddit.com/r/Database/comments/h8w0al/can_anyone_share_their_experience_of_storing/
https://www.reddit.com/r/mysql/comments/12kjsw6/images_in_a_mysql_database/
https://www.reddit.com/r/webdev/comments/e0pgee/is_it_better_to_store_images_in_database_or/
https://stackoverflow.com/questions/71346383/how-should-i-store-images-for-my-website-app
https://stackoverflow.com/questions/9000026/what-is-best-practice-when-it-comes-to-storing-images-for-a-gallery
https://softwareengineering.stackexchange.com/questions/357245/store-file-in-filesystem-and-its-metadata-to-the-database-atomicly
Example 2: Simple versioning system
See this previous Q&A post where I posed a simple versioning scheme I had in mind.
#18325
Example 3: Adhoc statistics and metrics on your datasets (whether parquet or whatever else)
Let's say you have a hive partitioned parquet dataset, and you just want to get some metrics like number of files and total size per specific partition. You could do something like this:
And thank you for all the work on this amazing tool!
Beta Was this translation helpful? Give feedback.
All reactions