Feature request: Add `hive_partitioning=true|false` to `read_blob` to provide implied hive partitioning columns to result/queries. #18416

mitstake · 2025-07-26T07:45:47Z

mitstake
Jul 26, 2025

Currently we can use read_blob to get metadata (and data if desired) for files at a specific location. It would be very useful, for hive partitioned datasets, to add the hive partition columns for queries and to the returned result. This would enable new useful ways of interacting with data through duckdb (examples at the bottom).

The most user friendly way I would imagine this would work is like so (similar to read_parquet):

SELECT *
FROM read_blob('/root/my_dataset/**/*.parquet', hive_partitioning=true)
WHERE ...

However, another way I could see this working is with a parse_hive_partitioning function (similar to parse_path) which returns a struct:

SELECT hive_partitions.*, * EXCLUDE hive_partitions
FROM (
    SELECT parse_hive_partitioning(filename) AS hive_partitions, *
    FROM read_blob('/root/my_dataset/**/*.parquet')
)
WHERE ...

It's a bit more verbose, and unideal since you lose the ability to do things like early directory tree pruning (actively being discussed here: #7620) in the subquery, but still useful.

Currently, this could already be done with some degree of success by manually globing and parsing to create the table, but you lose the consistency with the implementation in duckdb functions like read_parquet. For example parsing hive partition types and optimizations (like the early tree pruning mentioned above).

Some example benefits listed below:

Example 1: Image/blob-data lookups on filterable hive partition keys
A common pattern for storing image and other blob data in various systems is to store the data on some blob storage and store metadata/location data about that blob in a proper sql database. This allows you to locate your blob data through sql queries on the metadata. With the functionality I'm proposing you could now just store your image/blob data on blob storage with hive partitioning and use read_blob with hive filters in your where statement to locate the blob files that you want to retrieve.

See this small sample of links describing or suggesting this pattern, which could all be solved with duckdb!
#10761 (comment)
https://www.reddit.com/r/learnprogramming/comments/1849ldt/what_best_practices_for_storing_images_and/
https://www.reddit.com/r/Database/comments/h8w0al/can_anyone_share_their_experience_of_storing/
https://www.reddit.com/r/mysql/comments/12kjsw6/images_in_a_mysql_database/
https://www.reddit.com/r/webdev/comments/e0pgee/is_it_better_to_store_images_in_database_or/
https://stackoverflow.com/questions/71346383/how-should-i-store-images-for-my-website-app
https://stackoverflow.com/questions/9000026/what-is-best-practice-when-it-comes-to-storing-images-for-a-gallery
https://softwareengineering.stackexchange.com/questions/357245/store-file-in-filesystem-and-its-metadata-to-the-database-atomicly

Example 2: Simple versioning system
See this previous Q&A post where I posed a simple versioning scheme I had in mind.
#18325

Example 3: Adhoc statistics and metrics on your datasets (whether parquet or whatever else)
Let's say you have a hive partitioned parquet dataset, and you just want to get some metrics like number of files and total size per specific partition. You could do something like this:

SELECT partition1, partition3, count(*) as num_files, sum(size) / 1e6 as total_mb
FROM read_blob('/root/my_dataset/**/*.parquet', hive_partitioning=true)
GROUP BY partition1, partition3

And thank you for all the work on this amazing tool!

xevix · 2025-08-12T19:21:03Z

xevix
Aug 12, 2025

I had a look at doing this, and I think it's doable, but there's some internal design considerations. I like that this would unify how some of the other read_* functions work, and it might make sense to make use of the same MultiFileFunction class they do, but read_glob and read_text differ in that they don't read or produce columns, so it would require either a refactor or working around the column-reading related code.

3 replies

mitstake Aug 13, 2025
Author

ah, so the parquet/csv column parsing is tightly integrated into the hive-partition parsing then? Unfortunately my C++ skills are not at a level where I can effectively decipher what's going on here anytime soon. How hard do you think it would be to factor that out so the hive-partitioning could be used in isolation?

xevix Aug 13, 2025

Yeah, some of the Hive partitioning code is part of the parquet/csv/json etc. readers. But the vital bits can be extracted out. I'm working on a POC since I'm also interested in this feature. So far I have the Hive filtering working, just need to get the Hive column projections working (i.e. usable in a SELECT).

D EXPLAIN SELECT filename FROM read_blob('/Users/xevix/Downloads/data/noaa/by_year/**/*.parquet', hive_partitioning=1) WHERE year = 2024 AND element = 'TMAX';

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         READ_BLOB         │
│    ────────────────────   │
│    Function: READ_BLOB    │
│   Projections: filename   │
│                           │
│       File Filters:       │
│  (YEAR = 2024)(ELEMENT =  │
│          'TMAX')          │
│                           │
│      Scanning Files:      │
│          8/13593          │
│                           │
│          ~8 Rows          │
└───────────────────────────┘

xevix Aug 13, 2025

Alright I got projection working, so POC seems ok.

D SELECT * EXCLUDE (content) FROM read_blob('/Users/xevix/Downloads/data/noaa/by_year/**/*.parquet', hive_partitioning=1) WHERE year = 2024 AND element = 'TMAX';
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────┬──────────────────────────┬─────────┬───────┐
│                                                     filename                                                      │     size     │      last_modified       │ ELEMENT │ YEAR  │
│                                                      varchar                                                      │    int64     │ timestamp with time zone │ varchar │ int64 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────┼──────────────────────────┼─────────┼───────┤
│ /Users/xevix/Downloads/data/noaa/by_year/YEAR=2024/ELEMENT=TMAX/0a916734b79044f889cda53b000d215d_0.snappy.parquet │ 1.68 million │ 2024-12-23 12:37:45+00   │ TMAX    │  2024 │
│ /Users/xevix/Downloads/data/noaa/by_year/YEAR=2024/ELEMENT=TMAX/0a916734b79044f889cda53b000d215d_1.snappy.parquet │ 1.68 million │ 2024-12-23 12:37:45+00   │ TMAX    │  2024 │
│ /Users/xevix/Downloads/data/noaa/by_year/YEAR=2024/ELEMENT=TMAX/0a916734b79044f889cda53b000d215d_2.snappy.parquet │ 1.68 million │ 2024-12-23 12:37:45+00   │ TMAX    │  2024 │
│ /Users/xevix/Downloads/data/noaa/by_year/YEAR=2024/ELEMENT=TMAX/0a916734b79044f889cda53b000d215d_3.snappy.parquet │ 1.68 million │ 2024-12-23 12:37:45+00   │ TMAX    │  2024 │
│ /Users/xevix/Downloads/data/noaa/by_year/YEAR=2024/ELEMENT=TMAX/0a916734b79044f889cda53b000d215d_4.snappy.parquet │ 1.67 million │ 2024-12-23 12:37:45+00   │ TMAX    │  2024 │
│ /Users/xevix/Downloads/data/noaa/by_year/YEAR=2024/ELEMENT=TMAX/0a916734b79044f889cda53b000d215d_5.snappy.parquet │ 1.68 million │ 2024-12-23 12:37:45+00   │ TMAX    │  2024 │
│ /Users/xevix/Downloads/data/noaa/by_year/YEAR=2024/ELEMENT=TMAX/0a916734b79044f889cda53b000d215d_6.snappy.parquet │ 1.71 million │ 2024-12-23 12:37:48+00   │ TMAX    │  2024 │
│ /Users/xevix/Downloads/data/noaa/by_year/YEAR=2024/ELEMENT=TMAX/0a916734b79044f889cda53b000d215d_7.snappy.parquet │       499875 │ 2024-12-23 12:37:48+00   │ TMAX    │  2024 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┴──────────────────────────┴─────────┴───────┘

xevix · 2025-08-13T20:40:24Z

xevix
Aug 13, 2025

@samansmink Do you think it makes sense to add Hive filtering and projections to read_blob and read_text? I have the basic functionality working but might be some duplicate code, since I tried to avoid pulling in MultiFileFunction since these methods don't read any actual columns from files, and it could get messy quickly.

Branch for reference: https://github.com/duckdb/duckdb/compare/main...xevix:hive-filtering-read-blob-text?expand=1

5 replies

samansmink Aug 19, 2025
Collaborator

@xevix I think it does make sense! having read_blob and read_text use the MultiFileReader seems like a good feature.

However, I'm not sure I follow why pulling in the MultiFileFunction is a bad idea, or why the read_blob and read_text would be that different from read_parquet. The only differences are that it always produces a single row per file with the same schema and it only contains constant columns and no columns that are fetched from the file.

Ideally I think we'd implement the read_blob and read_text functions as close the the read_parquet as possible

xevix Aug 19, 2025

I think it would be good to have the various read_* methods use the same MultiFileFunction as well, it just seemed like many of the methods relate to having columns (e.g. cardinality), and this also requires implementing a MultiFileReaderInterface.

duckdb/src/include/duckdb/common/multi_file/multi_file_function.hpp

Lines 60 to 76 in a8206a2

    
           template <class OP> 
        
           class MultiFileFunction : public TableFunction { 
        
           public: 
        
           	explicit MultiFileFunction(string name_p) 
        
           	    : TableFunction(std::move(name_p), {LogicalType::VARCHAR}, MultiFileScan, MultiFileBind, MultiFileInitGlobal, 
        
           	                    MultiFileInitLocal) { 
        
           		cardinality = MultiFileCardinality; 
        
           		table_scan_progress = MultiFileProgress; 
        
           		get_partition_data = MultiFileGetPartitionData; 
        
           		get_bind_info = MultiFileGetBindInfo; 
        
           		projection_pushdown = true; 
        
           		pushdown_complex_filter = MultiFileComplexFilterPushdown; 
        
           		get_partition_info = MultiFileGetPartitionInfo; 
        
           		get_virtual_columns = MultiFileGetVirtualColumns; 
        
           		dynamic_to_string = MultiFileDynamicToString; 
        
           		MultiFileReader::AddParameters(*this); 
        
           	}

Looking at it a bit closer now though it doesn't seem too bad, and would be good to unify things.

Tishj Aug 20, 2025
Collaborator

It might help to use this PR as a reference: duckdb/duckdb-avro#21
The code was previously in avro_extension.cpp, and the logic is mostly kept entirely the same, just split into the various MultiFileFunction callbacks

xevix Aug 22, 2025

I've kinda got it working but currently working through making it parallelism aware. The original code relied on running in a loop on a single thread, but BaseFileReader::Scan() is called multiple times in parallel. Trying to think if I should introduce a lock or some other mechanism to avoid multiple calls to Scan() trying to process the same file.

MultiFileFunction has a parallel_lock which would appear to lock the file, but the GlobalTableFunctionState passed to Scan() which I'm using to keep track of which file is being processed still gets updated multiple times, indicating some kind of thread issue. I'll keep digging.

https://github.com/duckdb/duckdb/compare/main...xevix:duckdb:hive-filtering-read-blob-text-multi-file-function?expand=1

Duplicates due to processing same file multiple times.

D SELECT * EXCLUDE (content) FROM read_blob('/Users/xevix/Downloads/data/blob/DT=2025-08-13/FT=SCREENSHOT/Screenshot 2025-08-12 at 13.10.14.png');
┌────────────────────────────────────────────────────────────────────────────────────────────────────┬────────┬──────────────────────────┐
│                                              filename                                              │  size  │      last_modified       │
│                                              varchar                                               │ int64  │ timestamp with time zone │
├────────────────────────────────────────────────────────────────────────────────────────────────────┼────────┼──────────────────────────┤
│ /Users/xevix/Downloads/data/blob/DT=2025-08-13/FT=SCREENSHOT/Screenshot 2025-08-12 at 13.10.14.png │ 349218 │ 2025-08-12 20:10:17+00   │
│ /Users/xevix/Downloads/data/blob/DT=2025-08-13/FT=SCREENSHOT/Screenshot 2025-08-12 at 13.10.14.png │ 349218 │ 2025-08-12 20:10:17+00   │
│ /Users/xevix/Downloads/data/blob/DT=2025-08-13/FT=SCREENSHOT/Screenshot 2025-08-12 at 13.10.14.png │ 349218 │ 2025-08-12 20:10:17+00   │
│ /Users/xevix/Downloads/data/blob/DT=2025-08-13/FT=SCREENSHOT/Screenshot 2025-08-12 at 13.10.14.png │ 349218 │ 2025-08-12 20:10:17+00   │
│ /Users/xevix/Downloads/data/blob/DT=2025-08-13/FT=SCREENSHOT/Screenshot 2025-08-12 at 13.10.14.png │ 349218 │ 2025-08-12 20:10:17+00   │
└────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────────┘
D SELECT * EXCLUDE (content) FROM read_blob('/Users/xevix/Downloads/data/blob/DT=2025-08-13/FT=SCREENSHOT/Screenshot 2025-08-12 at 13.10.14.png');
┌────────────────────────────────────────────────────────────────────────────────────────────────────┬────────┬──────────────────────────┐
│                                              filename                                              │  size  │      last_modified       │
│                                              varchar                                               │ int64  │ timestamp with time zone │
├────────────────────────────────────────────────────────────────────────────────────────────────────┼────────┼──────────────────────────┤
│ /Users/xevix/Downloads/data/blob/DT=2025-08-13/FT=SCREENSHOT/Screenshot 2025-08-12 at 13.10.14.png │ 349218 │ 2025-08-12 20:10:17+00   │
│ /Users/xevix/Downloads/data/blob/DT=2025-08-13/FT=SCREENSHOT/Screenshot 2025-08-12 at 13.10.14.png │ 349218 │ 2025-08-12 20:10:17+00   │
│ /Users/xevix/Downloads/data/blob/DT=2025-08-13/FT=SCREENSHOT/Screenshot 2025-08-12 at 13.10.14.png │ 349218 │ 2025-08-12 20:10:17+00   │
└────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────────┘

xevix Aug 22, 2025

Continuing in #18706

mitstake · 2025-08-29T08:26:16Z

mitstake
Aug 29, 2025
Author

Thank you @xevix for adding support for this!

0 replies

Feature request: Add hive_partitioning=true|false to read_blob to provide implied hive partitioning columns to result/queries. #18416

Uh oh!

Uh oh!

Replies: 3 comments · 8 replies

Uh oh!

Uh oh!

mitstake Aug 13, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samansmink Aug 19, 2025 Collaborator

Uh oh!

Uh oh!

Tishj Aug 20, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mitstake Aug 29, 2025 Author

Feature request: Add `hive_partitioning=true|false` to `read_blob` to provide implied hive partitioning columns to result/queries. #18416

Replies: 3 comments 8 replies

mitstake Aug 13, 2025
Author

samansmink Aug 19, 2025
Collaborator

Tishj Aug 20, 2025
Collaborator

mitstake
Aug 29, 2025
Author