feat(c/driver/hologres): add Hologres ADBC driver#4266
Conversation
Add a new independent ADBC driver for Hologres with stub implementations of Database, Connection, and Statement classes. The driver compiles as a standalone library (adbc_driver_hologres) without modifying the existing PostgreSQL driver. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…river Copy error handling, result helpers, type system, bind stream, result reader, and COPY protocol files from the PostgreSQL driver. These files retain the adbcpq namespace and are compiled as part of the Hologres driver library to avoid modifying the PostgreSQL driver. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction Add real connection management (PGconn), Hologres version detection via SELECT hg_version(), PostgreSQL type resolver, and MakeFixedFeUri() for Stage mode FixedFE connections. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ueries Implement the full HologresConnection class with Hologres-specific behavior: - GetInfo returns Hologres vendor name and parsed version - Commit/Rollback return NOT_IMPLEMENTED (Hologres is always autocommit) - SetOption rejects disabling autocommit - GetObjects/GetTableSchema/GetTableTypes use PG-compatible system catalog queries - HologresGetObjectsHelper for metadata enumeration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion and COPY ingest Complete HologresStatement implementation: - TupleReader for streaming query results via COPY protocol - SQL query execution with COPY and PqResultArrayReader paths - Parameter binding (Bind/BindStream) and ExecuteBind - ExecuteSchema for schema inference - Bulk ingest via COPY FROM STDIN with STREAM_MODE and ON_CONFLICT - CreateBulkTable with CREATE/APPEND/REPLACE/CREATE_APPEND modes - Hologres-specific options: on_conflict, ingest_mode, batch_size_hint - OnConflictMode and HologresIngestMethod enums Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements Hologres internal Stage ingestion pathway: - StageConnection: thread-safe libpq wrapper for Stage COPY operations - StageWriter: parallel Arrow IPC upload with FSL→LIST conversion - ExecuteIngestStage: orchestrates FixedFE + regular FE connections - Vendor nanoarrow IPC support (flatcc + IPC encoder/decoder) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
70 tests covering BufferQueue thread safety, Stage create/drop/upload, Arrow IPC serialization (int64, string, boolean, date32, binary, list, FSL→LIST conversion with slicing), mock-based ingestion flow, and Hologres option enums. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iver Provides adbc-driver-hologres Python package with: - HologresOnConflict and HologresIngestMode enums - StatementOptions for driver-specific configuration - connect() function for low-level ADBC access - DBAPI 2.0 compatible interface via dbapi module Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 7 new test files covering database, connection, statement, postgres_util, postgres_type, copy reader, and error modules (~170 tests). Tests cover pure functions, option handling, type mapping, and COPY binary parsing without requiring a live database connection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ASV benchmarks for query fetch and bulk ingestion performance, comparing ADBC (COPY/Stage modes with ON_CONFLICT variants) against asyncpg, psycopg2, and DuckDB. Includes vector ingestion benchmarks for high-dimensional FLOAT4[] data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e ON_CONFLICT UPDATE Add 32 integration tests covering connection metadata, queries, COPY/Stage ingestion, ON_CONFLICT modes, batch size hints, and statistics. Tests run against a live Hologres instance via ADBC_HOLOGRES_TEST_URI. Fix a bug in Stage mode where ON_CONFLICT UPDATE was not implemented: InsertFromStage now queries pg_index for primary key columns and generates the proper ON CONFLICT (pk) DO UPDATE SET clause. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…large_binary, large_string Hologres Stage/EXTERNAL_FILES does not natively support Arrow timestamp[us, tz=*], large_binary, or large_string types. Add conversion logic that transforms these types before IPC serialization and restores them afterward: - TIMESTAMPTZ: timestamp[us, tz=*] → date64[ms] (divide microseconds by 1000) - BYTEA: large_binary → binary (narrow int64 offsets to int32) - TEXT: large_string → string/utf8 (narrow int64 offsets to int32) Also add comprehensive integration tests covering temporal, numeric, JSON, binary, list, dictionary, and operational scenarios for both COPY and Stage ingest modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… COPY and Stage modes Add test_ingest_list_float4 for COPY mode and test_stage_list_float_types for Stage mode to cover FLOAT4[] array type round-trip, complementing the existing FLOAT8[] tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The COPY writer was writing JSONB values as plain strings, but the
PostgreSQL binary protocol requires a 0x01 version byte prefix for
JSONB. This caused "unsupported jsonb version number 123" errors
(the first JSON character '{' = 0x7B being read as version byte).
Add PostgresCopyJsonbFieldWriter that prepends the 0x01 version byte,
and modify MakeCopyFieldWriter to accept PostgresType so it can select
the JSONB writer when the target column type is kJsonb. The target
table column types are resolved via pg_attribute before entering COPY
mode.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… 77.0.1 setuptools 77.0.0 was yanked from PyPI, causing ASV environment creation to fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on> on connect Automatically inject application_name=adbc_hologres_<version> into the connection URI so Hologres can identify ADBC driver connections in pg_stat_activity. User-specified application_name is preserved. Also fix ADBC_INFO_DRIVER_VERSION to return the actual version instead of "unknown". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix strncmp(sqlstate, "42", 0) always matching (should be length 2) - Fix SQL injection in Stage PK query by using PQexecParams - Add null checks for PQescapeIdentifier return values - Add Hologres >= 4.1 version gate for Stage ingestion mode - Ensure DropStage cleanup on all error paths after CreateStage - Remove redundant 3x pg_type query execution in RebuildTypeResolver - Make open_connections_ atomic to prevent data races Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…aringbitmap in Stage mode Query actual column types from pg_catalog.pg_attribute using format_type() and rebuild the EXTERNAL_FILES AS clause with correct type declarations. For types that EXTERNAL_FILES cannot auto-cast (json, jsonb, roaringbitmap), use explicit SELECT casts instead of SELECT *. Also fix std::atomic<int32_t> copy issue in Database::Release() by adding .load() for variadic printf args. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…4 formats Add ArrowCopyReader that decodes Hologres Arrow IPC streams wrapped in PG binary COPY framing, supporting both uncompressed (arrow) and LZ4-block-compressed (arrow_lz4) formats. Key Hologres compatibility workarounds: bypass flatcc verification for pre-1.0 IPC messages, ignore false LZ4_FRAME body compression declarations in RecordBatch metadata, and add LZ4 block fallback in nanoarrow codecs for implementations that report LZ4_FRAME but send block-compressed data. Includes parameterized ExecuteCopy, CopyFormat statement option, JSONB rejection for arrow formats, and 9 integration tests covering 23 types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…at benchmarks Add time_pandas_adbc_arrow and time_pandas_adbc_arrow_lz4 benchmarks to HologresBenchmarkBase, enabling side-by-side read performance comparison of binary, arrow, and arrow_lz4 COPY formats across OneColumn and MultiColumn suites. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ~120 offline unit tests across all testable modules to achieve 90%+ coverage. New test files: copy_writer_test.cc (55 tests for all COPY writer types), arrow_copy_reader_test.cc (26 tests for IPC decoding, LZ4 decompression, and stream trampolines), bind_stream_test.cc (15 tests for bind/iterate lifecycle). Extend existing tests for statement, connection, database, error, copy reader, and postgres_type modules. Extract shared MockTypeResolver into test_util.h. Fix null pointer dereference in ArrowCopyReader::ReleaseTrampoline when called with self=nullptr. Fix incorrect test expectation for PostgreSQL interval type: Arrow format should be "tin" (interval_month_day_nano), not "tDn" (duration_nanosecond). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… benchmarks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ion, and improve RAII safety - Fix O(n^2) digit insertion in Decimal COPY writer by using push_back + reverse - Eliminate per-row heap allocations in numeric serialization via member variable reuse - Add upfront ArrowBufferReserve in tuple writer and list writer for fewer allocation checks - Add digits_.reserve() and early-return for special numeric values in COPY reader - Replace memmove+resize pattern in ArrowCopyReader::Compact() - Add scope guards to SerializeArrayToIpcBuffer for safer multi-resource cleanup - Extract GetCurrentSchema() helper to replace 3 duplicate query blocks - Add PqEscapedString RAII wrapper to replace 10 manual PQescapeIdentifier+PQfreemem sites - Extract JoinUploadThreads() to replace 12 duplicate thread-join loops in stage_writer - Consolidate MockTypeResolver, CopyReaderTester, CopyWriterTester into test_util.h - Unify PG COPY binary signature constant via copy_common.h - Remove redundant HologresVersion() in favor of VendorVersion() - Name magic numbers in HologresStageConfig with constexpr constants - Replace stringstream with string_view loop in ParseTextArray - Promote is_null_param to BindStream member to avoid per-row vector allocation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s not exist" errors Decouple table creation (DDL) from data generation in ingest benchmark suites by introducing setup_cache(). Previously, setup() generated potentially huge vector data before creating tables — if data generation failed (OOM/memory pressure), tables were never created, and teardown() dropped all tables after each benchmark method, compounding the issue. Now setup_cache() creates all empty tables once upfront, setup() only handles data generation and connections, and teardown() frees memory via gc.collect() instead of dropping tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd add comprehensive unit tests Add JSONB support for Arrow/Arrow_LZ4 COPY read path by wrapping queries to cast JSONB columns to TEXT, matching the approach used by Java holo-client. Also add comprehensive unit tests across multiple modules to improve coverage. Key changes: - Add BuildJsonbWrapperQuery() to construct wrapper queries that cast JSONB columns to ::text with proper identifier escaping - Replace JSONB blocking logic in ExecuteQuery with transparent cast wrapping - Add result_helper_test.cc (22 tests for PqRecord parsing) - Expand connection_test.cc, copy_test.cc, postgres_type_test.cc, postgres_util_test.cc, and statement_test.cc with additional test cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…row counts Skip TEXT[] at 10M rows in HologresMultiColumnSuite to prevent OOM that kills the setup_cache subprocess and skips all benchmarks. Limit COPY binary vector ingest benchmarks to 1M rows max since COPY builds the entire data stream in memory; Stage mode benchmarks remain at 10M rows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to arrow_lz4 Benchmarks show arrow_lz4 outperforms binary in nearly all read scenarios (e.g. 10M-row single INT column: 358ms vs 2.72s, ~7.6x faster), while LZ4 compression also reduces network transfer size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat(c/driver/hologres): add Hologres ADBC driver
Add comprehensive documentation covering architecture, development workflow, testing, features, configuration reference, usage examples across Python/Java/C, known limitations, and release process. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
do you use this library internally already? maybe it can help the reviewers if it is a bit battletested. |
|
assume this is a question from someone that doesn't know the database. Isn't Hologres PSQL compliant? |
Hi, @xborder |
Hi, @metegenez |
Good to hear. We are doing arrow work at Huawei, too. Would love to help on the review but im still learning ADBC internals myself. Good luck with the PR! Btw, is this an open source DB or closed source like GaussDB? |
Nice to meet you! @metegenez ,my email is ding_ye_timo@163.com |
|
@lidavidm Could you help me to review it? |
Summary
Add a new ADBC driver for Hologres, Alibaba Cloud's real-time data warehouse service built on PostgreSQL. This driver enables high-performance columnar data access to Hologres through the standard ADBC interface.
Components
C Driver (
c/driver/hologres/) — ~20K lines of new codeHologresDatabase: Connection management with automatic Hologres/PostgreSQL version detection and type resolver initializationHologresConnection: Full ADBC metadata API (GetInfo,GetObjects,GetTableSchema,GetTableTypes,GetStatistics)HologresStatement: Query execution, parameterized queries, and two bulk ingestion paths:COPY FROM STDINbinary protocolArrowCopyReader: Reads query results viaCOPY TO STDOUTin Arrow IPC format (arroworarrow_lz4), bypassing row-by-row binary parsing for significantly better read performanceTupleReader: Reads query results via standard PostgreSQL binaryCOPY TO STDOUTwith nanoarrow-based batch assemblyIGNORE(skip conflicts) andUPDATE(upsert) modes for both COPY and Stage ingestionapplication_nametagging (adbc_hologres_<version>) for server-side observabilityHologres-specific data type support:
Vendored dependency (
c/vendor/nanoarrow/) — nanoarrow IPCArrowCopyReader(reading Arrow IPC from COPY protocol) andStageWriter(serializing Arrow batches for Stage upload)Python package (
python/adbc_driver_hologres/) — ~3.6K linesadbc_driver_hologres: Python bindings with DBAPI 2.0 support viaadbc_driver_managerHologresOnConflict,HologresIngestMode,StatementOptionsBuild system:
ADBC_DRIVER_HOLOGRESoptionadbc-driver-hologres.pc)Key design decisions
Forked from PostgreSQL driver: Core PostgreSQL utilities (
postgres_type.h,copy/reader.h,copy/writer.h, etc.) are copied into the Hologres driver rather than shared, to allow independent evolution for Hologres-specific type handling (JSONB version byte, roaringbitmap, etc.)Default COPY read format is
arrow_lz4: Hologres supports native Arrow IPC output in its COPY protocol. Thearrow_lz4format avoids row-by-row binary parsing and leverages LZ4 compression, providing better throughput for analytical queries. Falls back to standard binary format viaadbc.hologres.copy_formatoption.Stage ingestion for large datasets: The Stage writer serializes Arrow batches into IPC format, uploads them via dedicated FixedFE connections with configurable concurrency (default: 4 threads), and commits atomically. This path is optimized for bulk loading scenarios where COPY throughput is insufficient.
Testing
binary,arrow,arrow_lz4) and write (COPY,Stage) performance at various row counts (1K–10M)Configuration options
adbc.hologres.copy_formatbinary,arrow,arrow_lz4arrow_lz4adbc.hologres.ingest_modecopy,stagecopyadbc.hologres.use_copytrue,falsetrueadbc.hologres.on_conflictnone,ignore,updatenoneadbc.hologres.batch_size_hint_bytes16777216Test plan
cd build && ctest --test-dir . -R hologrescd python/adbc_driver_hologres && pytest tests/-DADBC_DRIVER_HOLOGRES=ON