Skip to content

feat(c/driver/hologres): add Hologres ADBC driver#4266

Closed
TimothyDing wants to merge 32 commits into
apache:mainfrom
TimothyDing:main
Closed

feat(c/driver/hologres): add Hologres ADBC driver#4266
TimothyDing wants to merge 32 commits into
apache:mainfrom
TimothyDing:main

Conversation

@TimothyDing
Copy link
Copy Markdown

Summary

Add a new ADBC driver for Hologres, Alibaba Cloud's real-time data warehouse service built on PostgreSQL. This driver enables high-performance columnar data access to Hologres through the standard ADBC interface.

Components

C Driver (c/driver/hologres/) — ~20K lines of new code

  • HologresDatabase: Connection management with automatic Hologres/PostgreSQL version detection and type resolver initialization
  • HologresConnection: Full ADBC metadata API (GetInfo, GetObjects, GetTableSchema, GetTableTypes, GetStatistics)
  • HologresStatement: Query execution, parameterized queries, and two bulk ingestion paths:
    • COPY mode (default): Standard PostgreSQL COPY FROM STDIN binary protocol
    • Stage mode: Hologres-native stage-based ingestion via Arrow IPC upload with configurable concurrency, batch sizing, and file targeting
  • ArrowCopyReader: Reads query results via COPY TO STDOUT in Arrow IPC format (arrow or arrow_lz4), bypassing row-by-row binary parsing for significantly better read performance
  • TupleReader: Reads query results via standard PostgreSQL binary COPY TO STDOUT with nanoarrow-based batch assembly
  • ON_CONFLICT support: IGNORE (skip conflicts) and UPDATE (upsert) modes for both COPY and Stage ingestion
  • Automatic application_name tagging (adbc_hologres_<version>) for server-side observability

Hologres-specific data type support:

  • Standard PostgreSQL types: bool, int2/4/8, float4/8, numeric, text, bytea, date, time, timestamp, timestamptz, interval, uuid
  • Array types: int2[], int4[], int8[], float4[], float8[], bool[], text[], bytea[]
  • Extended types: JSON, JSONB (with version byte prefix), CHAR(n), VARCHAR(n), roaringbitmap
  • Type conversions for Stage mode: timestamptz, large_binary, large_string

Vendored dependency (c/vendor/nanoarrow/) — nanoarrow IPC

  • Vendored nanoarrow IPC reader/writer and flatcc runtime for Arrow IPC serialization/deserialization, used by both the ArrowCopyReader (reading Arrow IPC from COPY protocol) and StageWriter (serializing Arrow batches for Stage upload)

Python package (python/adbc_driver_hologres/) — ~3.6K lines

  • adbc_driver_hologres: Python bindings with DBAPI 2.0 support via adbc_driver_manager
  • Enums: HologresOnConflict, HologresIngestMode, StatementOptions
  • Integration tests covering COPY and Stage modes across all supported types
  • ASV benchmark suites for read/write performance profiling

Build system:

  • CMake integration with ADBC_DRIVER_HOLOGRES option
  • pkg-config support (adbc-driver-hologres.pc)
  • Python setuptools with shared library bundling

Key design decisions

  1. Forked from PostgreSQL driver: Core PostgreSQL utilities (postgres_type.h, copy/reader.h, copy/writer.h, etc.) are copied into the Hologres driver rather than shared, to allow independent evolution for Hologres-specific type handling (JSONB version byte, roaringbitmap, etc.)

  2. Default COPY read format is arrow_lz4: Hologres supports native Arrow IPC output in its COPY protocol. The arrow_lz4 format avoids row-by-row binary parsing and leverages LZ4 compression, providing better throughput for analytical queries. Falls back to standard binary format via adbc.hologres.copy_format option.

  3. Stage ingestion for large datasets: The Stage writer serializes Arrow batches into IPC format, uploads them via dedicated FixedFE connections with configurable concurrency (default: 4 threads), and commits atomically. This path is optimized for bulk loading scenarios where COPY throughput is insufficient.

Testing

  • C unit tests (~8.6K lines): Comprehensive coverage for all modules — database, connection, statement, COPY reader/writer, Arrow COPY reader, Stage writer, bind stream, error handling, PostgreSQL type resolver, and utility functions
  • Python integration tests (~2.2K lines): End-to-end tests covering DBAPI 2.0 compliance, COPY/Stage ingestion for all supported types, ON_CONFLICT modes, and edge cases
  • Python benchmarks: ASV benchmark suites for read (binary, arrow, arrow_lz4) and write (COPY, Stage) performance at various row counts (1K–10M)

Configuration options

Option Values Default Description
adbc.hologres.copy_format binary, arrow, arrow_lz4 arrow_lz4 COPY TO STDOUT read format
adbc.hologres.ingest_mode copy, stage copy Bulk ingestion method
adbc.hologres.use_copy true, false true Enable COPY optimization for ingestion
adbc.hologres.on_conflict none, ignore, update none Conflict resolution for ingestion
adbc.hologres.batch_size_hint_bytes integer 16777216 Target batch size hint for reads

Test plan

  • C unit tests pass: cd build && ctest --test-dir . -R hologres
  • Python integration tests pass against a live Hologres instance: cd python/adbc_driver_hologres && pytest tests/
  • Build succeeds with -DADBC_DRIVER_HOLOGRES=ON
  • Python package installs and connects successfully

TimothyDing and others added 30 commits April 18, 2026 22:22
Add a new independent ADBC driver for Hologres with stub implementations
of Database, Connection, and Statement classes. The driver compiles as a
standalone library (adbc_driver_hologres) without modifying the existing
PostgreSQL driver.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…river

Copy error handling, result helpers, type system, bind stream, result
reader, and COPY protocol files from the PostgreSQL driver. These files
retain the adbcpq namespace and are compiled as part of the Hologres
driver library to avoid modifying the PostgreSQL driver.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction

Add real connection management (PGconn), Hologres version detection via
SELECT hg_version(), PostgreSQL type resolver, and MakeFixedFeUri() for
Stage mode FixedFE connections.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ueries

Implement the full HologresConnection class with Hologres-specific behavior:
- GetInfo returns Hologres vendor name and parsed version
- Commit/Rollback return NOT_IMPLEMENTED (Hologres is always autocommit)
- SetOption rejects disabling autocommit
- GetObjects/GetTableSchema/GetTableTypes use PG-compatible system catalog queries
- HologresGetObjectsHelper for metadata enumeration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion and COPY ingest

Complete HologresStatement implementation:
- TupleReader for streaming query results via COPY protocol
- SQL query execution with COPY and PqResultArrayReader paths
- Parameter binding (Bind/BindStream) and ExecuteBind
- ExecuteSchema for schema inference
- Bulk ingest via COPY FROM STDIN with STREAM_MODE and ON_CONFLICT
- CreateBulkTable with CREATE/APPEND/REPLACE/CREATE_APPEND modes
- Hologres-specific options: on_conflict, ingest_mode, batch_size_hint
- OnConflictMode and HologresIngestMethod enums

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements Hologres internal Stage ingestion pathway:
- StageConnection: thread-safe libpq wrapper for Stage COPY operations
- StageWriter: parallel Arrow IPC upload with FSL→LIST conversion
- ExecuteIngestStage: orchestrates FixedFE + regular FE connections
- Vendor nanoarrow IPC support (flatcc + IPC encoder/decoder)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
70 tests covering BufferQueue thread safety, Stage create/drop/upload,
Arrow IPC serialization (int64, string, boolean, date32, binary, list,
FSL→LIST conversion with slicing), mock-based ingestion flow, and
Hologres option enums.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iver

Provides adbc-driver-hologres Python package with:
- HologresOnConflict and HologresIngestMode enums
- StatementOptions for driver-specific configuration
- connect() function for low-level ADBC access
- DBAPI 2.0 compatible interface via dbapi module

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 7 new test files covering database, connection, statement,
postgres_util, postgres_type, copy reader, and error modules (~170 tests).
Tests cover pure functions, option handling, type mapping, and COPY binary
parsing without requiring a live database connection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ASV benchmarks for query fetch and bulk ingestion performance,
comparing ADBC (COPY/Stage modes with ON_CONFLICT variants) against
asyncpg, psycopg2, and DuckDB. Includes vector ingestion benchmarks
for high-dimensional FLOAT4[] data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e ON_CONFLICT UPDATE

Add 32 integration tests covering connection metadata, queries, COPY/Stage
ingestion, ON_CONFLICT modes, batch size hints, and statistics. Tests run
against a live Hologres instance via ADBC_HOLOGRES_TEST_URI.

Fix a bug in Stage mode where ON_CONFLICT UPDATE was not implemented:
InsertFromStage now queries pg_index for primary key columns and generates
the proper ON CONFLICT (pk) DO UPDATE SET clause.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…large_binary, large_string

Hologres Stage/EXTERNAL_FILES does not natively support Arrow timestamp[us, tz=*],
large_binary, or large_string types. Add conversion logic that transforms these types
before IPC serialization and restores them afterward:
- TIMESTAMPTZ: timestamp[us, tz=*] → date64[ms] (divide microseconds by 1000)
- BYTEA: large_binary → binary (narrow int64 offsets to int32)
- TEXT: large_string → string/utf8 (narrow int64 offsets to int32)

Also add comprehensive integration tests covering temporal, numeric, JSON, binary,
list, dictionary, and operational scenarios for both COPY and Stage ingest modes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… COPY and Stage modes

Add test_ingest_list_float4 for COPY mode and test_stage_list_float_types
for Stage mode to cover FLOAT4[] array type round-trip, complementing the
existing FLOAT8[] tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The COPY writer was writing JSONB values as plain strings, but the
PostgreSQL binary protocol requires a 0x01 version byte prefix for
JSONB. This caused "unsupported jsonb version number 123" errors
(the first JSON character '{' = 0x7B being read as version byte).

Add PostgresCopyJsonbFieldWriter that prepends the 0x01 version byte,
and modify MakeCopyFieldWriter to accept PostgresType so it can select
the JSONB writer when the target column type is kJsonb. The target
table column types are resolved via pg_attribute before entering COPY
mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… 77.0.1

setuptools 77.0.0 was yanked from PyPI, causing ASV environment
creation to fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on> on connect

Automatically inject application_name=adbc_hologres_<version> into the
connection URI so Hologres can identify ADBC driver connections in
pg_stat_activity. User-specified application_name is preserved. Also fix
ADBC_INFO_DRIVER_VERSION to return the actual version instead of "unknown".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix strncmp(sqlstate, "42", 0) always matching (should be length 2)
- Fix SQL injection in Stage PK query by using PQexecParams
- Add null checks for PQescapeIdentifier return values
- Add Hologres >= 4.1 version gate for Stage ingestion mode
- Ensure DropStage cleanup on all error paths after CreateStage
- Remove redundant 3x pg_type query execution in RebuildTypeResolver
- Make open_connections_ atomic to prevent data races

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…aringbitmap in Stage mode

Query actual column types from pg_catalog.pg_attribute using format_type()
and rebuild the EXTERNAL_FILES AS clause with correct type declarations.
For types that EXTERNAL_FILES cannot auto-cast (json, jsonb, roaringbitmap),
use explicit SELECT casts instead of SELECT *.

Also fix std::atomic<int32_t> copy issue in Database::Release() by adding
.load() for variadic printf args.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…4 formats

Add ArrowCopyReader that decodes Hologres Arrow IPC streams wrapped in
PG binary COPY framing, supporting both uncompressed (arrow) and
LZ4-block-compressed (arrow_lz4) formats. Key Hologres compatibility
workarounds: bypass flatcc verification for pre-1.0 IPC messages, ignore
false LZ4_FRAME body compression declarations in RecordBatch metadata,
and add LZ4 block fallback in nanoarrow codecs for implementations that
report LZ4_FRAME but send block-compressed data.

Includes parameterized ExecuteCopy, CopyFormat statement option, JSONB
rejection for arrow formats, and 9 integration tests covering 23 types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…at benchmarks

Add time_pandas_adbc_arrow and time_pandas_adbc_arrow_lz4 benchmarks to
HologresBenchmarkBase, enabling side-by-side read performance comparison
of binary, arrow, and arrow_lz4 COPY formats across OneColumn and
MultiColumn suites.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ~120 offline unit tests across all testable modules to achieve 90%+
coverage. New test files: copy_writer_test.cc (55 tests for all COPY writer
types), arrow_copy_reader_test.cc (26 tests for IPC decoding, LZ4
decompression, and stream trampolines), bind_stream_test.cc (15 tests for
bind/iterate lifecycle). Extend existing tests for statement, connection,
database, error, copy reader, and postgres_type modules. Extract shared
MockTypeResolver into test_util.h.

Fix null pointer dereference in ArrowCopyReader::ReleaseTrampoline when
called with self=nullptr. Fix incorrect test expectation for PostgreSQL
interval type: Arrow format should be "tin" (interval_month_day_nano),
not "tDn" (duration_nanosecond).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… benchmarks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ion, and improve RAII safety

- Fix O(n^2) digit insertion in Decimal COPY writer by using push_back + reverse
- Eliminate per-row heap allocations in numeric serialization via member variable reuse
- Add upfront ArrowBufferReserve in tuple writer and list writer for fewer allocation checks
- Add digits_.reserve() and early-return for special numeric values in COPY reader
- Replace memmove+resize pattern in ArrowCopyReader::Compact()
- Add scope guards to SerializeArrayToIpcBuffer for safer multi-resource cleanup
- Extract GetCurrentSchema() helper to replace 3 duplicate query blocks
- Add PqEscapedString RAII wrapper to replace 10 manual PQescapeIdentifier+PQfreemem sites
- Extract JoinUploadThreads() to replace 12 duplicate thread-join loops in stage_writer
- Consolidate MockTypeResolver, CopyReaderTester, CopyWriterTester into test_util.h
- Unify PG COPY binary signature constant via copy_common.h
- Remove redundant HologresVersion() in favor of VendorVersion()
- Name magic numbers in HologresStageConfig with constexpr constants
- Replace stringstream with string_view loop in ParseTextArray
- Promote is_null_param to BindStream member to avoid per-row vector allocation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s not exist" errors

Decouple table creation (DDL) from data generation in ingest benchmark
suites by introducing setup_cache(). Previously, setup() generated
potentially huge vector data before creating tables — if data generation
failed (OOM/memory pressure), tables were never created, and teardown()
dropped all tables after each benchmark method, compounding the issue.

Now setup_cache() creates all empty tables once upfront, setup() only
handles data generation and connections, and teardown() frees memory
via gc.collect() instead of dropping tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd add comprehensive unit tests

Add JSONB support for Arrow/Arrow_LZ4 COPY read path by wrapping queries
to cast JSONB columns to TEXT, matching the approach used by Java holo-client.
Also add comprehensive unit tests across multiple modules to improve coverage.

Key changes:
- Add BuildJsonbWrapperQuery() to construct wrapper queries that cast JSONB
  columns to ::text with proper identifier escaping
- Replace JSONB blocking logic in ExecuteQuery with transparent cast wrapping
- Add result_helper_test.cc (22 tests for PqRecord parsing)
- Expand connection_test.cc, copy_test.cc, postgres_type_test.cc,
  postgres_util_test.cc, and statement_test.cc with additional test cases

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…row counts

Skip TEXT[] at 10M rows in HologresMultiColumnSuite to prevent OOM that
kills the setup_cache subprocess and skips all benchmarks. Limit COPY
binary vector ingest benchmarks to 1M rows max since COPY builds the
entire data stream in memory; Stage mode benchmarks remain at 10M rows.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to arrow_lz4

Benchmarks show arrow_lz4 outperforms binary in nearly all read scenarios
(e.g. 10M-row single INT column: 358ms vs 2.72s, ~7.6x faster), while
LZ4 compression also reduces network transfer size.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat(c/driver/hologres): add Hologres ADBC driver
@TimothyDing TimothyDing requested a review from lidavidm as a code owner April 22, 2026 23:47
Add comprehensive documentation covering architecture, development
workflow, testing, features, configuration reference, usage examples
across Python/Java/C, known limitations, and release process.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@metegenez
Copy link
Copy Markdown
Contributor

do you use this library internally already? maybe it can help the reviewers if it is a bit battletested.

@xborder
Copy link
Copy Markdown
Contributor

xborder commented Apr 23, 2026

assume this is a question from someone that doesn't know the database. Isn't Hologres PSQL compliant?
Is there anything specific that requires a new driver?

@TimothyDing
Copy link
Copy Markdown
Author

TimothyDing commented Apr 23, 2026

assume this is a question from someone that doesn't know the database. Isn't Hologres PSQL compliant?
Is there anything specific that requires a new driver?

Hi, @xborder
Nice Question! Although Hologres is compatible with the PostgreSQL ecosystem, we've implemented many Arrow-related features. For instance, we support Arrow format (and compressed Arrow) during COPY OUT operations, as well as a Snowflake-like stage mode for imports. Since these capabilities are not supported by standard PostgreSQL, I created a dedicated driver based on the PostgreSQL code."

@TimothyDing
Copy link
Copy Markdown
Author

TimothyDing commented Apr 23, 2026

do you use this library internally already? maybe it can help the reviewers if it is a bit battletested.

Hi, @metegenez
We do have an official Java SDK, but its integration with the Arrow ecosystem is somewhat limited. I stumbled upon ADBC (an official library) the other day and got really interested! We are definitely looking to embrace the Arrow ecosystem!

@metegenez
Copy link
Copy Markdown
Contributor

do you use this library internally already? maybe it can help the reviewers if it is a bit battletested.

Hi, @metegenez We do have an official Java SDK, but its integration with the Arrow ecosystem is somewhat limited. I stumbled upon ADBC (an official library) the other day and got really interested! We are definitely looking to embrace the Arrow ecosystem!

Good to hear. We are doing arrow work at Huawei, too. Would love to help on the review but im still learning ADBC internals myself. Good luck with the PR!

Btw, is this an open source DB or closed source like GaussDB?

@TimothyDing
Copy link
Copy Markdown
Author

TimothyDing commented Apr 24, 2026

Good to hear. We are doing arrow work at Huawei, too. Would love to help on the review but im still learning ADBC internals myself. Good luck with the PR!

Btw, is this an open source DB or closed source like GaussDB?

Nice to meet you! @metegenez ,my email is ding_ye_timo@163.com
Hologres is a data warehousing product from Alibaba Cloud (similar to Snowflake or Apache Doris), and it is a closed-source product. We have a wide range of commercial customers, including Kering, LVMH, and Volkswagen.

@TimothyHologres
Copy link
Copy Markdown

@lidavidm Could you help me to review it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants