Skip to content

Add ClickHouse Provider#67080

Open
BentsiLeviav wants to merge 9 commits into
apache:mainfrom
BentsiLeviav:add-clickhouse-provider
Open

Add ClickHouse Provider#67080
BentsiLeviav wants to merge 9 commits into
apache:mainfrom
BentsiLeviav:add-clickhouse-provider

Conversation

@BentsiLeviav
Copy link
Copy Markdown

@BentsiLeviav BentsiLeviav commented May 18, 2026

Description

Adds a new apache-airflow-providers-clickhouse provider that integrates Airflow with ClickHouse via the HTTP interface using the clickhouse-connect library.

Scope of this implementation

  • ClickHouseHook - the core integration, extending DbApiHook so all standard SQLExecuteQueryOperator features work out of the box (templating, handler, split_statements, etc.)
  • Connection form UI with dedicated fields for TLS, timeouts, compression, session settings, and client kwargs
  • bulk_insert_rows() for more performant inserts using clickhouse-connect's native insert path
  • get_uri() for SQLAlchemy-compatible connection strings (clickhousedb:// / clickhousedbs://)
  • Connection type docs, operator how-to guide, and integration logo
  • 95 unit tests

Implementation decisions

  • DB-API 2.0 adapter (ClickHouseConnection): clickhouse-connect doesn't expose a DB-API connection natively - we wrap its Client in a thin adapter so DbApiHook.run() works unmodified. commit()
    and rollback() are intentional no-ops since ClickHouse has no transactions.
  • Two-level settings merge: both session_settings and client_kwargs can be set at the connection level (via the extra JSON field) and overridden at the task level (via hook constructor arguments), with the constructor taking precedence on conflicts.
  • Hook-managed kwargs protection: keys that the hook owns (host, port, username, password, database, secure, verify, client_name, settings) are stripped from any user-supplied client_kwargs so hook-managed values always win.
  • Client name: every query is tagged with apache-airflow/<version> apache-airflow-providers-clickhouse/<version> in the HTTP User-Agent (system.query_log), making queries traceable back to their Airflow source. Users can append a custom label via the client_name extra field.
  • No dedicated operators are added - SQLExecuteQueryOperator from common.sql covers all standard SQL use cases.

File structure (generated with Claude)

File(s) Purpose
provider.yaml Provider metadata: name, version, integrations, connection types, UI field behaviour, and conn-fields schema used to generate the connection form
pyproject.toml Package build config and dependencies (clickhouse-connect >=0.7.0, common-sql >=1.32.0) — auto-generated from the Breeze template
src/.../hooks/clickhouse.py Core implementation: ClickHouseHook (extends DbApiHook) and ClickHouseConnection (thin DB-API 2.0 adapter wrapping the clickhouse-connect client)
src/.../get_provider_info.py Auto-generated from provider.yaml by the Breeze release tooling — do not edit manually
src/airflow/__init__.py, src/airflow/providers/__init__.py Namespace package declarations required for the airflow.providers implicit namespace
src/.../clickhouse/__init__.py Version file (__version__ = "1.0.0") with minimum Airflow version guard — auto-generated
docs/connections/clickhouse.rst Connection configuration reference: all fields, their types, defaults, and JSON/URI examples
docs/operators/clickhouse.rst How-to guide: using SQLExecuteQueryOperator and ClickHouseHook directly, including session_settings and bulk_insert_rows examples
docs/index.rst, docs/conf.py, docs/changelog.rst, docs/security.rst Standard provider docs scaffold — mostly auto-generated
docs/integration-logos/ClickHouse.png Official ClickHouse logo used by the Apache Airflow website
tests/unit/clickhouse/hooks/test_clickhouse.py 95 unit tests covering connection building, settings/kwargs merge logic, database override, URI generation, bulk insert, UI widgets, and
autocommit semantics
tests/system/clickhouse/example_clickhouse.py System test / example DAG: create table → bulk insert → read rows → drop table
.github/boring-cyborg.yml Adds provider:clickhouse label rule for automatic PR labelling
scripts/ci/docker-compose/remove-sources.yml, tests-sources.yml Auto-updated by prek to mount the clickhouse provider sources/tests into the CI Docker environment

Introduces the package structure for the new ClickHouse provider:
pyproject.toml, provider.yaml, namespace packages, and test skeletons.
Implements ClickHouseHook (extending DbApiHook) via clickhouse-connect:
- ClickHouseConnection DB-API 2.0 adapter (cursor, commit/rollback no-ops)
- Connection-form widgets and UI field behaviour for the Airflow UI
- session_settings and client_kwargs merge (extra JSON + constructor args)
- bulk_insert_rows() for efficient columnar inserts
- get_uri() for SQLAlchemy-compatible connection strings
95 tests covering connection building, session_settings and client_kwargs
merge logic, database override, UI widgets, bulk_insert_rows, and get_uri.
Connection type reference, operator how-to guide, changelog, and
integration logo for the ClickHouse provider.
Demonstrates ClickHouseHook and SQLExecuteQueryOperator usage:
create table, bulk insert, read rows, and drop table.
Previous uv.lock was regenerated with a local uv version that produced
a different format. Restore to upstream format with only the
clickhouse-connect entries added.
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 18, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@BentsiLeviav
Copy link
Copy Markdown
Author

@koletzilla @joe-clickhouse would you mind reviewing that as well?

@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented May 18, 2026

thanks for adding Clickhouse provider @BentsiLeviav
I am happy to sponsor the provider. Please follow the procedure listed in https://github.com/apache/airflow/blob/main/providers/ACCEPTING_PROVIDERS.rst#discussion-thread-template we simplified the process of accepting new providers and now it requires just one mailing list thread. This is something that should be in done in parallel to code review of the PR.

Copy link
Copy Markdown

@joe-clickhouse joe-clickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @BentsiLeviav. Looks pretty good! From a clickhouse-connect perspective I have few comments. In short, the scheme name needs updating and I think the bulk insert should be changed to use an insert context or just the regular client insert method which will automatically stream.

host = conn.host or "localhost"
port = int(conn.port) if conn.port else 8123
database = self.database or conn.schema or "default"
scheme = "clickhousedbs" if bool(extra.get("secure", False)) else "clickhousedb"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clickhousedbs isn't a registered scheme. clickhouse-connect only registers clickhousedb (and clickhousedb+connect) as a SQLAlchemy dialect. The way you'll want to use with a TLS connection is a single scheme with secure as a query parameter, which the dbapi Connection.__init__ forwards via generic_args -> create_client. e.g. clickhousedb://user:pw@host:port/db?secure=true&verify=true.

It's probably worth wiring the other tuning params like connect_timeout, send_receive_timeout, compress, etc. through the query string the same way, otherwise SQLAlchemy-path users silently lose the settings change ability that DB-API-path users get.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! pushed a fix. LMK if it is ok now

Comment on lines +407 to +410
try:
for i in range(0, len(rows), commit_every):
batch = rows[i : i + commit_every]
client.insert(table, batch, column_names=column_names)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client.insert() already streams data to ClickHouse in adaptive ~2MB blocks as a single Transfer-Encoding: chunked HTTP request. Batching at the Python layer here doesn't help and could actually even hurt because each client.insert() call without a reusable context issues a DESCRIBE TABLE to resolve column types even when column_names is supplied column_names is only used to filter/order the describe result.

I'd recommend either:

  1. Just call client.insert(table, rows, column_names=column_names) once and let the driver handle block-level streaming internally.
  2. If batching is genuinely needed (e.g. memory pressure on extremely large inputs), build the context once and reuse it like:
ctx = client.create_insert_context(table, column_names=column_names)
for i in range(0, len(rows), commit_every):
    client.insert(data=rows[i:i+commit_every], context=ctx)

Side note, I think commit_every is a misnomer as inserts are not transactional. batch_size might be a more appropriate term.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input.
Took up your recommendation and did the following:

  • renamed commit_every to batch_size
  • Change the default value of it to None, so if it is not provided, we will have a single insert.
  • In case it is provided, create the context once before the insertion loop.
  • updated the tests to verify all these

Comment on lines +295 to +310
@patch("airflow.providers.clickhouse.hooks.clickhouse.ClickHouseHook.get_connection")
def test_get_uri_secure_uses_clickhousedbs_scheme(self, mock_get_connection):
"""secure=True in extra must produce the clickhousedbs:// (HTTPS) scheme."""
conn = Connection(
conn_id="ch_secure",
conn_type="clickhouse",
host="secure-host",
port=8443,
login="user",
password="pass",
schema="db",
extra=json.dumps({"secure": True}),
)
mock_get_connection.return_value = conn
uri = ClickHouseHook(clickhouse_conn_id="ch_secure").get_uri()
assert uri == "clickhousedbs://user:pass@secure-host:8443/db"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be deleted/reconfigured in light of the other comment explaining how clickhousedbs isn't a valid scheme.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the test to test_get_uri_secure_adds_query_param and updated the assertion to assert uri == "clickhousedb://user:pass@secure-host:8443/db?secure=true"

…ning

The clickhouse-connect library only registers the clickhousedb:// SQLAlchemy
dialect; clickhousedbs:// was never a valid scheme and would fail at engine
creation. TLS is now enabled via ?secure=true, and tuning params
(connect_timeout, send_receive_timeout, compress, verify) are forwarded as
query-string arguments so SQLAlchemy-path users get the same settings as
DB-API-path users. Tests updated accordingly.
Copy link
Copy Markdown

@joe-clickhouse joe-clickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @BentsiLeviav. The changes you made look good! I did notice one last thing that i've left a comment about in the code related to passing arbitrary kwargs to the client from the Connection extra level.

extra: dict[str, Any] = conn.extra_dejson

# Merge client_kwargs: extra values are the base, constructor values override.
raw_client_kwargs = extra.get("client_kwargs")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an Airflow expert, but I think this might expose low-level clickhouse-connect client options at the Connection extra level, which may be too broad for an Airflow provider.

From a clickhouse-connect perspective, arbitrary client kwargs are useful when the caller owns the Python code. So I think ClickHouseHook(client_kwargs=...) is reasonable at the Dag author level. But for Connection extras, the provider should probably only expose a finite set of reviewed and documented fields like host, port, username, password, database, secure, verify, timeouts, compression, etc.

It looks like _HOOK_MANAGED_KWARGS prevents overriding hook-owned fields, which is good, but it still allows any other clickhouse_connect.get_client() kwarg through. That means a Connection configuration user can configure low-level transport and security behavior on behalf of any Dag that uses the connection.

Long story short, I think we should keep arbitrary client_kwargs as a hook constructor argument only, and promote individual kwargs to Connection extras when the provider intentionally supports and documents them. This seems more consistent with Airflow's guidance to allowlist Connection extras rather than forwarding arbitrary kwargs into underlying libraries, but I'll defer to the Airflow maintainers on the provider policy.

Reference on Connection configuration users:

Connection configuration users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-tools area:providers backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch kind:documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants