[python] Support JDBC catalog by HansiChan · Pull Request #7720 · apache/paimon

HansiChan · 2026-04-28T03:28:50Z

Purpose

Support JDBC catalog in PyPaimon. This adds a Python JDBC catalog implementation that uses the same catalog metadata tables as Java Paimon JDBC catalog: paimon_tables, paimon_database_properties, and paimon_table_properties.

The implementation supports SQLite with the Python standard library and dynamically supports MySQL/PostgreSQL when a corresponding Python DB-API driver is installed. Table data and schema files continue to use existing PyPaimon FileIO and SchemaManager behavior.

What changed

Register metastore=jdbc in CatalogFactory
Add JdbcCatalog and JdbcCatalogLoader
Add catalog-key and sync-all-properties catalog options
Cover database and table create/list/get/alter/rename/drop behavior with SQLite-backed tests
Document JDBC catalog creation in PyPaimon Python API docs

Tests

python3 -m py_compile pypaimon/catalog/jdbc_catalog.py pypaimon/catalog/jdbc_catalog_loader.py pypaimon/catalog/catalog_factory.py pypaimon/common/options/config.py pypaimon/tests/jdbc_catalog_test.py
PYTHONPATH=/tmp/paimon-python-test-deps POLARS_SKIP_CPU_CHECK=1 python3 -m unittest pypaimon.tests.jdbc_catalog_test pypaimon.tests.filesystem_catalog_test

tub · 2026-04-29T08:59:43Z

Nice! I have a similar change locally that uses SQLAlchemy - but this looks great as it adds fewer dependencies.
Is it worth calling it something other than JDBC? It may be confusing to folks who think it uses the JVM underneath for the database connections.

HansiChan · 2026-04-29T09:18:34Z

Nice! I have a similar change locally that uses SQLAlchemy - but this looks great as it adds fewer dependencies. Is it worth calling it something other than JDBC? It may be confusing to folks who think it uses the JVM underneath for the database connections.

Good point. I kept the public catalog type as jdbc because it matches Paimon's existing JDBC catalog configuration and lets users reuse the same metastore=jdbc / jdbc: URI options across engines.

To avoid implying that PyPaimon uses JVM JDBC drivers, I updated the implementation and docs to clarify that PyPaimon uses native Python DB-API drivers under the hood. I also renamed the internal connection helper to _DbApiConnection and adjusted the driver error messages accordingly.

JingsongLi

Review: [python] Support JDBC catalog

Overall this is a solid contribution that brings JDBC catalog parity to PyPaimon with a clean design. The DB-API abstraction supporting SQLite/MySQL/PostgreSQL is well-structured, and the test coverage with SQLite is good. Below are issues I found, ranging from correctness bugs to design suggestions.

1. Lack of Transaction Atomicity (Bug)

_DbApiConnection.execute() commits after every single statement. This makes multi-statement operations non-atomic:

create_table: If _insert_table_properties fails after the INSERT INTO paimon_tables succeeds, the exception handler deletes the table directory but does NOT roll back the already-committed paimon_tables row. This leaves an orphaned metadata entry.
drop_database: Three separate DELETEs each commit independently. A failure between them leaves the catalog in an inconsistent state.
alter_table: The DELETE of old properties commits, then re-insertion of new properties happens row-by-row. A failure midway loses all table properties.
rename_table: Two UPDATEs commit separately; partial failure leaves inconsistent metadata.

Suggestion: Introduce a transaction context (e.g., remove the self.connection.commit() from execute and add explicit begin/commit/rollback boundaries around compound operations), or at minimum batch the operations within a single commit for methods like drop_database, create_table, alter_table, and rename_table.

2. `create_table` Error Handling Incomplete

try:
    self.connection.execute("INSERT INTO paimon_tables ...")
    if self._sync_all_properties():
        self._insert_table_properties(identifier, ...)
except Exception:
    self.file_io.delete_directory_quietly(table_path)
    raise

The except block only cleans up the file system directory. It should also delete the row from paimon_tables that was already committed, otherwise _table_exists() will return True for a table whose data directory was removed.

3. `rename_table` Performs Metadata Update Before File Move

If self.file_io.rename(source_path, target_path) fails (e.g., permission error, cross-device move), the catalog metadata already points to target_identifier but the data files are still at source_path. Consider moving the file first and rolling back on failure, or at least documenting this limitation.

4. Placeholder Substitution is Fragile

def _sql(self, sql: str) -> str:
    if self.placeholder == "?":
        return sql
    return sql.replace("?", self.placeholder)

A naive str.replace("?", "%s") would break if any SQL string literal ever contained a ? character. While current queries don't hit this, it's a latent risk. A safer approach would be regex-based replacement that skips quoted strings, or building queries with the target placeholder from the start.

5. MySQL `**props` Passthrough May Cause Conflicts

In _connect_mysql, after popping user/password/username, the remaining props dict (which merges jdbc.* options and URI query params) is passed as **props to pymysql.connect(). If any query parameter name overlaps with an explicit keyword argument (e.g., someone passes ?host=... or ?port=... in the URI), this will raise a TypeError: got multiple values for argument.

Suggestion: Pop host, port, database from props before passing as **kwargs, or whitelist known safe extra options.

6. SQLite Thread Safety

Python's sqlite3 module by default restricts connections to the creating thread (check_same_thread=True). If JdbcCatalog is ever used from multiple threads (e.g., in a web service or parallel writer), this will raise ProgrammingError. Consider passing check_same_thread=False if multi-thread usage is a goal, or documenting the single-thread constraint.

7. Minor: No `enter`/`exit` for Resource Cleanup

JdbcCatalog has a close() method but does not implement the context manager protocol. Adding __enter__/__exit__ would allow with CatalogFactory.create(opts) as catalog: ... usage and prevent connection leaks.

8. Minor: `_insert_database_properties` Issues One INSERT per Property

Each property key/value triggers a separate execute() call (and therefore a separate commit). For databases with many properties, this is both slow and non-atomic. Consider using executemany() or batching inserts.

Positive Notes

Parameterized queries throughout — no SQL injection risk.
Clean separation of connection logic in _DbApiConnection.
Tests cover the full lifecycle (create, list, get, alter, rename, drop) for both databases and tables.
The catalog-key option for multi-tenant isolation is a good design choice matching Java Paimon.
Documentation is clear and explains the native Python driver approach well.

Nice work overall. The atomicity issue (point 1) is the most critical to address before merge.

HansiChan · 2026-05-23T16:53:20Z

Thanks for the detailed review. I pushed a follow-up commit to address the JDBC catalog comments:

Added an explicit transaction context in _DbApiConnection; execute() no longer commits each statement.
Wrapped compound catalog operations in transactions, including create_table, drop_database, alter_table, rename_table, and property writes.
create_table now rolls back metadata on failure and still cleans up the table directory.
rename_table now moves the table path before updating metadata, and tries to move it back if the metadata update fails.
Replaced naive placeholder substitution with conversion that skips quoted SQL string literals.
Popped MySQL/PostgreSQL connection kwargs such as host, port, database / dbname before passing remaining props to the driver.
Set SQLite check_same_thread=False.
Added JdbcCatalog.__enter__ / __exit__.
Switched property inserts to executemany() and added tests for rollback, rename failure, placeholder conversion, and context manager cleanup.

I also checked the current CI failures and they are unrelated to this JDBC catalog change. I double-checked the PR diff: this PR only changes the JDBC catalog implementation/docs/tests and does not touch the GCS, Tantivy, or mixed e2e code paths.

The failed checks are:

lint-python (3.6.15) fails in GCS file IO tests because the Python 3.6 job installs pyarrow==6.0.1, where pyarrow.fs.GcsFileSystem is not available.
lint-python (3.10) and lint-python (3.11) pass the normal PyPaimon tests, including pypaimon/tests/jdbc_catalog_test.py, but fail later in the mixed e2e Tantivy full-text index test with:
AttributeError: 'tantivy.tantivy.Searcher' object has no attribute 'fast_field_values'.

Local validation:

PYTHONPATH=. python -m pytest pypaimon/tests/jdbc_catalog_test.py

python -m flake8 --config=dev/cfg.ini pypaimon/catalog/jdbc_catalog.py pypaimon/tests/jdbc_catalog_test.py

JingsongLi reviewed May 23, 2026

View reviewed changes

HansiChan added 4 commits May 23, 2026 23:51

[python] Support JDBC catalog

c1913cd

[python] Trigger CI rerun

158ea83

[python] Clarify JDBC catalog connection type

7fd9541

[python] Address JDBC catalog review comments

4f19eda

HansiChan force-pushed the codex-pypaimon-jdbc-catalog branch from 747e2a1 to 4f19eda Compare May 23, 2026 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Support JDBC catalog#7720

[python] Support JDBC catalog#7720
HansiChan wants to merge 4 commits into
apache:masterfrom
HansiChan:codex-pypaimon-jdbc-catalog

HansiChan commented Apr 28, 2026

Uh oh!

tub commented Apr 29, 2026

Uh oh!

HansiChan commented Apr 29, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

HansiChan commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HansiChan commented Apr 28, 2026

Purpose

What changed

Tests

Uh oh!

tub commented Apr 29, 2026

Uh oh!

HansiChan commented Apr 29, 2026

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Review: [python] Support JDBC catalog

1. Lack of Transaction Atomicity (Bug)

2. create_table Error Handling Incomplete

3. rename_table Performs Metadata Update Before File Move

4. Placeholder Substitution is Fragile

5. MySQL **props Passthrough May Cause Conflicts

6. SQLite Thread Safety

7. Minor: No __enter__/__exit__ for Resource Cleanup

8. Minor: _insert_database_properties Issues One INSERT per Property

Positive Notes

Uh oh!

HansiChan commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. `create_table` Error Handling Incomplete

3. `rename_table` Performs Metadata Update Before File Move

5. MySQL `**props` Passthrough May Cause Conflicts

7. Minor: No `enter`/`exit` for Resource Cleanup

8. Minor: `_insert_database_properties` Issues One INSERT per Property

HansiChan commented May 23, 2026 •

edited

Loading