Skip to content

[python] Support JDBC catalog#7720

Open
HansiChan wants to merge 4 commits into
apache:masterfrom
HansiChan:codex-pypaimon-jdbc-catalog
Open

[python] Support JDBC catalog#7720
HansiChan wants to merge 4 commits into
apache:masterfrom
HansiChan:codex-pypaimon-jdbc-catalog

Conversation

@HansiChan
Copy link
Copy Markdown
Contributor

Purpose

Support JDBC catalog in PyPaimon. This adds a Python JDBC catalog implementation that uses the same catalog metadata tables as Java Paimon JDBC catalog: paimon_tables, paimon_database_properties, and paimon_table_properties.

The implementation supports SQLite with the Python standard library and dynamically supports MySQL/PostgreSQL when a corresponding Python DB-API driver is installed. Table data and schema files continue to use existing PyPaimon FileIO and SchemaManager behavior.

What changed

  • Register metastore=jdbc in CatalogFactory
  • Add JdbcCatalog and JdbcCatalogLoader
  • Add catalog-key and sync-all-properties catalog options
  • Cover database and table create/list/get/alter/rename/drop behavior with SQLite-backed tests
  • Document JDBC catalog creation in PyPaimon Python API docs

Tests

  • python3 -m py_compile pypaimon/catalog/jdbc_catalog.py pypaimon/catalog/jdbc_catalog_loader.py pypaimon/catalog/catalog_factory.py pypaimon/common/options/config.py pypaimon/tests/jdbc_catalog_test.py
  • PYTHONPATH=/tmp/paimon-python-test-deps POLARS_SKIP_CPU_CHECK=1 python3 -m unittest pypaimon.tests.jdbc_catalog_test pypaimon.tests.filesystem_catalog_test

@tub
Copy link
Copy Markdown
Contributor

tub commented Apr 29, 2026

Nice! I have a similar change locally that uses SQLAlchemy - but this looks great as it adds fewer dependencies.
Is it worth calling it something other than JDBC? It may be confusing to folks who think it uses the JVM underneath for the database connections.

@HansiChan
Copy link
Copy Markdown
Contributor Author

Nice! I have a similar change locally that uses SQLAlchemy - but this looks great as it adds fewer dependencies. Is it worth calling it something other than JDBC? It may be confusing to folks who think it uses the JVM underneath for the database connections.

Good point. I kept the public catalog type as jdbc because it matches Paimon's existing JDBC catalog configuration and lets users reuse the same metastore=jdbc / jdbc: URI options across engines.

To avoid implying that PyPaimon uses JVM JDBC drivers, I updated the implementation and docs to clarify that PyPaimon uses native Python DB-API drivers under the hood. I also renamed the internal connection helper to _DbApiConnection and adjusted the driver error messages accordingly.

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [python] Support JDBC catalog

Overall this is a solid contribution that brings JDBC catalog parity to PyPaimon with a clean design. The DB-API abstraction supporting SQLite/MySQL/PostgreSQL is well-structured, and the test coverage with SQLite is good. Below are issues I found, ranging from correctness bugs to design suggestions.


1. Lack of Transaction Atomicity (Bug)

_DbApiConnection.execute() commits after every single statement. This makes multi-statement operations non-atomic:

  • create_table: If _insert_table_properties fails after the INSERT INTO paimon_tables succeeds, the exception handler deletes the table directory but does NOT roll back the already-committed paimon_tables row. This leaves an orphaned metadata entry.
  • drop_database: Three separate DELETEs each commit independently. A failure between them leaves the catalog in an inconsistent state.
  • alter_table: The DELETE of old properties commits, then re-insertion of new properties happens row-by-row. A failure midway loses all table properties.
  • rename_table: Two UPDATEs commit separately; partial failure leaves inconsistent metadata.

Suggestion: Introduce a transaction context (e.g., remove the self.connection.commit() from execute and add explicit begin/commit/rollback boundaries around compound operations), or at minimum batch the operations within a single commit for methods like drop_database, create_table, alter_table, and rename_table.


2. create_table Error Handling Incomplete

try:
    self.connection.execute("INSERT INTO paimon_tables ...")
    if self._sync_all_properties():
        self._insert_table_properties(identifier, ...)
except Exception:
    self.file_io.delete_directory_quietly(table_path)
    raise

The except block only cleans up the file system directory. It should also delete the row from paimon_tables that was already committed, otherwise _table_exists() will return True for a table whose data directory was removed.


3. rename_table Performs Metadata Update Before File Move

If self.file_io.rename(source_path, target_path) fails (e.g., permission error, cross-device move), the catalog metadata already points to target_identifier but the data files are still at source_path. Consider moving the file first and rolling back on failure, or at least documenting this limitation.


4. Placeholder Substitution is Fragile

def _sql(self, sql: str) -> str:
    if self.placeholder == "?":
        return sql
    return sql.replace("?", self.placeholder)

A naive str.replace("?", "%s") would break if any SQL string literal ever contained a ? character. While current queries don't hit this, it's a latent risk. A safer approach would be regex-based replacement that skips quoted strings, or building queries with the target placeholder from the start.


5. MySQL **props Passthrough May Cause Conflicts

In _connect_mysql, after popping user/password/username, the remaining props dict (which merges jdbc.* options and URI query params) is passed as **props to pymysql.connect(). If any query parameter name overlaps with an explicit keyword argument (e.g., someone passes ?host=... or ?port=... in the URI), this will raise a TypeError: got multiple values for argument.

Suggestion: Pop host, port, database from props before passing as **kwargs, or whitelist known safe extra options.


6. SQLite Thread Safety

Python's sqlite3 module by default restricts connections to the creating thread (check_same_thread=True). If JdbcCatalog is ever used from multiple threads (e.g., in a web service or parallel writer), this will raise ProgrammingError. Consider passing check_same_thread=False if multi-thread usage is a goal, or documenting the single-thread constraint.


7. Minor: No __enter__/__exit__ for Resource Cleanup

JdbcCatalog has a close() method but does not implement the context manager protocol. Adding __enter__/__exit__ would allow with CatalogFactory.create(opts) as catalog: ... usage and prevent connection leaks.


8. Minor: _insert_database_properties Issues One INSERT per Property

Each property key/value triggers a separate execute() call (and therefore a separate commit). For databases with many properties, this is both slow and non-atomic. Consider using executemany() or batching inserts.


Positive Notes

  • Parameterized queries throughout — no SQL injection risk.
  • Clean separation of connection logic in _DbApiConnection.
  • Tests cover the full lifecycle (create, list, get, alter, rename, drop) for both databases and tables.
  • The catalog-key option for multi-tenant isolation is a good design choice matching Java Paimon.
  • Documentation is clear and explains the native Python driver approach well.

Nice work overall. The atomicity issue (point 1) is the most critical to address before merge.

@HansiChan HansiChan force-pushed the codex-pypaimon-jdbc-catalog branch from 747e2a1 to 4f19eda Compare May 23, 2026 16:04
@HansiChan
Copy link
Copy Markdown
Contributor Author

HansiChan commented May 23, 2026

Thanks for the detailed review. I pushed a follow-up commit to address the JDBC catalog comments:

  • Added an explicit transaction context in _DbApiConnection; execute() no longer commits each statement.
  • Wrapped compound catalog operations in transactions, including create_table, drop_database, alter_table, rename_table, and property writes.
  • create_table now rolls back metadata on failure and still cleans up the table directory.
  • rename_table now moves the table path before updating metadata, and tries to move it back if the metadata update fails.
  • Replaced naive placeholder substitution with conversion that skips quoted SQL string literals.
  • Popped MySQL/PostgreSQL connection kwargs such as host, port, database / dbname before passing remaining props to the driver.
  • Set SQLite check_same_thread=False.
  • Added JdbcCatalog.__enter__ / __exit__.
  • Switched property inserts to executemany() and added tests for rollback, rename failure, placeholder conversion, and context manager cleanup.

I also checked the current CI failures and they are unrelated to this JDBC catalog change. I double-checked the PR diff: this PR only changes the JDBC catalog implementation/docs/tests and does not touch the GCS, Tantivy, or mixed e2e code paths.

The failed checks are:

  • lint-python (3.6.15) fails in GCS file IO tests because the Python 3.6 job installs pyarrow==6.0.1, where pyarrow.fs.GcsFileSystem is not available.
  • lint-python (3.10) and lint-python (3.11) pass the normal PyPaimon tests, including pypaimon/tests/jdbc_catalog_test.py, but fail later in the mixed e2e Tantivy full-text index test with:
    AttributeError: 'tantivy.tantivy.Searcher' object has no attribute 'fast_field_values'.

Local validation:

PYTHONPATH=. python -m pytest pypaimon/tests/jdbc_catalog_test.py

python -m flake8 --config=dev/cfg.ini pypaimon/catalog/jdbc_catalog.py pypaimon/tests/jdbc_catalog_test.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants