[python] Support JDBC catalog#7720
Conversation
|
Nice! I have a similar change locally that uses SQLAlchemy - but this looks great as it adds fewer dependencies. |
Good point. I kept the public catalog type as To avoid implying that PyPaimon uses JVM JDBC drivers, I updated the implementation and docs to clarify that PyPaimon uses native Python DB-API drivers under the hood. I also renamed the internal connection helper to |
JingsongLi
left a comment
There was a problem hiding this comment.
Review: [python] Support JDBC catalog
Overall this is a solid contribution that brings JDBC catalog parity to PyPaimon with a clean design. The DB-API abstraction supporting SQLite/MySQL/PostgreSQL is well-structured, and the test coverage with SQLite is good. Below are issues I found, ranging from correctness bugs to design suggestions.
1. Lack of Transaction Atomicity (Bug)
_DbApiConnection.execute() commits after every single statement. This makes multi-statement operations non-atomic:
create_table: If_insert_table_propertiesfails after theINSERT INTO paimon_tablessucceeds, the exception handler deletes the table directory but does NOT roll back the already-committedpaimon_tablesrow. This leaves an orphaned metadata entry.drop_database: Three separate DELETEs each commit independently. A failure between them leaves the catalog in an inconsistent state.alter_table: The DELETE of old properties commits, then re-insertion of new properties happens row-by-row. A failure midway loses all table properties.rename_table: Two UPDATEs commit separately; partial failure leaves inconsistent metadata.
Suggestion: Introduce a transaction context (e.g., remove the self.connection.commit() from execute and add explicit begin/commit/rollback boundaries around compound operations), or at minimum batch the operations within a single commit for methods like drop_database, create_table, alter_table, and rename_table.
2. create_table Error Handling Incomplete
try:
self.connection.execute("INSERT INTO paimon_tables ...")
if self._sync_all_properties():
self._insert_table_properties(identifier, ...)
except Exception:
self.file_io.delete_directory_quietly(table_path)
raiseThe except block only cleans up the file system directory. It should also delete the row from paimon_tables that was already committed, otherwise _table_exists() will return True for a table whose data directory was removed.
3. rename_table Performs Metadata Update Before File Move
If self.file_io.rename(source_path, target_path) fails (e.g., permission error, cross-device move), the catalog metadata already points to target_identifier but the data files are still at source_path. Consider moving the file first and rolling back on failure, or at least documenting this limitation.
4. Placeholder Substitution is Fragile
def _sql(self, sql: str) -> str:
if self.placeholder == "?":
return sql
return sql.replace("?", self.placeholder)A naive str.replace("?", "%s") would break if any SQL string literal ever contained a ? character. While current queries don't hit this, it's a latent risk. A safer approach would be regex-based replacement that skips quoted strings, or building queries with the target placeholder from the start.
5. MySQL **props Passthrough May Cause Conflicts
In _connect_mysql, after popping user/password/username, the remaining props dict (which merges jdbc.* options and URI query params) is passed as **props to pymysql.connect(). If any query parameter name overlaps with an explicit keyword argument (e.g., someone passes ?host=... or ?port=... in the URI), this will raise a TypeError: got multiple values for argument.
Suggestion: Pop host, port, database from props before passing as **kwargs, or whitelist known safe extra options.
6. SQLite Thread Safety
Python's sqlite3 module by default restricts connections to the creating thread (check_same_thread=True). If JdbcCatalog is ever used from multiple threads (e.g., in a web service or parallel writer), this will raise ProgrammingError. Consider passing check_same_thread=False if multi-thread usage is a goal, or documenting the single-thread constraint.
7. Minor: No __enter__/__exit__ for Resource Cleanup
JdbcCatalog has a close() method but does not implement the context manager protocol. Adding __enter__/__exit__ would allow with CatalogFactory.create(opts) as catalog: ... usage and prevent connection leaks.
8. Minor: _insert_database_properties Issues One INSERT per Property
Each property key/value triggers a separate execute() call (and therefore a separate commit). For databases with many properties, this is both slow and non-atomic. Consider using executemany() or batching inserts.
Positive Notes
- Parameterized queries throughout — no SQL injection risk.
- Clean separation of connection logic in
_DbApiConnection. - Tests cover the full lifecycle (create, list, get, alter, rename, drop) for both databases and tables.
- The
catalog-keyoption for multi-tenant isolation is a good design choice matching Java Paimon. - Documentation is clear and explains the native Python driver approach well.
Nice work overall. The atomicity issue (point 1) is the most critical to address before merge.
747e2a1 to
4f19eda
Compare
|
Thanks for the detailed review. I pushed a follow-up commit to address the JDBC catalog comments:
I also checked the current CI failures and they are unrelated to this JDBC catalog change. I double-checked the PR diff: this PR only changes the JDBC catalog implementation/docs/tests and does not touch the GCS, Tantivy, or mixed e2e code paths. The failed checks are:
Local validation:
|
Purpose
Support JDBC catalog in PyPaimon. This adds a Python JDBC catalog implementation that uses the same catalog metadata tables as Java Paimon JDBC catalog:
paimon_tables,paimon_database_properties, andpaimon_table_properties.The implementation supports SQLite with the Python standard library and dynamically supports MySQL/PostgreSQL when a corresponding Python DB-API driver is installed. Table data and schema files continue to use existing PyPaimon
FileIOandSchemaManagerbehavior.What changed
metastore=jdbcinCatalogFactoryJdbcCatalogandJdbcCatalogLoadercatalog-keyandsync-all-propertiescatalog optionsTests
python3 -m py_compile pypaimon/catalog/jdbc_catalog.py pypaimon/catalog/jdbc_catalog_loader.py pypaimon/catalog/catalog_factory.py pypaimon/common/options/config.py pypaimon/tests/jdbc_catalog_test.pyPYTHONPATH=/tmp/paimon-python-test-deps POLARS_SKIP_CPU_CHECK=1 python3 -m unittest pypaimon.tests.jdbc_catalog_test pypaimon.tests.filesystem_catalog_test