[python] Add HDFS native FileIO backend (no Hadoop install required)#8031
Conversation
Introduces HdfsNativeFileIO backed by the hdfs-native protocol client (Rust + PyO3). Removes the runtime need for HADOOP_HOME / JDK / libhdfs on the client side. viewfs mount tables and HA NameNode lists can come from local xml (HADOOP_CONF_DIR / hdfs.conf-dir option) or directly from catalog options delivered by a REST catalog (keys with prefixes dfs./fs./hadoop./ipc./io. are forwarded as-is). Default backend for hdfs:// and viewfs:// switches to native; the PyArrow / libhdfs path is kept, with auto-fallback when hdfs-native is unavailable (e.g. on Windows or when the extra is not installed). Adds: HdfsNativeFileIO, HdfsOptions, _kerberos helpers, unit tests, Docker-based e2e scaffold, native vs pyarrow benchmark, README section.
RAT check failed on docker-compose.yml and the e2e README. Add the standard Apache 2.0 header to both.
|
Thanks for the contribution. This PR is still in draft state, so I am leaving it unapproved for now. Please request review again once it is ready for a full review. |
Thanks, ready for review now! |
| 'datafusion>=52; python_version>="3.10"', | ||
| ], | ||
| 'hdfs': [ | ||
| 'hdfs-native>=0.13,<1; platform_system!="Windows"', |
There was a problem hiding this comment.
Could we add a Python version marker here? pypaimon still declares python_requires=">=3.6" and the CI matrix covers older Python versions, but the published hdfs-native>=0.13 packages declare Requires-Python: >=3.10. With the current extra, pip install "pypaimon[hdfs]" on Python 3.6-3.9 will try to resolve this dependency and fail instead of installing pypaimon and letting users keep using the legacy pyarrow backend. Something like hdfs-native>=0.13,<1; python_version >= "3.10" and platform_system != "Windows" (plus a doc note for older Python) would avoid breaking those environments.
There was a problem hiding this comment.
nice suggestion! done!
| hadoop_xml = self._load_hadoop_xml(config_dir) | ||
|
|
||
| config = self._build_config_dict() | ||
| self._maybe_inject_viewfs_fallback(scheme, netloc, config, hadoop_xml) |
There was a problem hiding this comment.
This fallback only looks at hadoop_xml, so it misses the REST/catalog-options path that the PR documents. For a zero-file viewfs config like fs.viewfs.mounttable.cluster.link./prod=hdfs://ns1/prod and dfs.nameservices=ns1 passed in catalog options, _build_config_dict() puts those values in config, but _maybe_inject_viewfs_fallback() searches only hadoop_xml for existing links/nameservices. Because the comment says hdfs-native rejects viewfs without linkFallback, the documented catalog-only viewfs setup can still fail during Client(**client_kwargs). Could we derive the fallback from the merged config ({**hadoop_xml, **config}) while still writing the injected key into config?
There was a problem hiding this comment.
nice suggestion! done!
| self.delete_quietly(path) | ||
| raise RuntimeError(f"Failed to write blob file {path}: {e}") from e | ||
|
|
||
| def close(self): |
There was a problem hiding this comment.
Since FileIO.get() now makes this class the default for hdfs:// and viewfs://, this needs to keep the same write surface as the previous PyArrowFileIO backend. Right now HdfsNativeFileIO implements parquet/orc/avro/blob but not write_lance() or write_vortex(), so any HDFS table configured with file.format=lance or file.format=vortex will start hitting the abstract FileIO.write_lance/write_vortex NotImplementedError after this change. The existing writers call these methods directly for those formats. Could you either implement/delegate these two methods here (the fsspec facade may be enough) or avoid routing those formats to native by default?
There was a problem hiding this comment.
nice suggestion! done!
- Gate the hdfs extra on Python 3.10+ so installs on 3.6-3.9 no longer fail resolving hdfs-native; document the requirement in the README. - Derive the viewfs linkFallback from the merged xml + catalog-options view so a config-only viewfs setup gets a fallback too. - Implement write_lance/write_vortex on the native backend so HDFS tables using those formats no longer hit NotImplementedError.
JingsongLi
left a comment
There was a problem hiding this comment.
Re-reviewed the latest head 96bcecb. The three issues from my previous pass look addressed: the hdfs-native extra is now gated to Python 3.10+, viewfs fallback is derived from merged XML/catalog options, and the native backend now overrides lance/vortex writes. I also ran the targeted HDFS native and Kerberos/FileIO tests locally; no new blocking issues found from my side.
Thanks! Any issue before landing it? |
Introduces HdfsNativeFileIO backed by the hdfs-native protocol client (Rust + PyO3)
Default backend for hdfs:// and viewfs:// switches to native; the PyArrow / libhdfs path is kept, with auto-fallback when hdfs-native is unavailable (e.g. on Windows or when the extra is not installed).
Adds: HdfsNativeFileIO, HdfsOptions, _kerberos helpers, unit tests, Docker-based e2e scaffold, native vs pyarrow benchmark, README section.