Skip to content

[python] Add HDFS native FileIO backend (no Hadoop install required)#8031

Merged
JingsongLi merged 3 commits into
apache:masterfrom
TheR1sing3un:feat_python_hdfs_native
Jun 3, 2026
Merged

[python] Add HDFS native FileIO backend (no Hadoop install required)#8031
JingsongLi merged 3 commits into
apache:masterfrom
TheR1sing3un:feat_python_hdfs_native

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

Introduces HdfsNativeFileIO backed by the hdfs-native protocol client (Rust + PyO3)

Default backend for hdfs:// and viewfs:// switches to native; the PyArrow / libhdfs path is kept, with auto-fallback when hdfs-native is unavailable (e.g. on Windows or when the extra is not installed).

Adds: HdfsNativeFileIO, HdfsOptions, _kerberos helpers, unit tests, Docker-based e2e scaffold, native vs pyarrow benchmark, README section.

Introduces HdfsNativeFileIO backed by the hdfs-native protocol client
(Rust + PyO3). Removes the runtime need for HADOOP_HOME / JDK / libhdfs
on the client side. viewfs mount tables and HA NameNode lists can come
from local xml (HADOOP_CONF_DIR / hdfs.conf-dir option) or directly
from catalog options delivered by a REST catalog (keys with prefixes
dfs./fs./hadoop./ipc./io. are forwarded as-is).

Default backend for hdfs:// and viewfs:// switches to native; the
PyArrow / libhdfs path is kept, with auto-fallback when hdfs-native
is unavailable (e.g. on Windows or when the extra is not installed).

Adds: HdfsNativeFileIO, HdfsOptions, _kerberos helpers, unit tests,
Docker-based e2e scaffold, native vs pyarrow benchmark, README section.
RAT check failed on docker-compose.yml and the e2e README. Add the
standard Apache 2.0 header to both.
@leaves12138
Copy link
Copy Markdown
Contributor

Thanks for the contribution. This PR is still in draft state, so I am leaving it unapproved for now. Please request review again once it is ready for a full review.

@TheR1sing3un TheR1sing3un marked this pull request as ready for review June 1, 2026 08:59
@TheR1sing3un
Copy link
Copy Markdown
Member Author

Thanks for the contribution. This PR is still in draft state, so I am leaving it unapproved for now. Please request review again once it is ready for a full review.

Thanks, ready for review now!

Comment thread paimon-python/setup.py Outdated
'datafusion>=52; python_version>="3.10"',
],
'hdfs': [
'hdfs-native>=0.13,<1; platform_system!="Windows"',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a Python version marker here? pypaimon still declares python_requires=">=3.6" and the CI matrix covers older Python versions, but the published hdfs-native>=0.13 packages declare Requires-Python: >=3.10. With the current extra, pip install "pypaimon[hdfs]" on Python 3.6-3.9 will try to resolve this dependency and fail instead of installing pypaimon and letting users keep using the legacy pyarrow backend. Something like hdfs-native>=0.13,<1; python_version >= "3.10" and platform_system != "Windows" (plus a doc note for older Python) would avoid breaking those environments.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice suggestion! done!

hadoop_xml = self._load_hadoop_xml(config_dir)

config = self._build_config_dict()
self._maybe_inject_viewfs_fallback(scheme, netloc, config, hadoop_xml)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fallback only looks at hadoop_xml, so it misses the REST/catalog-options path that the PR documents. For a zero-file viewfs config like fs.viewfs.mounttable.cluster.link./prod=hdfs://ns1/prod and dfs.nameservices=ns1 passed in catalog options, _build_config_dict() puts those values in config, but _maybe_inject_viewfs_fallback() searches only hadoop_xml for existing links/nameservices. Because the comment says hdfs-native rejects viewfs without linkFallback, the documented catalog-only viewfs setup can still fail during Client(**client_kwargs). Could we derive the fallback from the merged config ({**hadoop_xml, **config}) while still writing the injected key into config?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice suggestion! done!

self.delete_quietly(path)
raise RuntimeError(f"Failed to write blob file {path}: {e}") from e

def close(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FileIO.get() now makes this class the default for hdfs:// and viewfs://, this needs to keep the same write surface as the previous PyArrowFileIO backend. Right now HdfsNativeFileIO implements parquet/orc/avro/blob but not write_lance() or write_vortex(), so any HDFS table configured with file.format=lance or file.format=vortex will start hitting the abstract FileIO.write_lance/write_vortex NotImplementedError after this change. The existing writers call these methods directly for those formats. Could you either implement/delegate these two methods here (the fsspec facade may be enough) or avoid routing those formats to native by default?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice suggestion! done!

- Gate the hdfs extra on Python 3.10+ so installs on 3.6-3.9 no longer
  fail resolving hdfs-native; document the requirement in the README.
- Derive the viewfs linkFallback from the merged xml + catalog-options
  view so a config-only viewfs setup gets a fallback too.
- Implement write_lance/write_vortex on the native backend so HDFS tables
  using those formats no longer hit NotImplementedError.
@TheR1sing3un TheR1sing3un requested a review from JingsongLi June 2, 2026 09:24
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the latest head 96bcecb. The three issues from my previous pass look addressed: the hdfs-native extra is now gated to Python 3.10+, viewfs fallback is derived from merged XML/catalog options, and the native backend now overrides lance/vortex writes. I also ran the targeted HDFS native and Kerberos/FileIO tests locally; no new blocking issues found from my side.

@TheR1sing3un
Copy link
Copy Markdown
Member Author

Re-reviewed the latest head 96bcecb. The three issues from my previous pass look addressed: the hdfs-native extra is now gated to Python 3.10+, viewfs fallback is derived from merged XML/catalog options, and the native backend now overrides lance/vortex writes. I also ran the targeted HDFS native and Kerberos/FileIO tests locally; no new blocking issues found from my side.

Thanks! Any issue before landing it?

@TheR1sing3un TheR1sing3un requested a review from JingsongLi June 3, 2026 02:24
@JingsongLi JingsongLi merged commit b839753 into apache:master Jun 3, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants