Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5033] Switched to snakebite-py3 #5659

Closed
wants to merge 1 commit into from

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Jul 25, 2019

NOTE TO REVIEWER: It depends on #7278 - so please check only the last commit.

Make sure you have checked all steps below.

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Snakebite is not PY3-compatible. Trying to switch to alledgedly drop-in replacement which is py3-compatible.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@potiuk potiuk force-pushed the snakebite-py3 branch 2 times, most recently from a73a637 to 17b4f4b Compare July 25, 2019 08:29
@potiuk
Copy link
Member Author

potiuk commented Jul 25, 2019

It looks like kerberos is not working :( pip install snakebite-py3[kerberos] tries to compile Python2 version of kerberos libs

@potiuk potiuk changed the title [AIRFLOW-5033] Switched to snakebite-py3 [AIRFLOW-5033] Switched to snakebite-py3 [DO NOT MERGE Jul 25, 2019
@potiuk potiuk changed the title [AIRFLOW-5033] Switched to snakebite-py3 [DO NOT MERGE [AIRFLOW-5033] Switched to snakebite-py3 [DO NOT MERGE] Jul 25, 2019
@zhongjiajie
Copy link
Member

@potiuk snakebite only work for python2

@mik-laj
Copy link
Member

mik-laj commented Jul 25, 2019

It's fork of Snakebite.it should be Python 3 compatible

@sdikby
Copy link

sdikby commented Jul 25, 2019

@potiuk What is holding against dropping Snakebite and using the PyArrow interface to HDFS (https://arrow.apache.org/docs/python/filesystems.html)

@potiuk
Copy link
Member Author

potiuk commented Jul 25, 2019

@sdikby nothing :) we just need someone to make PR. Happy to review it.

setup.py Outdated Show resolved Hide resolved
@Fokko
Copy link
Contributor

Fokko commented Jul 26, 2019

With the community, we wanted to move to PyArrow. But this requires some more effort, see #3560. I'm fine with moving to snakebite-py3 for now since it isn't working on Python 3.

@potiuk
Copy link
Member Author

potiuk commented Jul 26, 2019

@Fokko - I am not 100% sure if it will work (hence [DO NOT MERGEI could not really test it with kerberos though - so maybe someone who uses kerberos + hdfs could test it? Do you know who could test it maybe :) ?

@Fokko
Copy link
Contributor

Fokko commented Jul 27, 2019

At ING they are using hdfs+kerberos, perhaps @bolkedebruin knows someone who could give this a spin.

@potiuk
Copy link
Member Author

potiuk commented Jul 28, 2019

@bolkedebruin :)? Can you help with testing that? That would be a super-easy fix for last remaining issue preventing full switch to Python 3 .

@elukey
Copy link

elukey commented Oct 30, 2019

@potiuk hi! We met at the Apachecon, I am wondering if I can help with anything. What is the current status? From what I can see snakebyte-py3[kerberos] seems not working, maybe we could follow up with the maintainers of the package and see if we can fix? Or is there something else to do?

@potiuk
Copy link
Member Author

potiuk commented Oct 30, 2019

Yeah. Snakebite-py3 is not working indeed. I think the only way to go is to implement new Hadoop operators base don another library :(

@elukey
Copy link

elukey commented Oct 31, 2019

Yeah. Snakebite-py3 is not working indeed. I think the only way to go is to implement new Hadoop operators base don another library :(

I opened internetarchive/snakebite-py3#5, the new maintainers (internet archive) are willing to get patches to make kerberos work. I'd give it a try, possibly replacing the current snakebite's kerberos dependency with something like python-gssapi (or similar)? What do you (and others) think?

@potiuk
Copy link
Member Author

potiuk commented Oct 31, 2019

Sounds great @elukey if you could try it !

I think not having python 3 Hadoop operators is actually a blocker for 2.0 release, so the sooner we have them the better :). I also got in contact with people from Cloudera at the ApacheCon - @gezapeti PMC of Oozie - and maybe through Geza we can find people at Cloudera who can also be interested in making the Hadoop operator work for Airflow 2.0 and we could do it together.

@Fokko
Copy link
Contributor

Fokko commented Oct 31, 2019

I would suggest moving to the pyarrow stuff: https://arrow.apache.org/docs/python/filesystems.html

The snakebyte(-py3) isn't maintained anymore.

@elukey
Copy link

elukey commented Oct 31, 2019

I would suggest moving to the pyarrow stuff: https://arrow.apache.org/docs/python/filesystems.html

The snakebyte(-py3) isn't maintained anymore.

I am a bit confused, earlier in this thread it was already proposed pyarrow but then it ended up being too difficult and it wasn't completed (this is my impression from reading, might be wrong). The snakebyte-py3 fork seems maintained by internet archive, it could be a good interim solution before moving to something different?

@Fokko
Copy link
Contributor

Fokko commented Oct 31, 2019

Well, there's barely anything going on: https://github.com/internetarchive/snakebite-py3/commits/master

We can move to snakebyte-py3 to remove the blocker, but in the end, I would like to migrate to PyArrow.

@elukey
Copy link

elukey commented Oct 31, 2019

Well, there's barely anything going on: https://github.com/internetarchive/snakebite-py3/commits/master

We can move to snakebyte-py3 to remove the blocker, but in the end, I would like to migrate to PyArrow.

I agree, but I think that the aim of the new maintainer was just to port the existing code to py3 and then support if needed. I opened an issue and they promptly answer, so I think it is a good sign!

I am not advocating for using snakebite, only trying to find a solution to unblock the current status :)

@elukey
Copy link

elukey commented Nov 2, 2019

Tried to work on it during the last couple of days, and hit an issue with my environment:

spotify/snakebite#153 (comment)

snakebite seems to rely ok python-krbV (the python2 only dep) only to get the kerberos principal of the running user, something that in theory can be done with other libs. The main problem seems to be that snakebite doesn't support HDFS encrypted RPCs, and fixing it might be a lot of work. I thought that it was a common use case for kerberos-enabled clusters, but apparently not?

@potiuk
Copy link
Member Author

potiuk commented Nov 2, 2019

Maybe @bolkedebruin can shed some light here? I guess our only realistic option is indeed switch to a different library - I think people rely on kerberos being available for secure clusters and we cannot provide Hadoop support without it.

@elukey
Copy link

elukey commented Nov 2, 2019

I'd love to know @bolkedebruin's opinion as well :) From my side, I can offer dev time and also test environment, since I have a kerberos with rpc encrypted cluster to use as testing environment.

@elukey
Copy link

elukey commented Nov 3, 2019

Correcting myself: the missing feature in snakebite is encryption/decryption of the HDFS data transfer protocol, because the Hadoop RPC SASL code works fine (so basic ops with the HDFS Namenode should work). I found a bug in the new python 3 code (text vs binary string comparison ending up in always false) and now I am able to at least replicate the snakebite-py2 behavior.

@elukey
Copy link

elukey commented Nov 5, 2019

Good news: I was able to swap krbV with gssapi, so in theory snakebite-py3 should now be able to work with Kerberos (after/if my PR is merged), no more py2 only dependencies. The only remaining thing to do is adding support for HDFS clusters with RPC encryption, that was not supported by snakebite previously (so a new feature).

Feedback/testing welcome! internetarchive/snakebite-py3#6

@stale stale bot closed this Mar 26, 2020
@wittfabian
Copy link
Contributor

Any updates here?

@zhongjiajie
Copy link
Member

@wittfabian No, this PR was close by stale-bot and I think no one could open it again, feel free to continue the PR by create new one if you want to and have more further idea.

@potiuk
Copy link
Member Author

potiuk commented Apr 14, 2020

I will take a look shortly while I will be looking at removing conflicting requirements. The state for this snakebite is that it has been fixed by @elukey and it should work for 2.0 with the exception of HDFS-RPC encryption. But just switching to latest released snakebite-py3 should work for all other use cases.

@potiuk potiuk reopened this Apr 14, 2020
@stale stale bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Apr 14, 2020
@potiuk potiuk self-assigned this Apr 14, 2020
@zhongjiajie
Copy link
Member

Ohh, I remember we could not reopen PR closed by stale-bot, am I wrong?

@zhongjiajie
Copy link
Member

BTW, that a good new for AIP-3

@wittfabian
Copy link
Contributor

Is it still planned to switch to PyArrow in the future? We only encounter problems with snakebite in our projects.

@zhongjiajie
Copy link
Member

I think we should switch, but let hear Jarek opinion

@potiuk
Copy link
Member Author

potiuk commented Apr 15, 2020

Yeah. If somoeone (@wittfabian ? ) would like to switch to PyArrow I am more than happy :).

@potiuk
Copy link
Member Author

potiuk commented Apr 15, 2020

We even had a JIRA issue for it: https://issues.apache.org/jira/browse/AIRFLOW-2697

@wittfabian
Copy link
Contributor

The last time I looked in PyArrow I had the problem that using libhdfs required Hadoop configuration (enviroment variables). Our use case is that Hadoop does not run on the same host.
Maybe I just missed something.

https://github.com/apache/arrow/blob/master/python/pyarrow/hdfs.py#L38
https://github.com/apache/arrow/blob/master/python/pyarrow/hdfs.py#L125

@elukey
Copy link

elukey commented Apr 15, 2020

@wittfabian I am curious about what problems have you encountered with snakebite (to understand if they are big ones or small that we can solve). It is true that I didn't have the time to add support for RPC encryption, but the task seems to be big and needs time/energy (any help is needed). More info on the current blockers are listed in internetarchive/snakebite-py3#8

Switching to PyArrow was on the table even when I started, but it was abandoned for various reasons (might not hold anymore). The main problem of snakebite is that it doesn't rely on java libs (as pyarrow does IIRC) and Hadoop still sadly uses old ciphers for SASL.

Anyway, happy if Airflow proceeds in any direction, or even more happy to collaborate with other people on snakebite-py3. Adding RPC encryption support by myself is not doable in a short amount of time.

@ashb
Copy link
Member

ashb commented Apr 15, 2020

Just an FYI: pyarrow is a nightmare to try and install on alpline linux. I don't think this affects us, as it's an optional dep, and we provide slim Debian images.

@ashb
Copy link
Member

ashb commented Apr 15, 2020

@zhongjiajie as committers we can reopen stalebot closed PRs, and we can also add the "pinned" label to prevent stalebot closing them in the first place.

@Tagar
Copy link
Contributor

Tagar commented Apr 15, 2020

Just an FYI: pyarrow is a nightmare to try and install on alpline linux. I don't think this affects us, as it's an optional dep, and we provide slim Debian images.

I had issues with pyarrow dependency issues too. Boost, a popular C++ library was most often caused a lot of those issues. Arrow 0.16 release in Jan made Boost optional.
https://arrow.apache.org/blog/2020/02/12/0.16.0-release/

Notably, Boost is no longer required. We invested effort to vendor some small essential dependencies: Flatbuffers, double-conversion, and uriparser. Many optional features requiring external libraries, like compression and GLog integration, are now disabled by default. Several subcomponents of the C++ project like the filesystem API, CSV, compute, dataset and JSON layers, as well as command-line utilities, are now disabled by default

So it might be a good idea to try latest pyarrow now.

On a separate note, I sent a comment above

I've published sasl3 package on pypi
https://pypi.org/project/sasl3/
https://github.com/sparkur/python-sasl3

Please give that a try instead if sasl

That should address the issue described by @elukey in this comment
#5659 (comment)

@wittfabian
Copy link
Contributor

wittfabian commented Apr 16, 2020

@elukey I'm not 100 percent sure, but we had a lot of problems with the installation. Maybe some problems have been solved with new versions. I looked at some details again yesterday, it is possible that some of the problems also come from kerberos.

I`m happy with any solution that runs on Python 3.

@zhongjiajie
Copy link
Member

@zhongjiajie as committers we can reopen stalebot closed PRs, and we can also add the "pinned" label to prevent stalebot closing them in the first place.

Thanks @ashb for the clarification

@elukey
Copy link

elukey commented Apr 24, 2020

On a separate note, I sent a comment above

I've published sasl3 package on pypi
https://pypi.org/project/sasl3/
https://github.com/sparkur/python-sasl3
Please give that a try instead if sasl

That should address the issue described by @elukey in this comment
#5659 (comment)

@Tagar I tried today sasl3 but I get to the same error that I had with sasl, namely the following error message on the Namenode when trying to send a getFileInfo RPC call:

javax.security.sasl.SaslException: Problems unwrapping SASL buffer [Caused by GSSException: Defective token detected (Mechanism level: Kerberos GSS-API Mechanism Token:Defective Token ID!)]

I recall that at the time I solved the problem simply switching to pure-sasl, I'll try to follow up and see if there is something new that I can debug with fresh eyes.

Note: the above error happens after a successful negotiation/connection of SASL + encryption (wrap), not at the very start.

@elukey
Copy link

elukey commented May 8, 2020

Good news: I got unblocked, I was able to make snakebite working with sasl3 after a lot of time spent between swearing and crying :D, details in internetarchive/snakebite-py3#8 (comment)

The next step is to implement the missing bits in snakebite, hoping that I'll not find any pitfall like this one. Fingers crossed!

@Fokko
Copy link
Contributor

Fokko commented May 9, 2020

@elukey Can you create a PR, would be great to get it into Airflow :)

@elukey
Copy link

elukey commented May 10, 2020

@elukey Can you create a PR, would be great to get it into Airflow :)

I am currently trying to solve the last missing bit (namely encrypted RPCs between client and Datanodes, I'll file a PR to the main snakebite-py3 repo as soon as I'll have something working and tested. Just as FYI, the current version of snakebite-py3 supports py3, non-encrypted-rpcs and a subset of encrypted RPCs (the ones between client and Namenode). The work that I am doing now is to have a complete support, but 75% of the work is already upstreamed and working :)

@elukey
Copy link

elukey commented May 12, 2020

Just an FYI: pyarrow is a nightmare to try and install on alpline linux. I don't think this affects us, as it's an optional dep, and we provide slim Debian images.

I had issues with pyarrow dependency issues too. Boost, a popular C++ library was most often caused a lot of those issues. Arrow 0.16 release in Jan made Boost optional.
https://arrow.apache.org/blog/2020/02/12/0.16.0-release/

Notably, Boost is no longer required. We invested effort to vendor some small essential dependencies: Flatbuffers, double-conversion, and uriparser. Many optional features requiring external libraries, like compression and GLog integration, are now disabled by default. Several subcomponents of the C++ project like the filesystem API, CSV, compute, dataset and JSON layers, as well as command-line utilities, are now disabled by default

So it might be a good idea to try latest pyarrow now.

@potiuk @Fokko due to the complexity of adding encryption support to snakebite (at least for me), I would in parallel try again pyarrow (given what is written above) and see if it suits the Airflow needs. I am doing some hacking now and I am learning some things, but it would be better to explore other paths as well. What do you think?

@potiuk
Copy link
Member Author

potiuk commented May 12, 2020

All for it. I already installed pyarrow few days ago as it was needed by apache beam provider!

@potiuk
Copy link
Member Author

potiuk commented May 26, 2020

Closing this one as we already have snakebite-py3 in master airflow. I think we might need separate issue for PyArrow implementation.

@potiuk potiuk closed this May 26, 2020
@elukey
Copy link

elukey commented Jun 30, 2020

To keep archives happy: I tried hard to make snakebite compatible with RPC encryption, but found a lot of road blockers, all explained in internetarchive/snakebite-py3#8 (comment). For the moment I am inclined to stop working on this and let Airflow/others to experiment with pyarrow, it seems a more promising strategy to follow :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants