New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-5033] Switched to snakebite-py3 #5659
Conversation
a73a637
to
17b4f4b
Compare
It looks like kerberos is not working :( |
@potiuk snakebite only work for python2 |
It's fork of Snakebite.it should be Python 3 compatible |
@potiuk What is holding against dropping Snakebite and using the PyArrow interface to HDFS (https://arrow.apache.org/docs/python/filesystems.html) |
@sdikby nothing :) we just need someone to make PR. Happy to review it. |
With the community, we wanted to move to PyArrow. But this requires some more effort, see #3560. I'm fine with moving to |
@Fokko - I am not 100% sure if it will work (hence [DO NOT MERGEI could not really test it with kerberos though - so maybe someone who uses kerberos + hdfs could test it? Do you know who could test it maybe :) ? |
At ING they are using hdfs+kerberos, perhaps @bolkedebruin knows someone who could give this a spin. |
@bolkedebruin :)? Can you help with testing that? That would be a super-easy fix for last remaining issue preventing full switch to Python 3 . |
@potiuk hi! We met at the Apachecon, I am wondering if I can help with anything. What is the current status? From what I can see snakebyte-py3[kerberos] seems not working, maybe we could follow up with the maintainers of the package and see if we can fix? Or is there something else to do? |
Yeah. Snakebite-py3 is not working indeed. I think the only way to go is to implement new Hadoop operators base don another library :( |
I opened internetarchive/snakebite-py3#5, the new maintainers (internet archive) are willing to get patches to make kerberos work. I'd give it a try, possibly replacing the current snakebite's kerberos dependency with something like python-gssapi (or similar)? What do you (and others) think? |
Sounds great @elukey if you could try it ! I think not having python 3 Hadoop operators is actually a blocker for 2.0 release, so the sooner we have them the better :). I also got in contact with people from Cloudera at the ApacheCon - @gezapeti PMC of Oozie - and maybe through Geza we can find people at Cloudera who can also be interested in making the Hadoop operator work for Airflow 2.0 and we could do it together. |
I would suggest moving to the pyarrow stuff: https://arrow.apache.org/docs/python/filesystems.html The snakebyte(-py3) isn't maintained anymore. |
I am a bit confused, earlier in this thread it was already proposed pyarrow but then it ended up being too difficult and it wasn't completed (this is my impression from reading, might be wrong). The snakebyte-py3 fork seems maintained by internet archive, it could be a good interim solution before moving to something different? |
Well, there's barely anything going on: https://github.com/internetarchive/snakebite-py3/commits/master We can move to |
I agree, but I think that the aim of the new maintainer was just to port the existing code to py3 and then support if needed. I opened an issue and they promptly answer, so I think it is a good sign! I am not advocating for using snakebite, only trying to find a solution to unblock the current status :) |
Tried to work on it during the last couple of days, and hit an issue with my environment: spotify/snakebite#153 (comment) snakebite seems to rely ok |
Maybe @bolkedebruin can shed some light here? I guess our only realistic option is indeed switch to a different library - I think people rely on kerberos being available for secure clusters and we cannot provide Hadoop support without it. |
I'd love to know @bolkedebruin's opinion as well :) From my side, I can offer dev time and also test environment, since I have a kerberos with rpc encrypted cluster to use as testing environment. |
Correcting myself: the missing feature in snakebite is encryption/decryption of the HDFS data transfer protocol, because the Hadoop RPC SASL code works fine (so basic ops with the HDFS Namenode should work). I found a bug in the new python 3 code (text vs binary string comparison ending up in always false) and now I am able to at least replicate the snakebite-py2 behavior. |
Good news: I was able to swap krbV with gssapi, so in theory snakebite-py3 should now be able to work with Kerberos (after/if my PR is merged), no more py2 only dependencies. The only remaining thing to do is adding support for HDFS clusters with RPC encryption, that was not supported by snakebite previously (so a new feature). Feedback/testing welcome! internetarchive/snakebite-py3#6 |
Any updates here? |
@wittfabian No, this PR was close by stale-bot and I think no one could open it again, feel free to continue the PR by create new one if you want to and have more further idea. |
I will take a look shortly while I will be looking at removing conflicting requirements. The state for this snakebite is that it has been fixed by @elukey and it should work for 2.0 with the exception of HDFS-RPC encryption. But just switching to latest released snakebite-py3 should work for all other use cases. |
Ohh, I remember we could not reopen PR closed by stale-bot, am I wrong? |
BTW, that a good new for AIP-3 |
Is it still planned to switch to PyArrow in the future? We only encounter problems with snakebite in our projects. |
I think we should switch, but let hear Jarek opinion |
Yeah. If somoeone (@wittfabian ? ) would like to switch to PyArrow I am more than happy :). |
We even had a JIRA issue for it: https://issues.apache.org/jira/browse/AIRFLOW-2697 |
The last time I looked in PyArrow I had the problem that using libhdfs required Hadoop configuration (enviroment variables). Our use case is that Hadoop does not run on the same host. https://github.com/apache/arrow/blob/master/python/pyarrow/hdfs.py#L38 |
@wittfabian I am curious about what problems have you encountered with snakebite (to understand if they are big ones or small that we can solve). It is true that I didn't have the time to add support for RPC encryption, but the task seems to be big and needs time/energy (any help is needed). More info on the current blockers are listed in internetarchive/snakebite-py3#8 Switching to PyArrow was on the table even when I started, but it was abandoned for various reasons (might not hold anymore). The main problem of snakebite is that it doesn't rely on java libs (as pyarrow does IIRC) and Hadoop still sadly uses old ciphers for SASL. Anyway, happy if Airflow proceeds in any direction, or even more happy to collaborate with other people on snakebite-py3. Adding RPC encryption support by myself is not doable in a short amount of time. |
Just an FYI: pyarrow is a nightmare to try and install on alpline linux. I don't think this affects us, as it's an optional dep, and we provide slim Debian images. |
@zhongjiajie as committers we can reopen stalebot closed PRs, and we can also add the "pinned" label to prevent stalebot closing them in the first place. |
I had issues with pyarrow dependency issues too. Boost, a popular C++ library was most often caused a lot of those issues. Arrow 0.16 release in Jan made Boost optional.
So it might be a good idea to try latest pyarrow now. On a separate note, I sent a comment above
That should address the issue described by @elukey in this comment |
@elukey I'm not 100 percent sure, but we had a lot of problems with the installation. Maybe some problems have been solved with new versions. I looked at some details again yesterday, it is possible that some of the problems also come from kerberos. I`m happy with any solution that runs on Python 3. |
Thanks @ashb for the clarification |
@Tagar I tried today sasl3 but I get to the same error that I had with sasl, namely the following error message on the Namenode when trying to send a getFileInfo RPC call:
I recall that at the time I solved the problem simply switching to pure-sasl, I'll try to follow up and see if there is something new that I can debug with fresh eyes. Note: the above error happens after a successful negotiation/connection of SASL + encryption (wrap), not at the very start. |
Good news: I got unblocked, I was able to make snakebite working with The next step is to implement the missing bits in snakebite, hoping that I'll not find any pitfall like this one. Fingers crossed! |
@elukey Can you create a PR, would be great to get it into Airflow :) |
I am currently trying to solve the last missing bit (namely encrypted RPCs between client and Datanodes, I'll file a PR to the main snakebite-py3 repo as soon as I'll have something working and tested. Just as FYI, the current version of snakebite-py3 supports py3, non-encrypted-rpcs and a subset of encrypted RPCs (the ones between client and Namenode). The work that I am doing now is to have a complete support, but 75% of the work is already upstreamed and working :) |
@potiuk @Fokko due to the complexity of adding encryption support to snakebite (at least for me), I would in parallel try again pyarrow (given what is written above) and see if it suits the Airflow needs. I am doing some hacking now and I am learning some things, but it would be better to explore other paths as well. What do you think? |
All for it. I already installed pyarrow few days ago as it was needed by apache beam provider! |
Closing this one as we already have snakebite-py3 in master airflow. I think we might need separate issue for PyArrow implementation. |
To keep archives happy: I tried hard to make snakebite compatible with RPC encryption, but found a lot of road blockers, all explained in internetarchive/snakebite-py3#8 (comment). For the moment I am inclined to stop working on this and let Airflow/others to experiment with pyarrow, it seems a more promising strategy to follow :) |
NOTE TO REVIEWER: It depends on #7278 - so please check only the last commit.
Make sure you have checked all steps below.
Jira
Description
Snakebite is not PY3-compatible. Trying to switch to alledgedly drop-in replacement which is py3-compatible.
Tests
Commits
Documentation
Code Quality
flake8