New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Pyjanitor for PySpark #504
Comments
Thanks for pinging in, @zjpoh! I'm more of a Dask user, and even then I don't use the Dask DataFrame API much, as my datasets are generally small (compared to the scale of what generally parallelization engines are good for). So I don't have much experience with PySpark, and hence I'd have to defer to someone else on how to extend pyspark DataFrames with custom method chainable functions. If you're up for it, I'd love to see what's possible! |
Sure. I'll explore how |
I'd be interested in helping out with his @zjpoh - we use Spark DataFrames all the time at work. |
It works by copying code from pandas and pandas_flavor and make a single adjustment (import Here is the code to create accessor (we should be able to further simply this).
Then running the following test code
gives
|
@ericmjl Any thoughts on the best place to add this? I'm more inclined to add this as a separate module ( |
@zjpoh definitely this would need some thought. As the package is currently architected, we have everything depending on pandas as a backend by default. If we were to structure this in a more logical fashion, there would be the functions, which wouldn’t be decorated, and then there would be individual backends in which the functions are decorated/wrapped and attached to individual dataframe implementations. We can maybe get this kickstarted by setting up a Admittedly, the organization of the package has to be rethought here in order to support multiple dataframe implementations. What are your thoughts here? |
I wonder about the backend approach, since I'll sometimes create Pandas DFs from Spark ones (and what if I want to call both Pandas janitor functions alongside Spark ones?). I'd personally vote for @zjpoh 's approach. |
Let's work backwards from the front-facing API, keeping in mind that in adding support for pyspark:
Here are my reasons for proposing a The first reason is preventing code duplication. Using Functions stay where they are. # functions.py
@pf.register_dataframe_method
def clean_names(...):
...
# janitor/backends/pyspark.py
from janitor.functions import clean_names
def register_dataframe_method(...):
...
clean_names = register_dataframe_method(clean_names) We provide a namespace in # janitor/__init__.py
from .backends import pyspark as pyspark_backend End users imports appropriate backend. # in a jupyter notebook, for e.g.
import janitor.pyspark_backend I'm not 100% sure whether this will work in a nice and clean fashion, so we probably have to test whether this works out. The second reason relates to semantics. Semantically, for the organization of the project, we currently have the submodules If both of you are willing to make the contribution work, I'm happy to add both of you as maintainers (so we can get some help + expertise using Spark on the team)! I am admittedly quite clueless about Spark, as I have been using Dask all this while, and usually only go to Dask for numerical data manipulations more than I do the string manipulation stuff (in Let me know what your thoughts are, having read what hopefully was (to you) a coherent block of text above. (If things are unclear, please let me know, I'm happy to address your questions about my thoughts.) |
I agree that keeping pyspark as backend ( I'm more than happy to be one of the maintainers of the pyspark version. I'm starting to think that having pyspark as a separate module (something like Do you know if R has different types of DataFrame such as pandas, pyspark, and Dask? If so, do you know how |
Well, there are Spark dataframes in R via sparklyr. If we do use backends, how do we put them in the codebase? Should each function have a big if statement to demonstrate which code is run if in Spark versus Dask vs whatever? I know for some functions the spark API differs from Pandas. Similarly, would this require people to install Spark on their system to run tests (even when they are not testing Spark code)? |
Also, are we thinking of using a context manager ( |
According to my understanding, In terms of testing, we can simply circumvent pyspark testing with pyspark is not installed. Something like try:
import pyspark
except ImportError:
pyspark = None
@pytest.mark.skipif(pyspark is None, reason="requires pyspark")
def test_pyspark():
... |
Thanks both of you, @anzelpwj and @zjpoh, for chiming in!
That could be a good idea! I haven't done much with context managers myself, so I'd be curious to see how this gets implemented.
I took a second look, and indeed, it looks like the pyspark API is quite different from the
This actually changes the calculus for maintainability! 😄 @anzelpwj, would you be on-board? Thinking forward, if both of you come on board and develop Before that happens, though, let's flesh things out by focusing on only |
I really like the idea of starting simple to flesh out the implementation details. I will spend some evenings next week and hopefully that will be enough to put in a PR. |
Thanks @zjpoh! Looking forward to reviewing your PR 😄 |
I can definitely help with maintenance and general architecting. |
I'm a little overwhelmed with work and won't have time to do this until next week. Just FYI. |
No worries, I totally understand. Same situation here 😄. If you'd like to further narrow the scope of the PR, that's totally cool - in fact, it might be preferable, because it makes the review process much easier too! |
What's the initial plan, @zjpoh ? Do we want to create a submodule and just get one function up and running? |
@anzelpwj Sorry for the late response. I was on break for a little bit. 😉 |
Sorry I haven't been able to be involved at all so far, kid's been sick a lot lately (just with "starting daycare means getting exposed to all the diseases"). Once things calm down, I can try to contribute more. |
No worries~~ |
Brief Description
I would like to know if there are any interest to create pyjanitor for pyspark? I'm using pyspark a lot and I would really like use custom method chaining to clean up my ETL code.
I'm not sure if it is doable or how easy it is but I would be open to explore.
The text was updated successfully, but these errors were encountered: