Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41757][CONNECT] Fixing String representation for Column class #39296

Closed
wants to merge 3 commits into from

Conversation

grundprinzip
Copy link
Contributor

What changes were proposed in this pull request?

To reenable the doc tests for Column, this patch fixes the string representation of the Column, in particular this introduces a workaround for a binary unresolved function that produces an infix representation of the function string.

Why are the changes needed?

Compatibility

Does this PR introduce any user-facing change?

No

How was this patch tested?

Reenabled doc tests.

@@ -420,6 +420,12 @@ def __repr__(self) -> str:
return f"{self._name}({', '.join([str(arg) for arg in self._args])})"


class UnresolvedBinaryFunction(UnresolvedFunction):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this issue is a bit complicated, since not all binary ops have this type of string repr, for example:

In [11]: df.name + 1
Out[11]: Column<'(name + 1)'>

In [12]: df.name + df.age
Out[12]: Column<'(name + age)'>

In [13]: df.name.startswith("n")
Out[13]: Column<'startswith(name, n)'>

In [14]: df.name.startswith(df.age)
Out[14]: Column<'startswith(name, age)'>

we may also need to take unary ops into account.

I guess we need to keep a list to control which function can be represented in this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is our goal to make the Column string representation consistent between client and server? We don't have a general rule about how to generate Column string representation, I think a lot of manual efforts are needed to check the server-side implementation and copy it to the client.

Shall we only take care of some common ones like attribute and literal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there was probably not a spec to follow for the string representation of these objects, we can try to do our best to make the doc tests pass and be somewhat compatible since this is not a real API:

  • binary ops with single character operator -> infix
  • other binary ops -> prefix
  • for other expression types we directly resolve check with the server

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think it is fine to start with only handling binary ops with single character operator -> infix.

I think we can simply modify the __repr__ method of UnresolvedFunction, seems unnecessary to add a new class UnresolvedBinaryFunction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this first but it feels weird and the code looks a bit ugly. in particular because the only place where we use this is in the Column class for the bin_op function. But given that we add even more logic I will change it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the behavior for infix / prefix notation and fixed the alias behavior as well. For the remaining simple checks I did it looks good so far and should get us to a reasonable comparison.

I don't think the goal should be compatibility on this non-api but similar readability. If we find cases that are unreasonable to fix, we should apply the change that @HyukjinKwon proposed.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -414,6 +414,12 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
return fun

def __repr__(self) -> str:
# Special handling for certain infix operators that require slightly
# different printing.
if len(self._args) == 2 and len(self._name) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what about we just whitelist, and make it nicer? For example, there are only a fixed number of operators (20 ~ 30) that PySpark supports plug eqNullSafe, and very likely we won't add some more about this.

Otherwise, I would just even remove these new logics, and just fix the doctest with ....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me just take it over at #39616 by doing a (very) simple version for now to make the tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants