Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41757][CONNECT] Fixing String representation for Column class #39296

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
11 changes: 4 additions & 7 deletions python/pyspark/sql/column.py
Expand Up @@ -182,9 +182,6 @@ def _(self: "Column", other: Union["LiteralType", "DecimalLiteral"]) -> "Column"
return _


# TODO(SPARK-41757): Compatibility of string representation for Column


class Column:

"""
Expand All @@ -203,16 +200,16 @@ class Column:
... [(2, "Alice"), (5, "Bob")], ["age", "name"])

Select a column out of a DataFrame
>>> df.name # doctest: +SKIP
>>> df.name
Column<'name'>
>>> df["name"] # doctest: +SKIP
>>> df["name"]
Column<'name'>

Create from an expression

>>> df.age + 1 # doctest: +SKIP
>>> df.age + 1
Column<'(age + 1)'>
>>> 1 / df.age # doctest: +SKIP
>>> 1 / df.age
Column<'(1 / age)'>
"""

Expand Down
3 changes: 3 additions & 0 deletions python/pyspark/sql/connect/column.py
Expand Up @@ -45,6 +45,7 @@
WindowExpression,
WithField,
DropField,
UnresolvedBinaryFunction,
)


Expand Down Expand Up @@ -94,6 +95,8 @@ def wrapped(self: "Column") -> "Column":


def scalar_function(op: str, *args: "Column") -> "Column":
if len(args) == 2:
return Column(UnresolvedBinaryFunction(op, [arg._expr for arg in args]))
return Column(UnresolvedFunction(op, [arg._expr for arg in args]))


Expand Down
10 changes: 8 additions & 2 deletions python/pyspark/sql/connect/expressions.py
Expand Up @@ -308,7 +308,7 @@ def to_plan(self, session: "SparkConnectClient") -> "proto.Expression":
return expr

def __repr__(self) -> str:
return f"Literal({self._value})"
return f"{self._value}"


class ColumnReference(Expression):
Expand All @@ -333,7 +333,7 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
return expr

def __repr__(self) -> str:
return f"ColumnReference({self._unparsed_identifier})"
return f"{self._unparsed_identifier}"


class SQLExpression(Expression):
Expand Down Expand Up @@ -420,6 +420,12 @@ def __repr__(self) -> str:
return f"{self._name}({', '.join([str(arg) for arg in self._args])})"


class UnresolvedBinaryFunction(UnresolvedFunction):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this issue is a bit complicated, since not all binary ops have this type of string repr, for example:

In [11]: df.name + 1
Out[11]: Column<'(name + 1)'>

In [12]: df.name + df.age
Out[12]: Column<'(name + age)'>

In [13]: df.name.startswith("n")
Out[13]: Column<'startswith(name, n)'>

In [14]: df.name.startswith(df.age)
Out[14]: Column<'startswith(name, age)'>

we may also need to take unary ops into account.

I guess we need to keep a list to control which function can be represented in this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is our goal to make the Column string representation consistent between client and server? We don't have a general rule about how to generate Column string representation, I think a lot of manual efforts are needed to check the server-side implementation and copy it to the client.

Shall we only take care of some common ones like attribute and literal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there was probably not a spec to follow for the string representation of these objects, we can try to do our best to make the doc tests pass and be somewhat compatible since this is not a real API:

  • binary ops with single character operator -> infix
  • other binary ops -> prefix
  • for other expression types we directly resolve check with the server

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think it is fine to start with only handling binary ops with single character operator -> infix.

I think we can simply modify the __repr__ method of UnresolvedFunction, seems unnecessary to add a new class UnresolvedBinaryFunction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this first but it feels weird and the code looks a bit ugly. in particular because the only place where we use this is in the Column class for the bin_op function. But given that we add even more logic I will change it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the behavior for infix / prefix notation and fixed the alias behavior as well. For the remaining simple checks I did it looks good so far and should get us to a reasonable comparison.

I don't think the goal should be compatibility on this non-api but similar readability. If we find cases that are unreasonable to fix, we should apply the change that @HyukjinKwon proposed.

grundprinzip marked this conversation as resolved.
Show resolved Hide resolved
def __repr__(self) -> str:
assert len(self._args) == 2
return f"({self._args[0]} {self._name} {self._args[1]})"


class WithField(Expression):
def __init__(
self,
Expand Down