-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44137] Change handling of iterable objects for on
field in joins
#41686
Conversation
a2d92a3
to
7490172
Compare
Mind creating a JIRA please? See also https://spark.apache.org/contributing.html |
@HyukjinKwon Thanks! I requested an account. |
on
field in joinson
field in joins
@HyukjinKwon Created ticket https://issues.apache.org/jira/browse/SPARK-44137 |
try: | ||
iter(obj) | ||
return True | ||
except (TypeError, PySparkTypeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PySparkTypeError is a TypeError -- should this be removed due to redundancy, or should the dependency be kept explicit?
7490172
to
331adbc
Compare
@HyukjinKwon could I request a final review on this PR? I fixed the github "actions" blocker, and it should pass since I tested locally. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
The
on
field complained when I passed it a Tuple. That's because it saw that it checked forlist
exactly, and so wrapped it into a list like[on]
, leading to immediate failure. This was surprising -- typically, tuple and list should be interchangeable, and typically tuple is the more readily accepted type. I have proposed a change that moves towards the principle of least surprise for this situation.The reason it checked for
list
exactly is becauseColumn
actually is anIterable
object because it implements__iter__
. It only does this because it has__getitem__
implemented, and this allows it to be iterated over withiter()
. This caused bad behavior, and so__iter__
was implemented to raise an exception any time a Column is iterated over. That change was implemented in SPARK-10417:#8574
It happens to also be that Python docs specifically advise against checking for iterability by using
isinstance(x, Iterable)
, and that checking for ability to calliter()
is preferred. For references:https://stackoverflow.com/questions/1952464/in-python-how-do-i-determine-if-an-object-is-iterable
https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable
There will be no user-facing changes for existing working code. It will only fix code that did not work previously.
How was this patch tested?
Tests for:
isinstance_interable
behaves as-expected for all combinations of (str, col) and (bare, list, tuple).to_list_column_style
creates a list when passed any of these types, and contains a non-iterable (as-defined)