Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44137] Change handling of iterable objects for on field in joins #41686

Closed
wants to merge 3 commits into from

Conversation

jhaberstroh-sharethis
Copy link

The on field complained when I passed it a Tuple. That's because it saw that it checked for list exactly, and so wrapped it into a list like [on], leading to immediate failure. This was surprising -- typically, tuple and list should be interchangeable, and typically tuple is the more readily accepted type. I have proposed a change that moves towards the principle of least surprise for this situation.

The reason it checked for list exactly is because Column actually is an Iterable object because it implements __iter__. It only does this because it has __getitem__ implemented, and this allows it to be iterated over with iter(). This caused bad behavior, and so __iter__ was implemented to raise an exception any time a Column is iterated over. That change was implemented in SPARK-10417:
#8574

It happens to also be that Python docs specifically advise against checking for iterability by using isinstance(x, Iterable), and that checking for ability to call iter() is preferred. For references:
https://stackoverflow.com/questions/1952464/in-python-how-do-i-determine-if-an-object-is-iterable
https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable

There will be no user-facing changes for existing working code. It will only fix code that did not work previously.

How was this patch tested?

Tests for:

  • isinstance_interable behaves as-expected for all combinations of (str, col) and (bare, list, tuple).
  • to_list_column_style creates a list when passed any of these types, and contains a non-iterable (as-defined)
  • require that all of these different joins produce the same result.

@HyukjinKwon
Copy link
Member

Mind creating a JIRA please? See also https://spark.apache.org/contributing.html

@jhaberstroh-sharethis
Copy link
Author

jhaberstroh-sharethis commented Jun 21, 2023

@HyukjinKwon Thanks! I requested an account.

@jhaberstroh-sharethis jhaberstroh-sharethis changed the title Change handling of iterable objects for on field in joins [SPARK-44137] Change handling of iterable objects for on field in joins Jun 21, 2023
@jhaberstroh-sharethis
Copy link
Author

try:
iter(obj)
return True
except (TypeError, PySparkTypeError):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PySparkTypeError is a TypeError -- should this be removed due to redundancy, or should the dependency be kept explicit?

@jrhaberstroh
Copy link

jrhaberstroh commented Jun 26, 2023

@HyukjinKwon could I request a final review on this PR? I fixed the github "actions" blocker, and it should pass since I tested locally.

@github-actions
Copy link

github-actions bot commented Oct 5, 2023

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 5, 2023
@github-actions github-actions bot closed this Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants