New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24915][Python] Fix Row handling with Schema. #26118
Conversation
@davies @HyukjinKwon you've been the last to work on this part of the code. Are you available for review? |
This contribution is my original work and that I license the work to the project under the project’s open source license. |
ok to test |
cc @BryanCutler too |
Test build #112581 has finished for PR 26118 at commit
|
Seems making sense to me. |
cc @zero323 |
@HyukjinKwon To be honest I have mixed feelings about this. It looks sensible as a temporary workaround, but I am not fond of the idea of enforcing notion of Personally I'd prefer to wait a moment and see where the discussion on SPARK-22232 goes. If the resolution is introduction of legacy mode, then the scope of this particular change could be conditioned on it and Python version. If not I'd like to see some memory profiling data (especially memory - timings might be actually better for now, as we skip all the nasty I've done some rough testing and conversion to dict (with simple optimization suggested below) is at roughly six times slower than conversion to It is also worth mentioning that SPARK-24915 is really an edge case. If user wants to provide schema then it is hard to justify using * Is there any reason why we do this: spark/python/pyspark/sql/types.py Line 615 in 2115bf6
instead of just |
@zero323 Thanks for looking into this
I agree that an update to Do you think it makes sense to apply this change for Spark 2.x?
I can try to do that. Do you have any example how to do proper timings with spark?
In my case it was about reproducing input for a test case. I just wanted to created a dataframe containing the same rows as the problematic dataframe. In the end, I chose a different way to produce the data but it feels strange to not be able to simple create
This was a workaround introduced by #14469. My change here is actually just doing the same workaround for schemas with fields that need some serialization. I'm unclear if the proposed change will have any negative effect. The codepath should only be taken in cases that either failed (as in SPARK-24915) or might have silently mixed up the Even with sub-optimal performance, this would only improve the situation for users. Would you prefer to replace
by
to reduce one indirection? |
Test build #113375 has finished for PR 26118 at commit
|
I don't have strong opinion (especially I cannot speak for project policy, as I am usually less conservative than most of the folks around here), but I'd say that conditioned on lack of significant negative performance impact on more common use cases (i.e. I am also not completely against incorporating this in 3.0, just prefer to see how
For such simple case
My bad, makes sense.
No, As much as I don't like dictionaries here, there much better solution here. |
Test build #113550 has finished for PR 26118 at commit
|
Test build #113602 has finished for PR 26118 at commit
|
@zero323 @HyukjinKwon As expected, for Of course, this is an edge case - I think it is rare to construct dataframes from If performance is an issue, we could check if the order of fields is already alphabetical (making the performance worse for the general case) or determine the order once and reuse this mapping (might require major changes). In my experience, not being able to create a dataframe from dict-like What do you think? Is there any way to improve the code without making it unnecessarily complex? |
Test build #113848 has finished for PR 26118 at commit
|
Thank you!
That seem acceptable in my opinion - not great, but given the diminishing importance of
I will just point out that |
Thanks! I agree, I wouldn't have used @zero323 @HyukjinKwon What is the process to get this PR merged? |
Test build #114546 has finished for PR 26118 at commit
|
retest this please |
Test build #115726 has finished for PR 26118 at commit
|
Let's review and merge #26496 first. |
@HyukjinKwon I think #26496 will remove the main reason for this change but this will be Spark 3.0. This change might be more helpful for 2.x. Does it make sense to open a PR against a 2.x branch? I saw examples of backporting fixes but not of PRs solely against old branches. Any suggestions how to proceed? |
Test build #115819 has finished for PR 26118 at commit
|
It's difficult to backport to 2.x branches due to behavioural changes. Maintenance releases should ideally not have such behaviour changes. |
@HyukjinKwon Thanks for the quick response. I meant just applying this change (#26118) against 2.4. #26496 is certainly a major behaviour change - #26118 just changes behaviour as much as expected from a bugfix. |
I'm -0 on this change for 2.x branch. It seems minimal and makes Row slightly more consistent, but there are still bigger issues this doesn't solve, I think it's better to just document as a known bug and workaround by using |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This change implements a special handling of
pyspark.sql.Row
within the conversion of aStructType
into an internal SQL object (toInternal()
) to fix SPARK-24915.Previously,
Row
was processed astuple
(since it inherits fromtuple
). In particular, it was expected that values come in the "right" order. This works if the internal order of theRow
(sorted by key in the current implementation) corresponds to the order of fields in the schema. If the fields have a different order and need special treatment (e.g._needConversion
isTrue
) then exceptions happened when creating dataframes.With this change, it will be processed as a
dict
if it has named columns and as tuple otherwise.Design
Re
asDict
: I first had an implementation forRow
as type. However, that implementation would fail for fields that are unknown to theRow
object, this is inconsistent with the handling ofdict
s. The most consistent implementation is to convert theRow
todict
.Note: The underlying problem is that
Row
inherits fromtuple
. This is visible in the tests, too. forassertEqual
theRow
sRow(a=..., b=...)
andRow(b=..., a=...)
are not equal because they are compared as lists (and the order is wrong) while a direct comparison returnsTrue
(For this reason the tests compare based onasDict
).Why are the changes needed?
The code part being changed relies on
Row
s being RDD-style tuples but breaks forRow
s created withkwargs
.This change fixes SPARK-24915, creating data frames from (pyspark.sql.)Rows which failed if the order of fields in the schema differed from the (internal) order of fields in the Row and the schema is "complicated".
Complicated can be if one type of the schema is nested (as in the JIRA issue) or one field needs conversion (e.g.
DateType()
)Without the change, the following examples fail:
From JIRA issue:
Date example:
Does this PR introduce any user-facing change?
This change is not introducing User-facing changes for existing, working pyspark code.
Code that previously caused exceptions b/c of the fixed bug will now work (which - technically - is a user-facing change).
How was this patch tested?
Standard Tests
ARROW_PRE_0_15_IPC_FORMAT=1 ./dev/run-tests
succeeded on my machinePython: 3.7.4
Spark: master (
a42d894a4090c97a90ce23b0989163909ebf548d
)OS: MacOS 10.14.6.
New Tests
I added the following tests in module
pyspark.sql.tests.test_types
:test_create_dataframe_from_rows_mixed_with_datetype
: schema with date field doesn't cause exceptiontest_create_dataframe_from_rows_with_nested_row
: schema with nested field doesn't cause exceptiontest_create_dataframe_from_tuple_rows
: Regression test: RDD-styleRow
s still workThe latter corresponds to the test case from SPARK_24915.