-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-6677] [SQL] [PySpark] fix cached classes #5445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #29993 has finished for PR 5445 at commit
|
|
Test build #30010 has finished for PR 5445 at commit
|
|
Test build #30039 has finished for PR 5445 at commit
|
|
cc @JoshRosen |
|
To recap my understanding of this patch:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to name-mangling, this field will no longer be accessible outside of the Row class itself, but I don't think that's a problem based on how we use it: it looks like the only place where we read the old __DATATYPE__ field was in __reduce__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that Row will have two reference to dataType, via _Row__dataType and __dataType, so I'd like to change it to __dataType__ to avoid double references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like Row held a reference to its DataType even in the old code, so I guess the only new references here are the ones that we're adding to the functions / arrays / etc. returned from the other branches of _create_cls. I suppose that we could set those objects' __dataType__ fields inside of the _create_cls function instead of setting them in _restore_object if you think that would be easier to understand. Not a huge deal to set it outside, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_restore_object is only used for Row, so the dataType should be StructType. The reason it will crash while Row has a reference to dataType is that there will be multiple id for a Row class in _cached_cls, all of them are the same StructType, but from different batches (see another comment).
So we should not add __dataType__ for other datatypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha. Do you want to go ahead with the renaming here to eliminate the double-reference?
|
A bit of time on GitHub code search suggests that renaming @davies, could you take a look at my summary above and let me know if it's accurate? Just want to double-check my understanding before merging. Thanks! |
|
This is fairly complicated, but the solution here makes sense to me: we should be guaranteed safety because we now always check that the cache returns a row class for the expected data type. This looks good to me, but I'll wait to see if @davies wants to do a field renaming proposed upthread. If not, I'll merge this. |
|
@JoshRosen I think it's fine to go without renaming. The comments are very valuable, thank you! |
|
Great! I'm going to merge this into |
It's possible to have two DataType object with same id (memory address) at different time, we should check the cached classes to verify that it's generated by given datatype. This PR also change `__FIELDS__` and `__DATATYPE__` to lower case to match Python code style. Author: Davies Liu <davies@databricks.com> Closes #5445 from davies/fix_type_cache and squashes the following commits: 63b3238 [Davies Liu] typo 47bdede [Davies Liu] fix cached classes (cherry picked from commit 5d8f7b9) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
It's possible to have two DataType object with same id (memory address) at different time, we should check the cached classes to verify that it's generated by given datatype.
This PR also change
__FIELDS__and__DATATYPE__to lower case to match Python code style.