New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6055] [PySpark] fix incorrect DataType.__eq__ (for 1.2) #4809
Conversation
Test build #28054 has started for PR 4809 at commit
|
Test build #28054 has finished for PR 4809 at commit
|
Test FAILed. |
Test build #28074 has started for PR 4809 at commit
|
Test build #28074 has finished for PR 4809 at commit
|
Test PASSed. |
Test build #28086 has started for PR 4809 at commit
|
Test build #28093 has started for PR 4809 at commit
|
>>> (MapType(StringType, IntegerType, False) | ||
... == MapType(StringType, FloatType)) | ||
>>> (MapType(StringType(), IntegerType(), False) | ||
... == MapType(StringType(), FloatType())) | ||
False | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that the PR opened against master
added typechecking asserts here. Should we also add them in this branch, or is there a reason why we should omit them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we can not cherry-pick the patch from master, I need to re-do all the things on 1.2/1.1, so I'd like to keep the changes minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough; just wanted to check.
Test build #28086 has finished for PR 4809 at commit
|
Test FAILed. |
LGTM, since the changes here are a subset of the changes in the PR opened against |
Test build #28093 has finished for PR 4809 at commit
|
Test PASSed. |
I've merged this into |
The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released. Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython). This PR also improve the performance of inferSchema (avoid the unnecessary converter of object). Author: Davies Liu <davies@databricks.com> Closes #4809 from davies/leak2 and squashes the following commits: 65c222f [Davies Liu] Update sql.py 9b4dadc [Davies Liu] fix __eq__ of singleton b576107 [Davies Liu] fix tests 6c2909a [Davies Liu] fix incorrect DataType.__eq__
@davies can you close this? (auto close doesn't work for the backport commits). |
The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.
Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).
This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).