Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6055] [PySpark] fix incorrect DataType.__eq__ (for 1.2) #4809

Closed
wants to merge 4 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Feb 27, 2015

The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.

Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).

This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).

@davies davies changed the title [SPARK-6055] [PySpark] fix incorrect DataType.__eq__ [SPARK-6055] [PySpark] fix incorrect DataType.__eq__ (for 1.1/1.2) Feb 27, 2015
@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28054 has started for PR 4809 at commit 6c2909a.

  • This patch merges cleanly.

@davies davies changed the title [SPARK-6055] [PySpark] fix incorrect DataType.__eq__ (for 1.1/1.2) [SPARK-6055] [PySpark] fix incorrect DataType.__eq__ (for 1.2) Feb 27, 2015
@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28054 has finished for PR 4809 at commit 6c2909a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28054/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28074 has started for PR 4809 at commit b576107.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28074 has finished for PR 4809 at commit b576107.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28074/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28086 has started for PR 4809 at commit 9b4dadc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28093 has started for PR 4809 at commit 65c222f.

  • This patch merges cleanly.

>>> (MapType(StringType, IntegerType, False)
... == MapType(StringType, FloatType))
>>> (MapType(StringType(), IntegerType(), False)
... == MapType(StringType(), FloatType()))
False
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the PR opened against master added typechecking asserts here. Should we also add them in this branch, or is there a reason why we should omit them here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we can not cherry-pick the patch from master, I need to re-do all the things on 1.2/1.1, so I'd like to keep the changes minimized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough; just wanted to check.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28086 has finished for PR 4809 at commit 9b4dadc.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28086/
Test FAILed.

@JoshRosen
Copy link
Contributor

LGTM, since the changes here are a subset of the changes in the PR opened against master.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28093 has finished for PR 4809 at commit 65c222f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28093/
Test PASSed.

@JoshRosen
Copy link
Contributor

I've merged this into branch-1.2 (1.2.2). Thanks!

asfgit pushed a commit that referenced this pull request Feb 28, 2015
The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.

Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).

This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).

Author: Davies Liu <davies@databricks.com>

Closes #4809 from davies/leak2 and squashes the following commits:

65c222f [Davies Liu] Update sql.py
9b4dadc [Davies Liu] fix __eq__ of singleton
b576107 [Davies Liu] fix tests
6c2909a [Davies Liu] fix incorrect DataType.__eq__
@pwendell
Copy link
Contributor

pwendell commented Mar 1, 2015

@davies can you close this? (auto close doesn't work for the backport commits).

@davies davies closed this Mar 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants