Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7314][SPARK-3524][PySpark] upgrade Pyrolite to 4.4 #5850

Closed
wants to merge 7 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented May 1, 2015

This PR upgrades Pyrolite to 4.4, which contains the bug fix for SPARK-3524 and some other performance improvements (e.g., SPARK-6288). The artifact is still under org.spark-project on Maven Central since there is no official release published there.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31615 has finished for PR 5850 at commit cc3903a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DataFrameStatFunctions(object):

@mengxr
Copy link
Contributor Author

mengxr commented May 2, 2015

test this please

@SparkQA
Copy link

SparkQA commented May 2, 2015

Test build #31633 has finished for PR 5850 at commit cc3903a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented May 3, 2015

@irmen Is Pyrolite 4.4 fully compatible with Python 2.6.6? We saw a test failure for Python 2.6.6:

======================================================================
FAIL: test_newhadoop_with_array (__main__.OutputFormatTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pyspark/tests.py", line 1286, in test_newhadoop_with_array
    self.assertEqual(result, array_data)
AssertionError: Lists differ: [(1, [24 chars]d', [3.0386519416174186e-319, 3.16202013338397[125 chars]0]))] != [(1, [24 chars]d', [1.0, 2.0, 3.0])), (2, array('d', [3.0, 4.0, 5.0]))]

First differing element 1:
(1, array('d', [3.0386519416174186e-319, 3.1620201333839779e-322, 1.0434666440167127e-320]))
(1, array('d', [1.0, 2.0, 3.0]))

  [(1, array('d')),
+  (1, array('d', [1.0, 2.0, 3.0])),
+  (2, array('d', [3.0, 4.0, 5.0]))]
-  (1,
-   array('d', [3.0386519416174186e-319, 3.1620201333839779e-322, 1.0434666440167127e-320])),
-  (2,
-   array('d', [1.0434666440167127e-320, 2.0553130866995856e-320, 2.5612363080410221e-320]))]

----------------------------------------------------------------------

@irmen
Copy link

irmen commented May 3, 2015

Could it be that you're seeing the effect of a bugfix (done in Pyrolite 4.2) regarding the previously incorrect encoding of floats? More info here irmen/Pyrolite#11

@mengxr
Copy link
Contributor Author

mengxr commented May 3, 2015

@irmen Interesting, I tested different versions. I saw this issue in 4.2, 4.3, and 4.4, but not in 3.1. Does that patch broke compatibility with Python 2.6?

Also ping @davies (if he has internet access).

@irmen
Copy link

irmen commented May 3, 2015

If you have the same error in a version before 4.2 it is not that particular fix that is to be blamed. However if the problems started in 4.2 it could very well be this particular bugfix.

I suspect that your Sparks code has some form of workaround or wrapper around the previously incorrect decoding of floats. Now that pyrolite has fixed this, i guess the workaround or wrapper may mess things up if it's still active?

If it is not that: I don't know what your failing unit test is. Maybe you should look into it, what does it test, what does it expect? Can you get hold of the offending pickle data, because that would be a nice place to start investigating. Pyrolite has a bunch of testcases itself too, do they all pass on your system?

@SparkQA
Copy link

SparkQA commented May 3, 2015

Test build #31702 has finished for PR 5850 at commit 7824a9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

@mengxr You might want to look into https://issues.apache.org/jira/browse/SPARK-3524, a followup JIRA that @davies created to remove a workaround introduced in #2365 for array unpickling. I suspect that rolling back that change might fix the test failure that you're seeing here.

As long as you're upgrading Pyrolite, it would also be good to confirm whether the upgrade fixes these related issues (and link / close those JIRAs if this is the case):

@airhorns
Copy link

airhorns commented May 3, 2015

Yeah, I think it'd be good to add pyspark test cases for roundtripping datetimes with timezones at least as regression tests. https://issues.apache.org/jira/browse/SPARK-6917 might also be fixed by this.

@irmen
Copy link

irmen commented May 3, 2015

Regarding https://issues.apache.org/jira/browse/SPARK-6289 (PySpark doesn't maintain SQL date Types) .... I don't understand what Pyrolite may be doing wrong here? Because java.sql.Timestamp extends Date and also contains time information, not only a date.

@mengxr
Copy link
Contributor Author

mengxr commented May 4, 2015

Thanks for the references! I reverted the machine code patch in #2365 and the tests are passing on my local machine. I'm looking at the datetime issue now.

@mengxr mengxr changed the title [WIP][SPARK-7314][PySpark] upgrade Pyrolite to 4.4 with patches [WIP][SPARK-7314][SPARK-3524][PySpark] upgrade Pyrolite to 4.4 with patches May 4, 2015
@mengxr mengxr changed the title [WIP][SPARK-7314][SPARK-3524][PySpark] upgrade Pyrolite to 4.4 with patches [WIP][SPARK-7314][SPARK-3524][SPARK-6411][PySpark] upgrade Pyrolite to 4.4 with patches May 4, 2015
@mengxr mengxr changed the title [WIP][SPARK-7314][SPARK-3524][SPARK-6411][PySpark] upgrade Pyrolite to 4.4 with patches [WIP][SPARK-7314][SPARK-3524][PySpark] upgrade Pyrolite to 4.4 with patches May 4, 2015
@SparkQA
Copy link

SparkQA commented May 4, 2015

Test build #31721 has finished for PR 5850 at commit 6ddac0e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented May 4, 2015

It seems that we need more time to solve the timezone issue with date/datetime. I left the details on SPARK-6411.

@irmen Did you try the bintray -> jcenter -> sonotype oss -> maven central route? If it takes time to publish on maven central, we can ask @pwendell to put the artifact to maven central under org.spark-project, same as our current solution.

@irmen
Copy link

irmen commented May 4, 2015

I haven't spent more time yet on getting the libs (serpent, pyrolite) onto maven central.
The java ecosystem is fairly alien to me and I'm learning my options.
Glancing at the sbt file you provided it looks like it is much simpler than the maven route.
FYI: I think I need to put https://github.com/irmen/serpent on there first, because it is a dependency of Pyrolite. Am I right? I do know that sparks won't use it because you're interested in just the pickle part of Pyrolite, but as a whole it is an interface to Pyro and that requires serpent 😄

@mengxr
Copy link
Contributor Author

mengxr commented May 4, 2015

@irmen Yes, the correct way is to publish serpent first and then pyrolite. Given the Spark 1.4 release window, maybe we are going to publish Pyrolite 4.4 on Maven Central under org.spark-packages again, which is intended to be used by Spark only and hence it doesn't need the Serpent dependency.

@irmen
Copy link

irmen commented May 4, 2015

Will you be able to switch later, to a net.razorvine version once it is finished and available on maven central? Will the serpent dependency be a problem for you at that time?

@mengxr mengxr changed the title [WIP][SPARK-7314][SPARK-3524][PySpark] upgrade Pyrolite to 4.4 with patches [SPARK-7314][SPARK-3524][PySpark] upgrade Pyrolite to 4.4 with patches May 4, 2015
@mengxr
Copy link
Contributor Author

mengxr commented May 4, 2015

@irmen It is simple to switch to use your official artifacts in a later release. We can exclude serpent in our pom file since we don't need it.

@JoshRosen I'm using the 4.4 release on Maven Central now (thanks to @brkyvz). Could you make a final pass? It seems that the timezone issue is not yet fully resolved in the latest Pyrolite/Spark. We need another PR for it, which may miss 1.4.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31799 has finished for PR 5850 at commit 2ed4a95.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31796 has finished for PR 5850 at commit fe7e29b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31798 has finished for PR 5850 at commit da3c2dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr mengxr changed the title [SPARK-7314][SPARK-3524][PySpark] upgrade Pyrolite to 4.4 with patches [SPARK-7314][SPARK-3524][PySpark] upgrade Pyrolite to 4.4 May 5, 2015
@rxin
Copy link
Contributor

rxin commented May 5, 2015

LGTM.

asfgit pushed a commit that referenced this pull request May 5, 2015
This PR upgrades Pyrolite to 4.4, which contains the bug fix for SPARK-3524 and some other performance improvements (e.g., SPARK-6288). The artifact is still under `org.spark-project` on Maven Central since there is no official release published there.

Author: Xiangrui Meng <meng@databricks.com>

Closes #5850 from mengxr/SPARK-7314 and squashes the following commits:

2ed4a95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7314
da3c2dd [Xiangrui Meng] remove my repo
fe7e29b [Xiangrui Meng] switch to maven central
6ddac0e [Xiangrui Meng] reverse the machine code for float/double
d2d5b5b [Xiangrui Meng] change back to 4.4
7824a9c [Xiangrui Meng] use Pyrolite 3.1
cc3903a [Xiangrui Meng] upgrade Pyrolite to 4.4-0 for testing

(cherry picked from commit e9b16e6)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in e9b16e6 May 5, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This PR upgrades Pyrolite to 4.4, which contains the bug fix for SPARK-3524 and some other performance improvements (e.g., SPARK-6288). The artifact is still under `org.spark-project` on Maven Central since there is no official release published there.

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#5850 from mengxr/SPARK-7314 and squashes the following commits:

2ed4a95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7314
da3c2dd [Xiangrui Meng] remove my repo
fe7e29b [Xiangrui Meng] switch to maven central
6ddac0e [Xiangrui Meng] reverse the machine code for float/double
d2d5b5b [Xiangrui Meng] change back to 4.4
7824a9c [Xiangrui Meng] use Pyrolite 3.1
cc3903a [Xiangrui Meng] upgrade Pyrolite to 4.4-0 for testing
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This PR upgrades Pyrolite to 4.4, which contains the bug fix for SPARK-3524 and some other performance improvements (e.g., SPARK-6288). The artifact is still under `org.spark-project` on Maven Central since there is no official release published there.

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#5850 from mengxr/SPARK-7314 and squashes the following commits:

2ed4a95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7314
da3c2dd [Xiangrui Meng] remove my repo
fe7e29b [Xiangrui Meng] switch to maven central
6ddac0e [Xiangrui Meng] reverse the machine code for float/double
d2d5b5b [Xiangrui Meng] change back to 4.4
7824a9c [Xiangrui Meng] use Pyrolite 3.1
cc3903a [Xiangrui Meng] upgrade Pyrolite to 4.4-0 for testing
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This PR upgrades Pyrolite to 4.4, which contains the bug fix for SPARK-3524 and some other performance improvements (e.g., SPARK-6288). The artifact is still under `org.spark-project` on Maven Central since there is no official release published there.

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#5850 from mengxr/SPARK-7314 and squashes the following commits:

2ed4a95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7314
da3c2dd [Xiangrui Meng] remove my repo
fe7e29b [Xiangrui Meng] switch to maven central
6ddac0e [Xiangrui Meng] reverse the machine code for float/double
d2d5b5b [Xiangrui Meng] change back to 4.4
7824a9c [Xiangrui Meng] use Pyrolite 3.1
cc3903a [Xiangrui Meng] upgrade Pyrolite to 4.4-0 for testing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants