Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12834] Change ser/de of JavaArray and JavaList #10772

Closed
wants to merge 2 commits into from

Conversation

yinxusen
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-12834

We use SerDe.dumps() to serialize JavaArray and JavaList in PythonMLLibAPI, then deserialize them with PickleSerializer in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780

@SparkQA
Copy link

SparkQA commented Jan 15, 2016

Test build #49459 has finished for PR 10772 at commit 4e8df6c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

@jkbradley @davies I know why we get error here. If the element of a JavaArray is primitive type, then we can use list(JavaArray). However, if its elements have complex structures, like List(model.weights, model.intercept).map(_.asInstanceOf[Object]).asJava in the failed test, then we cannot simply call list(JavaArray), because it cannot convert its inner elements automatically.

I'll change the fix into my original one.

@yinxusen
Copy link
Contributor Author

@jkbradley I have changed the fix. However, here is a little tricky. If a developer wants to convert an Array[Array[Int]] or an Array[Array[Array[Int]]], etc. , it may still have problem. But here, it's non-trivial to check the nested structure.

For complext structure, an example from current code is in VectorIndexer. Do you have any recommendations for it? How about we add an assertion in the above code:

case Array[Array[_]] => throw new XXXException("You need transform nested array into Java list yourself")

@SparkQA
Copy link

SparkQA commented Jan 16, 2016

Test build #49515 has finished for PR 10772 at commit 8903447.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

@jkbradley What do you think of the change? Or I believe that we can just ignore the Array[Array[_]] case, for now, since there is no Python API uses such a returned type.

@jkbradley
Copy link
Member

Sorry for the wait. I think this fix is fine for now and will hopefully let most Python wrappers use _call_java without worrying about conversions.

@jkbradley
Copy link
Member

I'll merge this after tests run again

@SparkQA
Copy link

SparkQA commented Jan 26, 2016

Test build #2458 has finished for PR 10772 at commit 8903447.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

Never mind, we can update the ser/de later when necessary.

2016年1月25日星期一,Apache Spark QA notifications@github.com 写道:

Test build #2458 has finished
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2458/consoleFull

for PR 10772 at commit 8903447
8903447
.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


Reply to this email directly or view it on GitHub
#10772 (comment).

Cheers

Xusen Yin (尹绪森)
LinkedIn: https://cn.linkedin.com/in/xusenyin

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks!

@asfgit asfgit closed this in ae47ba7 Jan 26, 2016
jkbradley pushed a commit to jkbradley/spark that referenced this pull request Jan 27, 2016
https://issues.apache.org/jira/browse/SPARK-12834

We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780

Author: Xusen Yin <yinxusen@gmail.com>

Closes apache#10772 from yinxusen/SPARK-12834.
asfgit pushed a commit that referenced this pull request Jan 27, 2016
…vaList

Backport of SPARK-12834 for branch-1.6

Original PR: #10772

Original commit message:
We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10941 from jkbradley/yinxusen-SPARK-12834-1.6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants