-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13390][SQL]Fix the issue due to Iterator.map().toSeq is not Serializable #11313
Conversation
Test build #51694 has finished for PR 11313 at commit
|
Maybe good to add a test which would have failed before this change? |
@@ -579,7 +579,7 @@ class SQLContext private[sql]( | |||
val className = beanClass.getName | |||
val beanInfo = Introspector.getBeanInfo(beanClass) | |||
val rows = SQLContext.beansToRows(data.asScala.iterator, beanInfo, attrSeq) | |||
DataFrame(self, LocalRelation(attrSeq, rows.toSeq)) | |||
DataFrame(self, LocalRelation(attrSeq, rows.toArray)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toArray could be kind of expensive for large local iterators - maybe we should only do this if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to materialize it anyways in order to support multiple scans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah I know we need to materialize anyways (toSeq forces that too on Iterators) - but I meant more the cost of copying all the objects to an array (see https://github.com/scala/scala/blob/v2.11.1/src/library/scala/collection/TraversableOnce.scala#L269 ) so maybe only calling toArray if necessary would be worth while (but perhaps not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was the same before -- it was materialized into a Seq
. I think Josh is suggesting this is inherently necessary before making a LocalRelation
so there's no meaningful opportunity for lazy eval (?) If so then this seems fine.
Just found this happened to be fixed in #10511. I'm going to resubmit this patch against branch-1.6 since this is not necessary for master. |
See #11334 instead. |
What changes were proposed in this pull request?
scala.collection.Iterator
's methods (e.g., map, filter) will return anAbstractIterator
which is not Serializable. E.g.,If we call something like
Iterator.map(...).toSeq
, it will create aStream
that contains a non-serializableAbstractIterator
field and make theStream
be non-serializable.This PR uses
toArray
instead oftoSeq
to fix such issue indef createDataFrame(data: java.util.List[_], beanClass: Class[_]): DataFrame
.How was the this patch tested?
Jenkins tests.