-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD #973
Conversation
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Hey Aaron, you should probably add these to the Java API as well for completeness. |
Merged build triggered. |
Merged build started. |
I added a Java API. However, I think everything would be much cleaner if we could change the return type of hadoopRDD/hadoopFile to be HadoopRDD. This would break binary compatibility and require widening the visibility of HadoopRDD out of DeveloperApi, though. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
@aarondav mind updating this to let it merge cleanly? |
Updated. |
QA tests have started for PR 973. This patch merges cleanly. |
QA results for PR 973: |
import org.apache.spark.api.java.function.{Function2 => JFunction2} | ||
import org.apache.spark.rdd.NewHadoopRDD | ||
|
||
class JavaNewHadoopRDD[K, V](rdd: NewHadoopRDD[K, V]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add @DeveloperApi
on these, and also on mapPartitionsWithInputSplit in general (since it requires casting stuff to HadoopRDD)
Thanks for the update. One other thing missing is a test written in Java -- otherwise it's not clear that the types and casts and such will work well in there. |
This allows users to gain access to the InputSplit which backs each partition. An alternative solution would have been to have a .withInputSplit() method which returns a new RDD[(InputSplit, (K, V))], but this is confusing because you could not cache this RDD or shuffle it, as InputSplit is not inherently serializable.
QA tests have started for PR 973. This patch merges cleanly. |
QA results for PR 973: |
Looks good to me. Merging it. |
This allows users to gain access to the InputSplit which backs each partition. An alternative solution would have been to have a .withInputSplit() method which returns a new RDD[(InputSplit, (K, V))], but this is confusing because you could not cache this RDD or shuffle it, as InputSplit is not inherently serializable. Author: Aaron Davidson <aaron@databricks.com> Closes apache#973 from aarondav/hadoop and squashes the following commits: 9c9112b [Aaron Davidson] Add JavaAPISuite test 9942cd7 [Aaron Davidson] Add Java API 1284a3a [Aaron Davidson] SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD
…abled by SQL Planner (#973) * [CARMEL-5970] Avoid introducing lots of partitions when bucketing disabled by SQL Planner * Simplify logic
This allows users to gain access to the InputSplit which backs each partition.
An alternative solution would have been to have a .withInputSplit() method which returns a new RDD[(InputSplit, (K, V))], but this is confusing because you could not cache this RDD or shuffle it, as InputSplit is not inherently serializable.