SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD #973

aarondav · 2014-06-05T03:12:53Z

This allows users to gain access to the InputSplit which backs each partition.

An alternative solution would have been to have a .withInputSplit() method which returns a new RDD[(InputSplit, (K, V))], but this is confusing because you could not cache this RDD or shuffle it, as InputSplit is not inherently serializable.

AmplabJenkins · 2014-06-05T03:12:59Z

Merged build triggered.

AmplabJenkins · 2014-06-05T03:13:08Z

Merged build started.

AmplabJenkins · 2014-06-05T03:53:37Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-05T03:53:38Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15468/

mateiz · 2014-06-06T05:26:19Z

Hey Aaron, you should probably add these to the Java API as well for completeness.

AmplabJenkins · 2014-06-16T01:19:43Z

Merged build triggered.

AmplabJenkins · 2014-06-16T01:19:50Z

Merged build started.

aarondav · 2014-06-16T01:25:01Z

I added a Java API. However, I think everything would be much cleaner if we could change the return type of hadoopRDD/hadoopFile to be HadoopRDD. This would break binary compatibility and require widening the visibility of HadoopRDD out of DeveloperApi, though.

AmplabJenkins · 2014-06-16T02:03:26Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-16T02:03:27Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15808/

mateiz · 2014-07-29T02:37:05Z

@aarondav mind updating this to let it merge cleanly?

aarondav · 2014-07-29T16:08:39Z

Updated.

SparkQA · 2014-07-29T16:13:54Z

QA tests have started for PR 973. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17360/consoleFull

SparkQA · 2014-07-29T17:00:28Z

QA results for PR 973:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class JavaHadoopRDD[K, V](rdd: HadoopRDD[K, V])
class JavaNewHadoopRDD[K, V](rdd: NewHadoopRDD[K, V])

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17360/consoleFull

mateiz · 2014-07-29T19:36:06Z

core/src/main/scala/org/apache/spark/api/java/JavaNewHadoopRDD.scala

+import org.apache.spark.api.java.function.{Function2 => JFunction2}
+import org.apache.spark.rdd.NewHadoopRDD
+
+class JavaNewHadoopRDD[K, V](rdd: NewHadoopRDD[K, V])


Add @DeveloperApi on these, and also on mapPartitionsWithInputSplit in general (since it requires casting stuff to HadoopRDD)

mateiz · 2014-07-29T19:36:45Z

Thanks for the update. One other thing missing is a test written in Java -- otherwise it's not clear that the types and casts and such will work well in there.

This allows users to gain access to the InputSplit which backs each partition. An alternative solution would have been to have a .withInputSplit() method which returns a new RDD[(InputSplit, (K, V))], but this is confusing because you could not cache this RDD or shuffle it, as InputSplit is not inherently serializable.

SparkQA · 2014-07-31T16:53:56Z

QA tests have started for PR 973. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17586/consoleFull

SparkQA · 2014-07-31T17:45:08Z

QA results for PR 973:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class JavaHadoopRDD[K, V](rdd: HadoopRDD[K, V])
class JavaNewHadoopRDD[K, V](rdd: NewHadoopRDD[K, V])

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17586/consoleFull

mateiz · 2014-07-31T18:34:56Z

Looks good to me. Merging it.

This allows users to gain access to the InputSplit which backs each partition. An alternative solution would have been to have a .withInputSplit() method which returns a new RDD[(InputSplit, (K, V))], but this is confusing because you could not cache this RDD or shuffle it, as InputSplit is not inherently serializable. Author: Aaron Davidson <aaron@databricks.com> Closes apache#973 from aarondav/hadoop and squashes the following commits: 9c9112b [Aaron Davidson] Add JavaAPISuite test 9942cd7 [Aaron Davidson] Add Java API 1284a3a [Aaron Davidson] SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD

…abled by SQL Planner (#973) * [CARMEL-5970] Avoid introducing lots of partitions when bucketing disabled by SQL Planner * Simplify logic

mateiz reviewed Jul 29, 2014
View reviewed changes

aarondav added 3 commits July 31, 2014 09:51

Add Java API

9942cd7

Add JavaAPISuite test

9c9112b

asfgit closed this in f193312 Jul 31, 2014

wangyum pushed a commit that referenced this pull request May 26, 2023

[CARMEL-5970] Avoid introducing lots of partitions when bucketing dis…

0fe346f

…abled by SQL Planner (#973) * [CARMEL-5970] Avoid introducing lots of partitions when bucketing disabled by SQL Planner * Simplify logic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD #973

SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD #973

aarondav commented Jun 5, 2014

AmplabJenkins commented Jun 5, 2014

AmplabJenkins commented Jun 5, 2014

AmplabJenkins commented Jun 5, 2014

AmplabJenkins commented Jun 5, 2014

mateiz commented Jun 6, 2014

AmplabJenkins commented Jun 16, 2014

AmplabJenkins commented Jun 16, 2014

aarondav commented Jun 16, 2014

AmplabJenkins commented Jun 16, 2014

AmplabJenkins commented Jun 16, 2014

mateiz commented Jul 29, 2014

aarondav commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

mateiz Jul 29, 2014

mateiz commented Jul 29, 2014

SparkQA commented Jul 31, 2014

SparkQA commented Jul 31, 2014

mateiz commented Jul 31, 2014

SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD #973

SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD #973

Conversation

aarondav commented Jun 5, 2014

AmplabJenkins commented Jun 5, 2014

AmplabJenkins commented Jun 5, 2014

AmplabJenkins commented Jun 5, 2014

AmplabJenkins commented Jun 5, 2014

mateiz commented Jun 6, 2014

AmplabJenkins commented Jun 16, 2014

AmplabJenkins commented Jun 16, 2014

aarondav commented Jun 16, 2014

AmplabJenkins commented Jun 16, 2014

AmplabJenkins commented Jun 16, 2014

mateiz commented Jul 29, 2014

aarondav commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

mateiz Jul 29, 2014

Choose a reason for hiding this comment

mateiz commented Jul 29, 2014

SparkQA commented Jul 31, 2014

SparkQA commented Jul 31, 2014

mateiz commented Jul 31, 2014