Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD #973

Closed
wants to merge 3 commits into from

Conversation

aarondav
Copy link
Contributor

@aarondav aarondav commented Jun 5, 2014

This allows users to gain access to the InputSplit which backs each partition.

An alternative solution would have been to have a .withInputSplit() method which returns a new RDD[(InputSplit, (K, V))], but this is confusing because you could not cache this RDD or shuffle it, as InputSplit is not inherently serializable.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15468/

@mateiz
Copy link
Contributor

mateiz commented Jun 6, 2014

Hey Aaron, you should probably add these to the Java API as well for completeness.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@aarondav
Copy link
Contributor Author

I added a Java API. However, I think everything would be much cleaner if we could change the return type of hadoopRDD/hadoopFile to be HadoopRDD. This would break binary compatibility and require widening the visibility of HadoopRDD out of DeveloperApi, though.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15808/

@mateiz
Copy link
Contributor

mateiz commented Jul 29, 2014

@aarondav mind updating this to let it merge cleanly?

@aarondav
Copy link
Contributor Author

Updated.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 973. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17360/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 973:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class JavaHadoopRDD[K, V](rdd: HadoopRDD[K, V])
class JavaNewHadoopRDD[K, V](rdd: NewHadoopRDD[K, V])

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17360/consoleFull

import org.apache.spark.api.java.function.{Function2 => JFunction2}
import org.apache.spark.rdd.NewHadoopRDD

class JavaNewHadoopRDD[K, V](rdd: NewHadoopRDD[K, V])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add @DeveloperApi on these, and also on mapPartitionsWithInputSplit in general (since it requires casting stuff to HadoopRDD)

@mateiz
Copy link
Contributor

mateiz commented Jul 29, 2014

Thanks for the update. One other thing missing is a test written in Java -- otherwise it's not clear that the types and casts and such will work well in there.

This allows users to gain access to the InputSplit which backs
each partition.

An alternative solution would have been to have a .withInputSplit()
method which returns a new RDD[(InputSplit, (K, V))], but this is
confusing because you could not cache this RDD or shuffle it, as
InputSplit is not inherently serializable.
@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA tests have started for PR 973. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17586/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA results for PR 973:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class JavaHadoopRDD[K, V](rdd: HadoopRDD[K, V])
class JavaNewHadoopRDD[K, V](rdd: NewHadoopRDD[K, V])

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17586/consoleFull

@mateiz
Copy link
Contributor

mateiz commented Jul 31, 2014

Looks good to me. Merging it.

@asfgit asfgit closed this in f193312 Jul 31, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
This allows users to gain access to the InputSplit which backs each partition.

An alternative solution would have been to have a .withInputSplit() method which returns a new RDD[(InputSplit, (K, V))], but this is confusing because you could not cache this RDD or shuffle it, as InputSplit is not inherently serializable.

Author: Aaron Davidson <aaron@databricks.com>

Closes apache#973 from aarondav/hadoop and squashes the following commits:

9c9112b [Aaron Davidson] Add JavaAPISuite test
9942cd7 [Aaron Davidson] Add Java API
1284a3a [Aaron Davidson] SPARK-2028: Expose mapPartitionsWithInputSplit in HadoopRDD
wangyum pushed a commit that referenced this pull request May 26, 2023
…abled by SQL Planner (#973)

* [CARMEL-5970] Avoid introducing lots of partitions when bucketing disabled by SQL Planner

* Simplify logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants