Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3580] add 'partitions' property to PySpark RDD #2478

Closed
wants to merge 1 commit into from

Conversation

mattf
Copy link
Contributor

@mattf mattf commented Sep 21, 2014

'rdd.partitions' is available in scala&java, primarily used for its
size() method to get the number of partitions. pyspark instead has a
getNumPartitions() call and no access to 'partitions'

this change adds 'partitions' to pyspark's rdd, allowing for
len(rdd.partitions) to get the number of partitions in a way familiar
to python developers

'rdd.partitions' is available in scala&java, primarily used for its
size() method to get the number of partitions. pyspark instead has a
getNumPartitions() call and no access to 'partitions'

this change adds 'partitions' to pyspark's rdd, allowing for
len(rdd.partitions) to get the number of partitions in a way familiar
to python developers
@SparkQA
Copy link

SparkQA commented Sep 21, 2014

QA tests have started for PR 2478 at commit 96316e4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 21, 2014

QA tests have finished for PR 2478 at commit 96316e4.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging

@davies
Copy link
Contributor

davies commented Sep 23, 2014

RDD._jrdd is very heavy for PipelinedRDD, but getNumPartitions() could be optimized for PipelinedRDD to avoid the creation of _jrdd (could be rdd.prev.getNumPartitions()).

Also, partitions() is one Java Object, it should be an implementation detail, it's better to keep it as internal interface.

len(rdd.partitions()) sounds more Pythonic, how about len(rdd)? cc @JoshRosen

@JoshRosen
Copy link
Contributor

I think len(rdd) has the potential to be confused with rdd.count(), since calling len() on a Python collection usually returns the size of that collection.

I also agree that we shouldn't expose Java Partition objects to users. Is there any reason to expose Partition objects besides allowing len(rdd.partitions()) to work? If not, I'm not sure that we should add this feature.

@mattf
Copy link
Contributor Author

mattf commented Sep 25, 2014

RDD._jrdd is very heavy for PipelinedRDD, but getNumPartitions() could be optimized for PipelinedRDD to avoid the creation of _jrdd (could be rdd.prev.getNumPartitions()).

very true

Also, partitions() is one Java Object, it should be an implementation detail, it's better to keep it as internal interface.

also true. when testing this i noticed the list should essentially be treated as full of black boxes, but that's difficult to enforce w/o wrapping the job object in a python version of Partition.

@davies @JoshRosen, what's the purpose of exposing an array of partition objects in Scala&Java?

I think len(rdd) has the potential to be confused with rdd.count(), since calling len() on a Python collection usually returns the size of that collection.

i agree

@JoshRosen
Copy link
Contributor

what's the purpose of exposing an array of partition objects in Scala&Java?

In Scala / Java, I think we expose Partition objects for use in custom RDD implementations. There are a bunch of methods like iterator(), compute(), preferredLocations(), etc. that take Partition objects. Outside of this context, I'm not sure that they're useful to end-users.

(If you use IntelliJ, try running "Find Usages" on Partition and look at the "Method Parameter Declaration" usages).

@mattf
Copy link
Contributor Author

mattf commented Sep 30, 2014

@JoshRosen a partition itself doesn't have much in the way of a user api. it wouldn't be difficult to wrap the java objects in a python Partition. we should then start implementing the rdd functions that take partitions in python.

@mattf
Copy link
Contributor Author

mattf commented Oct 13, 2014

@JoshRosen @pwendell any further comment on this?

@JoshRosen
Copy link
Contributor

@mattf I'm not sure that it's worth exposing those Partition-accepting methods in Python, since I don't think that they're really intended to be called by users. I guess I don't really see the benefit of adding a Partition wrapper to Python just so that we can write len(partitions); it seems like a lot of work for not much benefit.

@JoshRosen
Copy link
Contributor

Hi @mattf,

If you don't have any additional comments, do you mind closing this pull request? Thanks!

@asfgit asfgit closed this in f73b56f Nov 10, 2014
@mattf mattf deleted the SPARK-3580 branch January 6, 2015 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants