Skip to content

Conversation

@drJAGartner
Copy link

To perform supervised machine learning, one must partition data into labeled data sets, typically according to a set of rules. Creating an RDD function to perform this as a single step serves as a first step in moving toward single pass RDD partitioning.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@markhamstra
Copy link
Contributor

This needs to be implemented in Scala since Spark's architecture is such that the Scala implementation can be exposed via the Python, Java, R, etc. APIs and not the other way around, starting with one of those APIs as the root implementation.

@JoshRosen
Copy link
Contributor

I don't think that we should add this to the core RDD API itself. This method is probably more useful / discoverable as part of one of the ml or mllib subpackages. In addition, the implementation here is not more efficient than what a user could write themselves (i.e. it's just syntactic sugar for ~ two lines of code).

If you'd still like to propose this change, please see the instructions at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and follow the process for filing a JIRA.

@drJAGartner
Copy link
Author

Sounds good, thanks for the input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants