Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to support KMeans with large feature space #10739

Closed

Conversation

levin-royl
Copy link

The problem:

In Spark's KMeans code the center vectors are always represented as dense vectors. As a result, when each such center has a large domain space the algorithm quickly runs out of memory. In my example I have a feature space of around 50000 and k ~= 500. This sums up to around 200MB RAM for the center vectors alone while in fact the center vectors are very sparse and require a lot less RAM.
Since I am running on a system with relatively low resources I keep getting OutOfMemory errors. In my setting it is OK to trade off runtime for using less RAM. This is what I set out to do in my solution while allowing users the flexibility to choose.

My solution:

Allow the kmeans algorithm to accept a VectorFactory which decides when vectors used inside the algorithm should be sparse and when they should be dense. For backward compatibility the default behavior is to always make them dense (like the situation is now). But now potentially the user can provide a SmartVectorFactory (or some proprietary VectorFactory) which can decide to make vectors sparse.

For this I made the following changes:
(1) Added a method called reassign to SparseVectors allowing to change the indices and values
(2) Allow axpy to accept SparseVectors
(3) create a trait called VectorFactory and two implementations for it that are used within KMeans code

@yinxusen
Copy link
Contributor

Hi @levin-royl, you need to remove the two log files and create a JIRA. See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

@levin-royl
Copy link
Author

Thank you, I removed the log files and added the following JIRA request:

https://issues.apache.org/jira/browse/KYLIN-1326

@yinxusen
Copy link
Contributor

You may create your JIRA in a wrong place. Not Kylin, but Spark https://issues.apache.org/jira/browse/SPARK

@levin-royl
Copy link
Author

Sorry, I am a little new to this. For some reason when choosing "create new" I only had the options: Kylin, Atlas or Apache Infrastructure. Now through the link you sent I created the following JIRA request in Spark:

https://issues.apache.org/jira/browse/SPARK-12861

@levin-royl
Copy link
Author

Hi, just wanted to know if there are any unhanded items on my end WRT this change. Thanks.

@srowen
Copy link
Member

srowen commented Jan 28, 2016

Read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You may want to search for duplicate JIRAs too. There are several on this topic and k-means.

@levin-royl
Copy link
Author

Hi, there are indeed some similar issues I found, e.g.:

https://issues.apache.org/jira/browse/SPARK-4039
https://issues.apache.org/jira/browse/SPARK-1212
mesos/spark#736

But the difference is that in the problem I describe reducing the dimensions of the problem (i.e., the feature space) to allow using dense vectors is not suitable. Also, the solution I implemented supports this while allowing full flexibility to the user --- i.e., using the default dense vector implementation or selecting an alternative (only when the default it is not desired).

I will update the JIRA issue on this as well.

Please advise if there are any additional steps I need to do at this point.

Thanks in advance.

1 similar comment
@levin-royl
Copy link
Author

Hi, there are indeed some similar issues I found, e.g.:

https://issues.apache.org/jira/browse/SPARK-4039
https://issues.apache.org/jira/browse/SPARK-1212
mesos/spark#736

But the difference is that in the problem I describe reducing the dimensions of the problem (i.e., the feature space) to allow using dense vectors is not suitable. Also, the solution I implemented supports this while allowing full flexibility to the user --- i.e., using the default dense vector implementation or selecting an alternative (only when the default it is not desired).

I will update the JIRA issue on this as well.

Please advise if there are any additional steps I need to do at this point.

Thanks in advance.

@levin-royl
Copy link
Author

I wanted to know if you took a look at the code and the proposed solution in general. Are there any comments?

Thanks.

@thunterdb
Copy link
Contributor

@levin-royl it looks like @hhbyyh has a branch with some code that tackles the same issue (see the jira discussion for more information), you may want to coordinate there.

Also, I suggest you take again a look at the pull request section of the guidelines https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

@hhbyyh
Copy link
Contributor

hhbyyh commented Mar 10, 2016

I didn't send a PR because there's some ongoing effort on transforming the implementation of KMeans to Matrix multiplication.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@maropu maropu mentioned this pull request Apr 23, 2017
maropu added a commit to maropu/spark that referenced this pull request Apr 23, 2017
@asfgit asfgit closed this in e9f9715 Apr 24, 2017
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues).

// Open PRs whose JIRA tickets have been already closed
Closes apache#11785
Closes apache#13027
Closes apache#13614
Closes apache#13761
Closes apache#15197
Closes apache#14006
Closes apache#12576
Closes apache#15447
Closes apache#13259
Closes apache#15616
Closes apache#14473
Closes apache#16638
Closes apache#16146
Closes apache#17269
Closes apache#17313
Closes apache#17418
Closes apache#17485
Closes apache#17551
Closes apache#17463
Closes apache#17625

// Open PRs whose JIRA tickets does not exist and they are not minor issues
Closes apache#10739
Closes apache#15193
Closes apache#15344
Closes apache#14804
Closes apache#16993
Closes apache#17040
Closes apache#15180
Closes apache#17238

N/A

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes apache#17734 from maropu/resolved_pr.

Change-Id: Id2e590aa7283fe5ac01424d30a40df06da6098b5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants