Changes to support KMeans with large feature space #10739

levin-royl · 2016-01-13T12:03:31Z

The problem:

In Spark's KMeans code the center vectors are always represented as dense vectors. As a result, when each such center has a large domain space the algorithm quickly runs out of memory. In my example I have a feature space of around 50000 and k ~= 500. This sums up to around 200MB RAM for the center vectors alone while in fact the center vectors are very sparse and require a lot less RAM.
Since I am running on a system with relatively low resources I keep getting OutOfMemory errors. In my setting it is OK to trade off runtime for using less RAM. This is what I set out to do in my solution while allowing users the flexibility to choose.

My solution:

Allow the kmeans algorithm to accept a VectorFactory which decides when vectors used inside the algorithm should be sparse and when they should be dense. For backward compatibility the default behavior is to always make them dense (like the situation is now). But now potentially the user can provide a SmartVectorFactory (or some proprietary VectorFactory) which can decide to make vectors sparse.

For this I made the following changes:
(1) Added a method called reassign to SparseVectors allowing to change the indices and values
(2) Allow axpy to accept SparseVectors
(3) create a trait called VectorFactory and two implementations for it that are used within KMeans code

yinxusen · 2016-01-15T14:19:16Z

Hi @levin-royl, you need to remove the two log files and create a JIRA. See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

levin-royl · 2016-01-17T07:28:40Z

Thank you, I removed the log files and added the following JIRA request:

https://issues.apache.org/jira/browse/KYLIN-1326

yinxusen · 2016-01-17T07:34:15Z

You may create your JIRA in a wrong place. Not Kylin, but Spark https://issues.apache.org/jira/browse/SPARK

levin-royl · 2016-01-17T08:04:18Z

Sorry, I am a little new to this. For some reason when choosing "create new" I only had the options: Kylin, Atlas or Apache Infrastructure. Now through the link you sent I created the following JIRA request in Spark:

https://issues.apache.org/jira/browse/SPARK-12861

levin-royl · 2016-01-28T08:12:39Z

Hi, just wanted to know if there are any unhanded items on my end WRT this change. Thanks.

srowen · 2016-01-28T08:58:01Z

Read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You may want to search for duplicate JIRAs too. There are several on this topic and k-means.

levin-royl · 2016-01-28T13:13:35Z

Hi, there are indeed some similar issues I found, e.g.:

https://issues.apache.org/jira/browse/SPARK-4039
https://issues.apache.org/jira/browse/SPARK-1212
mesos/spark#736

But the difference is that in the problem I describe reducing the dimensions of the problem (i.e., the feature space) to allow using dense vectors is not suitable. Also, the solution I implemented supports this while allowing full flexibility to the user --- i.e., using the default dense vector implementation or selecting an alternative (only when the default it is not desired).

I will update the JIRA issue on this as well.

Please advise if there are any additional steps I need to do at this point.

Thanks in advance.

levin-royl · 2016-02-03T09:27:04Z

Hi, there are indeed some similar issues I found, e.g.:

https://issues.apache.org/jira/browse/SPARK-4039
https://issues.apache.org/jira/browse/SPARK-1212
mesos/spark#736

But the difference is that in the problem I describe reducing the dimensions of the problem (i.e., the feature space) to allow using dense vectors is not suitable. Also, the solution I implemented supports this while allowing full flexibility to the user --- i.e., using the default dense vector implementation or selecting an alternative (only when the default it is not desired).

I will update the JIRA issue on this as well.

Please advise if there are any additional steps I need to do at this point.

Thanks in advance.

levin-royl · 2016-02-03T09:27:57Z

I wanted to know if you took a look at the code and the proposed solution in general. Are there any comments?

Thanks.

thunterdb · 2016-03-10T23:04:29Z

@levin-royl it looks like @hhbyyh has a branch with some code that tackles the same issue (see the jira discussion for more information), you may want to coordinate there.

Also, I suggest you take again a look at the pull request section of the guidelines https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

hhbyyh · 2016-03-10T23:50:02Z

I didn't send a PR because there's some ongoing effort on transforming the implementation of KMeans to Matrix multiplication.

AmplabJenkins · 2016-10-26T15:17:17Z

Can one of the admins verify this patch?

Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238

This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues). // Open PRs whose JIRA tickets have been already closed Closes apache#11785 Closes apache#13027 Closes apache#13614 Closes apache#13761 Closes apache#15197 Closes apache#14006 Closes apache#12576 Closes apache#15447 Closes apache#13259 Closes apache#15616 Closes apache#14473 Closes apache#16638 Closes apache#16146 Closes apache#17269 Closes apache#17313 Closes apache#17418 Closes apache#17485 Closes apache#17551 Closes apache#17463 Closes apache#17625 // Open PRs whose JIRA tickets does not exist and they are not minor issues Closes apache#10739 Closes apache#15193 Closes apache#15344 Closes apache#14804 Closes apache#16993 Closes apache#17040 Closes apache#15180 Closes apache#17238 N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17734 from maropu/resolved_pr. Change-Id: Id2e590aa7283fe5ac01424d30a40df06da6098b5

levin-royl added 4 commits January 13, 2016 12:47

Changes to support KMeans with large feature space

33d760c

add newspace at eof

c5a8e7f

some slight improvements

b004b25

improve runtime performance of solution

98a14dc

remove log files

fc680d4

maropu mentioned this pull request Apr 23, 2017

[BUILD] Close stale PRs #17734

Closed

asfgit closed this in e9f9715 Apr 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to support KMeans with large feature space #10739

Changes to support KMeans with large feature space #10739

levin-royl commented Jan 13, 2016

yinxusen commented Jan 15, 2016

levin-royl commented Jan 17, 2016

yinxusen commented Jan 17, 2016

levin-royl commented Jan 17, 2016

levin-royl commented Jan 28, 2016

srowen commented Jan 28, 2016

levin-royl commented Jan 28, 2016

levin-royl commented Feb 3, 2016

levin-royl commented Feb 3, 2016

thunterdb commented Mar 10, 2016

hhbyyh commented Mar 10, 2016

AmplabJenkins commented Oct 26, 2016

Changes to support KMeans with large feature space #10739

Changes to support KMeans with large feature space #10739

Conversation

levin-royl commented Jan 13, 2016

The problem:

My solution:

yinxusen commented Jan 15, 2016

levin-royl commented Jan 17, 2016

yinxusen commented Jan 17, 2016

levin-royl commented Jan 17, 2016

levin-royl commented Jan 28, 2016

srowen commented Jan 28, 2016

levin-royl commented Jan 28, 2016

levin-royl commented Feb 3, 2016

levin-royl commented Feb 3, 2016

thunterdb commented Mar 10, 2016

hhbyyh commented Mar 10, 2016

AmplabJenkins commented Oct 26, 2016