Skip to content
This repository has been archived by the owner on Nov 19, 2020. It is now read-only.

Kmeans clustering exception #259

Closed
khan990 opened this issue Jun 29, 2016 · 10 comments
Closed

Kmeans clustering exception #259

khan990 opened this issue Jun 29, 2016 · 10 comments

Comments

@khan990
Copy link

khan990 commented Jun 29, 2016

I am getting the following exception
I believe my input is totally correct. I was computing cosine similarity over documents with TFIDF algorithm.

input to cluster.compute() is double[][] where 0>= input[i][j] <= 1
int[] index = cluster.Compute(inputs);

using https://github.com/primaryobjects/TFIDF for tfidf

System.InvalidOperationException was unhandled
HResult=-2146233079
Message=Generated value is not between 0 and 1.
Source=Accord.Statistics
StackTrace:
at Accord.Statistics.Distributions.Univariate.GeneralDiscreteDistribution.Random(Double[] probabilities)
at Accord.MachineLearning.KMeans.Randomize(Double[][] points, Boolean useSeeding)
at Accord.MachineLearning.KMeans.Compute(Double[][] data, Double threshold, Boolean computeInformation)
at TFIDFExample.Program.Main(String[] args) in C:\Users\jasim\Documents\Visual Studio 2015\Projects\TFIDF_TwitterClustering\TFIDFExample\Program.cs:line 49
at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
InnerException:

@zgrkpnr
Copy link

zgrkpnr commented Jun 30, 2016

Did you validate your input for every element? Just do the following.

var max = inputs.Select(i => i.Max()).Max();
var min = inputs.Select(i => i.Min()).Min();

After you check values, inform your result here.

@khan990
Copy link
Author

khan990 commented Jun 30, 2016

Yes, I did validate them.
Min and Max answers lie within 0>= input[i][j] <= 1 as mentioned before.

What I noticed that, that when I use kmean for 1000 x 15000 double. It works just fine.
But when I used it for 10000 x 15000, it gave me the above error.

@cesarsouza
Copy link
Member

Hi all,

It is very possible that this issue could have been fixed in the latest release of the framework (3.2, released a few days ago). If it is still possible, would it be possible to let us know if you are still experiencing the issue?

Thanks!

Regards,
Cesar

@jbrant
Copy link

jbrant commented Nov 11, 2016

Hi Cesar,

I'm actually on version 3.3 and experienced the same issue as described above clustering on anywhere from 1,000 - 2,000 dimensions. The data I'm trying to cluster are in the range 0 - 50.

Thanks!
Jonathan

@khan990
Copy link
Author

khan990 commented Nov 11, 2016

I reduced the dimensions, and it worked...
try doing that, I can understand, it may not be an option for u.
but give it a try...

@jbrant
Copy link

jbrant commented Nov 14, 2016

Hi Khan, thank you very much for your reply. Reducing the dimensions did work; however, in my particular situation, I'm unfortunately unable to incur that loss in fidelity. That being said, I switched to uniform seeding (as opposed to the default kmeans++) and didn't have any issues, even at rather high dimensionality.

Thanks again,
Jonathan

@ytakashina
Copy link

ytakashina commented Apr 20, 2017

I had the same issue.

This exception is originally thrown by GeneralDiscreteDistribution.Random(), which is used in ClusterCollection.Randomize() for k-means++ seeding.

The exception will be thrown if the cumulativeSum in GeneralDiscreteDistribution.Random() was under the value uniform which is randomly generated between [0, 1). Maybe something is wrong with the calculation of D in ClusterCollection.Randomize().

version: 3.4.0

cesarsouza added a commit that referenced this issue Jun 30, 2017
… distances to probabilities in the K-Means++ initialization.

Updates GH-259: K-means clustering exception
@cesarsouza
Copy link
Member

I have not been able to reproduce the error myself yet, but this is probably happening due to a loss of precision when computing the discrete probability weights, making the weight vector not sum up to one. One of the possible reasons for that is the probabilities for each point becoming too small.

I have added some handling to sidestep this issue and also present some better error messages.

Regards,
Cesar

@cesarsouza
Copy link
Member

Should have been fixed in release 3.6.0.

@Afgankhan
Copy link

hello i want an array of clustering data points like array[10]={1,2,3,4,5,6,7,8,9}
after clustering first show first clusters data indexes e.g number of clusters=2,first cluster data{2,5,9} and 2nd cluster data{1,3,4,6,7,8}
resultant array={12,5,9,1,3,4,6,7,9}
Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants