-
Notifications
You must be signed in to change notification settings - Fork 0
alessandroTorresani/bdmp
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
DATA MINING PROJECT 1 - UNCERTAIN CLUSTERING Overview DataminingProject.jar is a Java library that permits to: 1) Generate uncertain synthetic points with the shape (id, (d1,d2,...,dn),prob), where: * id: identifier * d1,d2,...,dn: dimensions of the uncertain points (coordinates) * prob: probability that the uncertain point with identifier=id is at position (d1,d2,...,dn) EXAMPLE: given a point with id = 'A', possible location for this point are ('A',(10,20),0,8)), ('A',(20,30),0.2)). Prob must sum to 1. This library has three different algorithms for sampling uncertain data: 1) Simple sampling (numberOfSamples, minValue, maxValue, highdifference): for each point generates two uncertain points having the same probability if highdifference is false; Different probabilities if highdifference is true EXAMPLE: generate id = 'A', generate two points: ('A', (x1,x2,...,xn),0.8), ('A', (x1,x2,...,xn),0.2) if highdifference = true. generate id = 'A', generate two points: ('A', (x1,x2,...,xn),0.5), ('A', (x1,x2,...,xn),0.5) if highdifference = false. 2) Random sampling (numberOfSamples, MaxNumberOfUncertainPoints, minValue, maxValue): for each point generates a random number of uncertain points with random probabilities EXAMPLE: generate id = 'A', generate m <= MaxNumberOfUncertainPoints points with random probabilities and ramdom coordinates ('A', (x1,x2,...,xn),p1)...('A', (x1,x2,...,xn),pm) where p1+p2+...+pm = 1 3) Poisson sampling (numberOfSamples, lambdas): each point is represented as a Poisson random variables. It will sample at random point following this distribution EXAMPLE: generate id = 'A', generate a PoissonDistribution with mean equal to lambdas[this point]. Sample points ('A', (poisson.sample),poisson.prob(poisson.sample)) till the sum of the probabilities sum to 1 2) Transform uncertain data to certain data according to two different algorithms: 1) MostProbable algorithm: for each point this algorithm chooses the most probable uncertain point. It is like ignoring the fact that points are uncertain and pretending that the most probable one is the only value of the point. EXAMPLE: given list of uncertain locations for a point 'A': ('A', (x1,x2,...,xn),p1)...('A', (x1,x2,...,xn),pm) OUTPUT: point ('A', (x1,x2,...,xn),pk) where pk = max(p1,p2,...,pm) 2) AverageDistance algorithm: for each point the algorithm computes the expected mean of the uncertain points and chooses the expected mean as the actual value. EXAMPLE: given list of uncertain locations for a point 'A': ('A', (x1,x2,...,xn),p1)...('A', (x1,x2,...,xn),pm) OUTPUT: point ('A', (x1a,x2a,...,xna),1) where each xa = (xa*p1 + xa*p2 + ... + xa*pm). This is the expected value of the discrete random variable that model point locations. 3) Run a k-Mean algorithm on the output of MostProbable and AverageDistance methods. Note that the output of mostProbable and AverageDistance methods is a set of points without a concept of uncertainty. EXAMPLE: given a list of points [(x1,x2,...,xn)] computes the k-Mean algorithm. OUTPUT: list of clusters [[cluster1]...[clusterk]]. Note: for now k is fixed to 3. In future implementations we will allow to choose it or let the program to find the best one. 4) Plot the output of clustering algorithm with 2 python scripts, one for 2D and one for 3D. OUTPUT: 2D or 3D plotter using matplotlib Execution Requires java8,python 2.7 and python library matplotlib. 1) From command line run java -jar DataminingProject.jar 2) Prompt the dimension of the points 3) Prompt a sampling method 4) Results will be stored in 3 folders: 1) $HOME/Documents/bdmpFiles/input/ : sampled data, output of AverageDistance and MostProbable algorithms 2) $HOME/Documents/bdmpFiles/output/bdmpFiles/output/average : output of k-Mean algorithm on data computed by AverageDistance algorihtm 3) $HOME/Documents/bdmpFiles/output/bdmpFiles/output/mostProbable : output of k-Mean algorithm on data computed by MostProbable algorihtm 5) If you chose 2 or 3 dimension points, you can plot each cluster using python scripts: plotter2D.py and plotter3D.py Usage: python [scriptname][foldername] So if you want to plot 2D points you will use: python plotter2D.py average if you want to plot clustering obtained from AverageDistance method python plotter2D.py MostProbable if you want to plot clustering obtained from MostProbable method [NOTE]: python scripts use absolute paths, so they can be run in any location. [NOTE]: Each cluster will have a different color inside the plot.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published