GitHub - alessandroTorresani/bdmp

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
ajt-2.11		ajt-2.11
bin		bin
commons-math3-3.6.1		commons-math3-3.6.1
jama		jama
javaml-0.1.7		javaml-0.1.7
scripts		scripts
src		src
target		target
pom.xml		pom.xml
readMe		readMe

Repository files navigation

DATA MINING PROJECT 1 - UNCERTAIN CLUSTERING

Overview

DataminingProject.jar is a Java library that permits to:
1) Generate uncertain synthetic points with the shape (id, (d1,d2,...,dn),prob), where: 
    * id: identifier 
    * d1,d2,...,dn: dimensions of the uncertain points (coordinates)
    * prob: probability that the uncertain point with identifier=id is at position (d1,d2,...,dn) 
    EXAMPLE:    given a point with id = 'A', possible location for this point are ('A',(10,20),0,8)), ('A',(20,30),0.2)). Prob must sum to 1.

    This library has three different algorithms for sampling uncertain data:
    1) Simple sampling (numberOfSamples, minValue, maxValue, highdifference): for each point generates two uncertain points having the same probability if highdifference is false;
    Different probabilities if highdifference is true
    EXAMPLE:    generate id = 'A', generate two points: ('A', (x1,x2,...,xn),0.8), ('A', (x1,x2,...,xn),0.2) if highdifference = true. 
                generate id = 'A', generate two points: ('A', (x1,x2,...,xn),0.5), ('A', (x1,x2,...,xn),0.5) if highdifference = false.

    2) Random sampling (numberOfSamples, MaxNumberOfUncertainPoints, minValue, maxValue): for each point generates a random number of uncertain points with random probabilities
    EXAMPLE:    generate id = 'A', generate m <= MaxNumberOfUncertainPoints points with random probabilities and ramdom coordinates
                ('A', (x1,x2,...,xn),p1)...('A', (x1,x2,...,xn),pm) where p1+p2+...+pm = 1

    3) Poisson sampling (numberOfSamples, lambdas): each point is represented as a Poisson random variables. It will sample at random point following this distribution 
    EXAMPLE:    generate id = 'A', generate a PoissonDistribution with mean equal to lambdas[this point]. Sample points 
                ('A', (poisson.sample),poisson.prob(poisson.sample)) till the sum of the probabilities sum to 1

2) Transform uncertain data to certain data according to two different algorithms:
    1) MostProbable algorithm: for each point this algorithm chooses the most probable uncertain point. It is like ignoring the fact that points are uncertain and pretending that 
    the most probable one is the only value of the point.
    EXAMPLE:    given list of uncertain locations for a point 'A': ('A', (x1,x2,...,xn),p1)...('A', (x1,x2,...,xn),pm) 
                OUTPUT: point ('A', (x1,x2,...,xn),pk) where pk = max(p1,p2,...,pm)

    2) AverageDistance algorithm: for each point the algorithm computes the expected mean of the uncertain points and chooses the expected mean as the actual value.
    EXAMPLE:    given list of uncertain locations for a point 'A': ('A', (x1,x2,...,xn),p1)...('A', (x1,x2,...,xn),pm)
                OUTPUT: point ('A', (x1a,x2a,...,xna),1) where each xa = (xa*p1 + xa*p2 + ... + xa*pm). This is the expected value of the discrete random variable that model point locations.

3) Run a k-Mean algorithm on the output of MostProbable and AverageDistance methods. Note that the output of mostProbable and AverageDistance methods is a set of points without a concept of uncertainty.
    EXAMPLE:    given a list of points [(x1,x2,...,xn)] computes the k-Mean algorithm.
                OUTPUT: list of clusters [[cluster1]...[clusterk]]. Note: for now k is fixed to 3. In future implementations we will allow to choose it or let the program to find the best one.

4) Plot the output of clustering algorithm with 2 python scripts, one for 2D and one for 3D.
    OUTPUT: 2D or 3D plotter using matplotlib


Execution

Requires java8,python 2.7 and python library matplotlib.

1) From command line run java -jar DataminingProject.jar
2) Prompt the dimension of the points
3) Prompt a sampling method 
4) Results will be stored in 3 folders: 
    1) $HOME/Documents/bdmpFiles/input/                                   :       sampled data, output of AverageDistance and MostProbable algorithms
    2) $HOME/Documents/bdmpFiles/output/bdmpFiles/output/average          :       output of k-Mean algorithm on data computed by AverageDistance algorihtm
    3) $HOME/Documents/bdmpFiles/output/bdmpFiles/output/mostProbable     :       output of k-Mean algorithm on data computed by MostProbable algorihtm
5) If you chose 2 or 3 dimension points, you can plot each cluster using python scripts: plotter2D.py and plotter3D.py

Usage: python [scriptname][foldername]

So if you want to plot 2D points you will use:
python plotter2D.py average         if you want to plot clustering obtained from AverageDistance method
python plotter2D.py MostProbable    if you want to plot clustering obtained from MostProbable method

[NOTE]: python scripts use absolute paths, so they can be run in any location.
[NOTE]: Each cluster will have a different color inside the plot.