## Model - based clustering 

Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. The model that we recover from the data then defines clusters and an assignment of documents to clusters. A commonly used criterion for estimating the model parameters is maximum likelihood.


In [6]:
import Pkg; Pkg.add("PyPlot")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m PyPlot ─ v2.10.0
[32m[1m   Installed[22m[39m PyCall ─ v1.93.0
[32m[1m    Updating[22m[39m `C:\Users\tsakalos\.julia\environments\v1.7\Project.toml`
 [90m [d330b81b] [39m[92m+ PyPlot v2.10.0[39m
[32m[1m    Updating[22m[39m `C:\Users\tsakalos\.julia\environments\v1.7\Manifest.toml`
 [90m [438e738f] [39m[92m+ PyCall v1.93.0[39m
 [90m [d330b81b] [39m[92m+ PyPlot v2.10.0[39m
[32m[1m    Building[22m[39m PyCall → `C:\Users\tsakalos\.julia\scratchspaces\44cfe95a-1eb2-52ea-b672-e2afdf69b78f\71fd4022ecd0c6d20180e23ff1b3e05a143959c2\build.log`
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mPyCall[39m
[32m  ✓ [39mPyPlot
  2 dependencies successfully precompiled in 11 seconds (202 already precompiled)


In [7]:
using RDatasets
using PyPlot
using Clustering
using GaussianMixtures


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\tsakalos\.julia\conda\3

  added / updated specs:
    - matplotlib


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    brotli-1.0.9               |       h8ffe710_6          18 KB  conda-forge
    brotli-bin-1.0.9           |       h8ffe710_6          21 KB  conda-forge
    cycler-0.11.0              |     pyhd8ed1ab_0          10 KB  conda-forge
    fonttools-4.28.5           |   py39hb82d6ee_0         1.5 MB  conda-forge
    freetype-2.10.4            |       h546665d_1         489 KB  conda-forge
    icu-68.2                   |       h0e60522_0        16.4 MB  conda-forge
    jbig-2.1                   |    h8d14728_2003          45 KB  conda-forge
    jpeg-9d                    |       h8ffe710_0         366 KB  conda-for

┌ Info: Installing matplotlib via the Conda matplotlib package...
└ @ PyCall C:\Users\tsakalos\.julia\packages\PyCall\L0fLP\src\PyCall.jl:711
┌ Info: Running `conda install -y matplotlib` in root environment
└ @ Conda C:\Users\tsakalos\.julia\packages\Conda\1403Y\src\Conda.jl:129


LoadError: ArgumentError: Package Clustering not found in current path:
- Run `import Pkg; Pkg.add("Clustering")` to install the Clustering package.


In [None]:
iris = dataset("datasets", "iris")
classes=unique(iris[:,5])
head(iris)

In [None]:

function breakDataByClass(f,classes=classes)
    dataByClass=Array{Array{Float64}}(length(classes))
    for i=1:length(classes)
        dataByClass[i]=Float64[]
    end
    for i=1:size(iris,1)
        current_class=f(i)
        for j=1:4
            push!(dataByClass[current_class],iris[i,j])
        end
    end
    for i=1:length(classes)
        dataByClass[i]=reshape(dataByClass[i],4,div(length(dataByClass[i]),4))
    end
    dataByClass
end

In [None]:
dataByClass= breakDataByClass(x->findfirst(iris[x,5].==classes))
function orgPlot(dataByClass,classes) # Poltiong the classes
    p_syms=["*","+","o","k+"]
    for i=1:length(classes)
        plot( dataByClass[i][1,:],dataByClass[i][3,:], p_syms[i], label=classes[i])
    end
    legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
               ncol=3, mode="expand", borderaxespad=0.)
    xlabel("$(names(iris)[1])")
    ylabel("$(names(iris)[3])");
end

In [None]:
srand(67)
R=kmeans(Array(iris[:,1:4])',3)
dataByCluster=breakDataByClass(z->R.assignments[z])
figure("Comparing stuff", figsize=(15,7))
title("GMM prediction")
subplot(1,2,1)
orgPlot(dataByCluster,["Cluster$i" for i=1:3]);
subplot(1,2,2)
orgPlot(dataByClass,classes);

The figure on the left shows the KNN clusters and one on the right shows the original classification. Note how the KNN cluster are limited by the spherical Euclidean distance metric and hence struggle to separate the top most clusters into their original classes.

In [None]:
gm=GMM(3,Array(iris[:,1:4]));
prob_pos=gmmposterior(gm,Array(iris[:,1:4]))[1]
ass=[indmax(prob_pos[i,:]) for i=1:size(iris,1)]
dataByCluster=breakDataByClass(x->ass[x])
figure("Comparing stuff", figsize=(15,7))
title("GMM prediction")
subplot(1,2,1)
orgPlot(dataByCluster,["Cluster$i" for i=1:3]);
subplot(1,2,2)
orgPlot(dataByClass,classes);

The default GMM building has uses a diagonal covariance matrix. This does not give it enough flexibility and hence, in this case, the results are not much better than KNN.

In [None]:
gm=GMM(3,Array(iris[:,1:4]),kind=:full);
prob_pos=gmmposterior(gm,Array(iris[:,1:4]))[1]
ass=[indmax(prob_pos[i,:]) for i=1:size(iris,1)]
dataByCluster=breakDataByClass(x->ass[x])
figure("Comparing stuff", figsize=(15,7))
title("GMM prediction")
subplot(1,2,1)
orgPlot(dataByCluster,["Cluster$i" for i=1:3]);
subplot(1,2,2)
orgPlot(dataByClass,classes);

A more flexible GMM building with a full covariance matrix does much better job. Note the leftmost Cluster1 point corresponding to virginica.