# Trabalho de Implementação

## INF2912 - Otimização Combinatória
### Prof. Marcus Vinicius Soledade Poggi de Aragão
### 2015-2

### Ciro Cavani
#### BigData / Globo.com

Algoritmos de clusterização.

## Conteúdo

Esse notebook tem o desenvolvimento e avaliação do algoritmo aproximado do P-Center (algoritmo Farthest-first traversal).

A avaliação do algoritmo é baseada em um mapeamento entre a maioria dos itens que foram atribuídos a um determinado cluster e o correspondente os valores verdadeiros gerados nesse cluster.

O P-Center teve resultados muito bons.

## Dataset

In [1]:
include("../src/clustering.jl")
import Inf2912Clustering
const Clustering = Inf2912Clustering



Inf2912Clustering

In [2]:
dataset = Clustering.dataset_tiny()
Clustering.summary(dataset)
sleep(0.2)

Clusters: 3
Dimension (features): 16
Features per Cluster: 3
Probability of Activation: 0.8

Size: 100
Min Cluster size: 20
Max Cluster size: 40
Cluster 1 size: 39
Cluster 2 size: 29
Cluster 3 size: 32


### P-Center - Problema de Localização de Centróides

Consiste em resolver o *P-Center* determinar os objetos representantes de cada grupo e classificar cada objeto como sendo do grupo com representante *mais próximo*

https://en.wikipedia.org/wiki/Metric_k-center

https://en.wikipedia.org/wiki/Farthest-first_traversal


In [3]:
let
    k = 3
    data = dataset.input.data
    
    centers = Array(Array{Int64,1}, 0)
    i = rand(1:length(data))
    push!(centers, data[i])
    
    min_dist(v) = minimum(map(c -> norm(c - v), centers))
    max_index() = indmax(map(min_dist, data))
    
    while length(centers) < k
        i = max_index()
        push!(centers, data[i])
    end
    
    cluster(v) = indmin(map(c -> norm(c - v), centers))
    
    assignments = zeros(Int, length(data))
    for (i, v) in enumerate(data)
        assignments[i] = cluster(v)
    end
    
    assignments
end

100-element Array{Int64,1}:
 1
 1
 2
 1
 3
 1
 3
 1
 2
 1
 2
 1
 1
 ⋮
 1
 2
 1
 2
 2
 1
 3
 1
 3
 1
 1
 3

In [4]:
import Clustering: Input, Dataset

"Algoritmo de clusterização P-Center (algoritmo Farthest-first traversal)."
function pcenter(input::Input, k::Int)
    data = input.data
    
    centers = Array(Array{Int64,1}, 0)
    i = rand(1:length(data))
    push!(centers, data[i])
    
    min_dist(v) = minimum(map(c -> norm(c - v), centers))
    max_index() = indmax(map(min_dist, data))
    
    while length(centers) < k
        i = max_index()
        push!(centers, data[i])
    end
    
    cluster(v) = indmin(map(c -> norm(c - v), centers))
    
    assignments = zeros(Int, length(data))
    for (i, v) in enumerate(data)
        assignments[i] = cluster(v)
    end
    
    assignments
end

pcenter(dataset::Dataset, k::Int) = pcenter(dataset.input, k)

pcenter(dataset, 3)

100-element Array{Int64,1}:
 3
 3
 3
 1
 2
 3
 2
 1
 1
 3
 1
 1
 1
 ⋮
 1
 3
 3
 1
 1
 1
 3
 1
 2
 1
 2
 1

In [5]:
import Clustering.mapping

"Algoritmo de clusterização P-Center (algoritmo Farthest-first traversal) \
aproximado para os grupos pré-definidos do dataset."
function pcenter_approx(dataset::Dataset, k::Int)
    assignments = pcenter(dataset, k)
    centermap = mapping(dataset, assignments, k)
    map(c -> centermap[c], assignments)
end

let
    k = dataset.clusters
    @time prediction = pcenter_approx(dataset, k)
    Clustering.evaluation_summary(dataset, prediction; verbose=true)
    sleep(0.2)
end

  0.110913 seconds (117.02 k allocations: 5.599 MB, 6.08% gc time)
Confusion Matrix:

[30 6 3
 10 14 5
 2 6 24]

Size: 100
Correct: 68
Mistakes: 32
Accuracy: 68.0%

Cluster 1

Size: 39
Accuracy: 79.0%
Precision: 71.43%
Recall: 76.92%
F-score: 0.74

True Positive: 30 (76.92%)
True Negative: 49 (80.33%)
False Negative: 9 (28.12%)
False Positive: 12 (37.5%)

Cluster 2

Size: 29
Accuracy: 73.0%
Precision: 53.85%
Recall: 48.28%
F-score: 0.51

True Positive: 14 (48.28%)
True Negative: 59 (83.1%)
False Negative: 15 (46.88%)
False Positive: 12 (37.5%)

Cluster 3

Size: 32
Accuracy: 84.0%
Precision: 75.0%
Recall: 75.0%
F-score: 0.75

True Positive: 24 (75.0%)
True Negative: 60 (88.24%)
False Negative: 8 (25.0%)
False Positive: 8 (25.0%)



In [6]:
Clustering.test_dataset("small", pcenter_approx)
sleep(0.2)

  0.022010 seconds (33.10 k allocations: 11.160 MB, 34.12% gc time)
Confusion Matrix:

[367 0 0
 1 265 0
 0 0 367]

Size: 1000
Correct: 999
Mistakes: 1
Accuracy: 99.9%

Cluster 1

Size: 367
Accuracy: 99.9%
Precision: 99.73%
Recall: 100.0%
F-score: 1.0

True Positive: 367 (100.0%)
True Negative: 632 (99.84%)
False Negative: 0 (0.0%)
False Positive: 1 (100.0%)

Cluster 2

Size: 266
Accuracy: 99.9%
Precision: 100.0%
Recall: 99.62%
F-score: 1.0

True Positive: 265 (99.62%)
True Negative: 734 (100.0%)
False Negative: 1 (100.0%)
False Positive: 0 (0.0%)

Cluster 3

Size: 367
Accuracy: 100.0%
Precision: 100.0%
Recall: 100.0%
F-score: 1.0

True Positive: 367 (100.0%)
True Negative: 633 (100.0%)
False Negative: 0 (0.0%)
False Positive: 0 (0.0%)



In [7]:
Clustering.test_dataset("large", pcenter_approx)
sleep(0.2)

  0.254922 seconds (339.12 k allocations: 111.643 MB, 38.77% gc time)
Confusion Matrix:

[3810 3 1
 8 3948 17
 1 0 2212]

Size: 10000
Correct: 9970
Mistakes: 30
Accuracy: 99.7%

Cluster 1

Size: 3814
Accuracy: 99.87%
Precision: 99.76%
Recall: 99.9%
F-score: 1.0

True Positive: 3810 (99.9%)
True Negative: 6177 (99.85%)
False Negative: 4 (13.33%)
False Positive: 9 (30.0%)

Cluster 2

Size: 3973
Accuracy: 99.72%
Precision: 99.92%
Recall: 99.37%
F-score: 1.0

True Positive: 3948 (99.37%)
True Negative: 6024 (99.95%)
False Negative: 25 (83.33%)
False Positive: 3 (10.0%)

Cluster 3

Size: 2213
Accuracy: 99.81%
Precision: 99.19%
Recall: 99.95%
F-score: 1.0

True Positive: 2212 (99.95%)
True Negative: 7769 (99.77%)
False Negative: 1 (3.33%)
False Positive: 18 (60.0%)

