# Solutions to the Clustering Task




## Clustering
The main clustering package for julia, is unexpectedly, named [Clustering.jl](https://github.com/JuliaStats/Clustering.jl)
 - It supports K-means, K-medoids, Affinity Propagation, DBSCAN
 - It also supports hierarchical clustering, but that is not currently in the docs.
 
You'll also want  [Distances.jl](https://github.com/JuliaStats/Distances.jl) for all your distance metric needs.
It is traditional with word2vec to use cosine distance.

### Affinity Propagraion
If you set the availability right, it can get a breakdown where the ball-sports and clustered seperately from the other sports. Though you may have problems with some of the cities being classes as sports, as this word2vec repressentation was trained on a dump of wikipedia taken in 2014, and there are a lot of sports pages talking about the Athens and Beijing olypics.


# First we loadup some data
For the the example presented here, we will use a subhset of Word Embedding, trained using [Word2Vec.jl](https://github.com/tanmaykm/Word2Vec.jl).
These are 100 dimentional vectors, which encode syntactic and semantic information about words.

In [1]:
using JLD
embeddings = load("../assets/ClusteringAndDimentionalityReduction.jld", "embeddings")

Dict{String,Array{Float32,1}} with 185 entries:
  "ferret"       => Float32[0.0945707,-0.435267,0.0109875,-0.107674,0.169001,-0…
  "gymnastics"   => Float32[-0.269173,-0.343412,-0.00603042,-0.186179,0.0342606…
  "vegas"        => Float32[-0.00530534,-0.264874,0.0167432,-0.289836,-0.14033,…
  "archery"      => Float32[0.0279714,-0.485648,0.105468,-0.0696941,0.182807,-0…
  "jacksonville" => Float32[-0.418758,-0.0284594,0.00847164,-0.0989162,0.098186…
  "ankara"       => Float32[-0.139109,0.0872892,0.749557,-0.0308427,-0.0936718,…
  "pentathlon"   => Float32[-0.357405,-0.379595,-0.134314,-0.31008,-0.0245871,-…
  "seoul"        => Float32[0.0274904,-0.153844,-0.0936614,-0.0269344,-0.091449…
  "china"        => Float32[0.132423,-0.515862,-0.0381339,-0.287565,-0.285202,-…
  "korea"        => Float32[0.236904,-0.128355,-0.0816942,-0.0702621,-0.148426,…
  "argentina"    => Float32[-0.113967,-0.437523,-0.226014,-0.439572,-0.230062,-…
  "mozambique"   => Float32[0.309411,-0.13457,-0.632055,-0.30

In [2]:
all_words = collect(keys(embeddings))
display(all_words)
embeddings_mat = hcat(getindex.([embeddings], all_words)...)

185-element Array{String,1}:
 "ferret"      
 "gymnastics"  
 "vegas"       
 "archery"     
 "jacksonville"
 "ankara"      
 "pentathlon"  
 "seoul"       
 "china"       
 "korea"       
 "argentina"   
 "mozambique"  
 "iraq"        
 ⋮             
 "volleyball"  
 "luanda"      
 "ghana"       
 "warsaw"      
 "accra"       
 "indianapolis"
 "las"         
 "russia"      
 "columbus"    
 "thailand"    
 "mesa"        
 "goose"       

100×185 Array{Float32,2}:
  0.0945707   -0.269173    …   0.0859109  -0.215521     0.118283 
 -0.435267    -0.343412       -0.185847   -0.0846722   -0.40088  
  0.0109875   -0.00603042      0.131935   -0.452262     0.0091058
 -0.107674    -0.186179       -0.221565   -0.115309     0.0121521
  0.169001     0.0342606      -0.0558827  -0.373113    -0.0509757
 -0.0564122   -0.137685    …  -0.0252548  -0.264813    -0.24657  
 -0.249841    -0.162321        0.0430546   0.0958876    0.0397347
 -0.115424    -0.253833        0.161854   -0.274667    -0.120246 
 -0.302291     0.0844513      -0.263644   -0.158253    -0.0829336
 -0.0232056    0.138056       -0.476437   -0.1159       0.0935187
 -0.0826832    0.0510365   …  -0.190116   -0.00022561  -0.338357 
 -0.11338      0.0767575      -0.0493041  -0.252975    -0.0785137
 -0.255015    -0.591677        0.0772709   0.180385    -0.134259 
  ⋮                        ⋱                                     
 -0.191331    -0.290943       -0.164481   -0.08340

In [3]:
using Clustering
using Distances

similarity = 1f0 - pairwise(CosineDist(), embeddings_mat)
availability = 0.01*ones(size(similarity,1)) 
# tweaking availability is how you control number of clusters
# it is the diagonal of the similarity matrix
similarity[diagind(size(similarity)...)] = availability
aprop = affinityprop(similarity)

Clustering.AffinityPropResult([10,17,24,49,77,84,87,88,91,107,136,148,161,165,174,182],[3,12,2,12,16,6,12,1,1,1  …  14,11,14,16,2,8,16,4,2,3],[9,12,14,9,7,14,7,8,15,7,5,20,5,19,10,24],46,true)

In [4]:
for (cluster_ii, examplar_ind) in enumerate(aprop.exemplars)
    println("-"^32)
    println("Exemplar: ", all_words[examplar_ind])
    cluster_member_inds = find(assignments(aprop).==cluster_ii)
    println(join(getindex.([all_words], cluster_member_inds), ", "))
end

--------------------------------
Exemplar: korea
seoul, china, korea, pyongyang, japan, vietnam, tokyo, hanoi, taipei
--------------------------------
Exemplar: sacramento
vegas, sacramento, francisco, tucson, seattle, san, albuquerque, denver, portland, fresno, las, mesa
--------------------------------
Exemplar: dog
ferret, dog, goldfish, cattle, dove, yak, duck, llama, mouse, alpaca, pigeon, guineafowl, camel, goose
--------------------------------
Exemplar: indonesia
jakarta, bangkok, myanmar, indonesia, manila, malaysia, philippines, singapore, thailand
--------------------------------
Exemplar: iran
iraq, kabul, uzbekistan, tehran, iran, yemen, afghanistan
--------------------------------
Exemplar: cairo
ankara, baghdad, algiers, khartoum, rabat, beirut, cairo, algeria, morocco, damascus, amman, egypt, arabia, riyadh
--------------------------------
Exemplar: vienna
italy, rome, berlin, vienna, stockholm, budapest, paris
--------------------------------
Exemplar: moscow
baku, kyi