# Assessing clustering stability visually

As you have experienced in your practical tutorial last week, investigating the overlaps of clustering results is never straightforward. Here is an additional and optional exercise for those of you who would like to develop a mini solution to visual cluster assessment.

The idea is to use a Sankey diagram as a visual aid to investigate how clusters are overlapping. This approach is used in the Caleydo Matchmaker tool that we looked at in class http://caleydo.org/publications/2010_infovis_matchmaker/

The example below uses the Sankey diagram from the plotly package. The idea is to construct a flow from two clustering results to find the overlap between clusters and quantify the overlap first. Once this is done, the data could be fed into the Sankey diagram example as below.

For details, have a look here: https://plot.ly/python/sankey-diagram/

In [3]:
import plotly
# You will need to get a username and an API key from https://plot.ly and replace the below here!
plotly.tools.set_credentials_file(username='dsikar', api_key='h29EDZcp4orwEkhpsBHx')

In [4]:

import plotly.plotly as py
data = dict(
    type='sankey',
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(
        color = "black",
        width = 0.5
      ),
      label = ["A1", "A2", "B1", "B2", "C1", "C2"],
      color = ["blue", "blue", "blue", "blue", "blue", "blue"]
    ),
    link = dict(
      source = [0,1,0,2,3,3],
      target = [2,3,3,4,4,5],
      value = [8,4,2,8,4,2]
  ))

layout =  dict(
    title = "Basic Sankey Diagram",
    font = dict(
      size = 10
    )
)

fig = dict(data=[data], layout=layout)
py.iplot(fig, validate=False)

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~dsikar/0 or inside your plot.ly account where it is named 'plot from API'


## Constructing flow between clustering results
OK, now you've seen a Sankey diagram working. The key task now is to get the data bit filled in with data from two (or more) clustering results:
....
```
      label = ["A1", "A2", "B1", "B2", "C1", "C2"],
      color = ["blue", "blue", "blue", "blue", "blue", "blue"]
    ),
    link = dict(
      source = [0,1,0,2,3,3],
      target = [2,3,3,4,4,5],
      value = [8,4,2,8,4,2]
  )
```
  .....
  
What you need to do is to look at clusters from one clustering run, let's say  C_1, C_2 and C_3 and a second run C_4, C_5, C_6, C_7. And find the overlap between the two sets `[C_1, C_2 and C_3]` and `[C_4, C_5 and C_6]` and construct these source, target and value triplets. You can also imagine that one of these "clusterings" could be the ground truth labels if you have them.

In finding the overlaps, differences between the different clusters, Numpy set operations can be very handy: https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.set.html

OK, here you go. Start from the clustering example on Week-07 and try to build these flows after running your clustering a few times, maybe with different k values or algorithms and compare them. Here was the code for last week: https://moodle.city.ac.uk/mod/page/view.php?id=1174283
  