# Clustering Algorithm Methodology and Testing
Will Wright

### Purpose and Context

[todo]

In [1]:
# Load packages
import shutil
from os import listdir
import json
import glob
import os
import numpy as np
import pandas as pd
import random
import copy
from heapq import nsmallest

# increase viewable dataframe rows and columns
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 20)

# set random seed
random.seed(30)

___
## Simulating Clusters
There are several algorithms out there for clustering binary data, but I'd like to try my hand at developing my own. After much whiteboarding, my plan is to test and see if I can get my own idea to work.

**Clustering Algorithm**
1. Start with nodes equal to the number of games (k=m)
2. Calculate the distance between each node and every other node where distance is the sum of resources that don't match (i.e. if a card is in both decks, that resource adds 0 distance, but it if is in one and not the other, then it adds 1 distance.  "Distance" is the sum of all these resource differences)
3. For each node, calculate the nearest node(s) via distance + tolerance and add to the node via averaging. 
4. For each updated node, calculate the distances to all other nodes and the min and max distances per node
5. Drop the node for which the min distance is the smallest and, of those, the max distance is the smallest (this steps adds maximum distance between the nodes)
6. Repeat 2-5 until k=1
7. Plot average distance for k=1 through k=m
8. Select a reasonable k

### Generating Test Data

Lets imagine we have 20 resources in 100 games and 4 relatively-tight clusters.  In order to do this, we'll create 4 medoids by drawing from the binomial distribution with p = 0.2, 0.4, 0.6, and 0.8 for each draw being a 1 instead of a 0 for each resource within a medoid.

In [2]:
first_medoid = np.random.binomial(1, 0.2, 20)
second_medoid = np.random.binomial(1, 0.4, 20)
third_medoid = np.random.binomial(1, 0.6, 20)
fourth_medoid = np.random.binomial(1, 0.8, 20)

print(first_medoid)
print(second_medoid)
print(third_medoid)
print(fourth_medoid)


[0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1]
[0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 1]
[1 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 1 1 1]
[1 1 1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1]


We'll want a function to measure the distance between games (as described in step 2 of the algorithm):

In [3]:
def cluster_distance_calculator(cluster1_input, cluster2_input):
    '''
    input: arrays of binary data for two clusters
    output: a distance measurement
    method: distance is the sum of differences in the binary data, by position
    '''
    distance = sum(abs(cluster1_input-cluster2_input))
    return(distance)

In [4]:
# test it out for the distance between the first medoid and the others
print(cluster_distance_calculator(first_medoid, first_medoid))
print(cluster_distance_calculator(first_medoid, second_medoid))
print(cluster_distance_calculator(first_medoid, third_medoid))
print(cluster_distance_calculator(first_medoid, fourth_medoid))


0
7
11
10


As expected, there is 0 distance for the first medoid compared to itself. Both the second and the third medoids have 9 total differences and the fourth medoid has 14. We'll want to see what the confusion matrix of each cluster's distance from each other cluster in a function for easy and comprehensive evaluation:

In [5]:
def cluster_confusioner(cluster_list_input):
    '''
    input: a list of clusters of equal length
    output: a matrix which applies the cluster_distance_calculator to each pair of clusters
    '''
    distance_matrix = np.empty((len(cluster_list_input), len(cluster_list_input)))
    
    # iterate through each comparison to populate the matrix
    for i in range(len(cluster_list_input)):
        for j in range(len(cluster_list_input)):
            distance_matrix[i,j] = cluster_distance_calculator(cluster_list_input[i], cluster_list_input[j])
    
    return(distance_matrix)
    

In [6]:
cluster_confusioner([first_medoid, second_medoid, third_medoid, fourth_medoid])

array([[ 0.,  7., 11., 10.],
       [ 7.,  0., 12., 11.],
       [11., 12.,  0.,  9.],
       [10., 11.,  9.,  0.]])

Here, we can see those same 0, 9, 9, 14 values across the first row and down the first column as well as the distances between the other clusters.  It looks like the maximum distance is 14 (difference between the first and fourth medoid) and the minimum distance is a three-way tie between medoids 1:2, 1:3, and 2:4 with a distance of 9.

Next, we'll want to create 24 similar games per medoid to simulate a situation in which there were 4 winning sets of resources.  We'll do this by randomly selecting 0-25% of the elements in each cluster and flipping them. such that we'll have 75% to 100% similarity between each game intended for a cluster and its medoid (I say _intended_ because it's possible that, after applying the random changes, it becomes more similar to a different medoid).

In [7]:
def cluster_creator(medoid_input, difference_percent_range, n_games):
    '''
    input: medoid_input is a one-dimensional array of binary data
           difference_percent_range is a list with a min and max percent (e.g. [0,0.25] for 0-25%); cannot do <1% 
           n_games is the number of games needed in the output
    output: a list of n games with the speficied similarity to the medoid_input
    '''
    
    simulated_games = []
    
    for i in range(n_games):
        # select how many elements will be changed
        # must multiply by 100 and add 1 due to randrange needing integers and being exclusive with the high end
        percent_change = random.randrange(difference_percent_range[0]*100, (difference_percent_range[1]+0.01)*100, 1)/100
        
        # convert the percent to an integer by multiplying by the total number of elements and rounding
        element_change = round(len(medoid_input)*percent_change)
        
        # select which elements will be changed
        element_change_positions = []
        for j in range(element_change):
            element_change_positions.append(random.randrange(0,len(medoid_input)))
        
        # change those elements
        simulated_game = copy.copy(medoid_input)
        for k in range(len(element_change_positions)):
            if simulated_game[element_change_positions[k]]==1:
                simulated_game[element_change_positions[k]]=0
            else:
                simulated_game[element_change_positions[k]]=1
        
        # append to list of games
        simulated_games.append(simulated_game)
    
    return(simulated_games)
        

In [8]:
# create the cluster and add the medoid to it
first_medoid_cluster = cluster_creator(first_medoid, [0,0.25], 24)

In [9]:
first_medoid_cluster

[array([1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0]),
 array([0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1]),
 array([0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1]),
 array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1]),
 array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1]),
 array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1]),
 array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]),
 array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1]),
 array([0, 0, 0, 0, 

Looks right, but just to test, let's make sure the differences between each simulated game and the medoid is less than or equal to 25% (5 elements of the 20):

In [10]:
distances = []
for i in range(len(first_medoid_cluster)):
    distances = distances + [cluster_distance_calculator(first_medoid, first_medoid_cluster[i])]
print(max(distances))

4


Perfect! now to create the other clusters of games around the other medoids.

In [11]:
second_medoid_cluster = cluster_creator(second_medoid, [0,0.25], 24)
third_medoid_cluster = cluster_creator(third_medoid, [0,0.25], 24)
fourth_medoid_cluster = cluster_creator(fourth_medoid, [0,0.25], 24)

Finally, we can stitch all the games together into one dataframe:

In [12]:
simulated_games = [first_medoid] + first_medoid_cluster + [second_medoid] + second_medoid_cluster +\
[third_medoid] + third_medoid_cluster + [fourth_medoid] + fourth_medoid_cluster

In [13]:
simulated_games = pd.DataFrame(np.array(simulated_games))

In [14]:
simulated_games

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,1,1
1,1,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,1,1,0
2,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,1,1,1
3,0,0,1,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,1,1
4,1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1,1,1,0,1,1,0,1,1,0,1,0,1,1,1,1,1,0,1,1
96,0,1,1,0,1,0,0,0,0,0,1,1,1,1,1,1,0,0,1,1
97,1,1,1,0,1,0,0,0,1,0,1,1,1,1,1,1,0,0,1,1
98,0,1,0,0,0,0,0,0,1,0,1,1,1,1,1,1,1,0,1,1


### Building the Clustering Algorithm
So now, in total, we have 100 games of 20 resources with 25 in each cluster centered around 4 medoids. This is roughly what we should expect to find with the actual game data so it should serve as a relevant proxy.

To restate the goal of the Algorithm:  
**Clustering Algorithm**
1. Start with nodes equal to the number of games (k=m)
2. Calculate the distance between each node and every other node where distance is the sum of resources that don't match (i.e. if a card is in both decks, that resource adds 0 distance, but it if is in one and not the other, then it adds 1 distance.  "Distance" is the sum of all these resource differences)
3. For each node, calculate the nearest node(s) via distance + tolerance and add to the node via averaging. 
4. For each updated node, calculate the distances to all other nodes and the min and max distances per node
5. Drop the node for which the min distance is the smallest and, of those, the max distance is the smallest (this steps adds maximum distance between the nodes)
6. Repeat 2-5 until k=1
7. Plot average distance for k=1 through k=m
8. Select a reasonable k

In [15]:
def node_distancer(resource_dataframe_input):
    '''
    input: resouce_dataframe_input is a dataframe with resources in the columns and games in the rows with 
            binary data filling the table.
           k_input is the number of clusters
    output: k centroids
    '''
    # create empty array to hold all the distances comparing each combination
    all_node_distances = np.zeros([len(resource_dataframe_input),len(resource_dataframe_input)])
    # calculate distance between the ith game and the jth game
    for i in range(len(all_node_distances)):
        for j in range(len(all_node_distances)):
            all_node_distances[i,j] = cluster_distance_calculator(resource_dataframe_input.iloc[i], resource_dataframe_input.iloc[j])
    
    return(all_node_distances)

In [16]:
simulated_node_distances = node_distancer(simulated_games)

In [17]:
simulated_node_distances

array([[ 0.,  3.,  4., ..., 10.,  8., 10.],
       [ 3.,  0.,  7., ..., 11., 11.,  9.],
       [ 4.,  7.,  0., ..., 10.,  8., 10.],
       ...,
       [10., 11., 10., ...,  0.,  4.,  2.],
       [ 8., 11.,  8., ...,  4.,  0.,  6.],
       [10.,  9., 10., ...,  2.,  6.,  0.]])

Above is a 100x100 array which shows the distance of each node to every other node.  Below is just the first row, which compares the first node to all 100 other nodes, separated by cluster:

In [18]:
print(simulated_node_distances[0,0:25])
print(simulated_node_distances[0,25:50])
print(simulated_node_distances[0,50:75])
print(simulated_node_distances[0,75:100])

[0. 3. 4. 2. 3. 0. 0. 1. 3. 0. 2. 3. 4. 1. 2. 0. 1. 4. 1. 3. 2. 0. 2. 1.
 2.]
[ 7.  7.  8.  9.  9.  9.  9. 10.  5.  7.  7.  7.  9.  6.  8.  9.  8. 10.
  7.  7. 10.  7.  6. 10.  6.]
[11. 11. 12. 10. 10. 12. 11. 10. 12.  9. 12.  9. 11. 10. 13. 11. 10. 12.
 10. 10. 12.  7. 11. 11. 10.]
[10. 11.  9. 10. 12. 10.  8.  9. 11. 12. 12. 10.  9. 11. 10. 10. 10. 10.
 10. 10. 14. 10. 10.  8. 10.]


Above is the first row of what I'm calling a 'node' distance separated into each of the medoid-centered groupings.  In this case, the node is 1 game, but we'll eventually be merging games of resources together so it's easier to call these the more generic 'node' than calling it 'the combination of game 1, game 10,..., game n'.  

What it's showing is that the distance between the first node and itself is, as expected, 0 and the distances between the first node and the others centered around that medoid ranges between 0 and 5.  Compared to the second medoid and the simulated games around it, we see values from 9 to 12. We see a difference of 8-13 for the third medoid cluster and 12-17 for the fourth medoid cluster. In all, it seems to be working as expected.

At step 3 in the algorithm, we calculate the nearest node and combine the two by taking the average of the resources. As can be seen, the nearest of all the nodes (that isn't itself) is 1 and there are 7 nodes that have the same distance.  This means that 7 nodes have only a 1-resource difference to the first node. I'm not exactly sure how to handle ties, but it seems reasonable to simply take an average across all nodes that are equally close.  I can imagine a better (and more complex) strategy doing backtracking and seeing which node-pairing results in a smaller distance in the n+1 iteration and going with that, but I'll keep this version simple. 

As per this plan, let's create a function which creates a new node based on an average of the equally-close nodes:

In [19]:
def node_averager(list_of_nodes):
    return(np.mean((list_of_nodes), axis = 0))

Next, for each row of 100 distances, we'll want to select the node(s) with the equally-minimal distance from the original node to send to the `node_averager()` and store the resulting nodes along with indices for the nodes included.  In other words, in the case of the first iteration when the nodes are equal to the games, we're finding the most similar other game(s) and creating an average new node.

In [22]:
def new_noder(node_table_input, node_distance_table_input, node_index, tolerance_input):
    '''
    inputs: resource_table_input: a table with games in the rows and resources in the columns (mxn)
            node_distance_table_input: an array of node distances (mxm)
            node_index: an integer value (should range from 0 to m)
            tolderance_input: value to be added to the minimum distance to be included in the closest nodes
    output: a tuple of a new average node and a list of the nodes averaged together
    '''
    node_distances = node_distance_table_input[node_index]
    min_distance = min(node_distances[1:])+tolerance_input # find the closest node(s) that aren't the primary node
    closest_node_indices = np.where(node_distances<=min_distance)[0] # [0] since these are 1-dimensional slices
    
    # grab the closest nodes and put into a list
    closest_nodes = []
    for i in range(len(closest_node_indices)):
        closest_nodes.append(node_table_input.iloc[closest_node_indices[i]].values)
    
    # grab primary_node and add to the list of closest_nodes
    primary_node = node_table_input.iloc[node_index].values
    closest_nodes.append(primary_node)
    
    
    # take an average of the primary and closest node(s)
    new_node = node_averager(closest_nodes)
    
    return([new_node,[node_index,closest_node_indices]])
    

In [23]:
new_node = new_noder(simulated_games, simulated_node_distances, 0,0)

Excellent! We now have a way of generating a new node, which is the average of the closest nodes (within some tolerance), along with the indices for the primary node and the nodes closest.  

As per the algorithm, we'll want to do this for all nodes before calculating the distances between each of them and removing the closest.

In [26]:
def node_updater(node_table_input, node_distance_table_input, tolerance_input = 1):
    '''
    TODO: add deets
    '''
    new_resource_list = []
    for i in range(len(node_table_input)):
        new_resource_list.append(new_noder(node_table_input, node_distance_table_input,i,tolerance_input)[0])
    
    # convert to dataframe
    updated_node_table = pd.DataFrame(np.array(new_resource_list))
    
    return(updated_node_table)

In [27]:
simulated_games_step2 = node_updater(simulated_games, simulated_node_distances,1)

In [28]:
simulated_games_step2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.000000,0.000000,0.000000,0.0,0.083333,0.0,0.000000,0.083333,1.000000,0.083333,0.0,1.0,0.000000,0.0,0.0,0.0,0.000000,1.0,0.916667,0.916667
1,1.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,1.000000,1.000000,0.0,1.0,0.000000,0.0,0.0,0.0,0.000000,1.0,1.000000,0.000000
2,0.000000,1.000000,0.000000,0.0,0.000000,0.0,1.000000,0.000000,0.000000,0.000000,0.0,1.0,1.000000,0.0,0.0,0.0,0.000000,1.0,1.000000,1.000000
3,0.000000,0.000000,0.666667,0.0,1.000000,0.0,0.000000,0.000000,1.000000,0.000000,0.0,1.0,0.000000,0.0,0.0,0.0,0.000000,1.0,1.000000,1.000000
4,1.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,1.000000,1.000000,0.000000,0.0,1.0,0.000000,0.0,0.0,0.0,0.666667,1.0,1.000000,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1.000000,1.000000,1.000000,0.0,1.000000,1.0,0.000000,1.000000,1.000000,0.000000,1.0,0.0,1.000000,1.0,1.0,1.0,1.000000,0.0,1.000000,1.000000
96,0.000000,1.000000,1.000000,0.0,1.000000,0.0,0.000000,0.000000,0.333333,0.000000,1.0,1.0,1.000000,1.0,1.0,1.0,0.000000,0.0,1.000000,1.000000
97,0.909091,0.909091,1.000000,0.0,1.000000,0.0,0.090909,0.000000,1.000000,0.000000,1.0,1.0,0.909091,1.0,1.0,1.0,0.000000,0.0,1.000000,1.000000
98,0.000000,1.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,1.000000,0.000000,1.0,1.0,1.000000,1.0,1.0,1.0,1.000000,0.0,1.000000,1.000000


In [29]:
node_distance_table_step2 = node_distancer(simulated_games_step2)

In [30]:
i = 0

In [31]:
node_distance_table_step2[i]

array([ 0.        ,  3.08333333,  4.41666667,  1.91666667,  2.91666667,
        0.        ,  0.        ,  0.5       ,  3.41666667,  0.        ,
        2.25      ,  3.08333333,  4.25      ,  0.69444444,  2.41666667,
        0.        ,  0.69444444,  4.41666667,  0.5       ,  3.41666667,
        2.41666667,  0.        ,  2.75      ,  0.5       ,  2.41666667,
        7.36111111,  7.08333333,  7.53571429,  9.08333333,  8.91666667,
        9.08333333,  8.75      , 10.08333333,  5.25      ,  7.36111111,
        7.25      ,  7.36111111,  8.91666667,  6.08333333,  7.88333333,
        8.91666667,  8.08333333, 10.08333333,  7.36111111,  6.91666667,
        9.91666667,  7.36111111,  6.25      , 10.08333333,  6.86111111,
       10.63888889, 10.58333333, 11.91666667, 10.41666667,  9.91666667,
       11.41666667, 10.75      , 10.25      , 12.08333333,  9.08333333,
       11.41666667,  8.91666667, 11.25      , 10.41666667, 12.58333333,
       10.91666667, 10.08333333, 11.91666667,  9.91666667, 10.25

In [32]:
min(node_distance_table_step2[i][[s for s in list(range(len(node_distance_table_step2[i]))) if s != i]])

0.0

In [33]:
max(node_distance_table_step2[i])

14.083333333333334

It looks like the 0th node has a min distance of 1.73 (excluding the distance to itself) and a max distance of 15.87.  What we want to know is which nodes share the minimum distance, be it 1.73 or something smaller, then, of those, remove the node with the smallest maximum distance.  In order to do this, we'll create a function which returns those min and max values.

In [34]:
def node_distance_min_maxer(node_distance_table_input):
    node_min_maxs= []
    for i in range(len(node_distance_table_input)):
        node_min = min(node_distance_table_input[i][[s for s in list(range(len(node_distance_table_input[i]))) \
                                                     if s != i]])
        node_max = max(node_distance_table_input[i])
        node_min_maxs.append([node_min, node_max])
    return(node_min_maxs)

In [35]:
node_distance_min_max_step2 = node_distance_min_maxer(node_distance_table_step2)

In [36]:
node_distance_min_max_step2

[[0.0, 14.083333333333334],
 [2.75, 15.0],
 [4.0, 14.0],
 [1.2222222222222223, 12.333333333333334],
 [0.6666666666666666, 12.666666666666666],
 [0.0, 14.083333333333334],
 [0.0, 14.083333333333334],
 [0.5, 14.25],
 [3.25, 13.666666666666666],
 [0.0, 14.083333333333334],
 [0.6666666666666666, 12.5],
 [0.3333333333333333, 15.333333333333334],
 [2.333333333333333, 16.0],
 [0.6944444444444444, 13.555555555555555],
 [2.0, 14.666666666666666],
 [0.0, 14.083333333333334],
 [0.6944444444444444, 13.555555555555557],
 [3.3333333333333335, 14.0],
 [0.5, 14.25],
 [3.0, 15.0],
 [2.0, 13.0],
 [0.0, 14.083333333333334],
 [0.3333333333333333, 15.666666666666666],
 [0.5, 14.25],
 [2.0, 14.0],
 [0.0, 13.88888888888889],
 [1.2222222222222223, 14.666666666666666],
 [0.39682539682539686, 13.714285714285714],
 [2.2857142857142856, 16.0],
 [1.2666666666666666, 15.666666666666666],
 [3.7142857142857144, 13.0],
 [1.2666666666666666, 15.666666666666668],
 [4.0, 16.0],
 [3.3333333333333335, 15.0],
 [0.0, 13.8888

Great, now we need to subset to which node has the min first value (looks like 0) and the max second:

In [37]:
def extract_nth(list_input, n): 
    return [element[n] for element in list_input] 

In [38]:
min_distance = min(extract_nth(node_distance_min_max_step2,0))

In [39]:
node_min_indices = np.where(extract_nth(node_distance_min_max_step2,0)==min_distance)[0]

In [40]:
node_min_indices

array([ 0,  5,  6,  9, 15, 21, 25, 34, 36, 43, 46, 75, 80, 86, 90, 91, 97])

In [41]:
min_distance_node_distances = [node_distance_min_max_step2[i] for i in node_min_indices]

In [42]:
min_distance_node_distances

[[0.0, 14.083333333333334],
 [0.0, 14.083333333333334],
 [0.0, 14.083333333333334],
 [0.0, 14.083333333333334],
 [0.0, 14.083333333333334],
 [0.0, 14.083333333333334],
 [0.0, 13.88888888888889],
 [0.0, 13.88888888888889],
 [0.0, 13.88888888888889],
 [0.0, 13.88888888888889],
 [0.0, 13.88888888888889],
 [0.0, 13.818181818181818],
 [0.0, 13.818181818181818],
 [0.0, 13.818181818181818],
 [0.0, 13.818181818181818],
 [0.0, 13.818181818181818],
 [0.0, 13.818181818181818]]

In [43]:
minmax_distance = min(extract_nth(min_distance_node_distances,1))

In [44]:
node_removal_distances = np.asarray([min_distance, minmax_distance])

In [45]:
node_removal_distances

array([ 0.        , 13.81818182])

We can see that the 0, 1, 2, 3, 4, and 5th index of the min node distances are all tied at [0, 13].  This being the case, we should remove those nodes from simulated_games_step2 and repeat the process.  What this means is we'll start with k=100 clusters, then go to k=95 since we're removing 6 nodes.  In other words, this is an algorithm which will return arbitrary viable values for k and this makes sense since we can imagine data which has distances that share min and max properties.

In [46]:
node_removal_indices = np.where((extract_nth(node_distance_min_max_step2,0)==node_removal_distances[0]) &\
                                (extract_nth(node_distance_min_max_step2,1)==node_removal_distances[1]))[0]

In [47]:
node_removal_indices.tolist()

[75, 80, 86, 90, 91, 97]

In [48]:
[node_distance_min_max_step2[i] for i in node_removal_indices.tolist()]

[[0.0, 13.818181818181818],
 [0.0, 13.818181818181818],
 [0.0, 13.818181818181818],
 [0.0, 13.818181818181818],
 [0.0, 13.818181818181818],
 [0.0, 13.818181818181818]]

In [49]:
simulated_games_step3 = simulated_games_step2.drop(node_removal_indices)

In [50]:
simulated_games_step3

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,0.0,0.000000,0.0,0.083333,0.0,0.0,0.083333,1.000000,0.083333,0.0,1.0,0.0,0.0,0.0,0.0,0.000000,1.0,0.916667,0.916667
1,1.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,1.000000,1.000000,0.0,1.0,0.0,0.0,0.0,0.0,0.000000,1.0,1.000000,0.000000
2,0.0,1.0,0.000000,0.0,0.000000,0.0,1.0,0.000000,0.000000,0.000000,0.0,1.0,1.0,0.0,0.0,0.0,0.000000,1.0,1.000000,1.000000
3,0.0,0.0,0.666667,0.0,1.000000,0.0,0.0,0.000000,1.000000,0.000000,0.0,1.0,0.0,0.0,0.0,0.0,0.000000,1.0,1.000000,1.000000
4,1.0,0.0,0.000000,0.0,0.000000,0.0,0.0,1.000000,1.000000,0.000000,0.0,1.0,0.0,0.0,0.0,0.0,0.666667,1.0,1.000000,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,1.0,1.0,1.000000,0.0,1.000000,1.0,0.0,0.000000,1.000000,0.000000,0.0,1.0,0.0,1.0,1.0,1.0,1.000000,0.0,1.000000,1.000000
95,1.0,1.0,1.000000,0.0,1.000000,1.0,0.0,1.000000,1.000000,0.000000,1.0,0.0,1.0,1.0,1.0,1.0,1.000000,0.0,1.000000,1.000000
96,0.0,1.0,1.000000,0.0,1.000000,0.0,0.0,0.000000,0.333333,0.000000,1.0,1.0,1.0,1.0,1.0,1.0,0.000000,0.0,1.000000,1.000000
98,0.0,1.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0,1.0,1.000000,0.0,1.000000,1.000000


Looks like this process works well so we just need to convert it to a function and ensure we get the same results

In [51]:
def drop_closest_nodes(node_table_input, node_minmax_distances):
    '''
    
    '''
    # calculate the min distance of the first element in each of the min/max lists
    min_distance = min(extract_nth(node_minmax_distances,0))
    
    # calculate indices of the nodes which contain the minimum
    node_min_indices = np.where(extract_nth(node_minmax_distances,0)==min_distance)[0]
    
    # generate a list of min/max distances for the nodes which contain the min distance
    min_distance_node_distances = [node_minmax_distances[i] for i in node_min_indices]
    
    # of those distances, calculate the minimum max distance (the second element in the distance lists)
    minmax_distance = min(extract_nth(min_distance_node_distances,1))
    
    # create an array containing the mininum and minimum maximum distance
    node_removal_distances = np.asarray([min_distance, minmax_distance])
    
    # calculate the node indices from the distance table which match the node_removal_distances
    node_removal_indices = np.where((extract_nth(node_minmax_distances,0)==node_removal_distances[0]) &\
                                (extract_nth(node_minmax_distances,1)==node_removal_distances[1]))[0]
    
    # drop the specified nodes from the resource table
    reduced_node_table = node_table_input.drop(node_removal_indices)
    
    return(reduced_node_table)

In [52]:
drop_closest_nodes(simulated_games_step2, node_distance_min_max_step2)==simulated_games_step3

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
95,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
96,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
98,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True


At this point, we have everything working and just need to stick all the pieces together:

In [53]:
def node_reducer(node_dataframe_input, tolerance_input):
    '''
    input: a dataframe with nodes in the rows and binary features in the columns
    output: a reduced version of the input table which creates hybrid nodes and removes the node(s) for which there is
        the least difference to the other nodes.
    '''
    
    # calculate node distances
    node_distances = node_distancer(node_dataframe_input)
    
    # create new hybrid nodes
    new_nodes = node_updater(node_dataframe_input, node_distances, tolerance_input)
    
    # calculate hybrid node distances
    new_node_distances = node_distancer(new_nodes)
    
    # calculate the min and max distances per new hybrid node
    new_node_minmax_distances = node_distance_min_maxer(new_node_distances)
    
    # drop the closest node(s)
    reduced_node_dataframe = drop_closest_nodes(new_nodes, new_node_minmax_distances)
    
    return(reduced_node_dataframe)


In [54]:
node_reducer(simulated_games, 1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,0.0,0.000000,0.0,0.083333,0.0,0.0,0.083333,1.000000,0.083333,0.0,1.0,0.0,0.0,0.0,0.0,0.000000,1.0,0.916667,0.916667
1,1.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,1.000000,1.000000,0.0,1.0,0.0,0.0,0.0,0.0,0.000000,1.0,1.000000,0.000000
2,0.0,1.0,0.000000,0.0,0.000000,0.0,1.0,0.000000,0.000000,0.000000,0.0,1.0,1.0,0.0,0.0,0.0,0.000000,1.0,1.000000,1.000000
3,0.0,0.0,0.666667,0.0,1.000000,0.0,0.0,0.000000,1.000000,0.000000,0.0,1.0,0.0,0.0,0.0,0.0,0.000000,1.0,1.000000,1.000000
4,1.0,0.0,0.000000,0.0,0.000000,0.0,0.0,1.000000,1.000000,0.000000,0.0,1.0,0.0,0.0,0.0,0.0,0.666667,1.0,1.000000,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,1.0,1.0,1.000000,0.0,1.000000,1.0,0.0,0.000000,1.000000,0.000000,0.0,1.0,0.0,1.0,1.0,1.0,1.000000,0.0,1.000000,1.000000
95,1.0,1.0,1.000000,0.0,1.000000,1.0,0.0,1.000000,1.000000,0.000000,1.0,0.0,1.0,1.0,1.0,1.0,1.000000,0.0,1.000000,1.000000
96,0.0,1.0,1.000000,0.0,1.000000,0.0,0.0,0.000000,0.333333,0.000000,1.0,1.0,1.0,1.0,1.0,1.0,0.000000,0.0,1.000000,1.000000
98,0.0,1.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0,1.0,1.000000,0.0,1.000000,1.000000


Next, we'll need to apply the cluster reducer down to k=1 and calculate the average distance between clusters at each step to get a sense of how many clusters should be included.

In [55]:
simulated_node_distances

array([[ 0.,  3.,  4., ..., 10.,  8., 10.],
       [ 3.,  0.,  7., ..., 11., 11.,  9.],
       [ 4.,  7.,  0., ..., 10.,  8., 10.],
       ...,
       [10., 11., 10., ...,  0.,  4.,  2.],
       [ 8., 11.,  8., ...,  4.,  0.,  6.],
       [10.,  9., 10., ...,  2.,  6.,  0.]])

In [56]:
np.mean(simulated_node_distances, axis = 0)

array([ 7.64,  8.82,  8.8 ,  7.64,  8.34,  7.64,  7.64,  8.4 ,  8.  ,
        7.64,  7.7 ,  8.74,  9.08,  7.64,  8.6 ,  7.64,  7.66,  8.46,
        8.02,  7.92,  8.28,  7.64,  8.38,  8.52,  8.66,  8.  ,  8.34,
        8.08,  9.12,  9.04,  8.84,  8.64,  9.44,  8.06,  8.  ,  8.08,
        8.  ,  9.16,  8.42,  8.64,  8.64,  8.86,  8.4 ,  8.  ,  8.36,
        9.18,  8.  ,  8.76,  9.34,  7.98,  8.64,  8.96,  9.06,  8.56,
        8.24,  9.28,  9.48,  8.62,  8.62,  9.48,  9.4 ,  8.58,  9.26,
        8.26,  9.42,  8.96,  8.52,  8.9 ,  8.86,  8.62, 10.18,  8.1 ,
        8.86,  8.9 ,  8.92,  8.12,  8.54,  8.08,  9.26,  8.88,  8.12,
        8.16,  8.1 ,  9.14,  8.88,  8.82,  8.12,  7.72,  9.64,  7.78,
        8.12,  8.12,  9.14,  8.86,  8.74,  9.18,  8.4 ,  8.12,  8.72,
        8.8 ])

In [57]:
np.mean(np.mean(simulated_node_distances, axis = 0))

8.531199999999998

So the average distance of each node to every other node is 8.4.  Ideally, this number gets larger as we decrease the number of clusters.  To that end, lets wrap all of these functions into a single function which stores the results at each stage as well as the average distances.

In [58]:
def binary_clusterer(node_dataframe_input, tolerance_input):
    '''
    input: node_dataframe_input: a dataframe with nodes in the rows and binary features in the columns
           tolerance_input: value to be added to the minimum distance to be included in the closest nodes
    output: clusters: for each step, the resulting hybrid clusters
            average_distances: for each step, the resulting average distance to every other cluster
            k: the number of centroids
    '''
    centroids = []
    average_distances = []
    ks = []
    node_dataframe = node_dataframe_input
      
    k = len(node_dataframe_input)
    
    while k>=1:
        # save results of current step
        centroids.append(node_dataframe)
        average_distances.append(np.mean(np.mean(node_distancer(node_dataframe), axis = 0)))
        ks.append(k)
        
        # reduce node_dataframe
        node_dataframe = node_reducer(node_dataframe, tolerance_input)
        
        # set k
        k = len(node_dataframe)
        
    return([centroids,average_distances,ks])
        
    
    

In [59]:
test = binary_clusterer(simulated_games,1)

In [61]:
test[1]

[8.531199999999998,
 8.515848584793808,
 8.555354830004495,
 8.668070571314388,
 8.674876289765983,
 8.681085343233875,
 8.68788758380867,
 8.706048189405557,
 8.711345685971418,
 8.72141522421576,
 8.73545661726091,
 8.747000694827822,
 8.758021479765945,
 8.751582693363561,
 8.760128025167035,
 8.77538370118284,
 8.778008311906856,
 8.786476686094963,
 8.78541196892222,
 8.79103303264111,
 8.788555767878993,
 8.772189782496149,
 8.78979279330103,
 8.791780915236693,
 8.777661604975455,
 8.781352514398943,
 8.772143264326495,
 8.780954695557565,
 8.81031007025345,
 8.820138855625919,
 8.808127584902811,
 8.834282661012724,
 8.835475364390126,
 8.831910637569703,
 8.818414067461537,
 8.813202780926204,
 8.807137605129478,
 8.839614412522431,
 8.820522564836304,
 8.805632502308402,
 8.76990504017531,
 8.768325617283951,
 8.765170068027212,
 8.7119140625,
 8.618480725623582,
 8.574302697759489,
 8.584418145956608,
 8.574805333333334,
 8.77373002754821,
 8.634161014994234,
 8.220503353057