# Clustering Algorithm Methodology and Testing
Will Wright

### Purpose and Context

I've worked with several unsupervised clustering algorithms as part of my professional and academic career, but I had never built one from scratch.  Because the data for this project is binary and I had never worked exclusively with this type of data, I decided to test my creativity and see if I could build a functioning algorithmm from scratch with no outside influence.  

Once complete, this algorithm will be used to classify the resource data from the Slay the Spire character datasets into resource-personas (i.e. the unique 'builds' possible to win with each character on Ascension 20).

In [628]:
# Load packages
import shutil
from os import listdir
import json
import glob
import os
import numpy as np
import pandas as pd
import random
import copy
from heapq import nsmallest

# increase viewable dataframe rows and columns
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 20)

___
## Clustering Algorithm

After much whiteboarding, testing, and deep thought on the matter, I devised the following algorithm for clustering:

**Clustering Algorithm**
1. Start with clusters equal to the number of nodes (k=m)
2. While k>=1:
    1. Calculate the distance between each node and every other node where distance is the sum of resources that don't match (i.e. if a card is in both decks, that resource adds 0 distance, but it if is in one and not the other, then it adds 1 distance.  "Distance" is the sum of all these resource differences)
    2. For each node, calculate the nearest node(s) via distance + a user-defined tolerance distance and merge with the node via taking the mean 
    3. For each merged node, calculate the distances to all other merged nodes and the min and max distances per merged node.
    4. For each min and max distance, add a weighted user-defined tolerance distance percent where the weight is relative to k such that weight = k/m (this applies a 100% weight when k=m and decreases to 0% as k decreases. i.e. it takes larger 'steps' toward a lower k and slows down as k approaches 0)
    5. Drop the node(s) for which the min distance is smaller than or equal to the min distance + tolerance and, of those, the max distance is the smaller than the max distance + tolerance (this steps adds maximum distance between the nodes)
    6. Store the resulting centroids, k, and average distance between nodes
    7. Set k equal to the resulting number of undropped nodes
4. Review average distance for k=2 through k=m
5. Select a reasonable k based on the average distance between nodes

### Simulating Test Data

In order to know if the algorithm works, we'll want to see how it performs on a dataset where we know results.

Lets imagine we have 20 resources in 100 games and 4 somewhat-tight clusters.  In order to do this, we'll create 4 medoids by drawing from the binomial distribution with p = 0.2, 0.4, 0.6, and 0.8 for each draw being a 1 instead of a 0 for each resource within a medoid.

In [629]:
seeded_random = np.random.RandomState(44932)
first_medoid = seeded_random.binomial(1, 0.2, 20)
second_medoid = seeded_random.binomial(1, 0.4, 20)
third_medoid = seeded_random.binomial(1, 0.6, 20)
fourth_medoid = seeded_random.binomial(1, 0.8, 20)

print(first_medoid)
print(second_medoid)
print(third_medoid)
print(fourth_medoid)


[0 0 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 0 0 0]
[0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0]
[1 1 0 0 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1]
[1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1]


We'll want a function to measure the distance between games (as described in step 2 of the algorithm):

In [630]:
def cluster_distance_calculator(cluster1_input, cluster2_input):
    '''
    input: arrays of binary data for two clusters
    output: a distance measurement
    method: distance is the sum of differences in the binary data, by position
    '''
    distance = sum(abs(cluster1_input-cluster2_input))
    return(distance)

In [631]:
# test it out for the distance between the first medoid and the others
print(cluster_distance_calculator(first_medoid, first_medoid))
print(cluster_distance_calculator(first_medoid, second_medoid))
print(cluster_distance_calculator(first_medoid, third_medoid))
print(cluster_distance_calculator(first_medoid, fourth_medoid))


0
7
11
15


As expected, there is 0 distance for the first medoid compared to itself. The second medoid has 7 differences, the third has 11, and the fourth has 15, all of which makes sense since we tried to make each medoid further away from the first. We'll want to see what the confusion matrix of each cluster's distance from each other cluster in a function for easy and comprehensive evaluation:

In [632]:
def cluster_confusioner(cluster_list_input):
    '''
    input: a list of clusters of equal length
    output: a matrix which applies the cluster_distance_calculator to each pair of clusters
    '''
    distance_matrix = np.empty((len(cluster_list_input), len(cluster_list_input)))
    
    # iterate through each comparison to populate the matrix
    for i in range(len(cluster_list_input)):
        for j in range(len(cluster_list_input)):
            distance_matrix[i,j] = cluster_distance_calculator(cluster_list_input[i], cluster_list_input[j])
    
    return(distance_matrix)
    

In [633]:
cluster_confusioner([first_medoid, second_medoid, third_medoid, fourth_medoid])

array([[ 0.,  7., 11., 15.],
       [ 7.,  0., 10., 12.],
       [11., 10.,  0.,  6.],
       [15., 12.,  6.,  0.]])

Here, we can see those same 0, 7, 11, and 15 values across the first row and down the first column as well as the distances between the other clusters.  It looks like the maximum distance is 15 (difference between the first and fourth medoid) and the minimum distance is a 6, which is between the third and fourth medoids.

Next, we'll want to create 24 similar games per medoid to simulate a situation in which there were 4 winning sets of resources.  We'll do this by randomly selecting 0-25% of the elements in each cluster and flipping them. such that we'll have 75% to 100% similarity between each game intended for a cluster and its medoid (I say _intended_ because it's possible that, after applying the random changes, it becomes more similar to a different medoid).

In [634]:
def cluster_creator(medoid_input, difference_percent_range, n_games):
    '''
    input: medoid_input is a one-dimensional array of binary data
           difference_percent_range is a list with a min and max percent (e.g. [0,0.25] for 0-25%); cannot do <1% 
           n_games is the number of games needed in the output
    output: a list of n games with the speficied similarity to the medoid_input
    '''
    simulated_games = []
    
    for i in range(n_games):
        # create inner random state which is determined by i
        inner_seeded_random = np.random.RandomState(i)
        
        # select how many elements will be changed
        percent_change = inner_seeded_random.uniform(difference_percent_range[0],difference_percent_range[1])
        
        # convert the percent to an integer by multiplying by the total number of elements and rounding
        element_change = round(len(medoid_input)*percent_change)
        
        # for unknown python reasons, i've got to set a seed too and it has to be based on i for reproducibility
        random.seed(i)
        
        # set indices of elements to change
        element_change_positions = random.choices(range(len(medoid_input)), k=element_change)
        
        # change those elements
        simulated_game = copy.copy(medoid_input)
        for k in range(len(element_change_positions)):
            if simulated_game[element_change_positions[k]]==1:
                simulated_game[element_change_positions[k]]=0
            else:
                simulated_game[element_change_positions[k]]=1
        
        # append to list of games
        simulated_games.append(simulated_game)
    
    return(simulated_games)
        

In [635]:
# create the cluster and add the medoid to it
first_medoid_cluster = cluster_creator(first_medoid, [0,0.25], 24)

In [636]:
first_medoid_cluster

[array([0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0]),
 array([0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1]),
 array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0]),
 array([0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0]),
 array([0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1]),
 array([0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0]),
 array([0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0]),
 array([0, 0, 1, 0, 

Looks right, but just to test, let's make sure the differences between each simulated game and the medoid is less than or equal to 25% (5 elements of the 20):

In [637]:
distances = []
for i in range(len(first_medoid_cluster)):
    distances = distances + [cluster_distance_calculator(first_medoid, first_medoid_cluster[i])]
print(max(distances))

5


Perfect! now to create the other clusters of games around the other medoids.

In [638]:
second_medoid_cluster = cluster_creator(second_medoid, [0,0.25], 24)
third_medoid_cluster = cluster_creator(third_medoid, [0,0.25], 24)
fourth_medoid_cluster = cluster_creator(fourth_medoid, [0,0.25], 24)

Finally, we can stitch all the games together into one dataframe:

In [639]:
simulated_games = [first_medoid] + first_medoid_cluster + [second_medoid] + second_medoid_cluster +\
[third_medoid] + third_medoid_cluster + [fourth_medoid] + fourth_medoid_cluster

In [640]:
simulated_games = pd.DataFrame(np.array(simulated_games))

In [641]:
simulated_games

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,1,0,1,0,0,0
1,0,0,0,0,0,1,1,0,1,0,0,1,1,0,1,1,0,0,0,0
2,0,0,1,0,0,1,1,0,0,0,0,1,1,0,1,0,0,0,0,0
3,0,0,0,0,0,1,1,0,0,0,0,1,1,0,1,0,1,0,1,1
4,0,0,0,0,1,1,1,1,0,0,1,1,1,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,1,0,1,1,1
96,1,1,1,1,1,1,1,1,1,0,1,1,0,0,1,0,0,1,0,1
97,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,1,0,1,1,1
98,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,1,0,1,1,0


### Building the Clustering Algorithm
So now, in total, we have 100 games of 20 resources with 25 in each cluster centered around 4 medoids. This is roughly what we should expect to find with the actual game data so it should serve as a relevant proxy.

To restate the goal of the Algorithm:  
**Clustering Algorithm**
1. Start with clusters equal to the number of nodes (k=m)
2. While k>=1:
    1. Calculate the distance between each node and every other node where distance is the sum of resources that don't match (i.e. if a card is in both decks, that resource adds 0 distance, but it if is in one and not the other, then it adds 1 distance.  "Distance" is the sum of all these resource differences)
    2. For each node, calculate the nearest node(s) via distance + a user-defined tolerance distance and merge with the node via taking the mean 
    3. For each merged node, calculate the distances to all other merged nodes and the min and max distances per merged node.
    4. For each min and max distance, add a weighted user-defined tolerance distance percent where the weight is relative to k such that weight = k/m (this applies a 100% weight when k=m and decreases to 0% as k decreases. i.e. it takes larger 'steps' toward a lower k and slows down as k approaches 0)
    5. Drop the node(s) for which the min distance is smaller than or equal to the min distance + tolerance and, of those, the max distance is the smaller than the max distance + tolerance (this steps adds maximum distance between the nodes)
    6. Store the resulting centroids, k, and average distance between nodes
    7. Set k equal to the resulting number of undropped nodes
4. Review average distance for k=2 through k=m
5. Select a reasonable k based on the average distance between nodes

#### 2A. Calculate the distance between each node and every other node
We'll need a function to calculate the distance between each node and every other node:

In [642]:
def node_distancer(resource_dataframe_input):
    '''
    input: resouce_dataframe_input is a dataframe with resources in the columns and games in the rows with 
            binary data filling the table.
           k_input is the number of clusters
    output: k centroids
    '''
    # create empty array to hold all the distances comparing each combination
    all_node_distances = np.zeros([len(resource_dataframe_input),len(resource_dataframe_input)])
    # calculate distance between the ith game and the jth game
    for i in range(len(all_node_distances)):
        for j in range(len(all_node_distances)):
            all_node_distances[i,j] = cluster_distance_calculator(resource_dataframe_input.iloc[i], resource_dataframe_input.iloc[j])
    
    return(all_node_distances)

In [643]:
simulated_node_distances = node_distancer(simulated_games)

In [644]:
simulated_node_distances

array([[ 0.,  3.,  2., ..., 15., 14., 14.],
       [ 3.,  0.,  3., ..., 12., 11., 11.],
       [ 2.,  3.,  0., ..., 13., 12., 12.],
       ...,
       [15., 12., 13., ...,  0.,  1.,  1.],
       [14., 11., 12., ...,  1.,  0.,  2.],
       [14., 11., 12., ...,  1.,  2.,  0.]])

Above is a 100x100 array which shows the distance of each node to every other node.  Below is just the first row, which compares the first node to all 100 other nodes, separated by cluster:

In [648]:
print("Cluster 1: ", simulated_node_distances[0,0:25])
print("Cluster 2: ", simulated_node_distances[0,25:50])
print("Cluster 3: ", simulated_node_distances[0,50:75])
print("Cluster 4: ", simulated_node_distances[0,75:100])

Cluster 1:  [0. 3. 2. 2. 3. 5. 1. 4. 0. 4. 0. 2. 1. 1. 2. 3. 4. 1. 1. 3. 0. 3. 0. 1.
 1.]
Cluster 2:  [7. 8. 9. 9. 8. 8. 6. 7. 7. 9. 7. 7. 8. 8. 7. 8. 9. 8. 8. 4. 7. 6. 7. 8.
 8.]
Cluster 3:  [11.  8. 11. 11. 12. 12. 10. 11. 11.  9. 11.  9. 12. 12. 11. 10.  9. 12.
 12. 12. 11. 10. 11. 10. 10.]
Cluster 4:  [15. 12. 13. 13. 12. 10. 14. 15. 15. 13. 15. 13. 16. 16. 15. 14. 13. 14.
 14. 14. 15. 12. 15. 14. 14.]


Above is the first row of what I'm calling a 'node' distance separated into each of the medoid-centered groupings.  In this case, the node is 1 game, but we'll eventually be merging games of resources together so it's easier to call these the more generic 'node' than calling it 'the combination of game 1, game 10,..., game n'.  

What it's showing is that the distance between the first node (the medoid for the cluster) and itself is, as expected, 0 and the distances between the first node and the others centered around that medoid ranges between 0 and 5. In Cluster 2, we see the difference between the first node and all the nodes in the second cluster, which range from 4 to 9. In general, Clusters 3 and 4 get further and further away.

#### 2B. For each node, calculate the nearest node(s) via distance + a user-defined tolerance distance and merge with the node via taking the mean
As can be seen, the nearest of all the nodes (that isn't itself) is 0 and there are 3 nodes that have the same distance.  This means that 3 nodes are the same as the first node. When we're dealing with ties, it seems reasonable to simply take an average across all nodes that are equally close.  

As per this plan, let's create a function which outputs an average of the nodes sent to it.

In [649]:
def node_averager(list_of_nodes):
    return(np.mean((list_of_nodes), axis = 0))

Next, for each row of 100 distances, we'll want to select the node(s) with the equally-minimal distance from the original node to send to the `node_averager()` and store the resulting nodes along with indices for the nodes included.  In other words, in the case of the first iteration when the nodes are equal to the games, we're finding the most similar other game(s) and creating an average new node.  

Importantly, there is a parameter for 'tolerance_input' which allows the user to add a distance value to the minimum distance. In cases where the nearest node is identical, this pulls in nodes that are within the distance of the tolerance. Overall, this helps the node centroids get more centered in a larger area.

In [650]:
def new_noder(node_table_input, node_distance_table_input, node_index, tolerance_input):
    '''
    inputs: resource_table_input: a table with games in the rows and resources in the columns (mxn)
            node_distance_table_input: an array of node distances (mxm)
            node_index: an integer value (should range from 0 to m)
            tolerance_input: value to be added to the minimum distance to be included in the closest nodes
    output: a tuple of a new average node and a list of the nodes averaged together
    '''
    node_distances = node_distance_table_input[node_index]
    min_distance = min(node_distances[1:])+tolerance_input # find the closest node(s) that aren't the primary node
    closest_node_indices = np.where(node_distances<=min_distance)[0] # [0] since these are 1-dimensional slices
    
    # grab the closest nodes and put into a list
    closest_nodes = []
    for i in range(len(closest_node_indices)):
        closest_nodes.append(node_table_input.iloc[closest_node_indices[i]].values)
    
    # grab primary_node and add to the list of closest_nodes
    primary_node = node_table_input.iloc[node_index].values
    closest_nodes.append(primary_node)
    
    # take an average of the primary and closest node(s)
    new_node = node_averager(closest_nodes)
    
    return([new_node,[node_index,closest_node_indices]])
    

Let's first see the results when the tolerance is 0:

In [651]:
# creating a new node for the first node where the tolerance is 0
new_node = new_noder(simulated_games, simulated_node_distances, 0, 0)

In [667]:
# average node resource values
new_node[0]

array([0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1.,
       0., 0., 0.])

In [666]:
# test to show that the new node is identical to the original when there is 0 tolerance
all(new_node[0]==simulated_games.iloc[0,])

True

In [655]:
# average node index followed by the indices of the nodes which qualify for averaging
new_node[1]

[0, array([ 0,  8, 10, 20, 22])]

And now, the results when we set the tolerance to 1:

In [669]:
# creating a new node for the first node where the tolerance is 0
new_node = new_noder(simulated_games, simulated_node_distances, 0, 1)

In [670]:
# average node resource values
new_node[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       1.        , 1.        , 0.07692308, 0.        , 0.15384615,
       0.07692308, 1.        , 0.92307692, 0.        , 1.        ,
       0.        , 1.        , 0.07692308, 0.        , 0.07692308])

In [671]:
# test to show that the new node is NOT identical to the original when there is 0 tolerance
all(new_node[0]==simulated_games.iloc[0,])

False

In [672]:
# average node index followed by the indices of the nodes which qualify for averaging
new_node[1]

[0, array([ 0,  6,  8, 10, 12, 13, 17, 18, 20, 22, 23, 24])]

As can be seen, when we set the tolerance to 1, more nodes are included in the average.

 As per the algorithm, we'll want to do this for all nodes before calculating the distances between each of them and removing the closest so we'll make a function to do this and then perform the operation with tolerance = 1.

In [674]:
def node_updater(node_table_input, node_distance_table_input, tolerance_input = 1):
    '''
    input: node_table_intput: a dataframe with observations in rows and features in columns with values between 0 and 1
           node_distance_table_input: an array of distances between nodes
           tolerance_input: value to be added to the minimum distance to be included in the closest nodes
    output: a dataframe of the hybrid average-closest nodes
    '''
    new_resource_list = []
    for i in range(len(node_table_input)):
        new_resource_list.append(new_noder(node_table_input, node_distance_table_input,i,tolerance_input)[0])
    
    # convert to dataframe
    updated_node_table = pd.DataFrame(np.array(new_resource_list))
    
    return(updated_node_table)

In [675]:
simulated_games_updated_nodes = node_updater(simulated_games, simulated_node_distances,1)

In [676]:
simulated_games_updated_nodes

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.076923,0.0,0.153846,0.076923,1.0,0.923077,0.0,1.0,0.0,1.0,0.076923,0.000000,0.076923
1,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.000000,1.0,0.000000,0.000000,1.0,1.000000,0.0,1.0,1.0,0.0,0.000000,0.000000,0.000000
2,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.000000,0.0,0.000000,0.000000,1.0,1.000000,0.0,1.0,0.0,0.0,0.000000,0.000000,0.000000
3,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.000000,0.0,0.000000,0.000000,1.0,1.000000,0.0,1.0,0.0,1.0,0.000000,0.666667,1.000000
4,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.000000,0.0,0.000000,1.000000,1.0,1.000000,0.0,1.0,0.0,1.0,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.923077,1.0,0.153846,0.923077,1.0,0.076923,1.0,1.0,1.0,0.0,0.923077,1.000000,0.923077
96,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.000000,1.0,0.000000,1.000000,1.0,0.000000,0.0,1.0,0.0,0.0,1.000000,0.000000,1.000000
97,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.923077,1.0,0.153846,0.923077,1.0,0.076923,1.0,1.0,1.0,0.0,0.923077,1.000000,0.923077
98,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.000000,1.0,0.000000,1.000000,1.0,0.000000,1.0,1.0,1.0,0.0,1.000000,0.875000,0.625000


With the updated-nodes at this iteration complete, we're on to the next step.

#### 2C. For each updated node, calculate the distances to all other merged nodes and the min and max distances per updated node.  

We can start by simply using the node_distancer function on the hybrid nodes in 'simulated_games_updated_nodes'.

In [677]:
node_distance_table_updated_nodes = node_distancer(simulated_games_updated_nodes)

In [678]:
# taking a look at the distances between the first updated node and every other updated node:
node_distance_table_updated_nodes[0]

array([ 0.        ,  3.53846154,  2.53846154,  2.05128205,  3.23076923,
        5.38461538,  0.67032967,  4.23076923,  0.        ,  4.38461538,
        0.        ,  2.53846154,  0.60576923,  0.60576923,  2.53846154,
        3.53846154,  4.38461538,  0.67032967,  0.67032967,  3.53846154,
        0.        ,  3.53846154,  0.        ,  0.88461538,  0.67032967,
        6.84615385,  8.38461538,  9.38461538,  8.8974359 ,  8.07692308,
        8.23076923,  7.0989011 ,  7.07692308,  6.84615385,  9.23076923,
        6.84615385,  7.38461538,  7.45192308,  7.45192308,  7.38461538,
        8.38461538,  9.23076923,  7.51648352,  7.51648352,  4.38461538,
        6.84615385,  6.38461538,  6.84615385,  7.73076923,  7.51648352,
       10.53846154,  8.07692308, 11.07692308, 10.8974359 , 11.76923077,
       11.92307692, 10.79120879, 10.76923077, 10.53846154,  9.23076923,
       10.53846154,  9.07692308, 11.14423077, 11.14423077, 11.07692308,
       10.07692308,  9.23076923, 11.20879121, 11.20879121, 12.07

Next, we calculate the min and max distances:

In [681]:
# calculate the minimum, but without include the distance to itself
min(node_distance_table_updated_nodes[i][[s for s in list(range(len(node_distance_table_updated_nodes[i]))) if s != i]])

0.7857142857142857

In [682]:
max(node_distance_table_updated_nodes[i])

14.875

It looks like the 0th node has a min distance of 0.79 (excluding the distance to itself) and a max distance of 14.9.  What we want to know is which nodes share the minimum distance, be it 0.79 or something smaller, then, of those, remove the node with the smallest maximum distance.  In order to do this, we'll create a function which returns those min and max values.

In [683]:
def node_distance_min_maxer(node_distance_table_input):
    '''
    input: node_distance_table_input: mxm array of distances between nodes
    output: list of lists containing the minimum and maximum distances per node
    '''
    node_min_maxs= []
    for i in range(len(node_distance_table_input)):
        node_min = min(node_distance_table_input[i][[s for s in list(range(len(node_distance_table_input[i]))) \
                                                     if s != i]])
        node_max = max(node_distance_table_input[i])
        node_min_maxs.append([node_min, node_max])
    return(node_min_maxs)

In [684]:
node_distance_min_max_updated_nodes = node_distance_min_maxer(node_distance_table_updated_nodes)

In [685]:
# preview of min/max distances
node_distance_min_max_updated_nodes[0:10]

[[0.0, 14.836538461538462],
 [3.0, 16.0],
 [2.0, 15.0],
 [1.1666666666666665, 14.333333333333334],
 [2.7142857142857144, 15.0],
 [4.0, 15.0],
 [0.5714285714285714, 15.089285714285714],
 [2.0, 16.0],
 [0.0, 14.836538461538462],
 [3.0, 15.0]]

The first value in each list is the min and the second is the max.  

Now, we need a way to subset to which node has the min first value and the minimum-max (minmax) second:

In [686]:
# function to pull the nth element from a list used for getting min and minmax distances:
def extract_nth(list_input, n): 
    return [element[n] for element in list_input] 

In [687]:
min_distance = min(extract_nth(node_distance_min_max_updated_nodes,0))

In [688]:
min_distance

0.0

#### 2D. For each min and max distance, add a weighted user-defined tolerance distance percent where the weight is relative to k such that weight = k/m

After testing with larger datasets, it became apparent that I'll want to establish a distance_tolerance_pct, which is a percent to add to the min and minmax distances in an effort to capture more nodes to be removed.  This is important because the computation is pretty intense and when we're dealing with thousands of observations/features where we're expecting a relatively small number of clusters, we aren't concerned with the clusters created for the large n or large m so the faster we can reduce while not losing the integrity of the reduction, the better.

In [689]:
distance_tolerance_pct = 0 # testing with 0, then we'll increase to see the effect

In [690]:
min_distance = min_distance * (1 + distance_tolerance_pct)

In [691]:
min_distance

0.0

In [692]:
# determine the indices which are less than or equal to the min distance * (1 + tolerance)
node_min_indices = np.where(extract_nth(node_distance_min_max_updated_nodes,0)<=min_distance)[0]

In [693]:
node_min_indices

array([ 0,  8, 10, 12, 13, 20, 22, 25, 33, 35, 37, 38, 45, 47, 50, 58, 60,
       62, 63, 70, 72, 75, 83, 85, 87, 88, 95, 97])

In [694]:
min_distance_node_distances = [node_distance_min_max_updated_nodes[i] for i in node_min_indices]

In [695]:
# preview of the min and max distances for which the node shares the minimum distance to the other nodes
min_distance_node_distances

[[0.0, 14.836538461538462],
 [0.0, 14.836538461538462],
 [0.0, 14.836538461538462],
 [0.0, 15.375],
 [0.0, 15.375],
 [0.0, 14.836538461538462],
 [0.0, 14.836538461538462],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.375],
 [0.0, 13.375],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.375],
 [0.0, 13.375],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 14.836538461538462],
 [0.0, 14.836538461538462],
 [0.0, 14.836538461538462],
 [0.0, 15.375],
 [0.0, 15.375],
 [0.0, 14.836538461538462],
 [0.0, 14.836538461538462]]

This is a list of lists which contain the min and max distance to the nodes where they all share the same minimum value. What we want is to know which of these is the smallest maximum distance.  This will reveal the nodes that are the least 'interesting' for clustering, which can be dropped for the next iteration.

In [696]:
minmax_distance = min(extract_nth(min_distance_node_distances,1))

In [697]:
# the minimum-max distance between nodes which share the minimum distance
minmax_distance

13.23076923076923

As with the min distance, we'll add the tolerance pct to the upper bound as well (0 in the case of this first test):

In [698]:
minmax_distance = minmax_distance * (1 + distance_tolerance_pct)

In [699]:
minmax_distance

13.23076923076923

Finally, we have the min and minmax distance conditions for which to subset:

In [700]:
node_removal_distances = np.asarray([min_distance, minmax_distance])

In [701]:
node_removal_distances

array([ 0.        , 13.23076923])

We can see several nodes tied at [0, 13.23].  This being the case, we should remove those nodes from simulated_games_updated_nodes and repeat the process. This will make it so that we're not stepping down from k=m to k=2 one-step-at-a-time.  Instead, k will drop by however many nodes meet the min and minmax criteria.  

Now that we know which values we're looking for, we'll grab the indices for removal in the next step.

In [702]:
node_removal_indices = np.where((extract_nth(node_distance_min_max_updated_nodes,0)<=node_removal_distances[0]) &\
                                (extract_nth(node_distance_min_max_updated_nodes,1)<=node_removal_distances[1]))[0]

In [703]:
# node indices to be removed
node_removal_indices.tolist()

[25, 33, 35, 45, 47, 50, 58, 60, 70, 72]

In [704]:
# nodes to be removed share the following values for min and minmax distances:
[node_distance_min_max_updated_nodes[i] for i in node_removal_indices.tolist()]

[[0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923],
 [0.0, 13.23076923076923]]

#### 2E. Drop the node(s) for which the min distance is smaller than or equal to the min distance + tolerance and, of those, the max distance is the smaller than the max distance + tolerance 

In [707]:
simulated_games_post_drop = simulated_games_updated_nodes.drop(node_removal_indices)

In [710]:
simulated_games_post_drop.shape

(90, 20)

In [708]:
simulated_games_post_drop

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.076923,0.0,0.153846,0.076923,1.0,0.923077,0.0,1.0,0.0,1.0,0.076923,0.000000,0.076923
1,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.000000,1.0,0.000000,0.000000,1.0,1.000000,0.0,1.0,1.0,0.0,0.000000,0.000000,0.000000
2,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.000000,0.0,0.000000,0.000000,1.0,1.000000,0.0,1.0,0.0,0.0,0.000000,0.000000,0.000000
3,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.000000,0.0,0.000000,0.000000,1.0,1.000000,0.0,1.0,0.0,1.0,0.000000,0.666667,1.000000
4,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.000000,0.0,0.000000,1.000000,1.0,1.000000,0.0,1.0,0.0,1.0,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.923077,1.0,0.153846,0.923077,1.0,0.076923,1.0,1.0,1.0,0.0,0.923077,1.000000,0.923077
96,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.000000,1.0,0.000000,1.000000,1.0,0.000000,0.0,1.0,0.0,0.0,1.000000,0.000000,1.000000
97,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.923077,1.0,0.153846,0.923077,1.0,0.076923,1.0,1.0,1.0,0.0,0.923077,1.000000,0.923077
98,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.000000,1.0,0.000000,1.000000,1.0,0.000000,1.0,1.0,1.0,0.0,1.000000,0.875000,0.625000


In total, with 0 tolerance, we dropped 10 nodes which are within the min and minmax are smallest.  We'll want a function to do all these steps automatically:

In [711]:
def drop_closest_nodes(node_table_input, node_minmax_distances, distance_tolerance_pct = 0):
    '''
    input: node_table_input: table with observations in rows, features in columns, and binary values
           mode_minmax_distance: list of lists containing the min and max values to every other node
           distance_tolerance: for larger datasets, it may be reasonable to drop more than just the nearest nodes
               such that k is reduced more quickly in the beginning. This value is the percent to be added to the 
               min/max values used to calculate which nodes to drop
    output: reduced_node_table: equal to node_table_input without the closest nodes
    '''
    # calculate the min distance of the first element in each of the min/max lists
    min_distance = min(extract_nth(node_minmax_distances,0)) + distance_tolerance_pct
    
    # add the distance_tolerance_pct to the min_distance
    min_distance = min_distance * (1 + distance_tolerance_pct)
    
    # calculate indices of the nodes which contain the minimum
    node_min_indices = np.where(extract_nth(node_minmax_distances,0)<=min_distance)[0]
    
    # generate a list of min/max distances for the nodes which contain the min distance
    min_distance_node_distances = [node_minmax_distances[i] for i in node_min_indices]
    
    # of those distances, calculate the minimum max distance (the second element in the distance lists)
    minmax_distance = min(extract_nth(min_distance_node_distances,1))
    
    # add the distance_tolerance_pct to the minmax_distance
    minmax_distance = minmax_distance * (1 + distance_tolerance_pct)
    
    # create an array containing the mininum and minimum maximum distance
    node_removal_distances = np.asarray([min_distance, minmax_distance])
    
    # calculate the node indices from the distance table which are less than or equal to the node_removal_distances
    node_removal_indices = np.where((extract_nth(node_minmax_distances,0)<=node_removal_distances[0]) &\
                                (extract_nth(node_minmax_distances,1)<=node_removal_distances[1]))[0]
    
    # drop the specified nodes from the resource table
    reduced_node_table = node_table_input.drop(node_removal_indices)
    
    return(reduced_node_table)

#### Testing impact of adjusting the distance tolerance parameter
With a tolerance of 0, 10 nodes are removed.  As we increase this figure, we should expect more and more rows to be removed.

In [730]:
tolerance_impacts = []
# test for tolerance values from 0 to 100%
for i in range(0,11):
    tolerance_impacts.append(100 - drop_closest_nodes(simulated_games_updated_nodes, \
                                           node_distance_min_max_updated_nodes,i/10).shape[0])

In [734]:
np.arange(0,1.1,0.1)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

In [736]:
tolerance_results = pd.DataFrame({'tolerance': np.arange(0,1.1,0.1),
                                  'dropped_nodes': tolerance_impacts})

In [737]:
tolerance_results

Unnamed: 0,tolerance,dropped_nodes
0,0.0,10
1,0.1,14
2,0.2,28
3,0.3,28
4,0.4,28
5,0.5,44
6,0.6,48
7,0.7,52
8,0.8,52
9,0.9,52


Looks like this tuning parameter should help speed up convergence!

At this point, we have everything working and just need to stick all the pieces together:

In [738]:
def node_reducer(node_dataframe_input, tolerance_input, distance_tolerance_pct):
    '''
    input: node_dataframe_input: a dataframe with nodes in the rows and binary features in the columns
           tolerance_input: value to be added to the minimum distance to be included in the closest nodes
           distance_tolerance_pct: for larger datasets, it may be reasonable to drop more than just the nearest nodes
               such that k is reduced more quickly in the beginning. This value is the percent to be added to the 
               min/max values used to calculate which nodes to drop
    output: a reduced version of the input table which creates hybrid nodes and removes the node(s) for which there is
        the least difference to the other nodes.
    '''
    
    # calculate node distances
    node_distances = node_distancer(node_dataframe_input)
    
    # create new hybrid nodes
    new_nodes = node_updater(node_dataframe_input, node_distances, tolerance_input)
    
    # calculate hybrid node distances
    new_node_distances = node_distancer(new_nodes)
    
    # calculate the min and max distances per new hybrid node
    new_node_minmax_distances = node_distance_min_maxer(new_node_distances)
    
    # drop the closest node(s)
    reduced_node_dataframe = drop_closest_nodes(new_nodes, new_node_minmax_distances, distance_tolerance_pct)
    
    return(reduced_node_dataframe)


Next, we'll need to apply the node reducer down to k=2 and calculate the average distance between clusters at each step to get a sense of how many clusters should be included.

In [739]:
simulated_node_distances

array([[ 0.,  3.,  2., ..., 15., 14., 14.],
       [ 3.,  0.,  3., ..., 12., 11., 11.],
       [ 2.,  3.,  0., ..., 13., 12., 12.],
       ...,
       [15., 12., 13., ...,  0.,  1.,  1.],
       [14., 11., 12., ...,  1.,  0.,  2.],
       [14., 11., 12., ...,  1.,  2.,  0.]])

Calculating the average distance between nodes:

In [740]:
np.mean(simulated_node_distances, axis = 0)

array([ 8.49,  8.11,  8.83,  8.91,  8.95,  8.87,  8.03,  9.29,  8.49,
        8.87,  8.49,  8.15,  9.25,  9.25,  8.91,  8.83,  8.87,  8.87,
        8.91,  8.57,  8.49,  8.15,  8.49,  8.49,  8.49,  7.81,  8.19,
        8.15,  8.23,  8.95,  8.87,  8.27,  8.53,  7.81,  8.87,  7.81,
        8.15,  8.57,  8.57,  7.39,  8.91,  8.19,  8.19,  8.23,  7.73,
        7.81,  8.99,  7.81,  7.81,  7.81,  7.31,  7.69,  7.65,  7.73,
        8.45,  8.37,  7.77,  8.87,  7.31,  7.61,  7.31,  7.65,  8.07,
        8.07,  7.73,  7.65,  6.93,  7.69,  7.73,  8.15,  7.31,  8.49,
        7.31,  7.31,  7.31,  8.49,  8.87,  8.15,  8.07,  8.03,  8.11,
        8.95, 10.05,  8.49,  8.87,  8.49,  8.83,  9.25,  9.25,  8.91,
        8.91,  8.87,  8.11,  8.07,  9.33,  8.49,  8.83,  8.49,  8.49,
        8.49])

Calculating the average of those average distances:

In [742]:
np.mean(np.mean(simulated_node_distances, axis = 0))

8.3306

The average distance of each node to every other node is 8.33.  Ideally, this number gets larger as we decrease the number of clusters.  

#### 2F. Store the resulting centroids, k, and average distance between nodes
#### 3G. Set k equal to the resulting number of undropped nodes

These steps can be completed at the same time within a function which wraps together all the other work:

In [743]:
def binary_clusterer(node_dataframe_input, tolerance_input = 1, distance_tolerance_pct = 0):
    '''
    input: node_dataframe_input: a dataframe with nodes in the rows and binary features in the columns
           tolerance_input: value to be added to the minimum distance to be included in the closest nodes
           distance_tolerance_pct: for larger datasets, it may be reasonable to drop more than just the nearest nodes
               such that k is reduced more quickly in the beginning. This value is the percent to be added to the 
               min/max values used to calculate which nodes to drop
    output: clusters: for each step, the resulting hybrid clusters
            average_distances: for each step, the resulting average distance to every other cluster
            k: the number of centroids
    '''
    centroids = []
    average_distances = []
    ks = []
    node_dataframe = node_dataframe_input
      
    k = len(node_dataframe_input)
    
    while k>=1:
        # save results of current step
        centroids.append(node_dataframe)
        average_distances.append(np.mean(np.mean(node_distancer(node_dataframe), axis = 0)))
        ks.append(k)
        
        # apply the distance_tolerance_pct more heavily when k is closer to the maximum number of nodes,
            # then back off as we approach convergence by applying a weight relative to k
        graduated_tolerance_pct = k/len(node_dataframe_input) * distance_tolerance_pct
        
        # reduce node_dataframe
        node_dataframe = node_reducer(node_dataframe, tolerance_input, graduated_tolerance_pct)
        
        # set k
        k = len(node_dataframe)
    
    cluster_results = [centroids,average_distances,ks]
    
    # reverse the order (from smallest number of clusters to greatest) to make interpretation easier
    cluster_results[0].reverse()
    cluster_results[1].reverse()
    cluster_results[2].reverse()
    
    return(cluster_results)
        
    
    

## Testing Results

Test with 0% tolerance (this should have the maximum number of clusters and take the longest to process)

In [744]:
cluster_results = binary_clusterer(simulated_games,1,0)

In [745]:
cluster_results_comparison = pd.DataFrame({'k_clusters':cluster_results[2],
                                          'average_cluster_distance':cluster_results[1]})

In [746]:
cluster_results_comparison.shape

(50, 2)

In [749]:
cluster_results_comparison[0:10]

Unnamed: 0,k_clusters,average_cluster_distance
0,2,4.277778
1,3,5.407407
2,4,5.84375
3,5,6.88
4,6,7.444444
5,7,7.918367
6,8,7.927083
7,10,8.26
8,11,8.320441
9,16,8.335938


In [750]:
max(cluster_results_comparison.average_cluster_distance)

8.566752697674302

This took about 4 minutes to process.  We can see from the average cluster distances that we approach the maximum at k=10 or so.  This is somewhat to be expected given how much noise we introduced to the dataset.

Lets compare with a 50% tolerance:

In [751]:
cluster_results_high_tolerance = binary_clusterer(simulated_games,1,0.50)

In [752]:
cluster_results_high_tolerance_comparison = pd.DataFrame({'k_clusters':cluster_results_high_tolerance[2],
                                                          'average_cluster_distance':cluster_results_high_tolerance[1]})

In [753]:
cluster_results_high_tolerance_comparison.shape

(17, 2)

In [754]:
cluster_results_high_tolerance_comparison

Unnamed: 0,k_clusters,average_cluster_distance
0,2,4.487654
1,3,5.185185
2,4,6.444444
3,5,6.933333
4,6,7.722222
5,7,7.660771
6,8,7.684028
7,10,7.486667
8,11,7.669421
9,13,8.213018


Here, we're getting similar results with k=10 looking ideal.  With a 50% tolerance, however, convergence is reached in only 16 iterations vs the 50 with 0% tolerance and it only took around 30 seconds. This should lead to significant gains in performance in larger datasets.

Lets take a look at the centroids for when k=10:

In [755]:
cluster_centroids = cluster_results[0][7]

In [756]:
cluster_centroids

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
1,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
5,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
6,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0
7,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0
8,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
9,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0
10,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0


Now that we have our centroids, we can classify each of the original simulated games into clusters by its nearest centroid:

In [757]:
def cluster_classifier(node_table_input, cluster_centroids):
    '''
    input: node_table_input: a dataframe with nodes in the rows and binary features in the columns
           cluster_centroids: the centroids resulting from the binary clustering algorithm
    output: a list which classifies each node into a cluster based on shortest distance
    
    '''
    clusters = []
    
    for i in range(len(node_table_input)):
        node_distances = []
        
        # get distance to each cluster centroid:
        for j in range(len(cluster_centroids)):
            node_distances.append(cluster_distance_calculator(node_table_input.iloc[i], cluster_centroids.iloc[j]))
        
        min_distance = min(node_distances)
        
        clusters.append(np.where(np.asarray(node_distances)==min_distance)[0][0])
    
    return(clusters)

In [758]:
node_clusters = cluster_classifier(simulated_games, cluster_centroids)

In [763]:
np.asarray(node_clusters).reshape(4,25)

array([[0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0],
       [1, 3, 1, 3, 1, 1, 1, 3, 1, 2, 1, 1, 1, 1, 6, 3, 2, 1, 1, 3, 1, 3,
        1, 2, 1],
       [4, 5, 5, 7, 4, 4, 4, 5, 4, 6, 4, 4, 5, 5, 5, 7, 6, 4, 4, 7, 4, 7,
        4, 6, 4],
       [9, 9, 4, 9, 6, 4, 9, 9, 9, 4, 9, 4, 9, 9, 9, 4, 8, 9, 9, 9, 9, 9,
        9, 8, 9]])

Wow! It looks like it's classifying all 25 of the first cluster into that cluster. Most of the second cluster are 1s, most of the third are 4s and most of the fourth are 9s.  This shows that it's working well, but the noise we introduced made getting an exact match to the original difficult.  

Even with the noise, lets see how it performs when we select k = 4:

In [764]:
# select index = 2, which is 4 clusters
cluster_centroids = cluster_results[0][2]
node_clusters = cluster_classifier(simulated_games, cluster_centroids)

We're expecting the groupings to show the first 25 in a cluster, the second 25 in a cluster, and so on, so lets investigate by each set of 25. First though, lets see how many nodes were classified in each cluster:

In [765]:
print(node_clusters.count(0))
print(node_clusters.count(1))
print(node_clusters.count(2))
print(node_clusters.count(3))

28
25
27
20


Looks good overall! Let's see how each cluster performed:

In [766]:
node_clusters[0:25]

[0, 2, 0, 0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0]

Looks like this cluster was classified as '0' so lets see how many aren't 0:

In [767]:
len(np.where(np.asarray(node_clusters[0:25])!=0)[0])

4

21/25 isn't bad!

In [768]:
node_clusters[25:50]

[2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 0, 2, 2, 0, 2, 0, 2, 0, 2]

In [769]:
len(np.where(np.asarray(node_clusters[25:50])!=2)[0])

7

18/25 for the second cluster.

In [770]:
node_clusters[50:75]

[1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 1, 1, 1, 2, 1]

In [771]:
len(np.where(np.asarray(node_clusters[50:75])!=1)[0])

6

19/25 for the third.

In [772]:
node_clusters[75:100]

[3, 3, 1, 3, 2, 1, 3, 3, 3, 1, 3, 1, 3, 3, 3, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3]

In [773]:
len(np.where(np.asarray(node_clusters[75:100])!=3)[0])

7

18/25 for the fourth.

Overall, we're showing 76% accuracy. 

## Testing with Tighter Simulated Clusters
The inclusion of 25% noise when generating the data in an effort to appear more realistic is obfuscating the clarity of the results.  This being the case, we'll test again with only a 10% difference between any node and its medoid:

In [774]:
first_medoid_cluster_simple = cluster_creator(first_medoid, [0,0.1], 24)
second_medoid_cluster_simple = cluster_creator(second_medoid, [0,0.1], 24)
third_medoid_cluster_simple = cluster_creator(third_medoid, [0,0.1], 24)
fourth_medoid_cluster_simple = cluster_creator(fourth_medoid, [0,0.1], 24)

In [775]:
simulated_games_simple = [first_medoid] + first_medoid_cluster_simple + [second_medoid] + \
    second_medoid_cluster_simple + [third_medoid] + third_medoid_cluster_simple + [fourth_medoid] + \
    fourth_medoid_cluster_simple

In [776]:
simulated_games_simple = pd.DataFrame(np.array(simulated_games_simple))

In [788]:
cluster_results_simple = binary_clusterer(simulated_games_simple, 1, .2)

In [789]:
cluster_results_simple_comparison = pd.DataFrame({'k_clusters':cluster_results_simple[2],
                                                  'average_cluster_distance':cluster_results_simple[1]})

In [790]:
cluster_results_simple_comparison[0:10]

Unnamed: 0,k_clusters,average_cluster_distance
0,2,3.333333
1,3,3.555556
2,5,4.586667
3,6,5.37037
4,7,5.659864
5,8,5.595833
6,11,7.442424
7,13,8.228797
8,15,8.383858
9,17,8.359769


In [791]:
cluster_centroids_simple = cluster_results_simple[0][2]

In [792]:
node_clusters_simple = cluster_classifier(simulated_games_simple, cluster_centroids_simple)

In [793]:
print(node_clusters_simple.count(0))
print(node_clusters_simple.count(1))
print(node_clusters_simple.count(2))
print(node_clusters_simple.count(3))

69
20
1
1


Hmm, my first testing showed 97% accuracy, but after I fixed an issue to ensure reproducibility, it looks like there may be an issue where the algorithm is attracted to outliers.  This needs more development, and after seeing the results on the final dataset, I can see how the final results were a bit skewed.