## Sample Notebook
This notebook shows a simple example of how to use the Polyhedral Cluster Explanation Tools.

In [1]:
import pandas as pd
from global_poly_cg import GlobalPolyClusterExplainCG
from group_helpers import *

### Example: Zoo dataset
For this simple example we'll use the zoo dataset from the UCI Machine Learning Repository. The dataset has already been pre-processed (numerical features scaled to \[0,1\] and categorical variables encoded using one-hot encoding), and a reference clustering was generated by selecting the best k-means ++ solution using silhouette score. 

In [3]:
data = pd.read_csv('sample_data/zoo.csv')
data.columns = ['hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator', 
               'toothed', 'backbone', 'breathes', 'venemous', 'fins', 'legs', 'tail', 'domestic',
               'catsize', 'type', 'cluster']


In [4]:
data.head()

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venemous,fins,legs,tail,domestic,catsize,type,cluster
0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.5,0.0,0.0,1.0,0.0,1
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.5,1.0,0.0,1.0,0.0,1
2,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.5,0
3,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.5,0.0,0.0,1.0,0.0,1
4,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.5,1.0,0.0,1.0,0.0,1


To speed up the column generation process it helps to have an initial set of candidate half-spaces. This function (amongst other things) generates a set of uni-variate half-spaces.

In [5]:
_,_, W, b, alpha = compressData(data.drop('cluster',axis=1), data['cluster'], alpha_type='global', estimator='dt')

In [6]:
cg_solver = GlobalPolyClusterExplainCG(data.drop('cluster',axis=1), data['cluster'], 
                                           W, b,
                                           bad_hps = 0, 
                                           alpha = 0, alpha_type = 'global', M=1, beta=1,
                                           objective='complex')

outputs = cg_solver.solve(colGen = True, timeLimit = 10)

Set parameter Username
Academic license - for non-commercial use only - expires 2024-01-28
14.0
Set parameter TimeLimit to value 300
Set parameter TimeLimit to value 300
Set parameter TimeLimit to value 300
Set parameter TimeLimit to value 300
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
14.0
Time limit for column generation exceeded. Solving MIP.
14.0


The output of the algorithm contains a variety of information on the optimization problem, including:

- `z`: An flatted array of the final $z_hk$ variables
- `y`: An array of the $y_d$ variables
- `eps`: An array of the $\xi$ variables


In [7]:
outputs

{'status': 'SOLVED',
 'num_constr': 423,
 'num_vars': 460,
 'b&b_nodes': 0.0,
 'time': 0.004539966583251953,
 'z': array([-0., -0., -0., -0., -0.,  0., -0.,  1., -0., -0., -0.,  0., -0.,
         0.,  1.,  0., -0.,  0., -0.,  0., -0.,  0., -0.,  0., -0.,  0.,
        -0.,  0., -0.,  0., -0.,  0., -0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0., -0.,  0., -0., -0., -0.,  0.,
         1.,  0., -0.,  0., -0.,  0., -0.,  0., -0.,  0., -0.,  0., -0.,
         0

To get the final polyhedra you can extract the selected halfspaces from the z decision variables and the W, b dictionaries (note columns generated during CG are added to these dictionaries)

In [8]:
#Helper function to go from decision varaibles to selected set
def extractW(res, W, b):
    #Pull out selected halfspaces
    z = res['z']
    W_final = {}
    B_final = {}
    starting_idx = 0
        
    for cl in W.keys():
        W_final[cl] = {}
        B_final[cl] = {}
        num_hps = len(W[cl].keys())

        for key in np.array(list(W[cl].keys()))[np.array(z[starting_idx:(starting_idx+num_hps)]).astype(bool)]:
            W_final[cl][key] = W[cl][key]
            B_final[cl][key] = b[cl][key]
            
        starting_idx = starting_idx + num_hps
        
    return W_final, B_final

In [9]:
#Helper function to print half-spaces
def print_polyhedron(W, b, column_names):
    for key in W.keys():
        coeffs = []
        for val, col_name in zip(W[key], column_names):
            if abs(val) > 0.5:
                coeffs.append(str(round(val))+'*'+col_name)
        print('+'.join(coeffs)+'<='+str(b[key]))

In [10]:
W_final, b_final = extractW(outputs, W, b)

In [11]:
#Print description for cluster 0
print_polyhedron(W_final[0], b_final[0], data.columns)

1*milk<=0.0
-1*toothed<=1.0
