## Sample Notebook
This notebook shows a simple example of how to use the Polyhedral Cluster Explanation Tools.

In [2]:
import pandas as pd
from global_poly_cg import GlobalPolyClusterExplainCG
from group_helpers import *

### Example: Zoo dataset
For this simple example we'll use the zoo dataset from the UCI Machine Learning Repository. The dataset has already been pre-processed (numerical features scaled to \[0,1\] and categorical variables encoded using one-hot encoding), and a reference clustering was generated by selecting the best k-means ++ solution using silhouette score. 

In [3]:
data = pd.read_csv('sample_data/zoo.csv')

In [4]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,cluster
0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.5,0.0,0.0,1.0,0.0,1
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.5,1.0,0.0,1.0,0.0,1
2,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.5,0
3,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.5,0.0,0.0,1.0,0.0,1
4,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.5,1.0,0.0,1.0,0.0,1


To speed up the column generation process it helps to have an initial set of candidate half-spaces. This function (amongst other things) generates a set of uni-variate half-spaces.

In [5]:
_,_, W, b, alpha = compressData(data.drop('cluster',axis=1), data['cluster'], alpha_type='global', estimator='dt')

phase_one_res = GlobalPolyClusterExplainCG(data.drop('cluster',axis=1), data['cluster'], 
                                                   W, b,
                                                   bad_hps = 0, 
                                                   alpha = 0, alpha_type = 'global', M=beta, beta=beta,
                                                   objective='feasibility', new_hp_prefix = 'PHASE1CG_%d_'%beta).solve(colGen = True, timeLimit = 300)


In [9]:
cg_solver = GlobalPolyClusterExplainCG(data.drop('cluster',axis=1), data['cluster'], 
                                           W, b,
                                           bad_hps = 0, 
                                           alpha = 0, alpha_type = 'global', M=1, beta=1,
                                           objective='feasibility')

outputs = cg_solver.solve(colGen = False)

Using license file /Users/connorlawless/gurobi.lic
Academic license - for non-commercial use only
0.0


In [10]:
outputs

{'status': 'SOLVED',
 'num_constr': 422,
 'num_vars': 254,
 'b&b_nodes': 0.0,
 'time': 0.034050941467285156,
 'z': array([-0.,  4., -0.,  4., -0., -0., -0.,  3., -0.,  4., -0., -0., -0.,
        -0.,  3., -0.,  3., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
        -0., -0., -0.,  0., -0., -0., -0., -0., -0., -0., -0.,  0., -0.,
        -0.,  1., -0., -0., -0., -0., -0., -0., -0., -0., -0.,  0., -0.,
         4., -0., -0.,  4., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
         0., -0., -0., -0., -0., -0.,  0., -0., -0., -0.,  0., -0., -0.,
        -0., -0., -0., -0., -0.,  0., -0.,  1., -0., -0., -0., -0., -0.,
         4., -0., -0., -0., -0., -0.,  0., -0., -0., -0., -0., -0., -0.,
        -0., -0.,  4., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0.,
         1., -0., -0., -0., -0., -0., -0., -0., -0.,  4., -0.,  4., -0.,
        -0., -0., -0.,  0.,  4., -0.]),
 'y': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
 'eps': array([-0., -0., -0., -0.