This is a C++11 implementation of the Convex Polytope Machine algorithm as presented in
Kantchelian, A., Tschantz, M. C., Huang, L., Bartlett, P. L., Joseph, A. D., & Tygar, J. D. Large-Margin Convex Polytope Machine.
In addition to the command line tool, Python bindings which are fully aware of numpy arrays and scipy sparse matrices are provided.
Building the code
The CPM can be invoked via command line or from a python interpreter directly. Building these two should be painless.
Building the command line tool
$ make cmdapp
will create the
Building the python module
The Python module is built and installed using the distutils tools, which is already included in the standard library. However, the python module itself requires numpy (and numpy headers) and scipy, so make sure these are installed. To build the extension, run:
$ python setup.py build
This puts all necessary files in
From this directory, you should be able to launch a Python interpreter and
successfully import the module. For example:
$ cd build/lib.macosx-10.9-x86_64-2.7 $ ls _cpm.so cpm.py $ python Python 2.7.6 (default, Nov 25 2013, 16:54:21) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import cpm >>>
If you are particularly fond of it, you can install the cpm extension module by
$ python setup.py install
which will make it import available from anywhere. Or you can just copy the build/lib.* files where needed.
Building the python wrapper
Unless you are planning to extend the python module features yourself, this is part is irrelevant to you.
You need a working installation of SWIG. Running
$ make wrapper
cpm.py in the src directory.
The python module essentially provides two classes:
CPM essentially behaves like a scikit learner, with
Dataset takes care of reading libSVM files from disk (fast!) and/or
translating your dense numpy data or sparse matrices into the memory
layout CPM likes. Most of the methods come with
easy reference. Here's an example usage:
>>> import numpy as np, scipy.sparse as sp >>> import cpm >>> X = [[0, 1, 0], [1, 0, 0]] # two instances with 3 features >>> Y = [0, 1] # corresponding labels >>> Xnumpy = np.array(X, dtype=np.float32) # notice the type spec >>> Xsparse = sp.csr_matrix(X, dtype=np.float32) # notice the type spec again >>> Ynumpy = np.array(Y, dtype=np.int32) # everything is 32 bits. >>> trainset_1 = cpm.Dataset(X, Y) # this works >>> trainset_2 = cpm.Dataset(Xnumpy, Ynumpy) # this is fine too >>> trainset_3 = cpm.Dataset(Xsparse, Ynumpy) # and this as well >>> clf = cpm.CPM(10) # a CPM with 10 sub-classifiers and default meta-parameters values >>> clf.fit(trainset_1, 100) # train model on 100 SGD steps >>> scores, assignments = clf.predict(trainset_2) # returns predicted scores and assigned sub-classifiers >>> scores array([-1.11855221, 1.37075484], dtype=float32) >>> assignments array([0, 0], dtype=int32) >>> clf.serializeModel('my_model.cpm') # you can save the current model to disk >>> clf_1 = cpm.CPM.deserializeModel('my_model.cpm') # and load it again later
The main gotcha is that data arrays should be float 32 bits and label arrays
32 bit integers (or less). You can accomplish this easily by the
method of numpy/scipy.sparse objects. It is anyhow a good idea to be
using 32 bit floats by default if working with large datasets. At the moment,
only CSR sparse matrices are supported.
The wrapper also exposes
parallelFitPredict(), a multithreaded method
which trains multiple models with arbitrary parameters and outputs their
predictions. Refer to the docstring for how to use it.
The command line interface follows Sofia-ml in spirit.
$ ./cpm -h Perform CPM training and/or inference. --quiet -q be quiet. Default: False --reshuffle shuffle training set between epochs. Default: False --classifiers -k <int> number of classifiers. Default: 1 --outer_label <int> outer class label (the class that will be decomposed). Default: 1 --iterations -i <int> number of iterations. Default: 50000000 --C -C <float> C regularization factor. Default: 1 --cost_ratio <float> cost ratio of negatives vs positives. Default: 1 --entropy <float> minimal (exp of) entropy to maintain in heuristic max. Value between 1 and k. Default: 1 --seed <unsigned long> random seed (for reproducibility). --train -t <string> train data file. --test -c <string> test data file. --model_in -m <string> model in file. Will be ignored if in training mode. --model_out -o <string> model out file. --scores -s <string> scores file.
For instance, to train a model with 10 subclassifiers on 1,000,000 iterations,
from libsvm formated file
train.libsvm, save model in
model.txt and output the scores
of the model on
$ ./cpm -k 10 -i 1000000 -t train.libsvm -c test.libsvm -o model.txt -s scores.txt