# Support Vector Machines

#### Jessica Morrise

In [11]:
import numpy as np
import cvxopt
from sklearn.cross_validation import train_test_split
import pandas

Our data set contains measurements of cells and labels of cancerous (1) or benign (-1). Before we can enter the exciting world of Suport Vector Machines, we need to read in the data and clean it a little bit. Luckily, Pandas is great at this.

In [58]:
# Load data and clean it with Pandas
cancer_data = pandas.read_csv('cancer.csv',index_col='index')
def clean(d):
    try:
        return int(d)
    except:
        return -9999
# Drop rows with non-numeric values
cancer_data['bare-nuclei'] = cancer_data['bare-nuclei'].apply(clean)
cancer_data.drop(cancer_data.index[cancer_data['bare-nuclei']==-9999],inplace=True)

target = cancer_data['cancerous'].values
data = cancer_data.drop('cancerous',axis=1).values

Below is our SVM class. There are three possible kernels: sigmoid, polynomial, and Gaussian (RBF). 

In [244]:
class SVM_classifier(object):
    
    def __init__(self, train_x, train_y, kernel='rbf',a=20.,d=3.,r=1.,gamma=.0005):
        self.X = train_x.astype(float)
        self.Y = train_y.astype(float)
        self.a = a
        self.d = d
        self.r = r
        self.g = gamma
        self.set_kernel(kernel)
        
    def set_kernel(self, kernel):
        if kernel not in ['sigmoid', 'polynomial', 'rbf']:
            raise ValueError('Invalid kernel type')
        elif kernel == 'sigmoid':
            self.k = lambda x,y: np.tanh(np.inner(x,y) + self.r)
        elif kernel == 'polynomial':
            self.k = lambda x,y: (np.inner(x,y) + self.a)**self.d
        elif kernel == 'rbf':
            self.k = lambda x,y: np.exp(-self.g*np.linalg.norm(x-y)**2)
        
    
    def train(self):
        n_samples = self.X.shape[0]
        K = np.zeros((n_samples,n_samples))
        for i in xrange(n_samples):
            for j in xrange(n_samples):
                K[i,j] = self.k(self.X[i,:],self.X[j,:])
        Q = cvxopt.matrix(np.outer(self.Y,self.Y)*K)
        q = cvxopt.matrix(-1*np.ones(n_samples))
        A = cvxopt.matrix(self.Y, (1,n_samples))
        print np.shape(A)
        b = cvxopt.matrix(0.0)
        G = cvxopt.matrix(np.diag(np.ones(n_samples)*-1))
        h = cvxopt.matrix(np.zeros((n_samples,1)))
        
        solution = cvxopt.solvers.qp(Q, q, G, h, A, b)
        self.coeffs = np.ravel(solution['x'])
        
    def classify(self,x):
        val = np.sum(self.coeffs*self.Y*self.k(self.X,x))
        if val == 0:
            return 1.
        else:
            return val/np.abs(val)
    
    def classify_many(self,samples,true_labels = None):
        predictions = np.zeros(samples.shape[0])
        for i,x in enumerate(samples):
            predictions[i] = self.classify(x)
        if true_labels is None:
            accuracy = None
        else:
            accuracy = np.sum(np.equal(predictions,true_labels))/float(samples.shape[0])
        return predictions, accuracy

### Results

After splitting the data into a training and test set, three models implementing each of the three kernels are tested below. These were found by experimentation to be the optimal values for the parameters:

* Sigmoid kernel: $r = 1$
* Polynomial kernel: $a = 20, d = 3$
* RBF kernel: $\gamma = 0.0005$

In the results below, each of the models got between 60% and 75% accuracy. In previous runs, however, the models varied all the way down to 30% accuracy (in other words, much worse than guessing). The sigmoid kernel also fails to converge every time for an unknown reason.

In [283]:
train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.2)

In [291]:
model_1 = SVM_classifier(train_x, train_y, 'sigmoid')
model_1.train()
preds_1, accuracy_1 = model_1.classify_many(test_x,test_y)

(1, 545)
     pcost       dcost       gap    pres   dres
 0: -4.9259e+02 -1.1379e+03  6e+02  3e-14  2e+00
 1: -1.0759e+03 -1.0900e+03  1e+01  3e-14  1e+00
 2: -8.3433e+04 -8.3444e+04  1e+01  6e-13  1e+00
 3: -6.3064e+08 -6.3064e+08  8e+02  6e-09  1e+00
 4: -1.9309e+12 -1.9309e+12  3e+06  2e-05  1e+00
Terminated (singular KKT matrix).


In [280]:
model_2 = SVM_classifier(train_x, train_y, 'polynomial')
model_2.train()
preds_2, accuracy_2 = model_2.classify_many(test_x,test_y)

(1, 545)
     pcost       dcost       gap    pres   dres
 0: -1.4624e+01 -3.1476e+01  2e+03  3e+01  2e+00
 1: -1.6596e+01 -9.0187e+00  4e+02  8e+00  5e-01
 2: -6.1050e-01 -4.3595e-01  2e+01  4e-01  3e-02
 3: -1.3418e-01 -1.2436e-01  4e+00  6e-02  4e-03
 4: -3.7272e-02 -3.8174e-02  7e-01  1e-02  7e-04
 5: -1.3537e-02 -1.6542e-02  2e-01  3e-03  2e-04
 6: -6.9264e-03 -7.9409e-03  6e-02  8e-04  5e-05
 7: -3.3781e-03 -4.0999e-03  2e-02  2e-04  2e-05
 8: -1.8618e-03 -2.5064e-03  8e-03  7e-05  5e-06
 9: -1.2756e-03 -1.9678e-03  4e-03  3e-05  2e-06
10: -1.0862e-03 -1.7536e-03  3e-03  1e-05  1e-06
11: -9.9165e-04 -1.4341e-03  1e-03  5e-06  3e-07
12: -1.0245e-03 -1.2737e-03  4e-04  1e-06  8e-08
13: -1.0850e-03 -1.2170e-03  2e-04  4e-07  2e-08
14: -1.1426e-03 -1.1907e-03  5e-05  3e-19  2e-12
15: -1.1756e-03 -1.1845e-03  9e-06  2e-19  1e-12
16: -1.1827e-03 -1.1833e-03  6e-07  1e-19  1e-12
17: -1.1832e-03 -1.1832e-03  6e-09  3e-19  2e-12
Optimal solution found.


In [281]:
model_3 = SVM_classifier(train_x, train_y, 'rbf')
model_3.train()
preds_3, accuracy_3 = model_3.classify_many(test_x,test_y)

(1, 545)
     pcost       dcost       gap    pres   dres
 0: -8.5062e+01 -2.6567e+02  2e+03  4e+01  3e+00
 1: -2.2059e+02 -5.5200e+02  2e+03  3e+01  2e+00
 2: -7.7608e+02 -1.6448e+03  2e+03  3e+01  2e+00
 3: -2.4325e+03 -3.8718e+03  2e+03  2e+01  1e+00
 4: -4.4537e+03 -6.4140e+03  2e+03  2e+01  1e+00
 5: -1.0191e+04 -1.3053e+04  3e+03  2e+01  1e+00
 6: -1.9389e+04 -2.3556e+04  4e+03  1e+01  1e+00
 7: -3.8574e+04 -4.5272e+04  7e+03  1e+01  1e+00
 8: -8.5092e+04 -9.8637e+04  1e+04  1e+01  1e+00
 9: -1.5647e+05 -1.8104e+05  2e+04  1e+01  1e+00
10: -2.8161e+05 -3.3195e+05  5e+04  1e+01  1e+00
11: -4.9211e+05 -6.0568e+05  1e+05  1e+01  8e-01
12: -7.6147e+05 -9.4816e+05  2e+05  6e+00  4e-01
13: -8.4183e+05 -8.7250e+05  3e+04  3e-01  3e-02
14: -8.4613e+05 -8.5076e+05  5e+03  4e-02  3e-03
15: -8.4700e+05 -8.4740e+05  4e+02  4e-04  3e-05
16: -8.4732e+05 -8.4732e+05  6e+00  5e-06  4e-07
17: -8.4732e+05 -8.4732e+05  6e-02  5e-08  4e-09
Optimal solution found.


In [282]:
print 'Sigmoid:',accuracy_1
print 'Polynomial:',accuracy_2
print 'RBF:',accuracy_3

Sigmoid: 0.642335766423
Polynomial: 0.773722627737
RBF: 0.63503649635
