# HW5

### practice Support Vector Machine

- training
    - `X_train.csv`
        - image
        - 5000 * 784(28 * 28)
    - `T_train.csv`
        - label
        - 5000 * 1
- testing
    - `X_test.csv`
        - image
        - 2500 * 784(28 * 28)
    - `T_test.csv`
        - label
        - 2500 * 1

In [179]:
import numpy as np
from matplotlib import pyplot as plt
import libsvm.python.svm as svm
import libsvm.python.svmutil as svmutil
import scipy
import ctypes
import pandas as pd

In [2]:
X_train = np.genfromtxt('X_train.csv', delimiter=',')
T_train = np.genfromtxt('T_train.csv', delimiter=',')
X_test = np.genfromtxt('X_test.csv', delimiter=',')
T_test = np.genfromtxt('T_test.csv', delimiter=',')

#### prepare data fit libsvm format

- in csv

```
label index:value index:value ...
label index:value index:value ...
```

- in python struct

```python
label = [1,2]
data = [{1:2,3:1},{3:2,10:1}]
```

In [44]:
y_train = list(T_train)
x_train = [{idx+1:X_train[i][idx] for _,idx in np.ndenumerate(np.argwhere(X_train[i]!=0))} for i in range(X_train.shape[0])]
y_test = list(T_test)
x_test = [{idx+1:X_test[i][idx] for _,idx in np.ndenumerate(np.argwhere(X_test[i]!=0))} for i in range(X_test.shape[0])]

### different kernel functions

- linear

$ K(u,v) = u^Tv$

- polynomial

$ K(u,v,c,d) = (u^Tv + c)^d$

- RBF

$ K(u,v,\gamma) = exp({-\gamma |u - v|^2}) $

#### libsvm option

```
options:
-s svm_type : set type of SVM (default 0)
	0 -- C-SVC
	1 -- nu-SVC
	2 -- one-class SVM
	3 -- epsilon-SVR
	4 -- nu-SVR
-t kernel_type : set type of kernel function (default 2)
	0 -- linear: u'*v
	1 -- polynomial: (gamma*u'*v + coef0)^degree
	2 -- radial basis function: exp(-gamma*|u-v|^2)
	3 -- sigmoid: tanh(gamma*u'*v + coef0)
-d degree : set degree in kernel function (default 3)
-g gamma : set gamma in kernel function (default 1/num_features)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
-m cachesize : set cache memory size in MB (default 100)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking: whether to use the shrinking heuristics, 0 or 1 (default 1)
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
-wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)
```

[LIBSVM学习（六）代码结构及c-SVC过程](https://blog.csdn.net/u014772862/article/details/51835192)

In [72]:
m = {}
p_label = {}
p_acc = {}
p_val = {}

In [86]:
parastrs = {
    'linear' : '-t 0',
    'polynomial d=3 c=0' : '-t 1 -d 3 -r 0',
    'polynomial d=3 c=1' : '-t 1 -d 3 -r 1',
    'polynomial d=10 c=0' : '-t 1 -d 10 -r 0',
    'polynomial d=10 c=1' : '-t 1 -d 10 -r 1',
    'RBF g=0.0013' : '-t 2 -g 0.0013',
    'RBF g=0.0033' : '-t 2 -g 0.0033',
    'RBF g=0.5' : '-t 2 -g 0.5',
    'RBF g=1' : '-t 2 -g 1',
}

In [87]:
%%time

for kernel_type, opts in parastrs.items():
    if not (kernel_type in m):
        m[kernel_type] = svmutil.svm_train(y_train, x_train, opts)
        p_label[kernel_type], p_acc[kernel_type], p_val[kernel_type]= svmutil.svm_predict(y_test, x_test, m[kernel_type])
    else:
        print('Accuracy = {:.2f}% ({:d}/{:d}) (classification)'.format(p_acc[kernel_type][0], int((len(y_test)*p_acc[kernel_type][0])/100), int(len(y_test))))
        #print('')
    print('kernel type : {} , acc : {}\n'.format(kernel_type, p_acc[kernel_type]))

Accuracy = 95.08% (2377/2500) (classification)
kernel type : linear , acc : (95.08, 0.1404, 0.931149802516624)

Accuracy = 34.68% (867/2500) (classification)
kernel type : polynomial d=3 c=0 , acc : (34.68, 2.6212, 0.14887572191533946)

Accuracy = 95.76% (2394/2500) (classification)
kernel type : polynomial d=3 c=1 , acc : (95.76, 0.1356, 0.9336123460070571)

Accuracy = 20.72% (518/2500) (classification)
kernel type : polynomial d=10 c=0 , acc : (20.72, 2.9884, 0.007062945744989033)

Accuracy = 97.32% (2432/2500) (classification)
kernel type : polynomial d=10 c=1 , acc : (97.32, 0.0924, 0.954375041275707)

Accuracy = 95.32% (2383/2500) (classification)
kernel type : RBF g=0.0013 , acc : (95.32000000000001, 0.1492, 0.9271864783823697)

Accuracy = 96.36% (2409/2500) (classification)
kernel type : RBF g=0.0033 , acc : (96.36, 0.1188, 0.9415603861168684)

Accuracy = 43% (1075/2500) (classification)
kernel type : RBF g=0.5 , acc : (43.0, 1.5768, 0.219471432111856)

Accuracy = 30.04% (751/25

In [205]:
#help(svmutil.svm_train)

In [210]:
#help(svmutil.svm_problem)
prob = svmutil.svm_problem(y_train, x_train)

### use C-SVC

use grid search and cross-validation

```
-v n: n-fold cross validation mode
```

In [211]:
def GridSearchForSVM(kernel, parameter_matrix, n_ford=10):
    opts = list(parameter_matrix.keys())
    opts_max = np.array([len(parameter_matrix[opts[i]]) for i in range(len(opts))])
    current_opt = np.array([0 for i in range(len(opts))])
    results = [];
    
    optstr_init = '-t {:d} -v {:d} '.format(int(kernel),int(n_ford))
    
    overflow = False
    while(True):
        while (np.count_nonzero(current_opt >= opts_max)):
            reset_indicator = np.argwhere(current_opt >= opts_max)
            current_opt[reset_indicator[-1]] -= opts_max[reset_indicator[-1]] 
            if (reset_indicator[-1]-1 < 0):
                overflow = True
                break;
            current_opt[reset_indicator[-1]-1] += 1
        
        if (overflow):
            break
        
        # gen option string
        optstr = optstr_init
        result = []
        for idx,para in enumerate(current_opt):
            optkey = opts[idx]
            optstr += '-' + str(optkey) + ' ' + str(parameter_matrix[optkey][para]) + ' '
            result.append(parameter_matrix[optkey][para])
        
        # get cross-validation result
        result.append(optstr)
        result.append(svmutil.svm_train(prob, optstr))
        
        results.append(result)
        # try next options
        current_opt[-1] += 1
    
    return results, opts

In [198]:
#results, options = GridSearchForSVM(2, {'c' : [1,2,3,4],'g' : [1,0.0033,0.0013]})

In [212]:
linear_results, linear_options = GridSearchForSVM(0, {'c' : [10**-5,10**-2,1,10**2,10**5]})

Cross Validation Accuracy = 79.44%
Cross Validation Accuracy = 97.04%
Cross Validation Accuracy = 96.1%
Cross Validation Accuracy = 96.14%
Cross Validation Accuracy = 96.34%


In [218]:
linear_table = pd.DataFrame(linear_results, columns=(linear_options + ['opt str', 'result']))

display(linear_table.sort_values(by=['result'], ascending=False))

Unnamed: 0,c,opt str,result
1,0.01,-t 0 -v 10 -c 0.01,97.04
4,100000.0,-t 0 -v 10 -c 100000,96.34
3,100.0,-t 0 -v 10 -c 100,96.14
2,1.0,-t 0 -v 10 -c 1,96.1
0,1e-05,-t 0 -v 10 -c 1e-05,79.44


In [220]:
linear_table.to_csv('linear_results.csv')

In [226]:
pd.read_csv('linear_results.csv', index_col=0)

Unnamed: 0,c,opt str,result
0,1e-05,-t 0 -v 10 -c 1e-05,79.44
1,0.01,-t 0 -v 10 -c 0.01,97.04
2,1.0,-t 0 -v 10 -c 1,96.1
3,100.0,-t 0 -v 10 -c 100,96.14
4,100000.0,-t 0 -v 10 -c 100000,96.34


In [227]:
poly_results, poly_options = GridSearchForSVM(1, {'c' : [10**-2,1,10**2],'r' : [0,1],'d' : [2,3,4,10]})

Cross Validation Accuracy = 45.2%
Cross Validation Accuracy = 28.12%
Cross Validation Accuracy = 23.44%
Cross Validation Accuracy = 20.52%
Cross Validation Accuracy = 78.26%
Cross Validation Accuracy = 81.14%
Cross Validation Accuracy = 85.8%
Cross Validation Accuracy = 92.46%
Cross Validation Accuracy = 87.06%
Cross Validation Accuracy = 33.64%
Cross Validation Accuracy = 23.52%
Cross Validation Accuracy = 20.48%
Cross Validation Accuracy = 96.42%
Cross Validation Accuracy = 96.84%
Cross Validation Accuracy = 96.98%
Cross Validation Accuracy = 97.76%
Cross Validation Accuracy = 97.76%
Cross Validation Accuracy = 93.36%
Cross Validation Accuracy = 68.02%
Cross Validation Accuracy = 20.52%
Cross Validation Accuracy = 97.04%
Cross Validation Accuracy = 97.28%
Cross Validation Accuracy = 97.52%
Cross Validation Accuracy = 97.82%


In [231]:
poly_table = pd.DataFrame(poly_results, columns=(poly_options + ['opt str', 'result']))


poly_table.to_csv('poly_results.csv')


display(poly_table.sort_values(by=['result'], ascending=False))

Unnamed: 0,c,r,d,opt str,result
23,100.0,1,10,-t 1 -v 10 -c 100 -r 1 -d 10,97.82
16,100.0,0,2,-t 1 -v 10 -c 100 -r 0 -d 2,97.76
15,1.0,1,10,-t 1 -v 10 -c 1 -r 1 -d 10,97.76
22,100.0,1,4,-t 1 -v 10 -c 100 -r 1 -d 4,97.52
21,100.0,1,3,-t 1 -v 10 -c 100 -r 1 -d 3,97.28
20,100.0,1,2,-t 1 -v 10 -c 100 -r 1 -d 2,97.04
14,1.0,1,4,-t 1 -v 10 -c 1 -r 1 -d 4,96.98
13,1.0,1,3,-t 1 -v 10 -c 1 -r 1 -d 3,96.84
12,1.0,1,2,-t 1 -v 10 -c 1 -r 1 -d 2,96.42
17,100.0,0,3,-t 1 -v 10 -c 100 -r 0 -d 3,93.36


In [228]:
rbf_results, rbf_options = GridSearchForSVM(2, {'c' : [10**-2,1,10**2],'g' : [1/100,1/300,1/784]})

Cross Validation Accuracy = 92.74%
Cross Validation Accuracy = 87.12%
Cross Validation Accuracy = 81.14%
Cross Validation Accuracy = 97.8%
Cross Validation Accuracy = 97.3%
Cross Validation Accuracy = 96.42%
Cross Validation Accuracy = 98.38%
Cross Validation Accuracy = 97.86%
Cross Validation Accuracy = 97.56%


In [232]:
rbf_table = pd.DataFrame(rbf_results, columns=(rbf_options + ['opt str', 'result']))


rbf_table.to_csv('rbf_results.csv')


display(rbf_table.sort_values(by=['result'], ascending=False))

Unnamed: 0,c,g,opt str,result
6,100.0,0.01,-t 2 -v 10 -c 100 -g 0.01,98.38
7,100.0,0.003333,-t 2 -v 10 -c 100 -g 0.0033333333333333335,97.86
3,1.0,0.01,-t 2 -v 10 -c 1 -g 0.01,97.8
8,100.0,0.001276,-t 2 -v 10 -c 100 -g 0.0012755102040816326,97.56
4,1.0,0.003333,-t 2 -v 10 -c 1 -g 0.0033333333333333335,97.3
5,1.0,0.001276,-t 2 -v 10 -c 1 -g 0.0012755102040816326,96.42
0,0.01,0.01,-t 2 -v 10 -c 0.01 -g 0.01,92.74
1,0.01,0.003333,-t 2 -v 10 -c 0.01 -g 0.0033333333333333335,87.12
2,0.01,0.001276,-t 2 -v 10 -c 0.01 -g 0.0012755102040816326,81.14


### linear + RBF kernel

use precomputed data

### report