# Classification

This notebook is an example of how to employ `ortho-svm` to perform classification tasks using **custom kernels** with Support Vector Machines. Althought `scikit-learn` already provides an API for this, `ortho-svm` makes it easier to do as it already has a couple of pre-defined kernels such as:

- **Hermite kernel:** A special kernel defined by Hermite polynomials.
- **Gegenbauer kernel:** A family of kernels defined by a parameter $\alpha$ that generalizes most of the classic polynomials, e.g. Legendre and Chebyshev.
- **Chebyshev kernel:** A particular instance of the Gegenbauer kernel when $\alpha = 0$; the Chebyshev polynomials defined are the Chebyshev polynomials of the _first kind_.

In this document, an example of a **classification** task will be performed by comparing the results obtained with those from the literature [1].

## 1. Setup

We will be performing classification for the first MONK's dataset [2] (monks-1) using the Gegenbauer kernel defined in `ortho-svm`.

The **monks-1** dataset has the following properties:

- 432 instances
- 6 attributes
- 2 classes

The classification problem is quite simple, so we expect to get similar results to those reported. From the main references [1] we have the following special hyper-parameters for the Gegenbauer kernel:

- $C = 20.96 \pm 3.4$
- $\alpha = 0.38 \pm 0.0$
- $n = 2$

These are the best values reported, and this will be considered the **ground truth** for the results obtained.

## 2. Methodology

Here, we will be performing the following steps to achieve similar results to those in the literature:

1. Pre-process the data, including loading, cleaning and creating a splitter for the following steps.
2. Define a 5-fold cross validation the obtain the best $C$ parameter, the special parameters $n$ and $\alpha$ will be taken for granted from those in the ground truth.
3. Using the _training_ dataset, evaluate the 5-fold cross validation to find the best value, while simultaneously recording the **classification accuracy.**
4. Finally, using all the hyper-parameters, test the trained Support Vector Machine on the _test_ dataset to evaluate the accuracy of the model.

## 2.1 Pre-processing

In [1]:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from orthosvm.gramian import gram

First, we load the _training_ dataset, which is contained in the same directory as this document.

In [2]:
# We ignore the last column here because we don't need them, so we
# define a list of columns to use
# raw_data = np.loadtxt("monks-1.csv", skiprows=1, delimiter=",")
raw_data = np.loadtxt("breast.csv", skiprows=1, delimiter=",")

# From this, we extract the labels
y = raw_data[:, -1]
# We now extract the features
xdata = raw_data[:, :-1]

# Check number of instances and features
print(xdata.shape)

xtrain, xtest, ytrain, ytest = train_test_split(xdata, y, test_size=0.2, random_state=56)

(683, 10)


## 2.2 Perform a 5-fold cross validation

We now take the cleaned-up data and perform a grid search for the best possible value for $C$.

In [3]:
# First, we need to create our estimator, which is a special SVM for classification

# We define the function that will compute the Grammian matrix
# and we specify the hyper-parameters
gram_matrix = gram.gram_matrix(kernel="hermite", degree=8)

# Next, we pass a dictionary of parameters, in this case the kernel, to the SVC
# estimator and we create an object with it
params = {"kernel": gram_matrix, "C": 8.0}
# params = {"kernel": "rbf", "C": 19.34}
svc = SVC(**params)

In [4]:
# Now we proceed to create the GridSearchCV

# Create the grid to search
c_grid = [{"C": [i for i in range(10, 101)]}]
# Instantiate the grid search object
# gcv = GridSearchCV(svc, c_grid, "accuracy", n_jobs=-1)

### 2.3 Train the model

We now proceed to train the model using the set-up we have until now, and with the great `scikit-learn` API, after running `GridSearchCV` we obtain the best estimator for the data.

**WARNING:** This might take a while, so if you have the processing resources to spare, enable the `n_jobs` parameter to make this easier for you.

In [5]:
# We proceed to fit the model to our data
# we use the n_jobs parameter before to make up for the time spent here
# gcv.fit(xdata, y)
svc.fit(xtrain, ytrain)

SVC(C=8.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale',
    kernel=<function gram_matrix.<locals>.compute_gram_matrix at 0x7f18451a8050>,
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [6]:
# Now let us compare the best value found for C
# gcv.best_params_

So we are somewhat far from the estimated value for $C$ given in the main reference, but this will suffice for now. We can then use the best estimator found and try out the _test_ dataset with it.

### 2.4 Test the model

We now proceed to test the model, we first load the testing dataset and then we proceed to find the classification accuracy with the trained model.

In [7]:
# We do the same as before to load the information, we can even re-use the same variables
# raw_data = np.loadtxt("monks-1_test.csv", skiprows=1, delimiter=",")

# Again, we extract the labels
# y = raw_data[:, -1]
# y[y == 0] = -1

# We now extract the features
# xtest = raw_data[:, :-1]

# Let us now test our model
# y_predict = gcv.predict(xtest)
# And print out a classification report
# print(classification_report(ytest, y_predict))
print(svc.score(xtest, ytest))

0.6496350364963503


## References

1. Luis Carlos Padierna, Martín Carpio, Alfonso Rojas-Domínguez, Héctor Puga, Héctor Fraire,
    A novel formulation of orthogonal polynomial kernel functions for SVM classifiers: The Gegenbauer family,
    Pattern Recognition,
    Volume 84,
    2018,
    Pages 211-225,
    ISSN 0031-3203,
    https://doi.org/10.1016/j.patcog.2018.07.010.
    (http://www.sciencedirect.com/science/article/pii/S0031320318302280)
    Keywords: SVM classifier; Orthogonal polynomials; Gegenbauer kernel; Binary classification
    
2. MONK's dataset: https://archive.ics.uci.edu/ml/datasets/MONK%27s+Problems