# Sparse Support Vector Machines  
## A Novel Technique for Analytical Classification of Extremely Large Data

### Chris Bailey, 17Oct2018


Take a vector $\vec{w}$ that is perpendicular to the line of maximal separation between the + and - samples 

Now for an unknown point $\vec{u}$ iff 
$$\vec{w}\cdot\vec{u} \ge c$$ or equivalently if $$\vec{w}\cdot\vec{u} + b \ge 0$$ then +, where $b=-c$

This is the decision rule

But we do not know $\vec{w}$ or $b$ but we know that $\vec{w}$ is perpendicular to the median line of the "widest street" separating the samples

To separate
$$
\vec{w}\cdot\vec{x}_{+}+b \ge 1
$$
for the positive samples, and, similarly for the negative
$$
\vec{w}\cdot\vec{x}_{-}+b \le -1
$$


Introduce a variable 
$$
y_i = \text{sign}( x_i ) \quad \quad \text{-1 for neg samples, +1 for pos}
$$
Multiplying the two previous equations by $y_i$
$$
y_i ( \vec{w}\cdot\vec{x}_{+}+b ) \ge 1 y_i 
$$
for the positive samples, and, similarly for the negative
$$
y_i ( \vec{w}\cdot\vec{x}_{-}+b ) \le -1 y_i
$$
Since we know the value of $y_i$ they both simplify to the same equation (noting the inequality reverses when multiplied by negation) to
$$
y_i(\vec{w}\cdot \vec{x}_i+b) \ge 1
$$
Bring the 1 over to the left side
$$
y_i(\vec{w}\cdot \vec{x}_i+b) -1 \ge 0
$$


Now for $\vec{x}_i$ that are exactly on the margins, i.e. the support vectors, simplify the equation from inequality to exact, an additional constraint
$$
y_i(\vec{w}\cdot \vec{x}_i+b) -1 = 0
$$

THe width of the street, using the vectors that lie on the margins $\vec{x}_{+}$ and $\vec{x}_{-}$ then
$$
\text{Width} = (\vec{x}_{+} - $\vec{x}_{-}) \cdot \frac{\vec{w}}{||w||}
$$
Now by rewriting the previous equation of
$$
y_i(\vec{w}\cdot \vec{x}_i+b) -1 = 0
$$
For the positive samples $\vec{x}_{+}$
$$
\begin{aligned}
y_i(\vec{w}\cdot \vec{x}_i+b) -1 &= 0 \\
y_i &= 1 \\
\vec{w}\cdot \vec{x}_i+b -1 &= 0 \\
\vec{x}_i \cdot \vec{w} &= 1 - b \\
\end{aligned}
$$
Similarly for the negative samples $\vec{x}_{+}$
$$
\begin{aligned}
y_i(\vec{w}\cdot \vec{x}_i+b) -1 &= 0 \\
y_i &= -1 \\
-\vec{w}\cdot \vec{x}_i-b -1 &= 0 \\
\vec{x}_i \cdot \vec{w} &= 1 + b \\
\end{aligned}
$$
Pulling the equation at the beginning and substituting
$$
\begin{aligned}
\text{Width} &= (\vec{x}_{+} - $\vec{x}_{-}) \cdot \frac{\vec{w}}{||w||} \\
\vec{x}_{+} \cdot \vec{w} &= 1-b \\
-\vec{x}_{-} \cdot \vec{w} &= 1 + b \\
\implies \text{Width} &= \frac{2}{||\vec{w}||}
\end{aligned}
$$

To achieve maximum width find maximum
$$
\max \frac{2}{||\vec{w}||}
$$
equivalently
$$
\max \frac{1}{||\vec{w}||}
$$
equivalently
$$
\min {||\vec{w}||}
$$
equivalently, for mathematical convenience
$$
\min \frac{1}{2} {||\vec{w}||}^2
$$


Now we have a max/min problem subject to a constraint, so use a LaGrange multiplier and find the extremum 
$$
L = \frac{1}{2} || \vec{w} ||^2 - \sum \alpha_i \left[ y_i ( \vec{w} \cdot \vec{x}_i + b ) - 1 \right]
$$
Taking the derivative (the derivative of a vector works the same as the derivative of a scalar), first wrt $\vec{w}$
$$
\frac{\partial L}{\partial \vec{w}} = \vec{w} - \sum \alpha_i y_i \vec{x}_i = 0 \implies \vec{w} = \sum_i \alpha_i y_i x_i
$$
Then with respect to $b$
$$
\frac{\partial L}{\partial b} = \sum \alpha_i y_i = 0  \quad \quad \text{note 2: used later}
$$


Now substituting in the values from the partial derivatives (not for the maginitude square use the product of $i$ and $j$ (
$$
L = \frac{1}{2} \left( \sum_i \alpha_i y_i \vec{x}_i \right) \cdot \left( \sum_j \alpha_i y_i \vec{x}_j \right) - \sum_i \alpha_i y_i \vec{x}_i \cdot \left( \sum_j \alpha_j y_j x_j \right) - \sum_i \alpha_i y_i b + \sum_i \alpha_i
$$
Since $b$ is a constant
$$
L = \frac{1}{2} \left( \sum_i \alpha_i y_i \vec{x}_i \right) \cdot \left( \sum_j \alpha_i y_i \vec{x}_j \right) - \sum_i \alpha_i y_i \vec{x}_i \cdot \left( \sum_j \alpha_j y_j x_j \right) - b \sum_i \alpha_i y_i + \sum_i \alpha_i
$$
But from note 2 above the term multiplied by $b$ is 0, so drop the term
$$
L = \frac{1}{2} \left( \sum_i \alpha_i y_i \vec{x}_i \right) \cdot \left( \sum_j \alpha_i y_i \vec{x}_j \right) - \sum_i \alpha_i y_i \vec{x}_i \cdot \left( \sum_j \alpha_j y_j x_j \right) + \sum_i \alpha_i
$$
Since the first two dot products have identical vector components, just add the constants in front
$$
L = \sum_i \alpha_i - \frac{1}{2} \left( \sum_i \alpha_i y_i \vec{x}_i \right) \cdot \left( \sum_j \alpha_i y_i \vec{x}_j \right) 
$$
Combing the summation terms
$$
L = \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \vec{x}_i \cdot \vec{x}_j
$$

Now everything is predicated on the dot product $\vec{x}_i \cdot \vec{x}_j$
And for an unknown $\vec{u}$
$$
\sum \alpha_i y_i \vec{x}_i \cdot \vec{u} + b \ge 0 \quad \quad \text{then estimate is }+
$$
so again a dot product $\vec{x}_i \cdot \vec{u}$


An explanation of the iterative convergence algorithm given the quadratic

https://en.wikipedia.org/wiki/Interior-point_method#Primal-dual_interior-point_method_for_nonlinear_optimization

### But, need kernel trick

this only applies for linearly separable classes where the equations hold

To solve when points cross the original hyperplane plane employ the kernel trick, finding some equation such that the + points are in a different plane than the - points
$$
\Phi(\vec{x})
$$
maximizing
$$
\Phi(\vec{x}_i) \cdot \Phi(\vec{x}_j)
$$
So the kernel
$$
K(x_i,x_j) = \Phi(\vec{x}_i) \cdot \Phi(\vec{x}_j)
$$



#### Linear kernel
$$
(\vec{u}\cdot\vec{v})^n
$$

#### Radial Basis Function kernel
$$
e^{-\frac{||x_i-x_j||}{\sigma}}
$$

In [2]:
#system('R CMD javareconf')
#install.packages('rJava')
#install.packages('Matrix')
library("Matrix")
library("rJava")

# .jaddClassPath("./target/svm-1.0-SNAPSHOT.jar")
classPath <- "./target/svm-1.0-SNAPSHOT.jar"
.jinit(classpath = classPath)

# use fisher data to verify
mat <- sparse.model.matrix(Petal.Length ~ ., data = iris)

nRows <- nrow(mat)
nColumns <- ncol(mat)
rowIndex <- mat@i
colBegin <- mat@p
matValue <- mat@x

obj <- "com.baileyteam.svm.BLAS"


.jcall(obj, returnSig = "V",  method = "kkt", nRows, nColumns, rowIndex, colBegin, matValue)


## For class separation comparison: logistic with dichotomous outcome

Logistic
$$
\frac{P(y=1|x)}{P(y=0|x)}  =e^z=e^{w^Tx+b}
$$
Solve for $w$ and $b$
$$
P(y_i|x_i,w,b)=P(y=1|x_i)^{y^i} \times P(y=0|x_i)^{1-y_i} = (\sigma(w^Tx_i+b))^{y_i}\times (1-\sigma(w^Tx_i+b))^{1-y_i}
$$
with the likelihood
$$
\mathcal{L}(w,b)=\prod_{i=1}^n P(y=1|x_i)^{y^i} \times P(y=0|x_i)^{1-y_i}
$$
Taking the negative log (for numeric stability) is the cross entropy function
$$
-\log \mathcal{L}(w,b) = -\sum_{i=1}^n y_i \log (\sigma(w^T x_i + b)) + (1-y_i) \log (1-\sigma(w^Tx_i + b))
$$
Need to find model parms $(w^*, b^*)$ resulting in lowest cross entropy
$$
(w^*, b^*) = \arg \min_{(w,b)} -\log \mathcal{L}(w,b)
$$
Using Newton-Raphson iteratively (using a single vector to describe $\overset{\sim}{w}=(w^Tb)^T, \overset{\sim}{x}=(x^T1)^T, \sigma=\sigma(z_1)\ldots \sigma(z_n), \overset{\sim}{X}=\overset{\sim}{x}_1 \ldots \overset{\sim}{x}_n$ )
$$
\overset{\sim}{w}^{(new)}=\overset{\sim}{w}^{(old)}-H^{-1}\bigtriangledown E(\overset{\sim}{w})
$$
The first order derivative
$$
\bigtriangledown E(\overset{\sim}{w}) = - \sum_{i=1}^n y_i \overset{\sim}{x}_i(1-\sigma(\overset{\sim}{w}^T\overset{\sim}{x}_i))-(1-y_i)\overset{\sim}{x}_i\sigma(\overset{\sim}{w}^T\overset{\sim}{x}_i)
$$
Taking the second order derivative of $E(\overset{\sim}{w})$
$$
H=\bigtriangledown \bigtriangledown E(\overset{\sim}{w}) = \sum_{i=1}^n \sigma (\overset{\sim}{w}^T\overset{\sim}{x}_i)(1-\sigma (\overset{\sim}{w}^T\overset{\sim}{x}_i))\overset{\sim}{x}_i \overset{\sim}{x}_i^T = \overset{\sim}{X}^TR \overset{\sim}{X}
$$
where $R$ is a diagonal matrix $R_{ii}=\sigma_i(1-\sigma_i)$ obtaining the update equation
$$
\overset{\sim}{w}^{(k+1)}=\overset{\sim}{w}^k-(\overset{\sim}{X}^TR_k\overset{\sim}{X})^{-1}\overset{\sim}{X}^T(\sigma_k-y)
$$

Note QDA, LDA will also outperform logistic given input characteristics
