#Support Vector Machines

1. Support Vector Machines: Introduction & Motivation
1. SVM Concepts
1. Separable vs. Nonseparable solutions
1. Kernels: Linear SVMs vs. Nonlinear SVMs
1. Multiclass SVMs

##By the End of this Lecture You Will:
1. Know how to write the hypothesis and cost functions of an SVM
1. Be able to write the primal and dual forms of the cost functions and know why we use one over the other
1. Know what a Soft Margin is
1. Know what a Kernel is
1. Know the difference between Separable and Non-separable solutions for SVMs
1. Understand the difference between linear and non-linear kernel functions and why we use them.

##Support Vector Machines: Introduction & Motivation

Support Vector Machines (SVMs) are among the most important tools you will learn about in your study of machine learning. Although there are cases wherein SVMs are not necessarily the best choice, they can do many things that other models do. In some cases, SVMs are superior in performance to more sophisticated tools. 

The SVM is by definition a *Classifier* and discriminative model. At the most basic level, it is a **binary** classifier, meaning it distinguishes between two classes; think (+) or (-). More advanced SVMs can discriminate between many classes.

Because of the intellectual importance of SVMs to a foundation in machine learning, we shall motivate and discuss them in some detail. 


###SVMs: Motivation

In your previous classwork, you observed the use of Lasso regression against an essentially binary dataset. Other forms of regularized regression reduced the emphasis of the regression on the x-value, but Lasso eliminated dependence on the x-value completely. Suppose, that instead of a regression, you were more worried about finding an **optimal boundary of discrimination** between two classes within the sample space, rather than trying to predict the next point. 

We shall define this boundary as follows:

**Given two classes, we seek to produce a hyperplane that maximizes the distance between a training example of any given class and itself. This means that the hyperplane is placed to maximize the margin between itself and either class.**

![linear](images/plot_iris_exercise_000_2.png)

###Definition of a hyperplane

Instead of projecting output variables into the feature space as we normally do, we are going to define the general equation of a (hyper)plane in the n-dimensional feature space (in two dimensions a plane is a line):

$$\beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_nx_p = z$$

$$\beta_0 + \textbf{$\beta^{T}$}\textbf{x} = z$$

Our plane is going to divide the two classes in hyperspace. There are an infinite number of ways of choosing the different coefficients to represent the plane, but in this case we are just going to make it as simple as possible by setting z to 1:

$$ |\beta_0 + \textbf{$\beta^{T}$}\textbf{x}| = 1 $$

In this case $\textbf{x}$ represents those training examples closest to the hyperplane. This is an important element of what we are doing, because ideally we want to find the plane that maximizes distance between the positive and negative training examples. 

#####Question: What is the difference between a plane and a hyperplane?

###Distance between a point and a plane

Now we use the result from linear algebra (derived in the appendix) for the distance $D$ between a point $\textbf{u} = (u_0,u_1,u_2)$ and a plane $ax_0+bx_1+cx_2+z=0$:

$$D = \frac{au_0+bu_1+cu_2+z}{|\textbf{v}|}$$

#####Question: Make sure to understand the derivation in the appendix.

###Definition of Margin

Extending this definition to our discussion of distance from the hyperplane, we can simply compute the distance between the support vector $\textbf{x}$ and the hyperplane we defined as as:

$$D_{hyperplane} = \frac{|\beta_0 + \textbf{$\beta^{T}$}\textbf{x}|}{\|\beta\|}$$

Extending this generalization, we can set the distance to the support vectors as:

$$D_{hyperplane} = \frac{|\beta_0 + \textbf{$\beta^{T}$}\textbf{x}|}{\|\beta\|} = \frac{1}{\|\beta\|}$$

The total **margin** between the support vectors is given by **M**:

$$M = \frac{2}{\|\beta\|}$$

With an SVM, we seek to **maximize M**. This is why SVMs are called **maximum margin classifiers**.


###Principal equation of the SVM

Recall that we were discussing classifying the data into the (+) and (-) group. Thus we consider these classification states as output variables $\textbf{y}$ to our method. Going back to our equation, we can write that 

$$\beta_0 + \textbf{$\beta^{T}$}\textbf{x} = 1$$

For the positive class, and

$$\beta_0 + \textbf{$\beta^{T}$}\textbf{x} = -1$$

For the negative class, again recalling that the $y$ are either $+1$ or $-1$. 

Also recall that the plane includes all data point solutions above the plane in the positive class. All solutions below the plane are in the negative class, therefore;

$$\beta_0 + \textbf{$\beta^{T}$}\textbf{x} \geq 1$$

and also

$$\beta_0 + \textbf{$\beta^{T}$}\textbf{x} \leq -1$$

Thinking of the $y_i$ as multipliers by $+1$ or $-1$, we can write:

$$y_i(\beta_0 + \beta^{T}x_{i}) \geq 1$$

for all datapoints $i=1,2, \cdots , m$

#####Question: Why do we write the principal equations this way?


##QUIZ:
    1. What differentiates SVM from standard regression techniques?
    2. Draw a picture of the above three equations and produce reasoning why they make sense.

##SVM Concepts

###Hypothesis

The variables can be most clearly classified in terms of a plane separating them.

**We can define the plane in terms of the training points closest to the plane. These points are called *support vectors***.

The equation of the plane is given as 

$$|\beta_0 + \textbf{$\beta^{T}$}\textbf{x}|=1$$

Where the $\textbf{x}$ are the **support vectors**.

It bears mentioning here that the $\beta$ coefficients are often described in other literature as **"weights"**, with the origin $\beta_0$ callsed **"bias"** $b$, such that the equation of the plane is written:

$$|\textbf{$w^{T}$}\textbf{x}+b|=1$$

However, we are going to continue to use the $\beta$ notation in these notes so as to highlight the relationship between the well-known OLS regression and SVMs.



###Cost Function

Recall that our main cost function seeks to find those minimal weights *as a whole* (remember shrinkage) that fit our criteria. Returning to the concept of a **norm**, we are going to minimize $\|\beta\|_2$ subject to our principle equation, $\sum_{i=0}^{n}y_i(\beta_0 + \beta^{T}x_{i}) \geq 1$ derived above.

We use an equation known in the literature as a *Lagrangian* as a way of posing the cost function of our method(see appendix for more discussion of lagrangian methods). A Lagrangian is an equation of the form:

$$\Lambda(x,y,\alpha) = f(x,y)+\alpha\ \cdot g(x,y)$$

#####Question: Make sure to understand the discussion in the appendix.

####Primal Form

<div style="background-color:#F7F99C">

In this case we do not formulate the cost function in terms of $\|\beta\|_2$ due to the onerous nature of the square root involved. Rather we can **transform** this term to its pseudo-integral, $\frac{1}{2}\|\beta\|_2^{2}$, which has the same properties. Then we just write down the Langrangian, where we minimize $\frac{1}{2}\|\beta\|_2^{2}$, subject to $\sum_{i=0}^{n}(1-y_i(\beta_0 + \beta^{T}x_{i}) \leq 0$ (Why did I write it this way?):

$$\Lambda(x,y,\alpha) = \frac{1}{2}\|\beta\|_2^{2}+\sum_{i=0}^{n}\alpha_{i}\ \cdot (1-y_i(\beta_0 + \beta^{T}x_{i}))$$

This equation is normally written by noting that $\|\beta\|_2^{2} = \beta^{T}\beta$, so:

$$\Lambda(x,y,\alpha) = \frac{1}{2}\beta^{T}\beta+\sum_{i=0}^{n}\alpha_{i}\ \cdot (1-y_i(\beta_0 + \beta^{T}x_{i}))$$

This is called the *primal form* because the **weights and coefficients appear in the same equation.** Now take the grad set to zero,taking the full gradient and setting the **bias** to zero:

$$\nabla_{\beta}\Lambda(x,y,\alpha) = \beta+\sum_{i=0}^{n}\alpha_{i}\ (-y_i(x_{i})) = 0$$

That gives us a first solution for the **weights**:

$$\beta = \sum_{i=0}^{n}\alpha_{i}y_{i}x_{i}$$

We also have a solution for the **bias** that we already set to zero:

$$\beta_0 = \sum_{i=0}^{n}\alpha_{i}y_{i} = 0$$

</div>

####Dual Form

<div style="background-color:#F7F99C">

The more common form of the SVM Lagrangian is the *dual form*, which we can get if we substitute our solution for the weights, $\beta = \sum_{i=0}^{n}\alpha_{i}y_{i}x_{i}$, into the primal form. We also remember that $\sum_{i=0}^{n}\alpha_{i}y_{i} = 0$.

$$\Lambda(x,y,\alpha) = \frac{1}{2}\sum_{j=0}^{n}\alpha_{j}y_{j}x_{j}^{T}\sum_{i=0}^{n}\alpha_{i}y_{i}x_{i}+\sum_{i=0}^{n}\alpha_{i}\ \cdot (1-y_i(\beta_0 + \sum_{j=0}^{n}\alpha_{j}y_{j}x_{j}^{T}x_{i}))$$

This reorganizes (see appendix) to get the following equation:

</div>

$$\Lambda(x,y,\alpha) = \sum_{i=0}^{n}\alpha_{i}-\frac{1}{2}\sum_{i=0}^{n}\sum_{j=0}^{n}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}^{T}x_{j}$$

This is a *dual form* because we see that we are describing the **weights** in terms of $\alpha$. Weights are not present in this equation. If we know all of the $\alpha$ we know the $\beta$; remember the formula $\beta = \sum_{i=0}^{n}\alpha_{i}y_{i}x_{i}$.

#####Question: What is special about the dual form Lagrangian? How would you program it with a computer?

####Soft Margin SVMs

For reasons discussed in more detail below, most real-world cases will require the ability to tolerate some misclassification, where there are datapoints labeled as one class that should be in the other, or where noise or hidden variables lead to unclear relationships amongst the classes. 

In this case, **even the optimal solution** will have some training datapoints that land on the wrong side of the plane. 

We deal with this using **Soft Margins**. All standard SVM applications use soft margins. 

For each point $i$, if the point falls on the wrong side of the hyperplane during training, we assign it a *penalty* $\xi_i$, based on its distance from the plane. 

We minimize $\sum_{i=0}^{n}\xi_i$ along with the hyperplane during optimization. 

In this case, the principal equations become:

$$y_{i}(\beta_0 + \textbf{$\beta^{T}$}x_{i}) \geq M(1 - \xi_{i})$$

For the positive class, and

$$y_{i}(\beta_0 + \textbf{$\beta^{T}$}x_{i}) \leq M(\xi_{i} - 1)$$

For the negative class. We also are assured that

$$\xi_{i} \geq 0$$

Because we don't penalize points that are not on the wrong side of the plane. 

***notice that the margin figures importantly into the principal equations here. what does it mean to have a nonzero $\xi_{i}$ value?**

So we want to minimize $\frac{1}{2}\|\beta\|_2^{2}+C\sum_{i=0}^{n}\xi_i$, subject to the above prinicpal equations. It turns out that this is not too much of a change to the dual form of the Lagrangian (Hastie has the derivation for the primal form on pp. 420):

$$\Lambda(x,y,\alpha) = \sum_{i=0}^{n}\alpha_{i}-\frac{1}{2}\sum_{i=0}^{n}\sum_{j=0}^{n}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}^{T}x_{j}$$

Where $\sum_{i=0}^{n}\alpha_{i}y_{i}x_{i}=0$ and we also follow the constraint $C\geq\alpha_i\geq 0$; this is all dealt with simultaneously during optimization.

The question arises: Now that we've included this new constant $C$ in the equations, what is its effect on the location of the decision boundary? We find that the fixed scale 1 is arbitrary because the $M$ term can be brought into the $\beta$.

What is important is that you understand that the **margin** $M$ changes with respect to $C$. Changes in $C$ result in changes in the $\beta$. This in turn results in changes in the direction of the plane and width of the margin simultaneously. More on this is discussed below.

###Optimization

We seek to maximize the (dual) Lagrangian as if it were an expectation function:

$$\Lambda(x,y,\alpha) = \sum_{i=0}^{n}\alpha_{i}-\frac{1}{2}\sum_{i=0}^{n}\sum_{j=0}^{n}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}^{T}x_{j}$$

subject to some $\alpha_{i} \geq 0$ and $\sum_{i=0}^{n}\alpha_{i}y_{i} = 0$

Problems of this type are called Quadratic Programming Problems (because a squared term exists in one variable) and require relatively heavy-duty methods to reliably achieve the solution. However, because they are QP, we are assured that a **global** not approximate(!) maximum of $\alpha_i$ can be found if optimized correctly! 

We will not cover QP solvers in this unit.

Directly computing weights is comparatively rare in most SVM implementations. It is more common to simply calculate the $\alpha_{i}$ and just report the model as such. This produces the **model** $G$ for input data $\textbf{z}$:

$$G(\textbf{z}) = \sum_{j=0}^{s}\alpha_{j}y_{j} \textbf{$x_{j}$}^{T}\textbf{z}$$

Where the $j$ are the indices of the $s$ support vectors.
The model classifies $\textbf{z}$ as class 1 (+) if positive and class 0 (-) otherwise.

(Note that the weights do not appear)

###Reasoning

Choosing a bisecting plane to divide two classes enables us to predict a classification using a training set without assuming linearity in the data or fitting to an explicit function.

SVM provides several advantages. In simple cases, SVM outperforms logistic regression. SVMs perform feature selection automatically by allowing $\alpha$ coefficients that are unimportant to the classification to go to 0; SVMs also perform shrinkage on the $\beta$ coefficients. We are assured of an optimal solution due to the formulation of the method. 


![svm_mock](images/svm_mockup_6.png)

##QUIZ:
    1. What is the hypothesis of a SVM?
    2. Write the primal form of the SVM Lagrangian without looking at the equation. Why do we use the dual form?
    3. Describe the elements of the above figure.
    
    

##Separable vs. Nonseparable Solutions (Soft Margins)

Many datasets are "poorly separable" due to stray points on the wrong side of the hyperplane. This does not mean that the classification is impossible, only that there may be some noise present or that there are some mislabeled points. Here, Soft Margins are necessary, providing us a way of managing noise in the training set and optimizing the performance of the SVM by allowing for misclassifications. All standard, out-of-the-box implementations of SVMs include Soft Margins.


By decreasing the budget $C$, we can reduce *generalization bias* by softening the requirements for fitting a class. However, underfitting $C$ can lead to a lot of misclassification:

![SVM_GB](images/SVM_Generalization_Bias2.png)

The budget $C$ also governs the *generalization variance.* Overfitting $C$ will lead to excessive sensitivity to training outliers and poor performance on test data:

![SVM_GB](images/SVM_Generalization_Variance2.png)

Tuning SVM performance with $C$ is important and will be discussed in later lectures.

##QUIZ:
    1. What is Generalization Bias?
    1. What is Generalization Variance?
    1. Why do we use the word "Generalization" when describing these things?

##Kernels: Linear vs. Nonlinear SVMs

We have made many claims about SVMs and separating two classes. Well, what if you **can't** divide the two classes with a single line? 


####What is a Kernel?

Here we introduce the concept of the "Kernel," which stands in the running along with "field" as quite possibly the most overused word in scientific history. "Kernel" means "heart" or "essence" of something.

In math, which is the origin of the word in this case, a kernel is a member of a set of functions:

$$K: \mathbb{R}^{N} \times \mathbb{R}^{N} \rightarrow \mathbb{R}$$

In the case of the SVM, it's referring to a very particular term in the cost function:

$$\Lambda(x,y,\alpha) = \sum_{i=0}^{n}\alpha_{i}-\frac{1}{2}\sum_{i=0}^{n}\sum_{j=0}^{n}\alpha_{i}\alpha_{j}y_{i}y_{j}\underline{x_{i}^{T}x_{j}}$$

This is a dot or inner product, meaning that it produces a single number for each pair of points (vectors) $x_i, x_j$. You will often see it written this way (inner product vector notation):

$$x_{i}^{T}x_{j} = <x_j,x_i>$$

This regular, standard kernel is called a **linear** kernel.

![linear](images/plot_iris_exercise_000_2.png)

####Nonlinear Kernels

If we apply the standard SVM with a linear kernel to a linearly nonseparable dataset, nothing good happens:

![nonsep_linear](images/nonsep_svm_linear2.png)

But there is another way: By *transforming the nonlinear data into a non-linear space* we **can** separate the two classes!

![nonsep_transformed](images/Nonseparable_Transformed.png)

For example, we can transform the data visualized in the above figure from $[x,y]$ space to $[x,y,x^2+y^2]$ space and suddenly, a clear boundary appears!

![twoD-to-threeD](images/TwoD_to_ThreeD_Boundary.png)

Why does this work? Well, the nonlinear separable data may not be separable in linear space but it may by separable in a nonlinear space! If we transform the data into the nonlinear space, we can optimize our cost function so that our SVM can separate nonlinear data!

Here we use a polynomial transformation; polynomial transformations usually include the outer product terms, i.e.:

$$P([x,y]) \rightarrow [x^2,y^2, 2xy]$$

Instead of computing the dot product $<x_j,x_i>$, where 

$$<x_j,x_i>=\begin{bmatrix}
         x_j &
         y_j
        \end{bmatrix} \begin{bmatrix}
         x_i \\
         y_i\\
        \end{bmatrix}$$

we would now compute a new polynomial dot product where:

$$<P(x_j),P(x_i)>=\begin{bmatrix}
         x_{j}^2 &
         y_{j}^2 &
         2x_{j}y_{j}
        \end{bmatrix} \begin{bmatrix}
         x_{i}^2 \\
         y_{i}^2\\
         2x_{i}y_{i} \\
        \end{bmatrix}$$

And then we would just substitute in our new **nonlinear kernel** into the Lagrangian:

$$\Lambda(x,y,\alpha) = \sum_{i=0}^{n}\alpha_{i}-\frac{1}{2}\sum_{i=0}^{n}\sum_{j=0}^{n}\alpha_{i}\alpha_{j}y_{i}y_{j}<P(x_j),P(x_i)>$$


Different types of data are going to be separable with different nonlinear transformations; the transformation is determined on a case-by-case basis. We can substitute a new kernel for some transformation $\phi$: $<\phi(x_j),\phi(x_i)> = K(x_j,x_i)$ into the Lagrangian in general and write as follows:

$$\Lambda(x,y,\alpha) = \sum_{i=0}^{n}\alpha_{i}-\frac{1}{2}\sum_{i=0}^{n}\sum_{j=0}^{n}\alpha_{i}\alpha_{j}y_{i}y_{j}K(x_j,x_i)$$

Now we just go ahead and run the SVM as normal:

![nonsep_poly](images/nonsep_svm_poly2.png)


####The Kernel Trick
<div>
<font color="red">THIS IS A STANDARD INTERVIEW QUESTION</font>
</div>

So this discussion should set off a few alarms. Transforming the entire Lagrangian from linear into polynomial space is extremely costly in terms of resources. Normally this would imply that we would need to actually increase the number of $\alpha$ coefficients and transform the response variables too.

If we were to actually do this, the computation would become prohibitively expensive rapidly with respect to the number of features. However, because we simply compute a **dot product** for every pair of data points $i,j$ in the original feature space $N$ after we transform the data, *this doesn't happen*. 

We call this **"the kernel trick"**. 

It's a trick because transformation to a nonlinear space doesn't seem to cost us much; we don't need to calculate the transformation explicitly. We can simply calculate the kernel product once, using the additional functions as necessary(a $N \times N$ matrix). This only leads to a linear increase in calculation time - the time it costs to calculate the transformation from one basis to another. We do not increase the dimensionality of the Lagrangian at all. Since this is the most expensive function in terms of computation and storage, it amounts to immense computational leverage.

####Types of Kernels

It bears mentioning that there are a few different types of kernels commonly in use; we will mention them here.

#####Linear Kernel:

This is simply the standard dot product $<x_j,x_i>$. Sometimes a scalar coefficient $c$ is added.

$$K(x_j,x_i) = <x_j,x_i>+c$$

#####Polynomial Kernels

Some combination of transformations and exponents of the orignal linear kernel. Transformation by the function:

$$f(x, b, c, d) = (bx+c)^{d}$$

Leads to the following expression:

$$K(x_j,x_i) = (b<x_j,x_i>+c)^{d}$$

#####Gaussian and Radial Basis Function Kernels

Radial basis functions model the data as if it came from a distribution, this is done in order to "soften" the locations of the training data. The Gaussian radial kernel is the most common of these:

$$K(x_j,x_i) = e^{-\frac{\|x_j-x_i\|^{2}}{2\sigma^{2}}}$$

There are other variants of the Gaussian, such as the Exponential and Laplace kernels, that leave out the square of the norm. 

The quality of the outcome is rather dependent on the scaling factor $\sigma$ (the broadness of the Gaussian function). 

<div style="background-color:#F7F99C">
There are other RBF kernels worth mentioning:

Multiquadric (used to relieve computational cost of Gaussian kernels):

$$K(x_j,x_i) = \sqrt{1+\frac{\|x_j-x_i\|^2}{2\sigma^{2}}}$$

Inverse multiquadric:

$$K(x_j,x_i) = \frac{1}{\sqrt{1+\frac{\|x_j-x_i\|^2}{2\sigma^{2}}}}$$

</div>

And variants of these. It also bears mentioning that the expression including the scaling factor $\frac{1}{2\sigma^2}$ is very often converted into a single coefficient $\gamma$:

$$\gamma = \frac{1}{2\sigma^2}$$

And thus large gamma results in a relatively narrow radial function.



#####Sigmoid Kernels
<div style="background-color:#F7F99C">
The sigmoid function is a top performer in real-world data. 

$$K(x_j,x_i) = tanh(\alpha<x_j,x_i> + c)$$

</div>

#####Other Kernels
<div style="background-color:#F7F99C">
There are many, many other kernels that have been tried. Honorable mentions include circular and spherical kernels, wavelet, spline and Bayesian kernels. Not all of them have particularly good performance.

http://crsouza.com/2010/03/kernel-functions-for-machine-learning-applications/#sigmoid
</div>

##QUIZ:
    1. What is the Kernel Trick?
    2. Why choose one kernel over another?
    3. Write the kernel for the below transformation from (x,y) to (x,y,z):
$$f(x,y) \rightarrow (x^2,y^2,2xy,xy^2, x^2y)$$

##Multiclass SVMs

It is possible to extend the SVM methodology to more than two classes by solving the constraints for each pair of classes separately.

![nonsep](images/plot_iris_001.png)


There are two methods:

#####Any-of

In this case, we build classifiers for each possible class pair and let them all vote. This is an example where it's ok to have something classified as more than one class.

1. Build a classifier for each class, where the training set consists of the set of documents in the class (positive labels) and its complement (negative labels).
1. Given the test set, apply each classifier separately. The decision of one classifier has no influence on the decisions of the other classifiers.

#####One-of

Here we insist that each object must belong to one and only one class.

1. Build a classifier for each class, where the training set consists of points belonging to the class (positive labels) and its complement (negative labels).
1. Given the test document, apply each classifier separately.
1. Assign the document to the class with
    1. the maximum score,
    1. the maximum confidence value,
    1. or the maximum probability of class membership based on the votes of every classifier.

We are not going to go into great detail regarding the construction of Multiclass SVMs at this time, although you will get an opportunity to work with them in the Lab exercise.


##Appendix

###Distance between a point and a plane

Now we use vector math to calculate the distance between any point $x$ (a vector in p-dimensional space) and any plane. To illustrate this, take a plane z in 3 dimensions:

$$ax_0+bx_1+cx_2+z=0$$

The normal vector to the plane is:

$$\textbf{v} = \left| \begin{array}{c}
a  \\
b  \\
c  \end{array} \right|$$


Given a point $\textbf{u} = (u_0,u_1,u_2)$, a vector $\textbf{w}$ from a point $\textbf{x}$ on the plane to $\textbf{u}$ is given by:

$$\textbf{w} = \textbf{u}-\textbf{x} = - \left| \begin{array}{c}
x_0-u_0  \\
x_1-u_1  \\
x_2-u_2  \end{array} \right|$$

Now we can take the projection of $\textbf{w}$ onto $\textbf{v}$ to give the distance from $\textbf{u}$:

$$D=|proj_{\textbf{v}}\textbf{w}|$$

$$D=\frac{|\textbf{v}\cdot\textbf{w}|}{|\textbf{v}|}$$


We have to work this equation to get the result we want:

$$D = \frac{|a(x_0-u_0)+b(x_1-u_1)+c(x_2-u_2)|}{|\textbf{v}|}$$

$$D = \frac{|ax_0-au_0+bx_1-bu_1+cx_2-cu_2|}{|\textbf{v}|}$$

$$D = \frac{|-z-au_0-bu_1-cu_2|}{|\textbf{v}|}$$

$$D = \frac{au_0+bu_1+cu_2+z}{|\textbf{v}|}$$

####Lagrangian optimization

The Lagrangian describes a relationship between two functions $f(x,y)$ and $g(x,y)$, such that we can minimize $f(x,y)$ subject to $g(x,y)$. We minimize both functions at once, and report those points (x,y) where the **contours of both functions are parallel.** Note that the **contour of a function** can only only be paralell to the contour of another function when the **gradients of both functions are parallel.** 

$$\nabla_{x,y}f(x,y) = -\alpha\ \nabla_{x,y}\ g(x,y)$$

This means that the gradients of the two must equal each other within a scalar multiplier $\alpha$. From the above equation we can just intuit the Lagrangian by adding the term in $g(x,y)$ to both sides:

$$\nabla_{x,y}f(x,y) + \alpha\ \nabla_{x,y}\ g(x,y) = 0$$

This yields the equation to solve:

$$\nabla_{x,y,\alpha}\Lambda(x,y,\alpha) = 0$$

####Computation of the dual form

$$\Lambda(x,y,\alpha) = \frac{1}{2}\sum_{j=0}^{n}\sum_{i=0}^{n}\alpha_{j}\alpha_{i}y_{j}y_{i}x_{j}^{T}x_{i}+\sum_{i=0}^{n}\alpha_{i}\ \cdot (1-y_i(\beta_0 + \sum_{j=0}^{n}\alpha_{j}y_{j}x_{j}^{T}x_{i}))$$

$$\Lambda(x,y,\alpha) = \frac{1}{2}\sum_{j=0}^{n}\sum_{i=0}^{n}\alpha_{j}\alpha_{i}y_{j}y_{i}x_{j}^{T}x_{i}+\sum_{i=0}^{n}\alpha_{i}-\sum_{i=0}^{n}\alpha_{i}\ y_i\sum_{j=0}^{n}\alpha_{j}y_{j}x_{j}^{T}x_{i}-\beta_0\sum_{i=0}^{n}\alpha_{i}y_{i}$$

$$\Lambda(x,y,\alpha) = \frac{1}{2}\sum_{j=0}^{n}\sum_{i=0}^{n}\alpha_{j}\alpha_{i}y_{j}y_{i}x_{j}^{T}x_{i}+\sum_{i=0}^{n}\alpha_{i}-\sum_{i=0}^{n}\sum_{j=0}^{n}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{j}^{T}x_{i}-\beta_0\sum_{i=0}^{n}\alpha_{i}y_{i}$$

Collecting terms and reorganizing we get:

$$\Lambda(x,y,\alpha) = \sum_{i=0}^{n}\alpha_{i}-\frac{1}{2}\sum_{i=0}^{n}\sum_{j=0}^{n}\alpha_{i}\alpha_{j}y_{i}y_{j}x_{i}^{T}x_{j}$$
