## [Sep 16] Basis of Machine Learning II

Presenter: Yuchen Ge  
Affiliation: University of Oxford  
Contact Email: gycdwwd@gmail.com  
Website: https://yuchenge-am.github.io

----

Content

1. [Kernel Method](#Kernel-Method)
2. [SVM Method](#SVM-Method)
3. [Basic Convex Optimization](#Basic-Convex-Optimization)

---

### 1. Kernel Method <a id='Kernel-Method'></a>

In practice, **linear separation** is often not possible. One way to define such a non-linear decision boundary is to use a non-linear mapping $\Phi$ from the input space $\mathcal{X}$ to a higher-dimensional space $\mathbb{H}$ (which is called a **feature space**), where linear separation is possible.

> A **kernel over $\mathcal{X}$** is a function $K: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$.

The idea is to define a kernel such that $K(x, x^{\prime})=\langle\Phi(x), \Phi(x^{\prime})\rangle$. However, **this is not true for general kernels**. Thereby, we define

> A **positive definite symmetric (PDS) kernel** is a kernel $K$ such that for any $\{x_1,...,x_m \} \subset \mathcal{X}$, the kernel (Gram) matrix matrix $[K(x_i,x_j)]_{ij} \in \mathbb{R}_{m \times m}$ is symmetric positive semidefinite (SPSD).

For example, we have

> **Polynomial kernels:** $K\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\left(\mathbf{x} \cdot \mathbf{x}^{\prime}+c\right)^{d}$.
>
> **Gaussian kernels:** $K\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\frac{\left\|\mathbf{x}^{\prime}-\mathbf{x}\right\|^{2}}{2 \sigma^{2}}\right)$.
>
> **Sigmoid kernels:** $K\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\tanh \left(a\left(\mathbf{x} \cdot \mathbf{x}^{\prime}\right)+b\right)$.


The following is **the main result** of this note.

Seeing that 

> $K\left(x, x^{\prime}\right)^{2} \leq K(x, x) K\left(x^{\prime}, x^{\prime}\right)$ 

since the matrix \begin{pmatrix}
K(x, x) & K\left(x, x^{\prime}\right) \\
K\left(x^{\prime}, x\right) & K\left(x^{\prime}, x^{\prime}\right)
\end{pmatrix}
is SPDS, then we immediately arrive at 

> If $K: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is a PDS kernel, then there exist $\Phi$ s.t. $K\left(x, x^{\prime}\right)=\left\langle\Phi(x), \Phi\left(x^{\prime}\right)\right\rangle$. 

**Proof.** Define $\Phi(x)\left(x^{\prime}\right)=K\left(x, x^{\prime}\right)$ and 
$ \mathbb{H}_{0}=\left\{\sum_{i \in I} a_{i} \Phi\left(x_{i}\right): a_{i} \in \mathbb{R}, x_{i} \in X,|I|<\infty\right\}$. Next, we introduce an operation on $\mathbb{H}_{0}$, which is 

$$\langle f, g\rangle=\sum_{i \in I, j \in J} a_{i} b_{j} K\left(x_{i}, x_{j}^{\prime}\right) = \sum_{j \in J} b_{j} f (x_{j}') = \sum_{i \in I} a_{i} g\left(x_{i}\right)$$
with $f=\sum_{i \in I} a_{i} \Phi\left(x_{i}\right)$ and $g=\sum_{j \in J} b_{j} \Phi\left(x_{j}^{\prime}\right)$.  The last two equations show that the operation is well-defined. Then, it's routine to show that $\langle f, g\rangle$ defines a **PDS kernel** on $\mathbb{H}_{0}$. The only **untrivial** thing is to show $\langle f, f\rangle=0$ iff $f=0$. Seeing 

$$ f(x)^2=\bigg(\sum_{i \in I} a_{i} K\left(x_{i}, x\right)\bigg)^2 = \langle f, \Phi(x)\rangle^{2} \leq\langle f, f\rangle\langle\Phi(x), \Phi(x)\rangle. $$

The last inequality is true since 

$$ \bigg( \sum a_i \langle x_i, x\rangle\bigg) ^{2} = \sum a_i^2 \langle x_i, x\rangle^{2} +  \sum_{ij} a_i a_j \langle x_i, x\rangle \langle x_j, x\rangle \leq  \sum a_i^2 \langle x_i, x_i\rangle \langle x, x\rangle +  \sum_{ij} a_i a_j \langle x_i, x_i\rangle^{1/2} \langle x_j, x_j\rangle^{1/2} \langle x,x\rangle
$$

where the last inequality is from the lemma.






### 2. SVM Method <a id='SVM-Method'></a>

We consider a binary classifier problem.

> Determine a **hypothesis** $h \in \mathcal{H}$ where
>
> $$ \mathcal{H}=\left\{\mathbf{x} \mapsto \operatorname{sign}(\mathbf{w} \cdot \mathbf{x}+b): \mathbf{w} \in \mathbb{R}^{N}, b \in \mathbb{R}\right\}$$ 
>
> with small generalization error $R_{\mathcal{D}}(h)=\underset{x \sim \mathcal{D}}{\mathbb{P}}[h(x) \neq f(x)].$

Here's the outline: we first introduce the algorithm for **separable datasets**, then its general version for **non-separable datasets**, and finally provide a theoretical foundation for SVMs based on **margin**.

> **Def 1.** $S=\left(\left(x_{1}, y_{1}\right), \ldots,\left(x_{m}, y_{m}\right)\right)$ is **separable** if  $\exists (\mathbf{w}, b) \in\left(\mathbb{R}^{N}-\{\mathbf{0}\}\right) \times \mathbb{R}$ such that $\forall i \in[m]$, $y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right) \geq 0$.

We can see that if there exist one  hyperplane $\mathbf{w} \cdot \mathbf{x}+b=0$ separating $S$, there exist ifinite many. Therefore, we require one criterion for determing the unique one.

> Consider maximizing the margin $\rho$ of a separating hyperplane 
> 
> $$\rho=\max _{\substack{\mathbf{w}, b:\\ y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right) \geq 0}} \min _{i} \frac{\left|\mathbf{w} \cdot \mathbf{x}_{i}+b\right|}{\|\mathbf{w}\|}=\max _{\mathbf{w}, b} \min _{i} \frac{y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right)}{\|\mathbf{w}\|}=\max _{\substack{\mathbf{w}, b: \\ \min _{i} y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right)=1}} \frac{1}{\|\mathbf{w}\|}=\max _{\substack{\mathbf{w}, b: \\ \forall i, y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right) \geq 1}} \frac{1}{\|\mathbf{w}\|}.$$
>
The above (last) optimization problem is equivalent to 

$$
\begin{aligned}
\min _{\mathbf{w}, b} & \quad \frac{1}{2}\|\mathbf{w}\|^{2} \\
\text { subject to: } & \quad y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right) \geq 1, \forall i \in \{1,2,...,m\}.
\end{aligned}
$$

In [Basic Convex Optimization](#Basic-Convex-Optimization), we will have recalled some basic facts about convex optimization. Write the Lagrangian function 

$$\mathcal{L}(\mathbf{w}, b, \boldsymbol{\alpha})=\frac{1}{2}\|\mathbf{w}\|^{2}-\sum_{i=1}^{m} \alpha_{i}\left[y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right)-1\right]$$

It's welcome to check the conditions stated in **KKT condition** are all satisfied. Therefore, we write the KKT conditions as follows:

$$\begin{aligned}
\nabla_{\mathbf{w}} \mathcal{L}=\mathbf{w}-\sum_{i=1}^{m} \alpha_{i} y_{i} \mathbf{x}_{i}=0 \quad \Longrightarrow  & \quad \mathbf{w}=\sum_{i=1}^{m} \alpha_{i} y_{i} \mathbf{x}_{i} \\
\nabla_{b} \mathcal{L}=-\sum_{i=1}^{m} \alpha_{i} y_{i}=0 \quad \Longrightarrow & \quad  \sum_{i=1}^{m} \alpha_{i} y_{i}=0 \\
 & \quad \alpha_{i}=0 \vee y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right)=1 .
\end{aligned}$$














The weight vector  $\mathbf{w}$  at the solution of the SVM problem is a linear combination of the training set vectors  $\mathbf{x}_{1}, \ldots, \mathbf{x}_{m}$. A vector  $\mathbf{x}_{i}$  appears in that expansion iff  $\alpha_{i} \neq 0$. Such vectors are called **support vectors**. By the **complementarity conditions**, if  $\alpha_{i} \neq 0$, then $y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right)=1$. Thus, support vectors lie on the marginal hyperplanes  $\mathbf{w} \cdot \mathbf{x}_{i}+b= \pm 1$, which justifies the name of the algorithm.

The Lagrangian function ( dual optimization problem ) **via the KKT conditions** can be simplified to 

$$
\begin{aligned}
\max _{\boldsymbol{\alpha}} \quad & \sum_{i=1}^{m} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{m} \alpha_{i} \alpha_{j} y_{i} y_{j}\left(\mathbf{x}_{i} \cdot \mathbf{x}_{j}\right) \\
\text { subject to: } \quad & \alpha_{i} \geq 0 \wedge \sum_{i=1}^{m} \alpha_{i} y_{i}=0, \forall i \in[m] .
\end{aligned}
$$

The dual optimization problem reveals an important property of SVMs: **the hypothesis solution depends only on inner products between vectors and not directly on the vectors themselves.** Therefore, the general framework of the **kernel mathod** applies.

In the non-separable case, for any hyperplane  $\mathbf{w} \cdot \mathbf{x}+b=0$, there exists  $\mathbf{x}_{i} \in S$  such that

$$y_{i}\left[\mathbf{w} \cdot \mathbf{x}_{i}+b\right] \ngeq 1.$$

Therefore, it's natural to introduce slack variables: for each  $i \in[m]$, there exist  $\xi_{i} \geq 0$  such that  

$$y_{i}\left[\mathbf{w} \cdot \mathbf{x}_{i}+b\right] \geq 1-\xi_{i}.$$


This leads to the following general optimization problem defining SVMs in the non-separable case where the parameter  $C \geq 0$  determines the trade-off between margin-maximization and the minimization of the slack penalty  $\sum_{i=1}^{m} \xi_{i}^{p}$:

$$\begin{aligned}
\min _{\mathbf{w}, b, \boldsymbol{\xi}} \quad & \frac{1}{2}\|\mathbf{w}\|^{2}+C \sum_{i=1}^{m} \xi_{i}^{p} \\
\text { subject to } \quad & y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right) \geq 1-\xi_{i} \wedge \xi_{i} \geq 0, i \in[m]
\end{aligned}$$

We may derive similar formulas and conclusions as in the separable case, and thus we omit it here.

### 3. Basic Convex Optimization <a id='Basic-Convex-Optimization'></a>

A **general (primal) constrained optimization problem** has the form:

> $$\begin{aligned}\min _{\mathbf{x} \in X} & \quad f(\mathbf{x}) \\ \text { subject to:} & \quad g_{i}(\mathbf{x}) \leq 0, \forall i \in\{1, \ldots, m\}.\end{aligned}$$

Recall that for any convex optimization problem ( i.e. it has convex $f, g_i$'s ), any local minimum is also a global minimum and a strictly convex objective function has at most one global minimum. We will denote by $p^∗$ the optimal value of the objective if it exists. Then 

> the **Lagrange function** associated to the general constrained optimization problem has the form
>
> $$\forall \mathbf{x} \in X, \forall \boldsymbol{\alpha} \geq 0, \quad \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha})=f(\mathbf{x})+\sum_{i=1}^{m} \alpha_{i} g_{i}(\mathbf{x}).$$
>
> with the corresponding **Lagrange dual function** being $F(\boldsymbol{\alpha})=\inf _{\mathbf{x} \in X} \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha})$, $\boldsymbol{\alpha} \geq 0$.

Therefore, the **dual optimization problem** is defined as

> $$\begin{aligned} \max _{\boldsymbol{\alpha}} & \quad F(\boldsymbol{\alpha}) \\ \text { subject to: } & \quad \boldsymbol{\alpha} \geq 0 . \end{aligned}$$

Let $d^∗$ denote an optimal value of the dual optimization problem. Since $F(\boldsymbol{\alpha}) \leq f(\mathbf{x})+\sum_{i=1}^{m} \alpha_{i} g_{i}(\mathbf{x}) \leq f(\mathbf{x})$, $\forall \alpha, x$, $d^{*} \leq p^{*}$ and The difference $p^{*}-d^{*}$ is known as the **duality gap**. 

Hereafter, we give without proof some technical lemmas:

> ( **sufficient** )  $ \quad \forall \mathbf{x} \in \mathbb{R}^{N}, \forall \boldsymbol{\alpha} \geq 0, \quad \mathcal{L}\left(\mathbf{x}^{*}, \boldsymbol{\alpha}\right) \leq \mathcal{L}\left(\mathbf{x}^{*}, \boldsymbol{\alpha}^{*}\right) \leq \mathcal{L}\left(\mathbf{x}, \boldsymbol{\alpha}^{*}\right) \implies $ $\mathbf{x}^{*}$ is a solution of the primal problem.

**Proof.** The first inequality implies that 
$$\begin{aligned}
\forall \boldsymbol{\alpha} \geq 0, \mathcal{L}\left(\mathbf{x}^{*}, \boldsymbol{\alpha}\right) \leq \mathcal{L}\left(\mathbf{x}^{*}, \boldsymbol{\alpha}^{*}\right) & \Rightarrow \forall \boldsymbol{\alpha} \geq 0, \boldsymbol{\alpha} \cdot g\left(\mathbf{x}^{*}\right) \leq \boldsymbol{\alpha}^{*} \cdot g\left(\mathbf{x}^{*}\right) \\
& \Rightarrow g\left(\mathbf{x}^{*}\right) \leq 0 \wedge \boldsymbol{\alpha}^{*} \cdot g\left(\mathbf{x}^{*}\right)=0 \quad \quad \quad \quad \quad \quad \quad \quad \text{(1)}
\end{aligned}$$

whereas the second implies that $\forall \mathbf{x}, \mathcal{L}\left(\mathbf{x}^{*}, \boldsymbol{\alpha}^{*}\right) \leq \mathcal{L}\left(\mathbf{x}, \boldsymbol{\alpha}^{*}\right) \Rightarrow \forall \mathbf{x}, f\left(\mathbf{x}^{*}\right) \leq f(\mathbf{x})+\boldsymbol{\alpha}^{*} \cdot g(\mathbf{x})$. Thus, combining the two inequalities above, we achieve the conclusion. 

Here $\left(\mathbf{x}^{*}, \boldsymbol{\alpha}^{*}\right)$ is known as a **saddle point** of the associated Lagrangian. Also, we have some necessay conditions as follows: 

> ( **necessary I** ) Assume that  $f$  and  $g_{i}$'s, are convex and that **Slater's condition** holds. Then, 
>
> $$\mathbf{x} \text{ is a solution} \implies \exists  \boldsymbol{\alpha} \geq 0: (\mathbf{x}, \boldsymbol{\alpha})   \text{ is a saddle point of the Lagrangian.}$$
>
>( **necessary II** ) Assume that  $f$  and  $g_{i}$'s, are convex differentiable and that the weak Slater's condition holds. Then, 
>
> $$\mathbf{x} \text{ is a solution} \implies \exists  \boldsymbol{\alpha} \geq 0: (\mathbf{x}, \boldsymbol{\alpha})   \text{ is a saddle point of the Lagrangian.}$$

Here, the **Slater’s condition** means that $\exists \overline{\mathbf{x}} \in \operatorname{int}(X): g(\overline{\mathbf{x}})<0$, while the **(weak) Slater’s condition** means that

$$\exists \overline{\mathbf{x}} \in \operatorname{int}(X): \forall i \in[m],\left(g_{i}(\overline{\mathbf{x}})<0\right) \vee\left(g_{i}(\overline{\mathbf{x}})=0 \wedge g_{i} \text { affine}\right).$$

Finally, combing the above conclusions, we get to:

> ( **KKT condition** ) 
> 
> Assume that $f, g_{i}: \mathcal{X} \rightarrow \mathbb{R}$ are convex and differentiable and that the constraints are qualified, then $\bar{x}$ is a solution of the primal constrained program iff there exists  $\overline{\boldsymbol{\alpha}} \geq 0$ such that, 
>
> $$ \nabla_{\mathbf{x}} \mathcal{L}(\overline{\mathbf{x}}, \overline{\boldsymbol{\alpha}})=0 \wedge  g\left(\mathbf{x}^{*}\right) \leq 0 \wedge \boldsymbol{\alpha}^{*} \cdot g\left(\mathbf{x}^{*}\right)=0.$$
>

**Proof.** The forward direction is from **necessary II** and **sufficient**. For the opposite, simply observing that  

$$\begin{aligned}
f(\mathbf{x})-f(\overline{\mathbf{x}}) & \geq \nabla_{\mathbf{x}} f(\overline{\mathbf{x}}) \cdot(\mathbf{x}-\overline{\mathbf{x}})  =-\sum_{i=1}^{m} \overline{\boldsymbol{\alpha}}_{i} \nabla_{\mathbf{x}} g_{i}(\overline{\mathbf{x}}) \cdot(\mathbf{x}-\overline{\mathbf{x}}) \\
& \geq-\sum_{i=1}^{m} \overline{\boldsymbol{\alpha}}_{i}\left[g_{i}(\mathbf{x})-g_{i}(\overline{\mathbf{x}})\right] = -\sum_{i=1}^{m} \overline{\boldsymbol{\alpha}}_{i} g_{i}(\mathbf{x}) \geq 0.
\end{aligned}$$

Here the constraints are qualified if, for example, the **(weak) Slater’s condition** holds. Also we remark that the last two conditions are simply (1).



---

### Reference

1. Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning. The MIT Press.
