## [Sep 16] Basis of Machine Learning II

Presenter: Yuchen Ge  
Affiliation: University of Oxford  
Contact Email: gycdwwd@gmail.com  
Website: https://yuchenge-am.github.io

----

Content

1. [Kernel Method](#Kernel-Method)
2. [SVM Method](#SVM-Method)
3. [Basic Convex Optimization](#Basic-Convex-Optimization)

---

### 1. Kernel Method <a id='Kernel-Method'></a>

In practice, **linear separation** is often not possible. One way to define such a non-linear decision boundary is to use a non-linear mapping $\Phi$ from the input space $\mathcal{X}$ to a higher-dimensional space $\mathbb{H}$ (which is called a **feature space**), where linear separation is possible.

> A **kernel over $\mathcal{X}$** is a function $K: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$.

The idea is to define a kernel such that $K(x, x^{\prime})=\langle\Phi(x), \Phi(x^{\prime})\rangle$. However, **this is not true for general kernels**. Thereby, we define

> A **positive definite symmetric (PDS) kernel** is a kernel $K$ such that for any $\{x_1,...,x_m
\} \subset \mathcal{X}$, the kernel (Gram) matrix matrix $[K(x_i,x_j)]_{ij} \in \mathbb{R}_{m \times m}$ is symmetric positive semidefinite (SPSD).

For example, we have

> **Polynomial kernels:** $K\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\left(\mathbf{x} \cdot \mathbf{x}^{\prime}+c\right)^{d}$.
>
> **Gaussian kernels:** $K\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\frac{\left\|\mathbf{x}^{\prime}-\mathbf{x}\right\|^{2}}{2 \sigma^{2}}\right)$.
>
> **Sigmoid kernels:** $K\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\tanh \left(a\left(\mathbf{x} \cdot \mathbf{x}^{\prime}\right)+b\right)$.


The following is **the main result** of this note.

Seeing that 

> $K\left(x, x^{\prime}\right)^{2} \leq K(x, x) K\left(x^{\prime}, x^{\prime}\right)$ 

since the matrix \begin{pmatrix}
K(x, x) & K\left(x, x^{\prime}\right) \\
K\left(x^{\prime}, x\right) & K\left(x^{\prime}, x^{\prime}\right)
\end{pmatrix}
is SPDS, then we immediately arrive at 

> If $K: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$ is a PDS kernel, then there exist $\Phi$ s.t. $K\left(x, x^{\prime}\right)=\left\langle\Phi(x), \Phi\left(x^{\prime}\right)\right\rangle$. 

**Proof.** Define $\Phi(x)\left(x^{\prime}\right)=K\left(x, x^{\prime}\right)$ and 
$ \mathbb{H}_{0}=\left\{\sum_{i \in I} a_{i} \Phi\left(x_{i}\right): a_{i} \in \mathbb{R}, x_{i} \in X,|I|<\infty\right\}$. Next, we introduce an operation on $\mathbb{H}_{0}$, which is 

$$\langle f, g\rangle=\sum_{i \in I, j \in J} a_{i} b_{j} K\left(x_{i}, x_{j}^{\prime}\right) = \sum_{j \in J} b_{j} f (x_{j}') = \sum_{i \in I} a_{i} g\left(x_{i}\right)$$
with $f=\sum_{i \in I} a_{i} \Phi\left(x_{i}\right)$ and $g=\sum_{j \in J} b_{j} \Phi\left(x_{j}^{\prime}\right)$.  The last two equations show that the operation is well-defined. Then, it's routine to show that $\langle f, g\rangle$ defines a **PDS kernel** on $\mathbb{H}_{0}$. The only **untrivial** thing is to show $\langle f, f\rangle=0$ iff $f=0$. Seeing 

$$ f(x)^2=\bigg(\sum_{i \in I} a_{i} K\left(x_{i}, x\right)\bigg)^2 = \langle f, \Phi(x)\rangle^{2} \leq\langle f, f\rangle\langle\Phi(x), \Phi(x)\rangle. $$

The last inequality is true since 

$$ \bigg( \sum a_i \langle x_i, x\rangle\bigg) ^{2} = \sum a_i^2 \langle x_i, x\rangle^{2} +  \sum_{ij} a_i a_j \langle x_i, x\rangle \langle x_j, x\rangle \leq  \sum a_i^2 \langle x_i, x_i\rangle \langle x, x\rangle +  \sum_{ij} a_i a_j \langle x_i, x_i\rangle^{1/2} \langle x_j, x_j\rangle^{1/2} \langle x,x\rangle
$$

where the last inequality is from the lemma.






### 2. SVM Method <a id='SVM-Method'></a>

We consider a binary classifier problem.

> Determine a **hypothesis** $h \in \mathcal{H}$ where
>
> $$ \mathcal{H}=\left\{\mathbf{x} \mapsto \operatorname{sign}(\mathbf{w} \cdot \mathbf{x}+b): \mathbf{w} \in \mathbb{R}^{N}, b \in \mathbb{R}\right\}$$ 
>
> with small generalization error $R_{\mathcal{D}}(h)=\underset{x \sim \mathcal{D}}{\mathbb{P}}[h(x) \neq f(x)].$

Here's the outline: we first introduce the algorithm for **separable datasets**, then its general version for **non-separable datasets**, and finally provide a theoretical foundation for SVMs based on **margin**.

> **Def 1.** $S=\left(\left(x_{1}, y_{1}\right), \ldots,\left(x_{m}, y_{m}\right)\right)$ is **separable** if  $\exists (\mathbf{w}, b) \in\left(\mathbb{R}^{N}-\{\mathbf{0}\}\right) \times \mathbb{R}$ such that $\forall i \in[m]$, $y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right) \geq 0$.

We can see that if there exist one  hyperplane $\mathbf{w} \cdot \mathbf{x}+b=0$ separating $S$, there exist ifinite many. Therefore, we require one criterion for determing the unique one.

> Consider maximizing the margin $\rho$ of a separating hyperplane 
> 
> $$\rho=\max _{\substack{\mathbf{w}, b:\\ y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right) \geq 0}} \min _{i} \frac{\left|\mathbf{w} \cdot \mathbf{x}_{i}+b\right|}{\|\mathbf{w}\|}=\max _{\mathbf{w}, b} \min _{i} \frac{y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right)}{\|\mathbf{w}\|}=\max _{\substack{\mathbf{w}, b: \\ \min _{i} y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right)=1}} \frac{1}{\|\mathbf{w}\|}=\max _{\substack{\mathbf{w}, b: \\ \forall i, y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right) \geq 1}} \frac{1}{\|\mathbf{w}\|}.$$
>
> The above (last) optimization problem is equivalent to 
>
> $$\begin{aligned}
\min _{\mathbf{w}, b} & \quad \frac{1}{2}\|\mathbf{w}\|^{2} \\
\text { subject to: } & \quad y_{i}\left(\mathbf{w} \cdot \mathbf{x}_{i}+b\right) \geq 1, \forall i \in \{1,2,...,m\}.
\end{aligned}$$
>

In [Basic Convex Optimization](#Basic-Convex-Optimization), we have recalled some basic facts about convex optimization.















### 3. Basic Convex Optimization <a id='Basic-Convex-Optimization'></a>

A **general constrained optimization problem** has the form:

>$$\begin{aligned}
\min _{\mathbf{x} \in X} & \quad f(\mathbf{x}) \\
\text { subject to:} & \quad g_{i}(\mathbf{x}) \leq 0, \forall i \in\{1, \ldots, m\} .
\end{aligned}$$

Then the **Lagrangian (Lagrange function)** associated to the general constrained optimization problem has the form
$$\forall \mathbf{x} \in X, \forall \boldsymbol{\alpha} \geq 0, \quad \mathcal{L}(\mathbf{x}, \boldsymbol{\alpha})=f(\mathbf{x})+\sum_{i=1}^{m} \alpha_{i} g_{i}(\mathbf{x})$$

We remark that 







---

### Reference

1. Gallager, Robert G. Discrete Stochastic Processes. Kluwer Acad. Publ., 1999. 
