## [Sep 14] Basis of Machine Learning I

Presenter: Yuchen Ge  
Affiliation: University of Oxford  
Contact Email: gycdwwd@gmail.com  
Website: https://yuchenge-am.github.io

### Content 

1. [Basic Definitions](#Basic-Definitions)
2. [More General Definitions](#More-General-Definitions)

----

### 1. Basic Definitions <a id='Basic-Definitions'></a>

Some definitions is needed to understand the whole framework. $\mathcal{X}$ is the set of all possible examples (instances)， and $\mathcal{Y}$ is the set of all possible labels (target values). For simplicity, $\mathcal{Y}=\{ 0,1 \}$.

> **Def 1.**  A **concept** is a mapping $c: \mathcal{X} \rightarrow \mathcal{Y}$.

A concept class is a set of concepts we may wish to learn, which is denoted by $\mathcal{C}$. All concepts that we consider form a hypothesis set, which is denoted by $\mathcal{H}$.

When we learn some $c \in \mathcal{C}$, we receive a sample $S=\left(x_{1}, \ldots, x_{m}\right)$ drawn i.i.d. according to $\mathcal{I}$.

The ultimate goal is to minimize 

> **Def 2.** (Generalization error) The generalization error (risk) of  $h\in\mathcal{H}$  is defined by
>
>$$R(h)=\underset{x \sim \mathcal{D}}{\mathbb{P}}[h(x) \neq c(x)]=\underset{x \sim \mathcal{D}}{\mathbb{E}}\left[1_{h(x) \neq c(x)}\right].$$

Also, we may have the emprical error 

$$\widehat{R}_{S}(h)=\frac{1}{m} \sum_{i=1}^{m} 1_{h\left(x_{i}\right) \neq c\left(x_{i}\right)}.$$

We will see in the following a number of guarantees relating these two quantities with high probability, under some general assumptions.

> **Remark:** this reminds me of the definition of a sourcr code $C$ in information theory, which is a mapping from $\mathcal{X}$, the range of a random variable $X$, to $D^{∗}$, the set of finite-length strings of symbols from a $D$-ary alphabet.

---

The following introduces the Probably Approximately Correct (PAC) learning framework, where the concept $h_S$ is selected based on the algorithm $\mathcal{A}$ and the sample $S$.

> **Def 3.** (PAC-learning)  $\mathcal{C}$  is PAC-learnable if there exists an algorithm  $\mathcal{A}$  and a polynomial function poly  $(\cdot, \cdot, \cdot, \cdot)$  such that for any  $\epsilon>0$  and  $\delta>0$, the following holds for any distribution $\mathcal{D}$, the goal $c\in \mathcal{C}$, and the sample size  $m \geq \operatorname{poly}(1 / \epsilon, 1 / \delta, n , size  (c))$:
>
>$$\underset{S \sim \mathcal{D}^{m}}{\mathbb{P}}\left[R\left(h_{S}\right) \leq \epsilon\right] \geq 1-\delta.$$

When $\# \mathcal{H}<\infty$, we know that 

> **Thm 1.** ( $\# \mathcal{H}<\infty$, consistent )  When the algorithm  $\mathcal{A}$ is s.t. for any goal $c\in\mathcal{H}$, $\widehat{R}_{S}\left(h_{S}\right)=0$. Then  for any $\epsilon, \delta>0$, the inequality
>
> $$\underset{S \sim \mathcal{D}^{m}}{\mathbb{P}}\left[R\left(h_{S}\right) \leq \epsilon\right] \geq 1-\delta$$  holds if
> $$ m \geq \frac{1}{\epsilon}\left(\log \#\mathcal{H}+\log \frac{1}{\delta}\right).$$

**Proof.** Define $\mathcal{H}_{\epsilon}=\{h \in \mathcal{H}: R(h)>\epsilon\}$. Then $
\mathbb{P}[\widehat{R}_{S}(h)=0] \leq(1-\epsilon)^{m}$ for $h\in\mathcal{H}_{\epsilon}$, since $R(h)=\underset{x \sim \mathcal{D}}{\mathbb{P}}[h(x) \neq c(x)]>\epsilon$. Thus, by the union bound, the following holds:

$$\begin{aligned}
\mathbb{P}\left[\exists h \in \mathcal{H}_{\epsilon}: \widehat{R}_{S}(h)=0\right] & =\mathbb{P}\left[\widehat{R}_{S}\left(h_{1}\right)=0 \vee \cdots \vee \widehat{R}_{S}\left(h_{\#\mathcal{H}_{\epsilon}}\right)=0\right]\\
& \leq \sum_{h \in \mathcal{H}_{\epsilon}} \mathbb{P}\left[\widehat{R}_{S}(h)=0\right] \\
& \leq \sum_{h \in \mathcal{H}_{\epsilon}}(1-\epsilon)^{m} \leq|\mathcal{H}|(1-\epsilon)^{m} \leq|\mathcal{H}| e^{-m \epsilon} .
\end{aligned}$$

Equivalently, in the language of generalization bound, with probability at least $1-\delta$, $R(h_{S}) \leq \frac{1}{m}(\log \#\mathcal{H}+\log \frac{1}{\delta})$.

---

> **Lemma 1.** (Hoeffding's inequality) Let  $X_{1}, \ldots, X_{m}$  be independent random variables with  $X_{i}\in \left[a_{i}, b_{i}\right]$  for all  $i \in[m]$. Then, for any  $\epsilon>0$,
>
>$$\begin{aligned}
& \mathbb{P}\left[S_{m}-\mathbb{E}\left[S_{m}\right] \geq \epsilon\right] \leq e^{-2 \epsilon^{2} / \sum_{i=1}^{m}\left(b_{i}-a_{i}\right)^{2}}, \\
& \mathbb{P}\left[S_{m}-\mathbb{E}\left[S_{m}\right] \leq-\epsilon\right] \leq e^{-2 \epsilon^{2} / \sum_{i=1}^{m}\left(b_{i}-a_{i}\right)^{2}}. \end{aligned} $$

Through **lemma 1**, we can derive

> **Thm 1.** ( $\# \mathcal{H}<\infty$, inconsistent )  For any $h\in \mathcal{H}$, with probability at least $1-\delta$, 
>
>$$ R(h) \leq \widehat{R}_{S}(h)+\sqrt{\frac{\log |\mathcal{H}|+\log \frac{2}{\delta}}{2 m}}.$$

Before proof, this can be viewed as an instance of the so-called **Occam’s Razor principle**: All other things being equal, a simpler (smaller) hypothesis set is better.

**Proof.** Seeing that 
$$\begin{aligned}
& \mathbb{P}\left[\exists h \in \mathcal{H}\left|\widehat{R}_{S}(h)-R(h)\right|>\epsilon\right] \\
= \text{ } & \mathbb{P}\left[\left(\left|\widehat{R}_{S}\left(h_{1}\right) R\left(h_{1}\right)\right|>\epsilon\right) \vee \ldots \vee\left(\left|\widehat{R}_{S}\left(h_{|\mathcal{H}|}\right)-R\left(h_{|\mathcal{H}|}\right)\right|>\epsilon\right)\right] \\
\leq \text{ } & \sum_{h \in \mathcal{H}} \mathbb{P}\left[\left|\widehat{R}_{S}(h)-R(h)\right|>\epsilon\right] \\
\leq \text{ } & 2|\mathcal{H}| \exp \left(-2 m \epsilon^{2}\right) .
\end{aligned}$$

where the last inequality applies **lemma 1**.

--- 

### 2. More General Definitions <a id='More-General-Definitions'></a>

In the most general scenario of supervised learning, we consider a distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$, i.e. the stochastic scenario. Then, we define 

> $$R(h)=\underset{(x, y) \sim \mathcal{D}}{\mathbb{P}}[h(x) \neq y]=\underset{(x, y) \sim \mathcal{D}}{\mathbb{E}}\left[1_{h(x) \neq y}\right]$$

and the Bayesian error

> $$R^{\star}=\inf _{h \text { measurable }} R(h).$$

### Reference

1. Mohri, M., Rostamizadeh, A., &amp; Talwalkar, A. (2018). Foundations of Machine Learning. The MIT Press. 