## Towards formalizing ‘learning’

### The basic process of learning
- Observe a phenomenon
- Construct a model from observations
- Use that model to make decisions/predictions

### A statistical machinery for learning

Phenomenon of interest:

- Input space: $X$; Output space: $Y$
- There is an unknown distribution $D$ over $(X,Y)$
- The learner observes $m$ examples $(x_1 ,y_1,\dots, x_m ,y_m )$ drawn from $D$

Construct a model:

- Let $F$ be a collection of models, where each $f: X \rightarrow Y$ predicts $y$ given $x$
- From $m$ observations, select a model $f_m$  in $F$ which predicts well.
- Generalization error of $f$:
  $$\mathrm{err}(f):=\mathbb{P}_{(x,y)\sim D}\left[f(x)\ne y\right]$$
  Notice this error is calculated on the whole distribution $D$
- We can say that we have learned a phenomenon if 
  $$\mathrm{err}(f_m)-\mathrm{err}(f^*)\le \epsilon \quad f^*:=\argmin_{f \in F}\mathrm{err}(f)$$
  for any tolerance level $\epsilon$ of our choice.

For all tolerance levels $\epsilon > 0$, and all confidence levels $\delta > 0$, if there exists some model selection algorithm $\mathcal{A}$ that selects $f_m^\mathcal{A} \in \mathcal{F}$ from $m$ observations i.e. 
 
 - $$\mathcal{A}:(x_i,y_i)_i^m \mapsto f_m^\mathcal{A}$$
 
 - And 
   $$\mathrm{err}(f_m^\mathcal{A})-\mathrm{err}(f^*)\le \epsilon$$
   with probability at least $1-\delta$ over the draw of the sample.

We call 

- The model class $\mathcal{F}$ is __PAC-Learnable__. (Probably Approximate Correct)
- If $m$ is polynomial in $\frac{1}{\epsilon}$ and $\frac{1}{\delta}$ then $\mathcal{F}$ is __Efficiently PAC-Learnable__.

A popular algorithm:

- Empirical risk minimization (ERM) algorithm.
  $$f_m^{\text{ERM}}:=\argmin_{f \in \mathcal{F}}\frac{1}{m}\sum_{i=1}^m\mathbf{1}\{f(x_i)\ne y_i\}$$

## PAC Learning Simple Model Classes

__Theorem (finite seize $\mathcal{F}$ ):__

- Pick any tolerance level $\epsilon > 0,$ and any confidence level $\delta > 0$ let $(x_1,y_1),\dots,(x_m,y_m)$ be $m$ examples drawn from an unknown $\mathcal{D}$ if $\displaystyle m \ge C \cdot \frac{1}{\epsilon^2}\ln\frac{\lvert \mathcal{F}\rvert}{\delta}$, then with the probability at least $1- \delta$
$$\mathrm{err}(f_m^\mathrm{ERM})-\mathrm{err}(f^*)\le \epsilon$$
$$\boxed{\mathcal{F}\text{ is efficiently PAC-learnable}}$$

- Proof Sketch
  - Define Generalization error of $f$
    $$\text{err}(f):=\mathbb{E}_{(x,y)\sim \mathcal{D}}\left[\mathbf{1}\{f(x_i)\ne y_i\}\right]$$
  - Define sample error of $f$
    $$\text{err}_m(f):=\frac{1}{m}\sum_{i=1}^m\left[\mathbf{1}\{f(x_i)\ne y_i\}\right]$$
  Fix any $f \in \mathcal{F}$ and sample $(x_i,y_i)$, define  random variable 
  $$\mathbf{Z}_i^f=\mathbf{1}\{f(x_i)\ne y_i\}$$
  Now we can re-write generalization error and sample error as below
  - Generalization error of $f$
    $$\text{err}(f):=\mathbb{E}_{(x,y)\sim \mathcal{D}}\left[\mathbf{Z}_1^f\right]$$
  - sample error of $f$
    $$\text{err}_m(f):=\frac{1}{m}\sum_{i=1}^m\left[\mathbf{Z}_i^f\right]$$

### __Lemma (Chernoff-Hoeffding bound'63)__<br>
Let $\mathbf{Z}_1,\dots,\mathbf{Z}_m$ be $m$ Bernouli r.v. drawn independently from $\mathbf{B(p)}$, for any tolerance level $\epsilon > 0$
$${\mathcal{P}}_{{\mathit{\mathbf{Z}}}_i } \left\lbrack \left \lvert \frac{1}{m}\sum_{i=m}^m \left\lbrack {\mathit{\mathbf{Z}}}_i \right\rbrack -\mathbb{E}\left\lbrack \mathbf{Z}_1 \right\rbrack \right \rvert\ge \epsilon \right\rbrack \le 2e^{-2\epsilon^2 m}$$

Analyze
$$\begin{align*}{}
&{\mathcal{P}}_{\left(x_i ,y_i \right)} \left\lbrack \mathrm{exists}\;f\in \mathcal{F},\left \lvert \frac{1}{m}\sum_{i=m}^m \left\lbrack {\mathit{\mathbf{Z}}}_i^f \right\rbrack -\mathbb{E}\left\lbrack {\mathbf{Z}}_1^f \right\rbrack \right \rvert\ge \epsilon \right\rbrack \\
&\qquad \quad \le \sum_{f\in \mathcal{F}} {\mathcal{P}}_{\left(x_i ,y_i \right)} \left\lbrack \left \lvert \frac{1}{m}\sum_{i=m}^m \left\lbrack {\mathit{\mathbf{Z}}}_i^f \right\rbrack -\mathbb{E}\left\lbrack {\mathbf{Z}}_1^f \right\rbrack  \right \rvert\ge \epsilon \right\rbrack \\
&\qquad  \quad \le 2{\left \lvert \;\mathcal{F}\right \rvert e}^{-2\epsilon^2 m} \\
&\qquad  \quad \le \delta 
\end{align*}$$

Equivalently by choosing $\displaystyle m \ge \frac{1}{2 \epsilon ^2}\ln\frac{2\mathcal{F}}{\delta}$ with probability at least  
$1-\delta,$ for __all__ $f \in \mathcal{F}$
$$\left \lvert \frac{1}{m}\sum_{i=m}^m \left\lbrack {\mathit{\mathbf{Z}}}_i^f \right\rbrack -\mathbb{E}\left\lbrack {\mathbf{Z}}_1^f \right\rbrack  \right \rvert=\left \lvert \mathrm{err}_m(f)-\mathrm{err}(f) \right \rvert \le \epsilon$$
 

## Learning general concepts

### VC dimension 

VC dimension is also known as Vapnik-Chervonenkis dimension.

- Definition <br>
  We say that a model class $\mathcal{F}$ has VC dimension $d,$ if $d$ is the largest set of points $x_1,\dots,x_d \subset X$ Such that for all possible labelling of  $x_1,\dots,x_d$ there exists some $f \in \mathcal{F}$ that achieves that labelling.
  - Example: $\mathcal{F}=$ Linear classifier in $\mathbb{R}^2$<br><br>
    ![](CS5590_images/Acrobat_FqAMtysA4S.png)<br>
    $$\text{VC}(\mathcal{F})=3$$
    Notice that we can change the structure of the data, for e.g. on the left side, data is in the from of triangle, we can not change it to form a line, i.e. we can not change the position of the data, but we can change the label as we want.

### VC Theory

- __Theorem (Vapnik-Chervonenkis'71)__<br>
  Chose any tolerance level $\epsilon >0,$ and any confidence level $\delta>0$ let $(x_1,y_1),\dots,(x_m,y_m)$ be $m$ examples drawn from an unknown $\mathcal{D},$ <br>
  if $\displaystyle m>C.\frac{\text{VC}(\mathcal{F})\ln(1/\delta)}{\epsilon^2},$ then with probability at least $1-\delta$
  $$\mathrm{err}(f_m^{\mathrm{ERM}})-\mathrm{err}(f^*)\le\epsilon$$
  $$\boxed{\mathcal{F} \text{ is efficiently PAC-learnable}}$$

### Tightness of VC Bound

__Theorem (VC lower bound)__<br>
Let $\mathcal{A}$ be any model selection algorithm that given $m$ samples, returns  a model from $\mathcal{F}$, that is $\mathcal{A}:(x_i,y_i)_{i=1}^m \mapsto f_m^\mathcal{A}$<br>
For all tolerance level $0<\epsilon <1,$ and all confidence levels $0<\delta<1/4,$ there exists a distribution $\mathcal {D}$ such that if $\displaystyle m \leq C \cdot \frac{\mathrm{VC}(\mathcal{F})}{\epsilon^2}$
$$\mathbb{P}_{(x_i,y_i)}\left[ \left \lvert \mathrm{err}(f_m^{\mathcal{A}})-\mathrm{err}(f^*) \right \rvert > \epsilon \right]> \delta$$

### Facts of VC dimension

- VC dimension:
    - A combinatorial concept to capture the true richness of $\mathcal{F}$
    - Often (but not always!) proportional to the degrees of freedom or the number of independent parameters in $\mathcal{F}$
- Other Observations
    - VC dimension of a model class fully characterizes its learning ability!
    - Results are agnostic to the underlying distribution.

## ERM

From the  discussion it may seem that ERM algorithm is universally consistent. Not really though! <br>
Below is a theorem which shows that error will always greater than some amount no matter what we do<br>

- __Theorem (no free lunch, Devroye'82):__<br>
  Pick any sample size $m$, any algorithm $\mathcal{A}$ any tolerance $\epsilon>0$ there exists a distribution $\mathcal {D}$ such that:
  $$\mathrm{err}(f_m^{\mathcal{A}})A>1/2-\epsilon$$
  while base optimal error, $\displaystyle \min_f \mathrm{err}(f)=0$

### Further 

- How to do model class selection? Structural risk results.
- Dealing with kernels Fat margin theory
- Incorporating priors over the models PAC Bayes theory
- Is it possible to get distribution dependent bound? It is also known as Rademacher complexity.
- How about regression ? Can derive similar results for nonparametric regression.

## Regression Formulation

$y \rightarrow$ True label <br>
$\hat y \rightarrow$ Predicted label <br>
$X \rightarrow$ Input data <br>
$L(\hat y,y):=\lvert \hat y-y \rvert \rightarrow$ Absolute error<br>
$L(\hat y,y):= (\hat y-y)^2 \rightarrow$ Squared error

A Liner predictor can be defined by slop $w$ and intercept $w_0$
$$\hat f(\vec x)=\vec w \cdot \vec x+ w_0$$

Which minimizes the loss
$$\min_{w,w_0} \mathbb{E}_{(\vec x,y)}[L(\hat f(\vec x),y)] $$
The intercept can be absorbed via lifting and now it can be written as 
$$\hat f(\vec x)=\vec w \cdot \vec x \tag{1}$$
Which minimizes the loss
$$\min_{w} \mathbb{E}_{(\vec x,y)}[L(\hat f(\vec x),y)] \tag{2}$$
- Parametric Regressor: Here we assume a particular form of the regressor and goal is to learn the parameter which minimizes the loss.
- Non-Parametric Regressor: Here we do not assume any specific form of the regressor and the goal here is to learn the predictor directly from the input data so the error is minimized.

we want to find a linear predictor $\hat f$ given by equation $(1)$ which minimizes the loss given by equation $(2)$ <br>
We estimate the parameter s by minimizing the corresponding loss on the training data:
$$\begin{align*}
&\argmin_w \frac{1}{n}\sum_{i=1}^n L(\vec w\cdot \vec x_i, y_i)\\
=&\argmin_w \frac{1}{n}\sum_{i=1}^n(\vec w\cdot \vec x_i-y_i)^2\\
=&\argmin_w  \left\lVert \left\lbrack \begin{array}{c}
\dots X_1 \dots\\
\dots X_i \dots\\
\dots X_n \dots
\end{array}\right\rbrack \left\lbrack \begin{array}{c}
\;\\
w\\
\;
\end{array}\right\rbrack -\left\lbrack \begin{array}{c}
y_1 \\
y_i \\
y_n 
\end{array}\right\rbrack \right\rVert^2 \\
=&\argmin_w \left\lVert X \vec w - \vec y\right\rVert_2^2
\end{align*}$$
Notice that every 

$$\left\lbrack \begin{array}{c}
\dots X_i \dots\\
\\
\\
\end{array}\right\rbrack \left\lbrack \begin{array}{c}
\;\\
w\\
\;
\end{array}\right\rbrack$$
produces a single value as it is just a dot product.

This is unconstrained problem, We can take the gradient and examine the stationary points.

$$\begin{align*}
&&\frac{\partial}{\partial \vec w} \left\lVert X \vec w - \vec y\right\rVert^2 &=0\\
&\Rightarrow& 2X^T(X\vec w-\vec y) &=0 \\
&\Rightarrow& X^T(X\vec w-\vec y) &=0 \\
&\Rightarrow& X^TX\vec w &=X^T\vec y \\
&\Rightarrow& \vec w &=(X^TX)^\dagger X^T\vec y \\
\end{align*}$$

Here $(\cdot)^\dagger$ is called pseudo-inverse.<br>
The above  equation is also called Ordinary Least Squares
$$\vec w_{ols} =(X^TX)^\dagger X^T\vec y $$
The solution is unique and stable when $X^TX$ is invertible.


<br><br>
Now consider the column space view of the data 
$$\mathbf X: \left\lbrack \begin{array}{c}
\dots X_1 \dots\\
\dots X_i \dots\\
\dots X_n \dots
\end{array}\right\rbrack \rightarrow \left\lbrack \begin{array}{c|c|c}
\ddot x_1  & \cdots  & \ddot x_d \\
\vdots  &  & \vdots \\
 &  & 
\end{array}\right\rbrack$$
Find a $w$ such that the linear combination of $X$ is minimized.

$$\frac{1}{n} \left\lVert \vec y-\sum_{i=1}^d w_i \ddot x_i \right\rVert:=\text{residual}$$
Say $\hat y$ is the solution 
$$\hat y:=X\vec w_{ols}=\sum_{i=1}^d w_{ols,i}\ddot x_i$$

- Thus $\hat y$  is the orthogonal projection of $y$ onto the $\text{span}(\ddot x_1,\dots,\ddot x_d)$
  $$\hat y = X \vec w_{ols}=\underbrace{X(X^TX)^\dagger X^T}_{\text{Projection Matrix }\prod} \vec y$$
    - Below pic shows the same:<br><br>
    ![](CS5590_images/mspaint_IDUg2yv0VR.png)

<br><br><br>
$\tiny  {\textcolor{#808080}{\boxed{\text{Reference: Dr. Vineeth, IIT Hyderabad }}}}$