# No free lunch theorems for supervised learning
   **Elvis Sikora** 
   
   elvis.sikora@gmail.com

**Contents:**
1. Informal discussion about theory and induction
2. Shwartz-David's NFLT + rudiments of PAC-Learning
3. Wolpert's NFLT

## 1. Why should we even care? What is this all about?
**tl;dr:** _there's no universally superior learner, but this might not be super relevant in practice_

### Statistical Learning Theory (SLT)
SLT is about 
_fundamental_ rather than _practical_ questions.

#### The job of SLT is not to:
* provide tricks to accelerate training a neural net
* help us achieve SotA in any specific dataset or learning task
* make the results of learning algorithms more interpretable

#### The kind of questions SLT helps us address
* what is learning?
* is it even possible to learn from finite datasets?
* how (computationally) hard is learning?
* what fundamental tradeoffs are involved in learning?
* is there a universally (across all conceivable learning tasks) superior learner?

These questions are philosophical, there might be no _one correct answer_ to them. But SLT provides us with useful ways to reason about them.

We'll discuss two theorems that allow us to give a negative answer to the last question.

### Problem of induction

In machine learning terms: can we learn from finite data?

### A thought experiment

We are tasked with training an ML model to answer the question: will the sun rise 1/jan next year?

Each night, we get out of bed and go take notes on our window:
1. on the first day, it was cold, and _the sun did rise_
2. on the second day, it was cold again, and _the sun did rise_
3. on the third day, it was warm, and _the sun did rise_
4. on the fourth day, it was cloudy, but we could nevertheless see that _the sun did rise_

...

We collect data for m = 365 days from 1/jan to 31/dec this year.
For all 365 examples in our _training set_, the sun did rise.

Can we be **sure** about what will happen on 1/jan next year?

### The depressing answer: no

There is no _logic_ reason to say the sun will rise the next day.

Anything could happen.

### The way out

We can fix this issue by introducing an _inductive bias_.
That is, we just assume the world works in a specified way.

### Two different inductive biases

Here are two examples of inductive biases we could embed in our learner:
1. always predict the majority class - in this case, we predict _the sun will rise_
2. always predict the minority class - in this case, we predict _the sun will not rise_

### Two different worlds

Suppose either one of the following is true:
1. the sun always rises 
2. the sun always rises this year, and then it will never rise again

Learners using the two proposed biases will have different performance in this learning task, depending on which world they live in.

### Which one is the better learner?

* the most accurate answer is: the one most adept for its own environment
* the practical answer is: the one who says the sun will rise

NFLTs formalize these intuitions.

### Two (sets of) NFLTs

We'll discuss 2 sets of NFLTs:
  + Wolpert's: introduced in a paper in 1996
  + Shalev-Shwartz & Ben-David's (Shwartz-David, for short): available at their 2014 textbook
* Wolpert's are the most commonly cited ones
* Shwartz-David's might be easier to intepret though

We'll provide formal statements of both - but not proofs :(

## 2. Shwart-David's NFLT
_as well as a simplified definition of PAC-Learnability_

### PAC-Learning

+ "PAC" stands for probably approximately correct
+ Leslie Valiant introduced this in 1983
+ it was an important part of why he won a Turing Award in 2010

### A (simple) learning problem

#### The setting

* domain $ X \in \mathbb R^p $
* output: $Y$
* there's an unknown underlying distribution: $P(X, Y)$

#### Our task

* we are given $m$ training cases, sampled i.i.d. (not always realistic)
* we must come up with a classifier $h : X \rightarrow Y$
* we want to minimize the loss function $L$ on _unseen data_

### Minimizing the loss

* we can _always_ achieve 0 error on the training set
* we know the problem with that: we'll likely _overfit_
* that is, our classifier will not generalize to unseen data

### How do we solve this problem?

* in PAC-learning, we restrict our search space
* we do not consider _every possible classifier_
* we only allow classifiers from a (predetermined) _hypothesis class_
* for example, we could restrict ourselves to work with _linear models_

### hypothesis class == inductive bias

If we choose to only consider linear models,
it's as if we were saying:

    I believe the underlying distribution is well approximated by a linear classifier

The NFLTs are essentially answering:

    You do not know that distribution, anything you assume could in principle be wrong

### Empirical Risk Minimization
* our strategy for learning is to minimize training error (ERM)
* since we are only considering some possible classifiers, we may not achieve 0 training error

**Aside**: _it is common to split the dataset into training, cv and test sets. This doesn't change anything fundamentally in the present discussion. We still only have a finite dataset that might not represent well the true underlying distribution. The decision to split it is internal to the learning procedure._

Is the training error a good approximation for the true error? As it turns out, it depends on the hypothesis class.

Notice:
* we do not care about having THE best classifier
* any classifier is fine as long as it is not _too much_ worse than the best one

In other words, we want to be close to the best classifier in our hypothesis class.

### PAC-learnability

Let's name the two classifiers we are concerned with:
* $h_S$ is the classifier that has lowest training error
* $h_{opt}$ is the classifier that has lowest true error

**Definition** (_PAC-learning_): We say a hypothesis class is _PAC-learnable_ if for all $\epsilon$ and $\delta$, there is a sample size $m$ above which the following holds:

$$ P \big( (\text{true error of } h_S) - (\text{true error of } h_{opt} ) \le \epsilon \big) \ge 1 - \delta $$

$\epsilon$ and $\delta$ are real numbers between 0 and 1.

#### _Probably Approximately Correct_
If our hypothesis class is PAC-learnable, we can safely use the strategy of choosing the classifier with lowest training error:
* we'll be _approximately correct_: $\big( (\text{true error of } h_S) - (\text{true error of } h_{opt} ) \le \epsilon \big)$
* but we cannot **be sure** we will be approximately correct
* we can always get a REALLY BAD training set
* what we do know is that we will be _probably_ approximately correct: with probability is $1 - \delta$
* if we want to be more correct with higher probability, the required sample size will grow

Learning a PAC-learnable hypothesis class is really convenient: we can happily pick the classifier which achieves lowest training error.

We might be interested now in asking things like:
1. what kinds of hypothesis classes are PAC-learnable? For example, is the set of all linear functions for a regression problem PAC-learnable?
2. how big should the sample size be so we can, for example, be 99.9% sure we'll be worse off by at most 1E-10 from the best classifier?

These are very important questions. As it turns out, SLT offers answers for both.  

The theory of VC-dimension and the fundamental theorem of PAC-Learning provide an answer to the first. 
And there are formulas to answer the second. 

Unfortunately, we must move on.

### So, is machine learning solved?
#### Does it all boil down to finding a PAC-learnable hypothesis class?
**tl;dr:** _no._

We may get very close to the best classifier in our class, but it might still suck.

If we apply logistic regression to classify images on ImageNet, the model will likely be bad.
Even if it's almost the best possible logistic regression classifier for this problem.

#### What gives?

A restrictive hypothesis class (e.g., logistic regression):
* is usually easy to _learn_ (get to the best possible model for that class)
* usually sucks

This is the bias-complexity tradeoff. NFLT, in some sense, is a technical way of saying:

    We cannot get rid of the bias. Well, at least if we consider the set of all conceivable problems.
    
Which, again, is the same as saying we must _assume_ our hypothesis class is good for our particular problem.

## Shwartz-David's theorem

#### The setting
* binary classification (domain X, labels Y = {0, 1})
* our training set cannot cover more than half of the domain
  - this is a technicality for _really big_ domains (such as the set of all float32 32x32 images)
  - intuition: the half of the domain we _do not_ see could _in principle_ be anything

**Theorem** (_SD-NFL_): for any learning algorithm we apply and for any fixed training set size we can find a distribution $P(X,Y)$ such that:

+ with probability $\ge$ 1/7 we'll get a training set for which the classifier outputted by the algorithm will have true error $\ge$ 1/8
+ there's a classifier that achieves 0 true error on it.

## 3. Wolpert's NFLT

### Wolpert & Macleary's theorem

#### Wolpert, 1996
    It cannot be emphasized enough that no claim is being made in this first paper that all algorithms are equivalent in practice, in the real world. In particular, no claim is being made that one should not use cross-validation in the real world. (I have done so myself many times in the past and intend to do so again in the future.) The sole concern of this   paper is what can(not) be formally inferred about the utility of various learning algorithms if one makes no assumptions concerning targets

## 4. Further reading & Bibliography

The following is an excelent introduction to machine learning from the point of view of SLT:
+ Shalev-Shwartz, Ben-David (2004). _Understanding Machine Learning: From Theory to Algorithms_

It's also available as a PDF at the [author's website](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf).

<img align="left" src="book.jpg" alt="Drawing" style="width: 170px;"/>

Originally, Wolpert's NFLTs for learning algorithms were introduced in:
+ Wolpert (1996). _The Lack of A Priori Distinctions Between Learning Algorithms and The Existence of A Priori Distinctions Between Learning Algorithms_

He published many subsequent works discussing the theorems, such as:
+ Wolpert (2012). _What the No Free Lunch Theorems Really Mean; How to Improve Search Algorithms_

An introduction to a simplified version of them can be found in:
+ Wolpert, (2001). _The Supervised Learning No-Free-Lunch Theorems_

Here is a list of websites I consulted:
+ https://en.wikipedia.org/wiki/David_Hume#Induction_and_causation
+ https://en.wikipedia.org/wiki/Problem_of_induction
+ https://mashimo.wordpress.com/2013/03/12/bertrand-russells-inductivist-turkey/
+ https://peekaboo-vision.blogspot.com/2019/07/dont-cite-no-free-lunch-theorem.html
+ https://amturing.acm.org/award_winners/valiant_2612174.cfm
+ http://www.no-free-lunch.org/