# Introduction to deep learning

## The Goal of This Talk
- To give a sense of what deep learning is actually good for, and what it's not good for
- To give a sense of when deep learning is the right strategy to take in your research

## Outline
1. No free lunch in AI
2. Inductive biases and the AI set
3. Deep architectures as a good inductive bias
4. Making learning work in deep architectures
5. When to use deep learning

### No Free Lunch in AI
The early dream of machine learning researchers was that we could develop general purpose learning algorithms that could learn well in any scenario on any dataset.

However, there exists no learning algorithm that can perform better than all other learning algorithms on all tasks. 

**No Free Lunch Theorem for Optimization (Wolpert and MacReady, 1997)**
- You can find the most appropriate algorithm for different problems, and each algorithm has its own applicable problems.
- Make sure you completely understand a machine learning problem and the data involved before selecting an algorithm to use.
- All models are only as good as the assumptions that they were created with and the data that was used to train them.
- Simpler models like logistic regression have more bias and tend to underfit, while more complex models like neural networks have more variance and tend to overfit.
- The best models for a given problem are somewhere in the middle of the two bias-variance extremes.
- To find a good model for a problem, you may have to try different models and compare them using a robust cross-validation strategy. 

If you don't any expectations about what task you're going to apply your algorithms, you are never able to actually have an algorithm that outperforms all other algorithms on average.

**You have to assume a non-flat prior over loss functions and select an algorithm that is well suited to learning that specific task if you want best performance**

## Inductive Biases And the AI Set
How to select the right algorithms for AI?

Three typical response categories

1. Defeatism: We just have to hand-craft specific solutions to very specific task we ever want to accomplish in AI.
2. Denial: Things like kernel machines can approximate any function, and we have bounds on their ability to generalize with regularization, so it's all good.
3. Optimism: We can define the set of things we actually want to do with AI and design systems that are general purpose within that restricted set. But, this requires inductive biases.

### Inductive Biases
Inductive biases are assumptions that we bake into our algorithms about the sort of tasks we will be performing. They are a means of embedding prior knoweldge into an optimization system.

Broadly, we can build inductive biases into machine learning algorithms using the following three components:
1. Using hand-wired pre-processing of data (e.g., extracting pre-determined features)
2. Using specific architectures for our learning machines
3. Using specific loss functions and regularizers for our learning machines

Note: a regurlarizer is just a way of preventing over-fitting to data.

### The AI Set
AI should take inspiration from brains!

Our brains seem general purpose in their learning, but they are actually pretty restricted, and learn certain things far more easily than others (e.g., learning to read natural language is easier than learning to read barcodes).

AI should be concerned with the broad set of tasks that animals and people are good at, and maybe some related tasks that animals and people aren't good at only to physical/speed limitations.

AI research should be about defining good inductive biases that are as minimal as possible in order to do well on the AI set.

## Deep Architectures as A Good Inductive Biases
### Kernel machines
Kernel methods are types of algorithms that are used for pattern analysis. These methods involve using linear classifiers to solve nonlinear problems.

**Getting good features**
The basic idea: project the data into a high-dimensional space where it becomes linearly separable.

The most obvious way to do this is to try to define a set of pre-processing stages that will accomplish this.

Kernal machines try to avoid hand-crafting the pre-processing stages by instead using the data itself to construct the pre-processing. But it turns out that kernel machines do not actually perform that well.

### The world is hierarchical
The world around us is compositional. This means that when we experience something, it is usally composed of smaller pieces, that are themselves composed of smaller pieces, etc. 

### AI should be hierarchical
If the world is best understood in terms of compositions of different pieces, then what we want is not a single non-linear projection into a new space (per kernel machines).

Instead, we want multiple layers of non-linear functions, each one operating with the features identified by the lower layers.

This is an inductive bias. It says, essentially, that we assme tasks where we have to build hierarchies of features.

We need to devise AI architectures that have built-in hierarchy. In other words, we need deep architectures. This line of logic is the origins of "deep learning".

## Making Learning Work in Deep Architectures
### Towards deep learning: gradient descent
You want to learn a hierarchy, and the most successful approach has been gradient descent to find the global cost minimum.

In the right circumstances, gradient descent in deep networks can work really well.

### Size matters
**The strange double descent phenomenon**
In statistics and machine learning, double descent is the phenomenon where a statistical model with a small number of parameters and a model with an extremely large number of parameters have a small error, but a model whose number of parameters is about the same as the number of data points used to train the model will have a large error.

### Other things that help with deep learning
- Modifications to gradient descent to escape saddles (e.g., ADAM)
- Activation functions that mitigate vanishing gradients (e.g., ReLU)
- Good regularization schemes (e.g., Dropout)

## When to Use Deep Learning
