# Introduction to Deep Learning with Python

## Chapter 2: Multilayer Perceptrons

### 2.1  Network Architecture

Recall that a learning problem can be understood as approximating some function **$f^*$**, e.g. $y = f^*(x)$ maps the **input $x$** to some **output $y$**. And a learning algorithmen tries to learn the values of parameters $\theta$ so that a given function $y = f(x;\theta)$ approximates $f^*$ in the best possible way.   

In the case of the perceptron, $f(x;\theta)$ was the weighted some of the inputs and $\theta$ was the weight vector ${[w_1, w_2, ..., w_n]}$ (and the bias $b)$. A multiplayer perceptron (MLP), also called Deep Learning, Deep feedforwards network, or feedforwards neural network, expands this idea by composing many functions together:

$$f(x) = f^{(3)}(f^{(2)}(f^{(1)}(x)))$$

Such directed acyclic graph, were $f^{(1)}$ is called the **first layer**, $f^{(2)}$ is called the **second layer**, and $f^{(3)}$ is called the **third layer**, can be seen as the typical archictecture used in modern neural networks. 

![Perceptron](00_ressources/img/chapter_2/network_architecture.jpeg)

In chapter 1.2, we saw that learning $w$ is not that problematic. The real challange is having the rights inputs $x$. One way to solve this problem is to manually engineer a feature vector $\phi(x)$ from the input vector $x$ using a nonlinear transformation $\phi$. We than use $\phi(x)$ instead of $x$ to solve the problem at hand. 

As already mentioned, support vector machines use an alternative method were $\phi$ is a predefined function such as the radial basis function (RBF) kernel, which proven to be helpful in a lot of problems.  

In contrast, deep learning does neither use manually engineered features nor predefined functions but tries to learn $\phi$ from the given data:

$$y = f(x;\theta,w) = \phi(x, \theta)^{T}w,$$

where $\theta$ is a parameter vector used to learn a function $\phi$, and $w$ is a weight vector that maps from $\phi(x)$ to $y$. Or in other words, given a generic function $\phi$ parametrized as $\phi(x;\theta)$, let an algorithmen find the best $\theta$ to have a useful new repsentation (feature) of $x$.


### 2.1 Types of Hidden Units