### Perceptron
- Uses a threshold logic unit (TLU) or linear threshold unit (LTU).
- Input and outputs of each nodes are numbers, each input associated with a weight
- TLU first computes a linear function of inputs:
$$z=w_1x_1+w_2x_2+...+w_nx_n+b = \mathbf{w^Tx} + b$$
- And then it applies a step function of the result of the sum:
$$h_w(x) = \text{step}(z)$$

**Common step functions (assuming threshold is 0):**
$$\operatorname{heaviside}(z) = \begin{cases} 
0 & \text{if } z < 0 \\ 
1 & \text{if } z \geq 0 
\end{cases}
\space
\operatorname{sgn}(z) = \begin{cases} 
-1 & \text{if } z < 0 \\ 
0 & \text{if } z = 0 \\ 
+1 & \text{if } z > 0 
\end{cases}$$

<img src="009Perceptron.png" height=200>

Note: Single TLU can be used for simple **linear binary classification**
- Computes linear function of inputs, if results $\geq$ some threshold, it outputs postive class, otherwise negative

**Multi-layer**
Perceptrons consists of one or more TLUs organized in a single layer.
Each TLU is connected to every input (Fully connected or Dense layer)
Inputs -> Input layer, Outputs of TLU -> output layer

Each output of perceptron can classify instances simultanously into the number of outputs. Can use it for multi-label and multi-class classification.

Outputs of a layer of artifical neurons for several instances at once:
$$\mathbf{\hat{Y}} = \phi(\mathbf{XW + b})$$
<image src="009Multilayer.png" height=200>

- $\mathbf{\hat{Y}}$ is output matrix. One row per instance, one column per neuron
- $\mathbf{X}$ is input matrix. One row per instance, one column per input feature
- $\mathbf{W}$ is weight matrix, contains all connection weights. One row per input feature, one column per neuron
- $\mathbf{b}$ contains all bias terms, one per neuron
- $\phi$ is activation function

**Training a perceptron**
"Cells that fire together, wire together" --> connection weight between two neurons tends to increase when they activate simultaneously.

Perceptrons trained using a variant of this rule, taking into account the error made my network when it makes a prediction. (Reinforce connections that reduces error)
- Feed network one instance at a time, and for each instance it makes prediction.
- For every output neuron with wrong prediction, reinforce connection weights from inputs that would have contributed to correct predictions

The rules can be described by this math function:
$$\mathit{w_{i,j}^{\text{(next step)}}} = \mathit{w}_{i,j} + \eta(y_j-\hat{y}_j)x_i$$
- $w_{i,j}$ is connection weight between $i^{th}$ input and $j^{th}$ neuron
- $x_i$ is the $i^{th}$ input value of current training instance
- $\hat{y}_j$ is output of $j^{th}$ output neuron for current training instance
- $y_j$ is target output of the $j^{th}$ output neuron for current training instance
- $\eta$ is learning rate

Because decisions made by each neuron is linear, perceptrons cannot learn complex patterns. However, if training instances are linearly seperable, it will eventually converge to a solution.

In [7]:
'''
Sklearn Perceptron class
'''
import numpy as np
from sklearn.datasets import load_iris 
from sklearn.linear_model import Perceptron

iris = load_iris(as_frame=True)

X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = (iris.target == 0) # Iris setosa

per_clf = Perceptron(random_state=42)
per_clf.fit(X,y)

X_new = [[2,0.5], [3,1]]
y_pred = per_clf.predict(X_new)
y_pred

array([ True, False])

One **huge** limitations of perceptrons is that they cannot solve XOR classification problems.

This can be elimiated by stacking multiple perceptrons. Resuling is called multilayer perceptron (MLP)

### Multilayer Perceptron and Backpropagation
MLP composed of 1 input later, 1 or more layers of artifical neurons, and one final layer of artifical neurons called the output layer.

Layers closer to input layer are lower layers, closer to outputs are upper layers.

When ANN contains deep stack of hidden layers, called deep neural network.

Training:
- Reverse Mode automatic differentiation or reverse-mode autodiff

    - Two passes of network, one forward and one backwards, can compute gradient of nueral network's error with regard to every single model parameter.
    - Can find how each connection weight and bias should be tweaked to reduce network error.
    - These gradients can then be used to perform gradient descent step
    - Repeat this process and the network's error will gradually drop until minimum.
    - Combination of reverse-mode autodiff and gradient descent is called **backpropagation** or backprop.

**Backpropagation in detail**
1. Handles one mini batch at a time, goes through full training set multiple times. If each minibatch has n instances, and each instance has m features, will will be represented as a matrix with n rows and m columns. Each pass through training set is called *epoch*

2. Forward pass: For each minibatch, algorithm computes output of all neurons in first hidden layer using $\mathbf{\hat{Y}} = \phi(\mathbf{XW + b})$. If layer has j neurons, output is a matrix with n rows and j columns. This matrix passed on to the next layer, output computed, and passed to the next layer until the output layer.

3. Then algorithm measures network output error using a loss function

4. Uses output of loss function, and comptues how much each output layer parameter contributed to layer. (chain rule). One gradient per parameter.

5. Algorithm measures how much of htese errors contributions came from each connecvtion in the layers below, using chain rule working backwards until input layer.

6. Finally, performs gradient descent step to twak all connection weights and bias terms in network using error gradients it just computer.

**Note** Should initialize all hidden layers' weight randomly.

Popular Activation Functinos functions of use:
1. Logitstic $\sigma(z) 1/ (1 + \exp(-z))$
2. Hyperbolic tangent $\text{tanh}(z) = 2\sigma(2z)-1$
3. Rectified Linear unit function $\text{ReLU}(z) = max(0,z)$

We need activation functions to get rid of the linearity constraint. Need nonlinearity between layers to learn patterns from non linear datasets

### Regression MLP

In [9]:
'''
MLP for regression task: One output neuron per label you want to predict.

Use sklearn.neural_network.MLPRegressor class
    Build MLP with 3 hidden layers, 50 neurons each
    Train it on California housing dataset
'''
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Get data and seperate into test and training sets
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(housing.data,
                                                    housing.target,
                                                    random_state=42)

In [None]:
'''
Create MLPRegressor model with 3 hidden layers, 50 neurons each
    First hidden layer's input size (row in weight matrix) and 
    output layer size (columns in weight matrix) will adjust
    automatically to dimensionality of inputs and targets when training
    starts.

    Model uses ReLU in all hidden layers with no activation function
        at output layer

sklearn.neural_network.MLPRegressor(params)
    params: 
        hidden_layer_size=[i,j,k,...] where i,j,k,... are sizes of each
            neural layer
        early_stopping=<bool>
            If true, will automatically set aside 10% of training data for 
                evaluation at each epoch.
            Can adjust size by setting validation_fraction to be some number.
        verbose=<bool>
        random_state=<int>
    If validation score stops improving for 10 epochs, will automatically stop
        Can change number of epochs by setting n_iter_no_change
        
'''

mlp_reg = MLPRegressor(hidden_layer_sizes=[50,50,50], early_stopping=True,
                       verbose=True,random_state=42)

In [None]:
''' 
Create a pipeline to input features before sending it to the regressor
    Does not converge well if features have very different scales.

Pipeline does: Standardize input features -> run MLPRegressor

'''
pipeline = make_pipeline(StandardScaler(), mlp_reg)
pipeline.fit(X_train, y_train)

Iteration 1, loss = 0.85190332
Validation score: 0.534299
Iteration 2, loss = 0.28288639
Validation score: 0.651094
Iteration 3, loss = 0.22884372
Validation score: 0.699782
Iteration 4, loss = 0.20746145
Validation score: 0.720468
Iteration 5, loss = 0.19649383
Validation score: 0.724839
Iteration 6, loss = 0.18928708
Validation score: 0.740084
Iteration 7, loss = 0.18132029
Validation score: 0.747406
Iteration 8, loss = 0.17556450
Validation score: 0.753945
Iteration 9, loss = 0.17190651
Validation score: 0.760500
Iteration 10, loss = 0.16687650
Validation score: 0.759213
Iteration 11, loss = 0.16329479
Validation score: 0.761907
Iteration 12, loss = 0.16054473
Validation score: 0.768950
Iteration 13, loss = 0.15690181
Validation score: 0.762699
Iteration 14, loss = 0.15630644
Validation score: 0.766003
Iteration 15, loss = 0.15712517
Validation score: 0.778464
Iteration 16, loss = 0.15155981
Validation score: 0.774237
Iteration 17, loss = 0.14957641
Validation score: 0.778361
Iterat

0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('standardscaler', ...), ('mlpregressor', ...)]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True

0,1,2
,"loss  loss: {'squared_error', 'poisson'}, default='squared_error' The loss function to use when training the weights. Note that the ""squared error"" and ""poisson"" losses actually implement ""half squares error"" and ""half poisson deviance"" to simplify the computation of the gradient. Furthermore, the ""poisson"" loss internally uses a log-link (exponential as the output activation function) and requires ``y >= 0``. .. versionchanged:: 1.7  Added parameter `loss` and option 'poisson'.",'squared_error'
,"hidden_layer_sizes  hidden_layer_sizes: array-like of shape(n_layers - 2,), default=(100,) The ith element represents the number of neurons in the ith hidden layer.","[50, 50, ...]"
,"activation  activation: {'identity', 'logistic', 'tanh', 'relu'}, default='relu' Activation function for the hidden layer. - 'identity', no-op activation, useful to implement linear bottleneck,  returns f(x) = x - 'logistic', the logistic sigmoid function,  returns f(x) = 1 / (1 + exp(-x)). - 'tanh', the hyperbolic tan function,  returns f(x) = tanh(x). - 'relu', the rectified linear unit function,  returns f(x) = max(0, x)",'relu'
,"solver  solver: {'lbfgs', 'sgd', 'adam'}, default='adam' The solver for weight optimization. - 'lbfgs' is an optimizer in the family of quasi-Newton methods. - 'sgd' refers to stochastic gradient descent. - 'adam' refers to a stochastic gradient-based optimizer proposed by  Kingma, Diederik, and Jimmy Ba For a comparison between Adam optimizer and SGD, see :ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_training_curves.py`. Note: The default solver 'adam' works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, 'lbfgs' can converge faster and perform better.",'adam'
,"alpha  alpha: float, default=0.0001 Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss.",0.0001
,"batch_size  batch_size: int, default='auto' Size of minibatches for stochastic optimizers. If the solver is 'lbfgs', the regressor will not use minibatch. When set to ""auto"", `batch_size=min(200, n_samples)`.",'auto'
,"learning_rate  learning_rate: {'constant', 'invscaling', 'adaptive'}, default='constant' Learning rate schedule for weight updates. - 'constant' is a constant learning rate given by  'learning_rate_init'. - 'invscaling' gradually decreases the learning rate ``learning_rate_``  at each time step 't' using an inverse scaling exponent of 'power_t'.  effective_learning_rate = learning_rate_init / pow(t, power_t) - 'adaptive' keeps the learning rate constant to  'learning_rate_init' as long as training loss keeps decreasing.  Each time two consecutive epochs fail to decrease training loss by at  least tol, or fail to increase validation score by at least tol if  'early_stopping' is on, the current learning rate is divided by 5. Only used when solver='sgd'.",'constant'
,"learning_rate_init  learning_rate_init: float, default=0.001 The initial learning rate used. It controls the step-size in updating the weights. Only used when solver='sgd' or 'adam'.",0.001
,"power_t  power_t: float, default=0.5 The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to 'invscaling'. Only used when solver='sgd'.",0.5
,"max_iter  max_iter: int, default=200 Maximum number of iterations. The solver iterates until convergence (determined by 'tol') or this number of iterations. For stochastic solvers ('sgd', 'adam'), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.",200


In [15]:
''' 
We just trained our first MLP, requring 45 epochs

Vaidation score uses R^2 score by default. Close to 80 which is pretty goodd
'''
print(mlp_reg.best_validation_score_)

#RMSE
y_pred = pipeline.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
rmse

0.791536125425778


0.5327699946812925

Here is what out MLP looks like:

<image src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9798341607972/files/assets/hmls_0909.png" height=300>

This MLP does not use any activation function for output layer
- Generally okay
- If you want to gurantee that the output is positive, should use ReLU on output or softplus activation where softplus(z) = log(1 + exp(z))
- guarantee prediction fall within a range of value, use sigmoid or hyperbolic tangent.
- MLPRegressor class does not support activation functions in output layer.

MSE is generally right loss function to use for regression, if there are alot of outliers, preferable to use mean absolute error or Huber loss which is a combination of both.
- MLPRegressor only supports the MSE loss

### Classification MlPs
tbd

### Hyperparameter Tuning

**# of hidden layers**
For many problems, one hidden layer can be enough for reasonable results.

However deep neural network have much higher parameter efficiency.
Hiearchical helps DNNs converge faster to a good solutions, can also be good for generalize to new datasets.

Start with 1 or 2 hidden layers. More more complex problems, can ramp up number of hidden layers until overfitting.

Very complex tasts such as large image classification or speech recognition usually have dozens or hundred of layers, but not fully connected ones, and need huge amounts of training data.

Rarely do you have to train a neural network from scratch, common to reuse parts of pretrained network that performs a similar tasks, then training will be a lot faster and require muc hless data

**# of neurons per hidden layer**
- Determined by the type of input and output of task. eg. MNIST requires 
28 x 28 = 784 inputs and 10 output neurons
- Common to size the mto for m a pyramid, fewer and fewer at each layer.
- Typical for MNISt might have 3 hidden layers, first with 300 neurosn, second with 200, and thrid with 100. However, this practice has been abandoned, should use same number of neurons in all hidden layers. Depending on dataset, can help to make first hidden layer a bit larger.
- Try building a model with too many layers and eurons, then use early stopping and other regularization tehcniques to prevent overfitting,.

**Learning Rate**
Should be about half the maximum learning rate.
Can train model for a few hundred iterations, starting from very low learning rate and gradually increasing it to a very alrge value.
Plot the loss as a function of learning rate, should see it dropping at first, and after will too too large and shoote back up.

