# Neural Networks


We started thinking about machine learning wiht the idea that the basic idea is
that we assume that our target variable ($y_i$) is related to the features $\mathbf{x}_i$
by some function (for sample $i$):

$$ y_i =f(\mathbf{x}_i)$$

But we don't know that function exactly, so we assume a type (a decision
  tree, a boundary for SVM, a probability distribution) that has some parameters
  $\theta$ and then use a machine
  learning algorithm $\mathcal{A}$ to estimate the parameters for $f$.  In the
  decision tree the parameters are the thresholds to compare to, in the GaussianNB the parameters are the mean and variance, in SVM it's the support vectors that define the margin.  

$$\theta = \mathcal{A}(X,y) $$

That we can use to test on our test data:

$$ \hat{y}_i = f(x_i;\theta) $$

A neural net allows us to not assume a specific form for $f$ first, it does
universal function approximation.  For one hidden layer and a binary classification problem:


$$f(x) = W_2g(W_1^T x +b_1) + b_2 $$

where the function $g$ is called the activation function. so we approximate some
unknown, complicated function $f4 by taking a weighted sum of all of the inputs,
and passing those through another, known function.

In [1]:
from sklearn.neural_network import MLPClassifier
from sklearn import svm
import pandas as pd
import sklearn

from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn import model_selection

We're going to use the digits dataset again.

In [2]:
digits = datasets.load_digits()
digits_X = digits.data
digits_y = digits.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(digits_X,digits_y)

In [3]:
digits.images[0]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

Sklearn provides an estimator for the Multi-Llayer Perceptron (MLP). We can see one with one layer to
start.

In [4]:
mlp = MLPClassifier(
  hidden_layer_sizes=(16),
  max_iter=100,
  alpha=1e-4,
  solver="lbfgs",
  verbose=10,
  random_state=1,
  learning_rate_init=0.1,
)

In [5]:
mlp.fit(X_train,y_train).score(X_test,y_test)

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =         1210     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  9.21925D+00    |proj g|=  7.00278D+00

At iterate    1    f=  8.16567D+00    |proj g|=  7.22122D+00

At iterate    2    f=  3.16228D+00    |proj g|=  1.95656D+00

At iterate    3    f=  2.36620D+00    |proj g|=  3.94812D-01

At iterate    4    f=  2.25501D+00    |proj g|=  2.35379D-01

At iterate    5    f=  2.12211D+00    |proj g|=  2.92144D-01

At iterate    6    f=  1.97474D+00    |proj g|=  3.52230D-01

At iterate    7    f=  1.71572D+00    |proj g|=  3.81059D-01

At iterate    8    f=  1.55334D+00    |proj g|=  5.65391D-01

At iterate    9    f=  1.49638D+00    |proj g|=  3.19177D-01

At iterate   10    f=  1.42795D+00    |proj g|=  2.96076D-01

At iterate   11    f=  1.31674D+00    |proj g|=  2.97448D-01

At iterate   12    f=  1.18559D+00    |proj g|=  3.85525D-01

At iterate   13    f=  1.1

 This problem is unconstrained.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


0.9288888888888889

We can compare it  to SVM:

In [6]:
svm_clf = svm.SVC(gamma=0.001)
svm_clf.fit(X_train, y_train)
svm_clf.score(X_test,y_test)

0.9911111111111112

We saw that the SVM performed a bit better, but this is a simple problem.
We can also compare these based on much they store, the number of parameters
is realted to the complexity.

In [7]:
import numpy as np

In [8]:
np.prod(list(svm_clf.support_vectors_.shape))

43968

In [9]:
np.sum([np.prod(list(c.shape)) for c in mlp.coefs_])

1184

In [10]:
mlp.coefs_

[array([[-0.04544792,  0.12067404, -0.27379261, ...,  0.20709889,
         -0.25885478,  0.09336685],
        [-0.04593881,  0.03009238, -0.19069045, ...,  0.2006957 ,
         -0.22772407, -0.17190691],
        [ 0.21563462, -0.03506171, -0.14037903, ..., -0.13049391,
          0.11532447, -0.08406878],
        ...,
        [-0.14630541, -0.05092604, -0.23553127, ...,  0.08691572,
          0.05896777, -0.07293738],
        [ 0.00778229, -0.10123061, -0.07661302, ..., -0.21167286,
         -0.29100928, -0.50212262],
        [-0.25364507,  0.10567017,  0.04284603, ...,  0.05389517,
         -0.04506549, -0.20359177]]),
 array([[-0.4195524 , -0.0093074 , -0.25021264,  0.43949534,  0.37288252,
          0.11959235,  0.45675145, -0.14863813,  0.3823662 , -0.04965362],
        [-0.239262  ,  0.43213332,  0.17866601, -0.404468  ,  0.28782924,
          0.19171586,  0.10122457,  0.23149726, -0.22656885,  0.10900253],
        [ 1.54655759, -0.90875783, -0.7039524 , -0.71434051,  0.55232251,
 

In [11]:
mlp64 = MLPClassifier(
  hidden_layer_sizes=(64),
  max_iter=100,
  alpha=1e-4,
  solver="lbfgs",
  verbose=10,
  random_state=1,
  learning_rate_init=0.1,
)

In [12]:
mlp64.fit(X_train,y_train).score(X_test,y_test)

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =         4810     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.04315D+01    |proj g|=  8.11718D+00

At iterate    1    f=  9.75424D+00    |proj g|=  4.90192D+00

At iterate    2    f=  8.69201D+00    |proj g|=  5.11136D+00

At iterate    3    f=  7.06707D+00    |proj g|=  3.08950D+00

At iterate    4    f=  5.19189D+00    |proj g|=  2.03877D+00

At iterate    5    f=  3.57584D+00    |proj g|=  3.71314D+00

At iterate    6    f=  2.12478D+00    |proj g|=  1.07366D+00

At iterate    7    f=  1.57146D+00    |proj g|=  8.76974D-01

At iterate    8    f=  1.08808D+00    |proj g|=  5.25384D-01

At iterate    9    f=  8.16400D-01    |proj g|=  3.57840D-01

At iterate   10    f=  5.45655D-01    |proj g|=  2.26321D-01

At iterate   11    f=  4.22838D-01    |proj g|=  3.20178D-01

At iterate   12    f=  3.12334D-01    |proj g|=  1.47141D-01

At iterate   13    f=  2.5

 This problem is unconstrained.


0.9733333333333334

## Questions After Class

### Roughly, how does the model know to use certain functions as the fitting becomes more complex (e.g. sin(x), ln(x), e^x)?

It does not learn an analytical form; it just approximates it.

### when doing the .score on the mlp does the limit vary or does it have a set limit on its own?



### What is tensorflow used for that scikit cant do?

Tensorflow can do more types of networks and has more options for training.  Most importantly, it has code optmizations so that you can use more complex hardware directly.

### when you say weight, what does that mean?

Weights are coefficients, or the weight of that feature.


### what is an artificial neuron?

An artificial neuron is one "unit" of calculation.  A neuron takes a weighted sum of all of its inputs (including a bias term) and passes it through an "activation function" that squashes the values of output into [0,1].

### what real life problems require tensorflow?

All modern ML applications are tensorflow, pytorch or similar.

### What do the hidden layers of the neural network represent?

We do not specify exactly what they represent up front; we can use model explanation techniques and visualization tools to examine them after the fact and try to interpret them if needed.



### What is the best way to optimize a neural net? would it be jut adding more layers?﻿

You could specify some of the parameters and use GridSearch as well. There are types of layers as well. We will see that later.



### Are the weights given to the hidden layers initially random?

Typically yes, they can be initialized randomly and then they are learned.

<!--
### on tensorflow playground, if we increase the weight is that increasing the amount we are feeding within the hidden layer?

### Do the neurons' layers have to be specified in the models we are going to use, or they are already specified for each model?

### Are hidden layers just a number of masks that help the function determine what the overall classification should be?
-->