Neural Networks
===

NN are an old idea -- as old as the 1940s -- but have become revitalized recently because of the Deep Learning movement; it has been shown to break all possible tests that require data to be mapped into outputs -- non-parametric modeling.

How Biological Neurons Work
---

Dendrites: filements that connect one neuron to the next

Nucleus: core of neuron that containes genetic code (brain of cell)

Axon: Shaft connecting to Nucleus

Axon Terminal: end of Axon near dendrites

Synapse: chemically rich space between dendrites where information is transferred from neuron to neuron

An electrical signal is generated at the left, processed by the nucleous, and is or is not passed onto the next neuron or output!

How Artificial Neurons Work:
---

Take Input values (features/samples) get passed along, through a set of weighting functions, to be summed together into the next layer of neurons.

A neuron, based on the input either fires or it does not fire
- the summation gets passed through a threshold or step function
- if the input summation is below the threshold, the threshold function results in a 0
- if the input summation is above the threshold, the threshold function results in a 1
- This is a perceptron

After this neuron is activated or not, the output becomes the input of the next connected or hidden layer in the neural networks

Note that typically people do not use an actual step function because {0,1} is too binary.  
- they would like to have a little more dynamic range or flexibility in each layer
- so people normally use a sigmoid function, which has a gradual step from 0 to 1
- "it is called sigmoid because it looks like an S" -- is that True??

Because the "threshold" function is no longer a literal "theshold", so now we call it the "Activation" function.

This tends to break down because the output of each layer is a function of x's and w's:
$$ y = f(\bar{x},\bar{w}) $$

Deep Neural Nets
---


That was just a single neuron, let's see what an actual neural network looks like:

The input data is processed through the activation funcitons onto the next layer. This can happen via multiple neurons and activations functions, as deep and complex as desired. Then the output goes to the next layer, and the next layer, on and on, until the output layer, which is the prediction of the NN that is compared to the labels in a classification scheme.


Deep networks are any neural network with more than 1 hidden layer. In the example image, we have 2 hidden layers and therefore is a **deep** neural network.

---
That's it! Why did it take so long to come to fruition?

Mostly it took a long time for DNNs to become useful because they require *so* much data to process through them for accurate estimation, that the amount of data we had before was just not enough to make DNNs useful or accurate.

In the first example we will cover, they have 60k samples; and that's not even 'big' by DNN standards. Most of the commerical examples out there that are doing 'crazy' things have more than 500M samples.  After about 500M samples, it appears -- with diminishing returns -- that you are not really seeing much better results; but that's about what it takes right now.

Recall from the SVM, we had what was called a *convex* optimization problem.  That meant that to optimize the SVM hyperparameters, we only had to traverse in the constraint space along the tangents to the convex solution curves.

But, with DNNs, the solution space (think *like* $\chi^2$ space) we could have many bumps and wiggles, with respect to both time and parameters.

In addition to the SVM being more simple to solve altogether, it only had 2 parameters that mattered: $\textbf{w}$, and $b$.  But, with the DNN, even in the simple 2 HL example that we have above, there are 33 weights that all have to be optimized simultaneously; and they are *all* correlated to one another!
- that's a lot of unique weights, and a lot of variables
- it's a very challenging problem for both mathematics and computations to control for them

Also, you need a lot data!  The more data you have, the more samples and therefore the more weights ("right?").

So it was 'kind of like this' 'perfect storm' that stopped NNs from growing in popularity.
- But these days we have both a lot of data *and* faster computers to process all of those data sets
    -- 'perfect storm'
- That's what really changed and allowed the DNNs to truly shine

With respect to normal classification problems, *I* think that the DNNs perform about the same as other classifiers; for example the state of the art SVM could be maybe 97%, while the state of the art DNN could perform 98% or 99%.
- That little tiny percentage is what everyone is fighting over right now
- But to me that is not as impressive as what DNN are doing to the data;

To *Sentdex*, what's more impressive is the modeling aspect of what DNNs are *doing* to the data.  People don't fully understand how the modeling with DNNs works, but the modeling does do very very very well.
- There is a lot of digging and analysis to demystify what is going on in the DNN
- But we won't have enough data to really understand that
- "it's just not possible for us"

For example, consider "Jack is 12; Jane is 10; Kate is older than Jane and younger than Jack?"
    - Kate is 11
- to get that answer, you have to employ a little bit of logic
- MOST classification schemes to day are not able to understand logic; they are great at classifying, but not interpreting
- Up until *very* recently, if you wanted to make an algorithm to answer then you would have had to build a machine to model the linguistics and to know those linguistics yourself.
- Where as with a neural network, you don't!
    - you take a bunch of examples like that and then just chug through (say 1M or 4M times) 
        and the NN can figure that out on it's own
    - it learns how to model on its own

That is what is so impressive! The output that we get and the methodology that we use is what's the most fascinating to me (and Sentdex)

Where to get this 4M data?
---

Images: imagenet

Text: Wikipedia data dump, crawl reddit, twitter?

Speech: tatoba (?)

Common crawl: huge huge huge data set! 
    - pedabytes of parsed websites
    
Google and Facebook actually possess on their own servers enough data to make this work

Packages
---

We will be using TensorFlow because it's new -- beta at the time of this taping

There is Keras / Theano / Torch, but they all pretty much do the same things
    - they use sequential layers of weights to map the inputs to the outputs of the training data sets!

---

Running Tensorflow
---

Tensorflow takes all of your inputs / labels, then runs in the background and returns with the answers. This is something in between Python and Compiled coding; but it's more like Compiled coding

TF does have an interactive session, but you should build the code to be run at optimal usage, like the compiled versions

DNN packages like Tensorflow, Theano, Keras, Caffe, Torch, etc, are just matrix processing libraries. They are just stacks of packages that can multiply matrices in fast ways.

So "what's a tensor?" -- it's an array like value in multi-dimensions and scales

All tensorflow does is process functions on a set of tensors

If you can convert your problem to be answered by processing a function on a an array/matrix/tensor, then you can use DNNs to solve that problem

It just so happens that deep learning is just 'that place' where this method is really needed.
- Tensor flow is a Deep learning library becasue it has tons of Deep Learning functions and packages

---
How Tensorflow is setup:
---

1. Define your model in sort of abstract terms
    - this is where you are building your "computation graph"

2. When you're ready, you then 'run the session'
    - that runs the graph and everything is done in the backend
    - then you get your result back


In [4]:
import tensorflow as tf

We are making the following constants to be very simple, but they can be `variables` or `placeholders`
- We'll get into that later

In [2]:
x1  = tf.constant(5)
x2  = tf.constant(6)

Tensorflow will recognize simplistic multiplication such as

In [3]:
result = x1*x2

We *might* be able to get away with this, but maybe not for a more complex model; and/or might not win a speed test

But that's not nearly as efficient as the official way of doing this, which uses `tf.mul`

In [4]:
result = tf.mul(x1,x2)

This is for multiplying simple numbers together.  But there is a much more useful function for multiplying matrices, called `matmul`

In [5]:
print(result)

Tensor("Mul:0", shape=(), dtype=int32)


This result is an 'abstract' tensor in the graph `result`

To run the session, you can use:

In [None]:
sess = tf.Session()
print(sess.run(result))
sess.close()

Like a file, if you open it, you should close it 
-- like any connection object ever!

What we should probably do everytime is something like this:

In [None]:
with tf.Session() as sess:
    print(sess.run(result))

Using `with`, much like with `file`, it will just open and close for you; so that you don't have to remember and/or won't forget to do so

Another way to achieve this is:

In [8]:
with tf.Session() as sess:
    output = sess.run(result)
    print(output)

print(output)

30
30


This shows that assigning the return struction from `sess.run` to a variable, stores that variable to a python variable, which can be manipulated later

In Contrast, the following code segment will crash because we are accessing the tf.Session after it has both processed all of the data *and* been closed!

In [24]:
with tf.Session() as sess:
    output = sess.run(result)
    print(output)

print(output)
# print(sess.run(result))

30
30


In [25]:
del sess, output

The computation graph is where you model everything; we build everything into it
- nodes, layers, data, features, etc are in the graph

Then we will run that graph to modify the weights iteratively
- we have to tell tensorflow what data goes where and which cost functions to use, but overall it's behind the veil

With **Tensorflow**, you have do everything in 2 major chunks:
1. Build the graph with all of the data and activation functions
2. Run the graph to optimize the weights, which are everything that we are training

MNIST Model Build Example
===

We are going to learn to format the data later, but for now we will work with the MNIST data set because it's been pre-engineered.

After all of this, we will think about how to work with our own data sets

MNIST: hand written digits
---
1. Train against 60k training digits
2. Test against 10k testing digits
3. Validate against 10k testing digits (after all training/testing is FINISHED!)

Each 'feature' is a pixel value of either 0 or 1
- on or off = is the pixel 'part of the digit' or 'just white space'

In [5]:
import tensorflow as tf

Layout:

Input data > weight > hidden layer 1 ( activation function ) > weights > hidden layer 2 ( activation function ) > weights > output layer

Repeat this process up until convergence of output layer onto the training labels to optimize the weights

**Feed Forward:** Taking a neural net in one direction, from the input data, through *any* number of layers, into the output layer for comparison with known labels
- Passes data 'straight through'
- at the end, we compare the prediction output to the data
- compute the cost or loss function (i.e. cross entropy)
- use an optimizer to attemp to minimize that cost function
    - we will use the AdamOptimizer
    - there also exists **Stochastic Gradient Descent (SGD)**, AdaGrad, ..., ~8 options
- the optimizer goes backwards through the NN to optimize the weights
    -- this is called **back propagation (backprop)**
- **Slang:** feed forward + backprop = **"epoch"**
    - the algorithim will do this 15,20,50,... times
    - hopefully each cycle, we will lower the cost function 

In [27]:
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets('/tmp/data/', one_hot=True)

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


`one hot encoding` comes from electronics, where only one component is `hot` or active and the rest are off
    - if it physically has electricity running through, then NO others will be so

We will then have 10 classes, from 0,1,2,...,9

if you have a 0,1,2, ... digit, then normal labeling will result in 0,1,2,...

But, for `one hot encoding`:
0 = 1 0 0 0 0 0 0 0 0 0

1 = 0 1 0 0 0 0 0 0 0 0

2 = 0 0 1 0 0 0 0 0 0 0

...

8 = 0 0 0 0 0 0 0 0 1 0

9 = 0 0 0 0 0 0 0 0 0 1

---
MNIST Example Solution
---

3 hidden layers, with 500, 500, 500 nodes. Depending on the "thing you're trying to model", the number and depth of each layers will vary

In [28]:
print(mnist.train.images.shape)

(55000, 784)


In [6]:
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500

n_classes   = 10

n_pixels    = 784 # 28*28

Most of us will be able to load MNIST into our RAM, but that does not prepare us for the future
- especially if it's common crawl, which will be pedabytes of ram
- as a result, we set the `batch_size` to control the RAM usage
- it also adds a level of randomness to the NN training

In [30]:
batch_size = 100

Here is how we establish the data types and usage that TF is expecting. 
- with a matrix = Height x Width
- we want to set the dimensions too. We will will flatten the picture into a 1D array using [None, size]

From Kenny GoodMan in the Youtube Comment section:

```python
def networks(data, nodes_and_layers):
    length = len(nodes_and_layers) - 1
    # create hidden layers
    hiddenLayers = [0 for i in range(length)]
    for i in range(length): 
         hiddenLayers[i] = {'weights': tf.Variable(tf.random_normal([nodes_and_layers[i],nodes_and_layers[i+1]])),
                                          'biases' :tf.Variable(tf.random_normal(nodes_and_layers[i+1])) }
         hiddenLayers[-1]['biases'] = tf.Variable(tf.random_normal([nodes_and_layers[-1]])) # change the biases of output
      # relu(input_date * weights + biases)
      layers = data
      for i in range(length - 1):
          layers = tf.nn.relu(tf.add(tf.matmul(layers, hiddenLayers[i]['weights']) + hiddenLayers[i]['biases']))
      return tf.matmul(layers[-1], hiddenLayers[-1]['weights']) + hiddenLayers[-1]['biases']
size_of_output = 10
size_of_input = 784
nodes_and_layers = [ size_of_input, 500, 500, 500, size_of_output ]
networks(data, nodes_and_layers)﻿
```

From Edoardo Pona:
```python
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
n_nodes = [784, 500, 400, 300] #number of nodes, so that n_nodes[1] is n_nodes_hl1 etc.. the first value is the 784 that is needed at the beginning 
layers = []
hiddenLayers = []
n_layers = 100 #number of layers 
n_classes = 10

def neural_network_model(data):
 layers.append(data) #the first is going to be the input data
 for i in xrange(len(n_nodes)-1):

  if i < len(n_nodes-1):
   hiddenLayers.append({'weights':tf.Variable(tf.random_normal([n_nodes[i], n_nodes[i+1]])),
    'biases':tf.Variable(tf.random_normal(n_nodes[i]))})

   layer=tf.add(tf.matmul(layers[i], hiddenLayers[i]['weights']) + hiddenLayers[i]['biases'])
   layers.append[tf.nn.relu(layer)]

  elif i == len(n_nodes-1):
   output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes[i], n_classes])),
     'biases':tf.Variable(tf.random_normal([n_clases]))}
   output = tf.matmul(layers[i], output_layer['weights']) + output_layer['biases']
   return output
```

From Santiago Penate:
```python
def neural_network_model2(data, input_nodes, layer_nodes, output_nodes):
    
    nodes = [input_nodes] + layer_nodes + [output_nodes]
    
    n_layers = len(layer_nodes) + 1
    for i in range(n_layers):        
        print('Layer', i, ':' )
        print('\tweights:', nodes[i], 'x', nodes[i+1])
        print('\tbiases:',  nodes[i+1])
        w = tf.Variable(tf.random_normal([nodes[i], nodes[i + 1]]))
        b = tf.Variable(tf.random_normal([nodes[i + 1]]))

        # formula = data * weight + bias
        if i == 0:
            output = tf.add(tf.matmul(data, w), b)
            output = tf.nn.relu(output)
        else:
            output = tf.add(tf.matmul(output, w), b)
            if (i + 1) < n_layers:
                output = tf.nn.relu(output)

    return output

neural_network_model2(x, input_size=28*28, layer_nodes=[500, 500, 500], output_size=10)
```

In [43]:
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')

If you attempt to feed throug something that is **not** this shape, then TF will through an error.
- Sometimes it can be useful to leave something in there because TF **won't** flag an error if you didn't specify the size

X and Y are just placeholders for the graph to shove data through network

`biases` are something that is added through after the weigths are multiplied together
- in the standard example above, we Summed all of the weights functions together
- but it is better the pass "(input_data * weights) + biases
- with Rectilinear Units (ReLu) without biases, none of the neurons would EVER fire!

`biases` make it so that the neuron will fire, even if the weights results in 0.0

In [7]:
def general_feed_forward_neural_network_model(inputs, nOutputs, numberOfNeurons = []):
    
    if not isinstance(numberOfNeurons, list):
        numberOfNeurons = [int(numberOfNeurons)]
    
    numberOfLayers = len(numberOfNeurons) # The number of layers is specified by the length of `numberOfNeurons`
    if not numberOfLayers:
        raise ValueError('`numberOfNeurons` must be a either an integer or list of integers')
    
    numberOfNeurons = np.int32(numberOfNeurons) # ensure they are integers
    
    hidden_layers   = [{'weights':None, 'biases':None}]*numberOfLayers
    
    numberOfNeurons = [inputs.shape[0]] + numberOfNeurons# + [nOutputs]
    
    # For loop over zip([all_but_the_last_layer, all_but_the_first_layer])
    for nNeuronsNow, nNeuronsNext in zip(numberOfNeurons[:-1], numberOfNeurons[1:]):
        hidden_layers[k]['weights'] = tf.Variable(tf.random_normal([nNeuronsNow, nNeuronsNext])),
        hidden_layers[k]['biases']  = tf.Variable(tf.random_normal([nNeuronsNext]))
    
    output_layer   = {'weights':tf.Variable(tf.random_normal([numberOfNeurons[-1], nOutputs])),
                      'biases': tf.Variable(tf.random_normal([nOutputs]))}
    
    # Setup the topology of the layers
    layer = tf.add(tf.matmul(inputs, hidden_layers[k]['weights']) , hidden_layers[k]['biases'])
    layers= [tf.nn.relu(layer)] # rectified linear activation function
    for k in range(1,numberOfLayers):
        layer = tf.add(tf.matmul(layers[-1], hidden_layers[k]['weights']) , hidden_layers[k]['biases'])
        layers.append(tf.nn.relu(layer)) # rectified linear activation function
    
    # the output layer does *not* get summed all together, or get passed through the activation function
    output = tf.add(tf.matmul(layers[-1], output_layer['weights']  ) , output_layer['biases'])
    
    return output

In [8]:
general_feed_forward_neural_network_model(inputs, nOutputs, [n_nodes_hl1, n_nodes_hl2, n_nodes_hl3])

(500, 500, 500)

In [50]:
def neural_network_model(data):
    # Shape must be `n_inputs` by `n_weights` or `n_pixels` by`n_nodes_hl1` here
    # we start with random_normal to through so randomness into the beginning
    # the biases are just the `vector` form of the weights
    hidden_1_layer = {'weights':tf.Variable(tf.random_normal([n_pixels, n_nodes_hl1])),
                      'biases': tf.Variable(tf.random_normal([n_nodes_hl1]))}
    
    hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2])),
                      'biases': tf.Variable(tf.random_normal([n_nodes_hl2]))}
    
    hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3])),
                      'biases': tf.Variable(tf.random_normal([n_nodes_hl3]))}
    
    output_layer   = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes])),
                      'biases': tf.Variable(tf.random_normal([n_classes]))}
    
    layer1 = tf.add(tf.matmul(data, hidden_1_layer['weights']) , hidden_1_layer['biases'])
    layer1 = tf.nn.relu(layer1) # rectified linear activation function
    
    layer2 = tf.add(tf.matmul(layer1  , hidden_2_layer['weights']) , hidden_2_layer['biases'])
    layer2 = tf.nn.relu(layer2)
    
    layer3 = tf.add(tf.matmul(layer2  , hidden_3_layer['weights']) , hidden_3_layer['biases'])
    layer3 = tf.nn.relu(layer3)
    
    # the output layer does *not* get summed all together, or get passed through the activation function
    output = tf.add(tf.matmul(layer3  , output_layer['weights']  ) , output_layer['biases'])
    
    return output

In [51]:
def train_neural_network(x):
    prediction = neural_network_model(x)
    cost       = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(prediction,y))
    
    # AdamOptimizer has a param called `learning_rate` wit default = 0.0001
    optimizer  = tf.train.AdamOptimizer().minimize(cost)
    
    nEpochs = 10
    
    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        
        for epoch in range(nEpochs):
            epoch_loss = 0
            for _ in range(int(mnist.train.num_examples / batch_size)):
                epoch_x, epoch_y = mnist.train.next_batch(batch_size)
                _, c  = sess.run([optimizer, cost], feed_dict={x: epoch_x, y: epoch_y})
                epoch_loss += c
            
            print('Epoch', epoch, 'completed out of',nEpochs, 'loss:', epoch_loss)
        
        correct = tf.equal(tf.argmax(prediction,1), tf.argmax(y,1))
        
        accuracy= tf.reduce_mean(tf.cast(correct, 'float'))
        
        print('Accuracy:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))

In [52]:
train_neural_network(x)

('Epoch', 0, 'completed out of', 10, 'loss:', 1795090.7263946533)
('Epoch', 1, 'completed out of', 10, 'loss:', 397719.79118537903)
('Epoch', 2, 'completed out of', 10, 'loss:', 221768.89266777039)
('Epoch', 3, 'completed out of', 10, 'loss:', 134885.91095066071)
('Epoch', 4, 'completed out of', 10, 'loss:', 82654.702231878415)
('Epoch', 5, 'completed out of', 10, 'loss:', 53040.316723240481)
('Epoch', 6, 'completed out of', 10, 'loss:', 32520.759185434406)
('Epoch', 7, 'completed out of', 10, 'loss:', 26949.754656583071)
('Epoch', 8, 'completed out of', 10, 'loss:', 21321.228015989065)
('Epoch', 9, 'completed out of', 10, 'loss:', 17755.359222649509)
('Accuracy:', 0.94800001)
