*Unit 4, Sprint 2, Module 1*

---

# Architect (Prepare)
__*Neural Network Foundations*__

## Learning Objectives
* <a href="#p1">Part 1</a>: Student should be able to describe the foundational components of a neural network
* <a href="#p2">Part 2</a>: Student should be able to introduce the Keras Sequential Model API
* <a href="#p3">Part 3</a>: Student should be able to learn how to select a model architecture

Neural Networks are the most powerful modeling techniques that we possess in machine learning today. In spite of the hype surrounding these topics, we hope that you will come to see them as just another tool in your tool bag with their own strengths and weaknesses. They are useful, but they are not a silver bullet, and they are not always preferable to other -- perhaps more simple -- machine learning methods. 

The goal of this week is to familiarize you with the fundamental theory, terminology and libraries that will enable you to build and use neural network architectures. This week will not be a run-through of the history of Neural Networks and each of the individual advancements leading up to current technologies -- we don't have time for that. We will spend some time on some older methods, but only to the degree that they will help introduce us to relevant terminology and understand more complex versions of these technologies.

# Foundational Neural Network Components (Learn)
<a id="p1"></a>

## Overview

### Major Components
- Neurons
- Weight and Bias Parameters
- Activation Function
- Loss Function
- Layers: collections of neurons with the same inputs



<center><img src="https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-Neural-Networks/main/module1-Architect/IMG_0167.jpeg" width=400></center>


#### Let's zoom in on the first of these components, the neuron.

----

## 1. Dissecting the Artificial Neuron (a.k.a Perceptron)

Every branch of science has a fundamental unit, a baseline model of a physical system, that is used as the starting point (the first principle) of that science. Every idea, experiment result, and hypothesis in a branch of science rests upon the building block of that science; unless you're doing purely theoretical work that is explicitly looking to introduce a new building block, or challenge the first priciples. So it's important that we understand the builidng block of any science that we wish to study. 

In **Physics**, the fundamental building block is the **particle**.

In **Chemistry**, the fundamental building block is the **chemical element**. 

In **Biology**, the fundamental building block is the **cell**.

In **Neuroscience**, the fundamental building block is the **Neuron**.

**You are about to learn computational Neuronscience!** 

This image has a side-by-side comparison of a biological model of a neuron and a computational model of a neuron.
![](https://miro.medium.com/max/610/1*SJPacPhP4KDEB1AdhOFy_Q.png)


Let's dive deeper into the computational model of the neuron. 
![](https://images.squarespace-cdn.com/content/v1/59d9b2749f8dce3ebe4e676d/1547561883197-ZO8CJILFNGZMORIJZOJ1/ke17ZwdGBToddI8pDm48kAuxETKhxDsgKuKi-UGpnEIUqsxRUqqbr1mOJYKfIPR7LoDQ9mXPOjoJoqy81S2I8N_N4V1vUb5AoIIIbLZhVYxCRW4BPu10St3TBAUQYVKcofcaXKJYLRX9EzsQWIgTsTayKHY9LJ3-BRv5jWxoI5y-JkyZtODn1AMLPOg8sn20/Artificial-Neuron.png )

**Note:** The $\mathbf{^{T}}$ superscript on the $\mathbf{w}$ weight vector stands for a vector/matrix transpose. Sometimes these are necessary in order to get the dimensions of a vector or matrix product to align so that a valid product can take place. 

**This fundamental equation describes how an artificial neuron combines multiple inputs to compute a single output!**

This is the neuron in a single equation, all the relevant terms are present. This equation will continue to reappear as we continue our study of various neural network architectures, as well as the techniques for training neural networks such as **gradient descent** and **back-propagation**. 


$${y = f(\mathbf{w^T} \mathbf{x} + b )}$$

In this equation, $\textbf{f}$ is the **activation function** of the neuron <br>
$\textbf{x}$ is a column vector of the inputs to the neuron (i.e. features).<br>
The neuron is characterized by a column vector of weights $\textbf{w}$, and a bias $b$<br>
The expression $\textbf{w}^{\textbf{T}}\textbf{x}+b$ is the **signal** from the neuron. <br> 
Here, $\textbf{w}^{\textbf{T}}$ is the transpose of the weights vector -- recall that the transpose of a column vector is a row vector. <br>
The signal is a weighted average of the inputs $\{x_{1}, x_{2},...,x_{n}\}$ plus a bias term $b$:

`signal = weight1*input1 + weight1*input2 + weight3*input3 + bias`


$${y} = f(w_0  x_0 + w_1   x_1 + w_2   x_2 + b)$$

----
## 2. Mathematics of vector-vector and matrix-vector dot products
In this section we'll review vector-vector and matrix-vector multiplication <br>
As an application, we'll develop the mathematical machinery to create a `perceptron` model and understand its operation.


### 2.1 First let's define some terms and notation!

Let $\mathcal {X}$ be defined as the set of all vectors in an ${N}$-dimensional vector space denoted as $\mathbb{R}^{N}$ 

Let $\mathbf {x}$ be a vector $\in \mathcal {X}$<br>here, $ \in $ means "is a member of"


Let $\mathcal {W}$ be defined as the set of all weight vectors in $\mathbb{R}^{N}$  

Let $\mathbf {w}$ be a vector $\in \mathcal {W}$

Let ${b}$ be a scalar on the real line, denoted by $\mathbb{R}^{1}$, or just $\mathbb{R}$<br>

In all the work to follow in this course, we will strive to follow standard notation:
* *matrix* variables by bold capital letters, i.e. $\textbf{X}$
* *vector* variables by bold lower case letters, i.e. $\textbf{w}$
* *scalar* variables by plain lower case letters, i.e. $b$


### 2.2 Representing a neuron's input and parameters<br>
We'll specialize to a toy example in two dimensions.<br>
Suppose we have 3 examples (data points) $\{ \textbf{x}_{1}, \textbf{x}_{2},\textbf{x}_{3} \}$, each of which is a two dimensional row vector. <br>
We can stack these input vectors into a $3 \text{x} 2$ input matrix $\textbf{X}$ <br>
The neuron's weights vector $\textbf{w}$ always has the same dimension as the input vector, so it is also two dimensional.<br>
The neuron's bias $b$ is a scalar.<br>
Let's create code for our toy example:

In [None]:
# Although the math holds for any feature dimensionality n, 
#     let's keep things simple and start with n = 2 dimensional input vectors
import numpy as np

# define our 2-dim input vectors and input matrix 
x1 = np.array([10, 20])
x2 = np.array([-10, -20])
x3 = np.array([100, 111])

# stack the input vectors to form the input matrix
X = np.array([x1, 
              x2, 
              x3])

# define our 2-dim weight vector
w = np.array([0.2, 0.4]) 

# define our bias term 
b = 1

Input matrix $\textbf{X}$

In [None]:
X

Weights vector $\textbf{w}$

In [None]:
w

Scalar bias $b$

In [None]:
b

###2.3 The dot product of two vectors
*The dot product of two vectors is the sum of the products of their corresponding components:*<br><br>
Example: Let $\textbf{a} = [1, 2, 3]$ and $\textbf{b} = [5, 6, 7]$  <br><br>
Then
$\textbf{a} \cdot \textbf{b} = 1\times5 + 2\times6 + 3\times7 = 5 + 12 + 21 = 38$<br><br>
Note that the dot product of two vectors is just a single number, i.e. a **scalar**<br>
In `python` we can compute the dot product like so:<br> 
`numpy.dot(a,b)`


In [None]:
# to compute the dot product between vectors a and b:
# multiply element-wise, then sum

a = np.array([1,2,3])
b = np.array([5,6,7])

np.dot(a,b)

A dot product between an input vector $\mathbf x$ and a weight vector $\mathbf w$ <br>can be thought of as a *weighted sum* of the input vectors's components:

$${\displaystyle \mathbf {w} \cdot \mathbf {x} }~~=~~{\displaystyle \sum _{i=1}^{m}w_{i}x_{i}}$$

### 2.4 A bit of `pythonic` weirdness explained 
"one-dimensional vector" vs. <br>
"two-dimensional row vector" vs.<br>
"two-dimensional column vector"

In [None]:
# one dimensional vector
print(f'w looks like this {w} and is a one-dimensional vector and has shape {w.shape}\n')

# reshape w to a two-dimensional row vector!
print(f'w.reshape(1,2) looks like this {w.reshape(1,2)} and is a two-dimensional array, a row vector (1 row,2 columns) with shape {w.reshape(1,2).shape}\n')

# check out this strange alternative way to reshape!
print(f'w[None,:] also looks like this {w[None,:]} and is a two-dimensional array, a row vector with shape {w[None,:].shape}\n')

# reshape w to a two-dimensional column vector!
print(f'w.reshape(2,1) looks like this\n {w.reshape(2,1)} \n and is a two-dimensional array, a column vector, with shape {w.reshape(2,1).shape}\n')

# check out this strange alternative way to reshape!
print(f'w[:,None] also looks like this \n {w[:,None]} \n and is a two-dimensional array, a column vector, with shape {w[:,None].shape}\n')

### 2.5 The dot product of a matrix and a vector

If $\textbf{X}$ is an $\text{m} \times \text{n}$ matrix and $\textbf{w}$ is an $\text{n} \times \text{1}$ column vector, <br>
then the matrix-vector product $\textbf{X}\cdot\textbf{w}$ is an $\text{m} \times \text{1}$ column vector<br>
whose entries are the dot products of each row of $\textbf{X}$ with $\textbf{w}$

Note that the *first* and *second* dimensions of the matrix-vector product are:<br> the *first* dimension of the matrix and the *second* dimension of the vector<br>
Here $m$ refers to the number of examples and $n$ refers to the number of features, i.e. the dimensionality of each input example.

#### 2.5.1 The `numpy.dot()` method is *overloaded* to handle the product of a matrix with a vector
Caveat: when the `.dot` method is used this way, the result is a vector, not a scalar.

In [None]:
# The dot product of matrix X with column vector w
np.dot(X,w)

####2.5.2 Alternatively, we can use `python`'s  `@` operator for matrix muliplication. <br>
To compute the product of two matrices `A@B`, we must make sure that `A` and `B` have the proper shapes. <br>
For example to compute `X@w`, we must cast (reshape) w as a two-dimensional column array. <br>
Note that the result vector is the same list of numbers as the previous answer.

In [None]:
X@w[:,None]

## 3. A `perceptron` is a mathematical model of the neuron
A neuron takes in a number of inputs through its dendrites. Depending on its sensitivity to each of the inputs, the neuron responds by either firing or not firing along the axon. We can model this process in code! <br><br>
The inputs to the neuron are real numbers, and the neuron generates an output that is either $\text{ON}$ (fires) or $\text{OFF}$ (doesn't fire). <br>

The inputs for a single example can be described by a vector $\textbf{x}$. <br>
The neuron has internal weights $\textbf{w}$  describing its sensitivity to each input and also a bias $b$ describing its overall sensitivity.<br>

We can think of the neuron as performing three steps to generate its binary output. <br><br>
### Step 1: the neuron combines the inputs linearly to produce a response<br><br>
$Z = \textbf{w} \cdot \textbf{x} + b$. <br><br>
Here, $Z$ is the response, a real number that can be positive or negative: $Z \in [-\infty,+ \infty]$<br><br>
The neuron ultimately needs to convert the response to a decision <br>whether or not to fire, i.e. to a binary output of either $0$ or $1$. Hence<br><br>
### Step 2: the neuron "squishes" $Z$ down to a number between $0$ and $1$. <br>
We'll need a special function $S$ to do this job for us.<br>
For now, let's think about what we want $S$ to do: <br><br>
$S(Z)$ maps any real number $Z$ onto the unit interval, so that <br><br>$S(Z)=z$, where $z \in [0,1]$ <br><br>
We'll have more to say about $S$ in a bit.<br><br>
### Step  3: the neuron maps the number $z$ to a binary $\text{output} \in \{0,1\}$<br><br>
How would the neuron do this? <br>
We can imagine applying a threshold (say $0.5$) to $z$, producing a result <br><br>
$\text{output} = (z > 0.5) = { \begin{cases}1& {\text{if }}\  {z} >0.5\\ 0& {\text{if }}\  {z} \le 0.5\\ \end{cases}} $<br><br>

### 3.1 Summary of the `perceptron` model
1. Map the vector input $\textbf{x}$ to a linear response $Z = \textbf{w} \cdot \textbf{x} + b$
2. Map $Z$ to a number $z \in [0,1]$
3. Apply a threshold to $z$ to produce a binary output of either $1$ or $0$<br><br>

The `perceptron` output ($0$ or $1$) can be expressed in terms of 
* the inputs $\textbf{x}$,<br>
* the neuron's internal parameters (weights $\textbf{w}$ and bias $b$)
* the activation function $S$<br><br>
$\text{output} = S(\textbf{w} \cdot \textbf{x} + b) > 0.5$<br><br>
where $S$ is the $\text{sigmoid}$ function. <br><br>

We are now close to a complete understanding of the `perceptron`!<br> The only missing piece is a discussion of the **sigmoid** function $\textbf{S}$, which follows in the next sections.


### 3.2 Linear Response of a neuron
A neuron's *linear response* is the weighted sum of the inputs plus the bias:<br><br>
$Z = \textbf{w} \cdot \textbf{x} + b$

Let's code up the linear response function.

In [None]:
def Z(X,w,b):
  # combine the inputs, weights and bias to compute the linear response of the neuron
  
  # reshape w into a columnn vector, of shape [len(w), 1]
  w=w.reshape((len(w),1))

  return X@w + b

Try it out and and see what $Z$ looks like. <br>
Note that the output of a neuron given an input data point (or example) is a single number. <br>
Here Z is a column vector, giving the neuron outputs for each of the three input examples.

In [None]:
Z(X,w,b)


### 3.3 The [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function)
The sigmoid function $$S(x) = \frac{1}{1+exp(-x)}$$ <br>
does the job we referred to earlier: <br>
$S$ maps a real number $x \in [-\infty,+\infty]$ onto the unit interval $[0,1]$.<br><br>
The sigmoid is an example of an *activation function* -- which is a nonlinear function that is applied to a neuron's linear response $Z$ to compute the output. *Activation functions* are an essential component of every neural network, and there are a variety of others besides the **sigmoid**. We'll encounter other examples of *activation functions* later on.


In [None]:
def sigmoid(x):
    """Calculate the output of a sigmoid
    Input can be a scalar or an array 
    If the input is an array, the output will be an array of the same size
    whose values are the sigmoid of each input element """

    return 1 / (1 + np.exp(-x))

In [None]:
sigmoid(11)
sigmoid(0)

In [None]:
sigmoid(Z(X,w,b))

$$\textbf{Sigmoid formula}$$
$${\displaystyle S(x)={\frac {1}{1+e^{-x}}}}$$

![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1200px-Logistic-curve.svg.png)


### 3.4 Exercise: Familiarize yourself with the sigmoid function
1. With the code above, try out a few examples, computing the sigmoid function applied to large and small positive and negative numbers.<br>
Convince yourself that the sigmoid function really does have property of "squishing" the real line into the interval $[0,1]]$ as shown in the above plot.<br>
2. Without coding, compute the following: <br>
$S(0) = $ ?<br>
$S(-\infty) = $ ?<br>
$S(+\infty) = $ ? 


In [None]:
# YOUR ANSWER HERE 
# hint np.exp(0) = e**0 = 1

### 3.5 The output of the neuron!
We can now compute the final outputs of the neuron for out data set of three examples <br>by applying a *threshold* to the sigmoid result, producing a boolean value.<br>
If we set the threshold value to $0.5$, then we get an output of $1$ for positive $Z$ values <br>
and an output of $0$ for negative $Z$ values. <br>
See the graph of the sigmoid function above.

In [None]:
# The neuron fires whenever the linear response (input to the sigmoid) is positive
# The neuron does not fire whenever the linear response (input to the sigmoid) is negative
threshold = 0.5
output = sigmoid(Z(X,w,b))>threshold
output

### 3.6 Implement the `perceptron` model of the neuron in code
Congratulations! You have now understood the `perceptron` model of the neuron! <br>The `perceptron` is the simplest example of a neural network. <br><br>
The next step: think of `perceptrons` as the $\text{Lego}^{TM}$ bricks for building more complex neural networks<br><br>
Let's stitch together our results so far, and code up a `perceptron` model<br><br>
As we'll see in our code, the `perceptron` is composed of a sequence of mathematical operations, each one acting upon the result of the previous one. For that reason, `perceptrons` (and neural networks composed of them) are often referred to as **Feed-Forward Neural Networks** (FFNN), since the data "feeds forward", i.e. flows from left to right through the network as it is processed by successive operations.


In [None]:
# implement the perceptron model
def perceptron(w, X, b):
    """
    Calculates the sigmoid of a weighted sum plus a bias term w * x + b
    and returns a classification for the input data (i.e. a prediction)
    
    Returns a 1 if sigmoid output is greated than the threshold
    Returns a 0 if sigmoid output is less than the threshold
    
    Parameters
    ----------
    w: numpy array 
        weight vector
        
    X: numpy 2D array
        Input data, one row for each example 
        
    b: scalar (i.e. constant)
        Bias term 
        
    Returns 
    -------
    boolean value 
    """
    

    # 1. reshape w into a columnn vector, of shape [len(w), 1]
    w = w.reshape((len(w),1))
    
    # 2. compute the linear response of the neuron
    Z = X@w + b
    
    # 3. apply the sigmoid activation function to the linear response
    #      to get a number between 0 and 1
    z = sigmoid(Z)

    # 4. set the threshold
    threshold = 0.5 

    # 5. apply the threshold to determine whether the or not the neuron should fire
    # the output is 1 if the neuron fires, 0 if not
    output =  z > threshold
    return output
   

In [None]:
# Calculate the perceptron output for the inputs, weights and bias we defined earlier
perceptron(w, X, b)

In [None]:
b

In [None]:
# what happens if we make the bias large and positive?
b = 70
perceptron(w, X, b)

In [None]:
# what happens if we make the bias large and negative?
b = -70
perceptron(w, X, b)

-----

### 3.7 What kind of predictive model is the perceptron?

The perceptron is a **linear binary classifier** - just like Logistic Regression 

**Binary** in the sense that it can only distinguish between two classes in a classification task. 

**Linear** in the sense that it can only separate two classes that have a linear decision boundary. 

By combining multiple neurons into a neural network we can overcome both of these limitations. <br>
More on this in a bit. For now, let's zoom back out and look at the big picture again. 

-----

### 3.8 Essential aspects of neural networks

<img src="https://camo.githubusercontent.com/41ebef1f914f92a03a75d5c468c10b49d9f7ba4fb7fac455b3f9ba891953889c/68747470733a2f2f7777772e7079696d6167657365617263682e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031362f30382f73696d706c655f6e657572616c5f6e6574776f726b5f6865616465722e6a7067" alt="Drawing" style="width: 600px;"/>

### 3.8.1 What does a Neuron Do?
A neuron is often called a "unit" which is shorthand for "activation unit". <br>
According to the `perceptron` model, each neuron in a network calculates a weighted sum of its inputs, <br>
adds a bias term and passes the resulting value through an activation function. 

### 3.8.2 Layers
A single *dense* layer is a collection of neurons. There are three common types of layers:
- **Input** -- neurons in this layer do nothing. They simply "pass through", or connect, the corresponding inputs rightward to the first hidden layer.<br> Assertion: for an input layer, the weights are all $1$s, the bias is $0$, and the activation function is simply multiplication by $1$.<br>

- **Hidden** -- there are one or more hidden layers, connected left to right.
- **Output** -- the output layer computes the final output of the network

### 3.8.3 Weights
are *parameters* within our neural network. In the diagram above, weights are represented as arrows. During the training process, the weights are adjusted to minimize the loss function (see below).

### 3.8.4 Bias
is a trainable *parameter* within our neural network. The bias term is a constant allowing greater flexibility in the output of a neuron. During training the bias is adjusted along with the weights to minimize the loss function.

### 3.8.5 Loss Function
The *loss function* measures the error: how close your model predictions are to the target values.<br> 
The *derivative* (or *slope*, or *gradient*) of the loss function informs you how you should update the weights and biases in order to decrease the error. This is the method of *Gradient Descent*. <br><br>
We train the network by iteratively adjusting the weights and biases until we *minimize* (i.e. reach the minimum value of) the loss function. This iterative process is called *Back-Propagation*.<br><br>
This is a lot of information to take in! Don't worry if you don't understand this yet. <br>
We'll explain *Gradient Descent* and *Back-Propagation* in detail in the next Module.

### 3.8.6 Activation Function
The activation function controls the output of any given neuron.
Its most important feature is its derivative or slope,
which provides information that we use to update the neuron's weights and bias during model training with **Gradient Descent**




**Sigmoid Curve and its Derivative**

![](https://i.stack.imgur.com/inMoa.png)

A number of activation functions are commonly used in neural networks. <br>
The most important of these are the `sigmoid`, `relu`, `tanh` and `softmax` activation functions<br>
Here are some short reference articles explaining activation functions:
* [Everything you need to know about activation functions in deep learning](https://towardsdatascience.com/everything-you-need-to-know-about-activation-functions-in-deep-learning-models-84ba9f82c253)
* [7 popular activation functions you should know in deep learning](https://towardsdatascience.com/7-popular-activation-functions-you-should-know-in-deep-learning-and-how-to-use-them-with-keras-and-27b4d838dfe6)

### 3.8.7 Training a Neural Network
Training is the process of iteratively finding the optimal value of weights and bias for each neuron in the network, i.e. the values that minimize the loss function. The methods used in training are called *Gradient Descent* and *Back-propagation* and will be discussed in the next Module <br>

## 4. The Keras Sequential API (Learn)
The Keras Sequential API enables you to easily to build, compile, fit and evaluate your own neural network models. <br><br>
Warning: learning to use Keras is exhilarating!<br>
The feeling is a bit like when your parents gave you the keys to the car when you were a teenager...<br>
In the remainder of Unit 4, we'll learn to build, drive, and apply the machinery of neural networks to solve important and interesting problems.

To get an intuitive feeling for how neural networks work, try [TensorFlow playground](https://playground.tensorflow.org/)!

## Overview

> "Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. Use Keras if you need a deep learning library that:

> Allows for easy and fast prototyping (through user friendliness, modularity, and extensibility).
Supports both convolutional networks and recurrent networks, as well as combinations of the two.
Runs seamlessly on CPU and GPU." 

### !!!!! STOP -- You need to access GPU Resources on Colab before running the codes below!!!!!

From Colab's **Runtime** menu, select **Change Runtime type**, and choose the **GPU hardware accelerator**. <br>
Without the GPU, neural network model training is slow as molasses!

### 4.1 Example: the XOR problem
In computer science XOR is a logical operation called "Exclusive Or".<br>
Fot two logical inputs $a$ and $b$, <br>
$a~~\text{XOR}~~b$ has the same outputs as $a~~\text{OR}~~b$, <br>
with the exception that if both inputs are $\text{True}$, the output is $\text{False}$

In [None]:
import pandas as pd

data = { 'x1': [0,1,0,1],
         'x2': [0,0,1,1],
         'y':  [0,1,1,0]
       }

df = pd.DataFrame.from_dict(data).astype('int')
X = df[['x1', 'x2']].values
y = df['y'].values

XOR truth table: inputs are $x1$ and $x2$, output is $y$

In [None]:
df.head()

We can think of XOR as a binary classification problem

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# notice that we can't just draw a line to separate the two classes
sns.relplot(x="x1", y="x2", hue="y",
            sizes=(40, 400), alpha=1, palette="muted",s=200,
            height=5,  data=df);
plt.grid()


#### 4.1.1 Let's try to solve the $\text{XOR}$ problem
Our first approach will be to use a `perceptron` model to predict the output $y$ given the inputs $x_1$ and $x_2$

In [None]:
!pip install keras

In [None]:
from keras.models import Sequential
from keras.layers import Dense

In [None]:
## Build a perceptron with keras 
# The perceptron can't get to 100% accuracy because it can only fit linear boundaries between classes

# instantiate a sequential model
model = Sequential()

# add a dense layer
# with some layer-specific hyperparameters

model.add(Dense(1,                     # 1 neuron in the hidden layer
                input_dim=2,           # input_dim is the only place where we say anything about the input layer
                activation='sigmoid')) # selecting our activation function

# compile the model 
# locks the model architecture. 
# indicate network-level hyperparameters
model.compile(loss='binary_crossentropy', # We're doing binary classification
             optimizer='sgd',             # stochastic gradient descent "vanilla" optimizer
             metrics=['accuracy']) 

# fit the model 
model.fit(X, y, epochs=100) # training data X and y

What went wrong?<br> 
The `perceptron` model cannot do the job, because (as we saw above) the classes in this problem cannot be separated with a line!<br>
A `perceptron` model with a sigmoid activation function is equivalent to Logistic Regression, which is a linear model.


#### 4.1.2 Let's try again to solve the $\text{XOR}$ problem
This time we'll build a multilayer `perceptron` model!

In [None]:
%%time
# This is a Deep Neural Network with multiple hidden layers
# Deep neural networks are any NNs with more than 1 hidden layer

# instantiate a sequential model
model = Sequential()

# add a dense layer
# with some layer-specific hyperparameters

model.add(Dense(10,  
                input_dim=2, # input_dim is the only place where we say anything about the input layer
                activation='relu')) # selecting our activation function

model.add(Dense(8, activation='relu')) # selecting our activation function

model.add(Dense(5, activation='relu')) # selecting our activation function

# output layer with sigmoid function for binary classification
model.add(Dense(1, activation='sigmoid'))

# compile the model 
# locks the model architecture. 
# indicate network-level hyperparameters
model.compile(loss='binary_crossentropy', # We're doing binary classification
             optimizer='adam',
             metrics=['accuracy']) # stochastic gradient descent "vanilla" optimzer

# fit the model 
model.fit(X, y, epochs=100) # training data X and y

In [None]:
# evaluate the model
scores = model.evaluate(X, y)
print(f"{model.metrics_names[1]}: {scores[1]*100}")

##### BOOM! The Multilayer Perceptron (MLP) model solved the $\text{XOR}$ problem perfectly!

In [None]:
model.summary()

The perceptron is only effective for linearly separable data sets. <br>
But if we combined two or more of them in a multi-perception model (i.e. a neural network) then they can handle non-linear data!

![](https://www.edureka.co/blog/wp-content/uploads/2017/07/Linear-528x264.jpg)

### 4.2 Follow Along: Neural Network Modeling Workflow with the Keras Sequential API

In the Keras `Sequential` API, you specify a model architecture by sequentially adding layers. This type of architecture works well for feed forward neural networks in which the data flows in one direction (forward propagation) and the error flows in the opposite direction (backwards propagation). The Keras `Sequential` API follows a standardarized workflow to estimate a neural network model: 

1. Load Data
2. Define Model
3. Compile Model
4. Fit Model
5. Plot metrics
5. Evaluate Model
6. Get Predictions for the test data

You saw these steps in our Keras Perceptron examples above, but let's walk through each step in detail.

### MNIST digit classification
As our next example of neural networks, we will use a multilayer perceptron (MLP) model  to solve the famous **MNIST digit classification problem**. The [MNIST](https://en.wikipedia.org/wiki/MNIST_database) database is a publicly available collection of 60,000 images, each of which contains a singe handwritten decimal digit (0 - 9). The ten digit classes are roughtly equally represented. We want to train a model that can classify each image into the correct digit class.

#### 4.2.1 Load Data

In [None]:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense

import numpy as np

In [None]:
# Load the Data
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [None]:
import matplotlib.pyplot as plt
plt.imshow(X_train[16]);

In [None]:
X_train.shape

In [None]:
28*28

In [None]:
pd.options.display.max_columns=30
pd.DataFrame(data = X_train[16])

### Each pixel has a value that is represented by 8 bits
The maximum value that can be stored in 8 bits is $2^8-1=255$

In [None]:
# normalize the pixel values by dividing by  the  max value

max_value = 255

X_train = X_train.astype('float32') / max_value
X_test = X_test.astype('float32') / max_value

# normalize the pixel values 

In [None]:
X_train.shape

In [None]:
# flatten image 

X_train = X_train.reshape(X_train.shape[0], 784)
X_test = X_test.reshape(X_test.shape[0], 784)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
y_train[16]

In [None]:
np.unique(y_train)

#### 4.2.2 Build the Neural Network Model
[Read this brief and informative introduction to the Keras Sequential API](https://keras.io/getting-started/sequential-model-guide/)!

In [None]:
from keras import Sequential

We instantiate a `Sequential` model. We'll then build the model's architecture one layer at a time.

In [None]:
model = Sequential()

We construct our neural network model by adding `perceptron` layers one at a time. Networks  composed of perceptron layers in which each neuron connects to all the neurons in the previous layer and to all the neurons in the next layer are also called "fully-connected", "dense", or "densely-connected" layers. 

>[Keras layers API](https://keras.io/layers/core/)
 
 When building a `perceptron` layer in Keras, the first argument specifies how many neurons we want to have in that layer. We'll create an "input" layer for this problem using $32$ neurons. The second argument specifies the type of activation function to use: here we'll use the `relu` activation function. The third argument specifies the number of inputs to the layer. In the MNIST data set, each input is a flattened $28\times28$ image, so there are  $784 = 28\times28$ inputs coming into this layer from our dataset image. 

In [None]:
# Start with an initial layer of 32 neurons
# The number of neurons in a layer is also the number of outputs. 
# Note: neural network layers before the output layer are often referred to as "hidden" layers)
model.add(Dense(32, activation='relu', input_dim=784))
model.summary()

We want the second layer to combine its inputs (which, remember, are the outputs of the neurons in the first layer) to compute a set of $10$ outputs. Why $10$? You guessed it -- one for each digit class! <br>
Using the `softmax` activation function creates 10 outputs representing the respective probabilities that the image corresponds to each digit. I.e., the $3\text{rd}$ and $10\text{th}$ outputs are the probabilities that the handwritten digit in the image is a $2$ or a $9$, respectively, etc. <br>
So our neural network doesn't actually classify an input image, instead it computes a set of probabilities over the $10$ output digit classes. Of course, the $10$ probabilities sum to $1$, as they should! We can then easily add a final layer to classify each image by picking the digit class with the highest probability.<br>
Why do you suppose it's better to get the $10$ digit class probabilities as outputs rather than just the single class prediction?

Reference: [Softmax Activation Function with Python](https://keras.io/api/layers/activation_layers/softmax/#softmax-layer) by Jason Brownlee


In [None]:
# add an output layer with a softmax activation function
# Sequential() knows that the inputs to this layer are the outputs from the previous layer, so we don't have to specify the number of inputs
model.add(Dense(10, activation='softmax'))

In [None]:
model.summary()

#### 4.2.3 Compile the Neural Network Model
When compiling a model in Keras, there are three inputs to specify: the `optimizer`, the `loss function` and the `metric`. <br>
We discuss them in this section.

#### The `loss function` 
is a special function quantifying the error between the targets $y$ from the training data and the network's predicted targets $\hat{y}$. We train the network to **minimize the loss function**, that is, to make the prediction error as small as possible!<br>
In a **binary** classification problem (i.e. with two clases), we would use the `binary_cross_entropy` loss function. <br>
For more than two categories (the MNIST problem has 10), the [appropriate loss function](https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other) is  `sparse_categorical_crossentropy` when the targets are expressed as integers, or `categorical_crossentropy` when the targets are one-hot-encoded. 

#### Training a neural network
means finding the "optimal" values of the weights and biases of all the neurons, i.e. the values that lead to the lowest prediction error. This is usually done via some variation on the basic methods of **gradient descent** and **back-propagation**, which you will learn about soon!

#### The `metric`
is the number we use to gauge the performance of our neural network model. In this case, we specify that we want to report model `accuracy` as our metric for each *epoch*. An *epoch* is a complete training run that uses either a subset of the data (a *mini-batch*), or all the data. In training neural networks, we usually run through multiple epochs, until the performance stops improving. We will also be able to see the overall accuracy once the model has finished training. 

#### The `optimizer` 
Chooses which method to use in training the network. There are many optimizers available in Keras, but the **Adam Optimizer** usually gives great overall performance. Best practice: don't waste time trying out a lot of optimizers to find out which is best for your model, just pick **Adam**!

#### Adam Optimizer
For more background on the **Adam** optimizer, check out these references 
* [Adam Optimization Algorithm](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)
* [Adam Optimizer - original paper](https://arxiv.org/abs/1412.6980)

In [None]:
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])

#### 4.2.4 Fit the Neural Network Model

Lets train the MNIST neural network model that we just built to predict the digit classes! `model.fit()` has a `batch_size` parameter that we can use if we want to do *mini-batch* epochs, but since this tabular dataset is pretty small we'll use the default value `batch_size = None` so that each epoch will consist of the entire batch of input data. 

In [None]:
history = model.fit(X_train, 
                    y_train, 
                    epochs=5, 
                    validation_data=(X_test, y_test))

#### 4.2.5 Plot the Learning Curves
Acuracy vs. Epoch and Loss vs. Epoch

In [None]:
type(history)

In [None]:
history.history

In [None]:
history.history['val_accuracy']

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
epochs = [i for i in range(len(history.history['loss']))]
sns.lineplot(x=epochs, y=history.history['loss'], label="train")
sns.lineplot(x=epochs, y=history.history['val_loss'], label="test")
plt.title('Loss')
plt.grid()

#### 4.2.6 Evaluate the Model
The `model.evaluate()` computes the loss and metric values for the input data and labels.

In [None]:
model.evaluate(X_test,y_test)

#### 4.2.7 Get Model Predictions on the test data

To see what a model prediction looks like -- say for the first example in the test set, we can use the `model.predict()` method:

In [None]:
print(f'Model outputs are class probabilities from the softmax activation function:\n {model.predict(X_test[0:1])}')
print(f'\nSum of the class probabilities: {model.predict(X_test[0:1]).sum()}')
print(f'\nPredicted label: {model.predict(X_test[0:1]).argmax()}')
print(f'\nTrue label: {y_test[0]}')

Get the predicted class probabilities

You'll notice that if we rerun fitting, the results might differ slightly from run to run. This is due to the many effects of randomness in the modeling process. Interested to know more? Read the article [Embrace Randomness in Machine Learning](https://machinelearningmastery.com/randomness-in-machine-learning/), by Jason Brownlee.

In [None]:
predictions = model.predict(X_test)

Get class predictions for the whole `test` data set

In [None]:
predicted_digits = [predictions[index,:].argmax() for index in range(len(X_test))]

In [None]:
predicted_digits[:20]

#### Let's look at some of the misclassified images

In [None]:
predicted_digits=np.array(predicted_digits)

In [None]:
# boolean indicator that is True when the predicted digit label agrees with the actual digit label
indicator = (predicted_digits == y_test)

In [None]:
# get pixel vectors and the labels for the misclassified examples 
misclassified_examples = X_test[~indicator]
misclassified_labels = predicted_digits[~indicator]

In [None]:
# reshape the pixel vectors back to arrays (so we can display them as images)
print(misclassified_examples.shape)
misclassified_images = misclassified_examples.reshape((misclassified_examples.shape[0], 28, 28))
print(misclassified_images.shape)

True digit labels for the first 10 misclassified images

In [None]:
true_labels = y_test[~indicator]
true_labels[:10]

Let's examine the True vs. Predicted digit labels for the first 10 misclassified images.<br> 
Can you explain why each image was misclassified?

In [None]:
misclassified_labels[:10]

In [None]:
for image, label, true_label in zip(misclassified_images[:10,:,:],misclassified_labels[:10], true_labels[:10]):
  plt.figure()
  plt.title(f'digit: true {true_label}, predicted {label}')
  plt.imshow(np.squeeze(image));

## Challenge

You will be expected to use the Keras `Sequential` API to estimate (i.e. build, compile, fit, and evaluate) a feed forward neural network on a dataset.

---

# 5. Choosing An Architecture (Learn)

## Overview

Choosing an architecture for a neural network is almost more an art than a science. The best way to choose an architecture is through research and experimentation. 

Let's do a few experiments. To track our results we'll use a tool called [TensorBoard](https://www.tensorflow.org/tensorboard), which is a way to interactively visualize the results of our various experiments. Here is our previous model with TensorBoard incorporated: 

In [None]:
%load_ext tensorboard

import os
import datetime
import tensorflow as tf

logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

In [None]:
## build a 1 layer neural network 

# instantiate a sequential model

# add a dense layer

# add an output layer

# compile the model 

# fit the model 

### BEGIN SOLUTION
model = Sequential()

model.add(
    # Hidden Layer 1
    Dense(32, activation="relu", input_dim=784)
)

model.add(
    Dense(10, activation='softmax')
)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test), callbacks=[tensorboard_callback])
### END SOLUTION 

In `tensorboard`, click on "SCALARS" to see plots of metric (accuracy) and loss vs. epoch

In [None]:
%tensorboard --logdir logs

## Follow Along

Let's run a couple of experiments in groups based on your birthday: 
1. Jan - March:  Try adding an additional layer to the model
2. April - June: Add 2 additional hidden layers with identical number of neurons
3. July - Sept: Change the activation functions in the hidden layers (use as many layers as you want)
4. Oct - December: Try changing the optimization function and use any architecture that you want. 

### 1. Additional Hidden Layer

In [None]:
## build a single hidden layer neural network 

# instantiate a sequential model

# add 1st dense layer

# add 2nd dense layer

# add an output layer

# compile the model 

# fit the model 


### BEGIN SOLUTION
model = Sequential()

model.add(
    # Hidden Layer 1
    Dense(32, activation="relu", 
          input_dim=784)
)

model.add(
    # Hidden Layer 2
    Dense(5, activation="relu")
)

# output layer
model.add(
    Dense(10, activation='softmax')
)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test), callbacks=[tensorboard_callback])
### END SOLUTION

In `tensorboard`, click on "SCALARS" to see plots of metric (accuracy) and loss vs. epoch

In [None]:
%tensorboard --logdir logs

### 2 Additional Hidden Layers

In [None]:
## build a 3 hidden layer neural network 

# instantiate a sequential model

# add 1st dense layer

# add 2nd dense layer

# add 3rd dense layer

# add an output layer

# compile the model 

# fit the model 



### BEGIN SOLUTION
model = Sequential()

model.add(
    # Hidden Layer 1
    Dense(32, activation="relu")
)

model.add(
    # Hidden Layer 2
    Dense(32, activation="relu")
)

model.add(
    # Hidden Layer 3
    Dense(32, activation="relu")
)

model.add(
    Dense(10, activation='softmax')
)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test), callbacks=[tensorboard_callback])
### END SOLUTION

In `tensorboard`, click on "SCALARS" to see plots of metric (accuracy) and loss vs. epoch

In [None]:
%tensorboard --logdir logs

### 3. Different Activation Functions

In [None]:
## build a 3 hidden layer neural network with different activation functions

# instantiate a sequential model

# add 1st dense layer

# add 2nd dense layer

# add 3rd dense layer

# add an output layer

# compile the model 

# fit the model 


### BEGIN SOLUTION
model = Sequential()

model.add(
    # Hidden Layer 11
    Dense(32, activation="sigmoid", input_dim=784)
)

model.add(
    # Hidden Layer 2
    Dense(32, activation="sigmoid")
)

model.add(
    # Hidden Layer 3
    Dense(32, activation="sigmoid")
)

model.add(
    Dense(10, activation='softmax')
)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test), callbacks=[tensorboard_callback])
### END SOLUTION

In `tensorboard`, click on "SCALARS" to see plots of metric (accuracy) and loss vs. epoch

In [None]:
%tensorboard --logdir logs

### 4. Different Optimization Functions
[Keras Optimizers Docs](https://keras.io/api/optimizers/)

In [None]:
## build a 3 hidden layer neural network with a different optimizer

# instantiate a sequential model

# add 1st dense layer

# add 2nd dense layer

# add 3rd dense layer

# add an output layer

# compile the model 

# fit the model 


### BEGIN SOLUTION
model = Sequential()

model.add(
    # Hidden Layer 1
    Dense(32, activation="relu", input_dim=784)
)

model.add(
    # Hidden Layer 2
    Dense(32, activation="relu")
)

model.add(
    # Hidden Layer 3
    Dense(32, activation="relu")
)

model.add(
    Dense(10, activation='softmax')
)

# reference the docs to see optimizer options such as rmsprop, sgs, adadelta, etc ... 
model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test), callbacks=[tensorboard_callback])
### END SOLUTION

In `tensorboard`, click on "SCALARS" to see plots of metric (accuracy) and loss vs. epoch

In [None]:
%tensorboard --logdir logs

## Challenge

You will have to choose your own architectures in today's module project. In the next module, we will discuss hyperparameter optimization which can help you handle these numerous choices. 

---

# Sources


### Academic References -- background on the perceptron
- McCulloch, W.S. & Pitts, W. Bulletin of Mathematical Biophysics (1943) 5: 115. https://doi.org/10.1007/BF02478259
- Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. https://doi.org/10.1037/h0042519
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). [Deep learning](https://www.deeplearningbook.org/).

### Implementation Reference
- [NN-SVG](http://alexlenail.me/NN-SVG/index.html) by Alex Lenail. Used to generate diagrams for this notebook. 
- Alammar, Jay (2016). [A Visual and Interactive Guide to the Basics of Neural Networks](https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/).
- [SINGLE LAYER NEURAL NETWORK - PERCEPTRON MODEL ON THE IRIS DATASET USING HEAVISIDE STEP ACTIVATION FUNCTION](https://www.bogotobogo.com/python/scikit-learn/Perceptron_Model_with_Iris_DataSet.php) by K Hong. For Perceptron Demo.

### Supplementary Videos
These videos build a deeper level of understanding of neural networks

- [3 Blue 1 Brown Neural Networks Playlist](https://www.3blue1brown.com/topics/neural-networks)  -- these 4 short videos do a great job introducing neural networks in understandable terms. The first 3 videos give you an intuitive feel for neural networks and how backpropagation works. The last video goes into the details of backpropagation using calculus, if you're interested in a deeper dive.
- [Andrew Ng Neural Network Introduction Videos](https://www.youtube.com/watch?v=1ZhtwInuOD0&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=43)
