Theory Topics
- Perceptron Model to Neural Networks
- Activation Functions
- Cost Functions
- Feed Forward Networks
- BackPropagation

Coding Topics
- Tensorflow 2.0 Keras Syntax
- ANN with Keras
    - Regression
    - Classification
- Exercises for Keras ANN
- Tensorboard Visualizations


### Perceptron Model

To begin understanding deep learning, we will build up our model abstractions:
- Single Biological Neuron
- Perceptron
- Multi-layer Perceptron Model
- Deep Learning Neural Network

As we learn about more complex models, we'll introduce concepts, such as:
- Activation Functions
- Gradient Descent 
- Back Propagation

A perceptron was a form of neural network introduced in 1958 by Frank Rosenblatt.

However, in 1969 Marvin Minsky and Seymour Papert's published their book Perceptrons.

It suggested that there were severe limitations to what perceptrons could do.

This marked the beginning of what is known as the AI winter, with little funding into AI and Neural Networks in the 1970s.


### Neural Networks

- A single perceptron will not be enough to learn complicated systems.
- Fortunately, we can expand on the idea of a single perceptron, to create a multi-layer perceptron model.

To build a network of perceptions, we can connect layers of perceptrons, using a multi-layer perceptron model.

The outputs of one perceptions are directly fed into as inputs to another perceptron.

This allows the network as a whole to learn about interactions and relationships between features.

The first layer is the input layer.

The last layer is the outer layer. This last layer can be more than one neuron.

Layers in between the input and output layers are the hidden layers.

Hidden layers are difficult to interpret, due to their high interconnectivity and distance away from known input and output.

Neural networks become "deep neural networks" if then contains 2 or more hidden layers.

In classification tasks, it would be useful to have all outputs fall between 0 and 1.
These values can then present probability assignments for each class. 

### Activation Functions

Inputs x have a weight w and a bias term b attached to them in the perceptron model.
Which means we have:
x * w + b

Clearly w implies how much weight or strength to give the incoming input. We can think of b as an offset value, making x * w have to reach to a certain threshold before having an effect.

For example  if b = -10
- x * w + b

Then the effects of x*w will not really start to overcome the bias until their product surpasses 10.
After that, then the effect is solely based on the value of w. Thus the term "bias".

Next we want to set boundaries for the overall output value of: x * w + b

We can state: z = x * w + b

And then pass z through some activation function to limit its value.

The most simple networks rely on a basic step function that outputs 0 or 1. This sort of functions could be useful for classification ( 0 or 1 class). However this is a very "strong" function, since small changes are not reflected. There is a immediate cut off that splits between 0 and 1.

Lucky for us, this is the sigmoid function! 
 F(z)  = 1 / ( 1 + e ^ -z)

Some activation functions:
- Hyperbolic Tangent: tanh(z)
- Rectified Linear Unit (ReLU): This is actually a relatively simple function: max (0, z) ReLu has been found to have very good performance, especially when dealing with the issue of vanishing gradient.



### Multi-Class classification

There are 2 main types of multi-class situations
- Non - Exclusive Classes: A data point can have multiple classes/categories assigned to it.
- Mutually Exclusive Classes: Only one class per data point.

#### Non-Exclusive Classes
- E.g. Photos can have multiple tags (e.g. beach, family, vacation, etc.)

#### Mutually Exclusive Classes
- Photos can be categorized as being in greyscale (black and white) or full color photos. A photo can not be both at the same time.

#### Organizing Multiple Classes 
The easiest way to organize multiple classes is to simply have 1 output node per class.

### Non-exclusive 
- Sigmoid Function:
Each neuron will output a value between 0 and 1, indicating the probability of having that class assigned to it. This allows each neuron to output independent of the other classes, allowing for a single data point fed into the function to have multiple classes assigned to it.

### Mutually Exclusive Classes
- Softmax function: Softmax function calculates the probabilities distribution of the event over K different events. This function will calculate the probabilities of each target class over all possible target classes.

The range will be 0 to 1, and the sum of all the proabilities will be equal to one. The model returns the probabailites of each class and the target class chosen will have the highest probability.


#### Review
- Perceptrons expanded to neural network model
- Weight and Biases
- Activation Functions 
- Time to learn about Cost Functions!

### Cost Functions and Gradient Descent

We understand that neural networks take in inputs, multiply them by weights, and add biases to them. Then this result is passed through an activation function which at the end of all the layers leads to some output.

The output y is the model estimation of what it ptrdicts the label to be. So after the network creates its prediction, how do we evaluate it?
And after the evaluation how can we update the network's weights and biases?

We need to take the estimated outputs of the network and then compare them to the real values of the label. Keep in mind this is using the training data set during the fitting/training of the model.

The cost function (loss function) must be an average so it can output a single value. We can keep track of our loss/cost during training to monitor network performance.

We will us the following variables:
- y to represent the true value.
- a to represent neuron's prediction.

In terms of weights and bias:
- w * x + b = z
- Pass z into activation function f(z) = a

One very common cost function is the quadratic cost function; We simply calculate the difference between the real values y(x) against our predicted values a(x).

We can think of the cost function as: C(W,B, S, E ); W is our neural network's weights, B is our neural network's biases, S is the input of a single training sample, and E is the desired output of that training sample.

This also means that if we have a huge network, we can expect C to be quite complex, with huge vectors of weights and biases.

### Gradient Descent 
- We could start with larger steps, then go smaller as we realize the slope gets closer to zero. This is known as adaptive gradient.

In 2015, Kingma and Ba published their paper: "Adam: A Method for Stochastic Optimization" Adam is a much more efficient way of searching for these minimums, so you will see us it for our code! Realisrically we're calculating this descent in an n-dimensional space for all our weights. When dealing with these N-dimensional vectors (tensors), the notation changes from derivative to gradient.

From classification problems, we often use the cross entropy loss function. The assumption is that your model predicts a probability distribution p(y = i) for each class i = 1,2,....., C.


Review:
- Cost Fuctions
- Gradient Descent 
- Adam Optimizer
- Quadratic Coast and Cross - Entropy

### Back Propagations

Fundamentally, we want to know how the cost function results changes with respect to the weights in the network, so we can update the weights to minimize the cost function.

Each input will receive a weight and a bias. This mean we have: C(w1, b1, w2, b2, w3,b3). 

The main idea here is that we can use the gradient to go back through the network and adjust our weights and biases to minimize  the output of the error vector on the last output layer.

Using some calculus notation, we can expand this idea to networks with multiple neurons per layer.

Hadamard Products (Element by Element mutltiplication)

### Learning Process of the Neural Network

- Step 1: Using input x set the activation function a for the input layer.
    - z = w * a + b
    - a = f(Z)

- This resulting a then feeds into the next layer (and so on).

- Step 2: for each layer, compute:
    - z(l) = w(l) * a (l-1) + b (l)
    - a(l) = f(z(l))

- Step 3: We compute our error vector:
    - Expressing the rate of change of C with respect to the output activations.

- Step 4: Backpropagate the error

### Difference between Keras and Tensorflow

TensorFlow is an open-source deep learning library developed by Google, with TF 2.0 being officially realeased in late 2019.
TensorFlow has a large ecosystem of related components, including libraries like Tensorboard, Deployment and Production APIs, and support for various programming languages.


Keras is a high-level python library that can use a variety of deep learning libraries underneath, such as: TensorFlow, CNTK, or Theano.

TensorFlow 1.x had a complex python class system for building models, and due to the huge popularity of Keras, when TF 2.0 was released, TF adopted Keras as the official API for TF.


### Keras Syntax Basic

In [1]:
import pandas as pd 
import numpy as np 

import seaborn as sns

In [None]:
df = pd.read_csv("USA_Housing.csv")
df.head()

In [None]:
sns.pairplot(df)

In [None]:
df.columns

In [None]:
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population']].values
X

In [None]:
y = df["Price"].values
y

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [10]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
scaler.fit(X_train)

In [12]:
X_train = scaler.transform(X_train)

In [13]:
X_test = scaler.transform(X_test)

In [None]:
X_train.min()

In [None]:
X_train.max()

In [None]:
X_train

In [17]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [18]:
model = Sequential()

model.add(Dense(4, activation="relu"))
model.add(Dense(4, activation="relu"))
model.add(Dense(4, activation="relu"))
model.add(Dense(1))

model.compile(optimizer="rmsprop", loss="mse")

In [None]:
model.fit(x=X_train, y=y_train, epochs=250)

In [20]:

history_dict = model.history.history  # This remains a dictionary
loss_df = pd.DataFrame(history_dict)

In [None]:
loss_df.plot()

In [None]:
model.evaluate(x=X_test, y = y_test, verbose=0)

In [None]:
model.evaluate(x=X_train, y=y_train, verbose=0)

In [None]:
test_predictions = model.predict(X_test)

In [None]:
test_predictions

In [None]:
test_predictions = pd.Series(test_predictions.reshape(1500,))
test_predictions

In [27]:
pred_df = pd.DataFrame(y_test, columns=["Test True Y"])

In [28]:
pred_df = pd.concat([pred_df, test_predictions], axis = 1)

In [29]:
pred_df.columns= ["Test True Y", "Model Predictions"]

In [None]:
pred_df

In [None]:
sns.scatterplot(x="Test True Y", y = "Model Predictions", data= pred_df)

In [32]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [45]:
m_a_e = mean_absolute_error(pred_df["Test True Y"], pred_df["Model Predictions"])

In [42]:
mean_price = df["Price"].mean()

In [None]:
print((m_a_e / mean_price) * 100)

In [None]:
df.describe()

In [None]:
mean_squared_error(pred_df["Test True Y"], pred_df["Model Predictions"])

In [None]:
(mean_squared_error(pred_df["Test True Y"], pred_df["Model Predictions"]))** 0.5

In [47]:
from tensorflow.keras.models import load_model

In [50]:
# Save the model in the Keras native format
model.save('housing_prediction.keras')


In [51]:
later_model = load_model("housing_prediction.keras")

In [None]:
later_model.predict(X_test)