# Notes on Neural Networks

## Linear regression

Suppose you have a dataset $D$ with numeric predictor variables and a numeric target variable.

For a specific row: 
- The values of the predictor variables are denoted $x_1$, $x_2$, ..., $x_n$. Let $x_0=1$. 
- The value of the target variable is denoted $y_a$

Create a linear _prediction_ function, also called a model,
$$ L_w(x) = w_0 x_0 + w_1 x_1 + w_2 x_2 + ... + w_n x_n = w\cdot x
$$ 
using a set of numeric weights $w = w_0$, $w_2$, ..., $w_n$. The weights $w$ determine the function. The $x$ values are input to the function.

This function predicts $y$ values, where $y_p = L_w(x)$ is the predicted value and where $x$ is an array/row of predictor values. 

The difference $|y_a - y_p|$ between the actual value $y_a$ and the predicted value $y_p$ is the _error_ or _residual_. 

The _cost_ of a model $L$ determined by weights $w$ and given a dataset $D$ is the sum of the square of the error for each row:
$$ C(L,D) = \sum_{x, y_a \in D} \left[ y_a - L_w(x) \right]^2
$$

The goal of linear regression, given a dataset with numeric predictors and target, is to find weights which minimize this cost.

## Logistic regression

Supppose that, though the predictor variables remain numeric, the target variable is binary, in this case with values `0` and `1`.

The standard logistic function 
$$ \sigma(t) = \frac{1}{1+e^{-t}}
$$
has a range between `0` and `1`. 

A _prediction_ function can be created using $\sigma$ and $L_w$
$$ P(x) = \sigma(L_w(x))
$$

The function $P$ is determined by the weights $w$ since $L_w$ is determined by these weights.

Another prediction function can be created 
$$ P(x) = \operatorname{tanh}(L(x))
$$
using the $\operatorname{tanh}$ function 
$$ \operatorname{tanh}(t) = \frac{e^t - e^{-t}}{e^t + e^{-t}}
$$
which has a range between `-1` and `+1`.

In general we write 
$$ A_w(x) = A(L_w(x)) = A(w^j\cdot x)
$$
where $A$ is an _activation function_ such as $\sigma$ or $\operatorname{tanh}$.

There is a cost function $C(P,D)$ associated with these logistic regression prediction functions. 

Wikipedia: 
[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression),
[Logistic function](https://en.wikipedia.org/wiki/Logistic_function#Derivative)
and [graph](https://en.wikipedia.org/wiki/Logistic_function#/media/File:Logistic-curve.svg)

## Keras 

There are two types of Keras models:
1. The _sequential_ models, which will be discussed below
1. The _functional_ (complex) models, which will not be discussed below

These Keras models have some common API calls. 
See [About Keras Models](http://keras.io/models/about-keras-models/). 

### Sequential model

The [Sequential Model](http://keras.io/models/sequential/) (see also [Sequential model guide](http://keras.io/getting-started/sequential-model-guide/)) consists of a sequence of layers and each layer consists of a sequence of nodes. See this [example](https://www.google.com/imgres?imgurl=https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/2000px-Colored_neural_network.svg.png&imgrefurl=https://en.wikipedia.org/wiki/Artificial_neural_network&h=2405&w=2000&tbnid=yXQTmNEYaTljzM:&tbnh=160&tbnw=133&docid=ynYusGDc2AddHM&itg=1&client=safari&usg=__3Yu82IJzvxUUl1bq40OteHlp0qg=&sa=X&ved=0ahUKEwifjZqw9anMAhXEaz4KHfT1C2wQ9QEIIzAA#h=2405&tbnh=160&tbnw=133&w=2000). 
The first layer corresponds to the input and has one node for each input variable. The output layer has one node for each numeric target variable and one node for each class of each categorical target variable. I haven't constructed a model for a dataset with multiple multi value target variables or with mixed target variables.

There are several steps to creating and working with models:
1. Create the model and its layers
1. Compile the model by specifying, at least, an optimization function and a loss function
1. Fit/train the model on a dataset
1. Predict target values for each row of a test dataset
1. Evaluate the model

These steps are demonstrated using a simple dataset constructed below. The target variable is continuous, which makes this a regression problem.

###  Import the required packages

In [42]:
import pandas    as pd
import numpy     as np
import itertools as it

from keras.utils       import np_utils 
from keras.models      import Sequential
from keras.layers.core import Dense, Activation

### Create the demonstration dataset

In [89]:
N_samples      =    10  # training sample size
N_variables    =     3  # number of input variables
training_steps = 10000  # number of training iterations

x_start        = 0
x_stop         = 1
x_numof        = 5

rnd_mul  = 0.001

x_gen    = np.random.randint(1, 4, 1+N_variables)
x_gen

x_train = np.column_stack([np.ones(x_numof**N_variables),
                           np.array(list(it.product(np.linspace(x_start, 
                                                                x_stop, 
                                                                x_numof),
                                                    repeat=N_variables)))])

y_train = x_train.dot(x_gen) + np.random.randn(x_numof**N_variables) * rnd_mul
y_train.shape, y_train.dtype, x_train.shape, x_train.dtype

((125,), dtype('float64'), (125, 4), dtype('float64'))

### Create the model and its layers

A sequential model is a sequence of layers. See this [example](https://www.google.com/imgres?imgurl=https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/2000px-Colored_neural_network.svg.png&imgrefurl=https://en.wikipedia.org/wiki/Artificial_neural_network&h=2405&w=2000&tbnid=yXQTmNEYaTljzM:&tbnh=160&tbnw=133&docid=ynYusGDc2AddHM&itg=1&client=safari&usg=__3Yu82IJzvxUUl1bq40OteHlp0qg=&sa=X&ved=0ahUKEwifjZqw9anMAhXEaz4KHfT1C2wQ9QEIIzAA#h=2405&tbnh=160&tbnw=133&w=2000) for a graphical representation of a seqeuntial model.
There is one input layer (first in the sequence), one output layer (last in the sequence) and zero or more hidden layers. A layer is a sequence of functions denoted 
$$ A^j_w(x) = A(L_w(x)) = A(w^j\cdot x)
$$
where $A$ is an _activation_ function and $w$ is a vector of weights. 

- The input of each function in the input layer is a vector $x$ of values from the predictor variables. 
- The output of the set of functions in the input layer is the single value
- The output of each function in a layer is a single value, which is the value of the activation function 

- The output of all functions in the output layer are predicted values of the target variables. 
- The input of each function in a layer is a vector of output values from all functions in the previous layer. 

There are two parameters that must be specified for each layer:

1. [Initializations](http://keras.io/initializations/): 
    An initialization method for the weights of that layer. 
    Options: `glorot_uniform`, `uniform`, `zero`, ...
    
1. [Activations](http://keras.io/activations/): 
    The activation function for that layer. 
    Options: `linear`, `sigmoid`, `tanh`, `relu`, ...
    

Recall, the input of the activation function is the dot product of the vector of output values, of the previous layer, with the vector of weights; examples of activation functions are the sigmoid (aka logistic function) and the hyperbolic tangent (tanh).

In [82]:
model = Sequential()
model.add(Dense(input_dim  =  4, init       = "uniform",
                output_dim = 10, activation = 'linear'))

#model.add(Dense(input_dim  = 10, init       = "glorot_uniform",
#                output_dim = 10, activation = 'tanh'))

model.add(Dense(input_dim  = 10, init       = "uniform",
                output_dim =  1, activation = 'linear'))



### Compile the model

The two required parameters to compile a model are the optimizer and the objective function.

- [Objectives](http://keras.io/objectives/): The loss or cost function to minimize, which measures the difference between the predicted and actual target values. 
Options: `mean_squared_error`, `mean_absolute_error`, ...
- [Optimizers](http://keras.io/optimizers/): The optimization method to minimize the objective function. Options: `sgd`, `rmsprop`, ...
- [Regularizers](http://keras.io/regularizers/): Adds penalties to the objective function,  which often constrain the weights. More on this later.

In [83]:
model.compile(optimizer='sgd',
              loss     ='mean_squared_error')

### Fit (train) the model on a dataset


In [84]:
model.fit(x_train,  #predictors
          y_train,  #target
          nb_epoch  =100, 
          batch_size=125, 
          verbose   =True)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x128288f98>

### Predict target values for each row of a test dataset


In [87]:
pd.concat([pd.DataFrame(x_train,                            columns=['x0','x1','x2','x3']),
           pd.DataFrame(y_train,                            columns=['y_actual']),
           pd.DataFrame(np.round(model.predict(x_train),3), columns=['y_predicted'])
          ],
          axis=1)

Unnamed: 0,x0,x1,x2,x3,y_actual,y_predicted
0,1.0,0.00,0.00,0.00,3.000575,4.514
1,1.0,0.00,0.00,0.25,3.748740,4.984
2,1.0,0.00,0.00,0.50,4.498506,5.454
3,1.0,0.00,0.00,0.75,5.249418,5.924
4,1.0,0.00,0.00,1.00,5.999339,6.393
5,1.0,0.00,0.25,0.00,3.749023,4.982
6,1.0,0.00,0.25,0.25,4.502320,5.452
7,1.0,0.00,0.25,0.50,5.249595,5.922
8,1.0,0.00,0.25,0.75,5.998899,6.392
9,1.0,0.00,0.25,1.00,6.748396,6.861


### Evaluate the model

More on this later.

## Neural networks categorical example

In [71]:
n_samples = 10
col_data = {'x1' : np.random.randint(2, size=n_samples),
            'x2' : np.random.randint(2, size=n_samples),
            'y'  : np.random.randint(2, size=n_samples)}
data       = np.asarray(pd.DataFrame(col_data)[[0,1]])
labels     = np.asarray(pd.DataFrame(col_data)[[2]])
labels_bin = np_utils.to_categorical(labels)

In [72]:
model = Sequential()
model.add(Dense(input_dim  =  2, init       = "glorot_uniform",
                output_dim = 10, activation = 'tanh'))

#model.add(Dense(input_dim  = 10, init       = "glorot_uniform",
#                output_dim = 10, activation = 'tanh'))

model.add(Dense(input_dim  = 10, init       = "glorot_uniform",
                output_dim =  1, activation = 'sigmoid'))

In [73]:
# for a regression problem
model.compile(optimizer='sgd',
              loss='mse')

# for a binary classification problem
model.compile(optimizer='sgd',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# for a multi classification problem
model.compile(optimizer='sgd',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [74]:
model.fit(data,   # predictors
          labels, # target
          nb_epoch=100, 
          batch_size=n_samples, 
          verbose=True)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x126f90dd8>

In [75]:
pd.concat([pd.DataFrame(col_data),
           pd.DataFrame(model.predict_classes(data)    ,columns=['class']),
           pd.DataFrame(np.round(model.predict(data),3),columns=['value'])
          ],
         axis=1)



Unnamed: 0,x1,x2,y,class,value
0,1,0,1,0,0.445
1,0,1,1,1,0.588
2,1,1,0,0,0.488
3,0,0,0,1,0.533
4,1,0,1,0,0.445
5,0,1,0,1,0.588
6,1,0,0,0,0.445
7,0,0,1,1,0.533
8,1,1,1,0,0.488
9,0,0,1,1,0.533


In [76]:
y_pred = model.predict_classes(data, 
                               verbose=False
                              ).reshape(n_samples,)
y_init = labels.reshape(n_samples,)
y_freq = np.bincount(abs(y_pred - y_init))

print("Frequency counts:", y_freq)

Frequency counts: [5 5]


NEW STUFF

# STOP HERE

# Keras-introduction to neural networks


Source: 

Designing Machine Learning Systems with Python <br>
by David Julian <br>
Publisher: Packt Publishing <br>
Release Date: April 2016 <br>
ISBN: 9781785882951

See 
- Chapter 5. Linear Models

## Setup

With one target/dependent variable $y$, one feature (independent variable) $x_1$ and two weight values $w_0$ and $w_1$, then the function $h$ can be used to predict $y$. All we need to do is find the best weights. 

$$ h(x,w_0,w_1) = w_0 + w_1 x
$$

## Batch gradient descent (single variable)

For future reference the partial derivatives of $h(x,w)$ are calculated with respect to $w_0$ and to $w_1$:

$$ \frac{\partial h}{\partial w_0} 
    = \frac{\partial}{\partial w_0} \left( w_0 + w_1 x \right) 
    = 1
$$ 
and
$$ \frac{\partial h}{\partial w_1} 
    = \frac{\partial}{\partial w_1} \left( w_0 + w_1 x \right) 
    = x
$$

The _optimal weights_ are determined by minimizing the cost function: 

$$ \begin{align}
    C(w_0,w_1) = & \frac{1}{2m} \sum_{i=1}^m \left[ h(x^i,w_0,w_1) - y^i \right]^2
\\             = & \frac{1}{2m} \sum_{i=1}^m \left[ w_0 + w_1 x^i - y^i\right]^2
\end{align} $$
which is quadratic in $w_0$ and $w_1$ and has an absolute minimum, which we can easily find
by following the negative of the gradient. A two variable function creates a surface and the gradient function returns the direction of steepest ascent for any given pair of input values for that function. The reverse direction is the steepest descent ...

with respect to the weights where $m$ is the number of samples and $x^i$ and $y^i$ are the independent and dependent variables for the $i$-th sample/row.

For each pair of values $x^i$ and $y^i$ the 

The partial derivatives of $C$ with respect to $w_0$ and to $w_1$ are calculated:

$$ 
\begin{align}
\frac{\partial C}{\partial w_0} = & \frac{1}{2m} \sum_{i=1}^m 2 \left[ h(x^i,w) - y^i \right]
\\
                                = & \frac{1}{m} \sum_{i=1}^m \left[ h(x^i,w) - y^i \right]
\\
\frac{\partial C}{\partial w_1} = & \frac{1}{2m} \sum_{i=1}^m 2 \left[ h(x^i,w) - y^i \right]
                                     x^i
\\
                                = & \frac{1}{m} \sum_{i=1}^m \left[ h(x^i,w) - y^i \right]
                                     x^i
\\
                                = & \frac{1}{m} \sum_{i=1}^m \left[ w_0 + w_1 x^i - y^i \right]
                                     x^i
\end{align}
$$

The update rule(s):

$$ \begin{align}
    w_j \colon = & w_j - a \frac{\partial C}{\partial w_j} \qquad \text{for $j=0,1$}
\\  w_0 \colon = & w_0 - a \frac{\partial C}{\partial w_0}
\\  w_1 \colon = & w_1 - a \frac{\partial C}{\partial w_1}
\end{align} $$

## Logistic regression

The logistic function is $h_W(x) = s(W^T x)
$
where
$ s(t) = \frac{1}{1 + e^{-t}}
$
which maps
- negative numbers to the range $(0,0.5)$
- positive numbers to the range $(0.5,1)$
- zero $0$ to $0.5$

For more on the function $s$ see [Sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) at Wikipedia.

The cost function, for a single pair of values $x, y$, is
$$ C(x) = \frac{1}{2} \left[ h_W(x) - y \right]^2
$$
doesn't work, but the following does:
$$ \begin{align}
    C(x,y) = & - \log(h_W(x))                          & \text{if $y=1$}
\\  C(x,y) = & - \log(1-h_W(x))                        & \text{if $y=0$}
\\  C(x,y) = & - y \log(h_W(x)) - (1-y) \log(1-h_W(x)) & 
\end{align} $$

Now put this together to obtain a cost function for all samples and targets:

$$ \begin{align}
    C(x,y) = & \frac{-1}{m} \left[ \sum_{i=1}^m y^i \log(h_W(x^i)) - (1-y^i) \log(1-h_W(x^i))\right] 
\end{align} $$

- https://getpocket.com/a/read/824013193
- http://keras.io/activations/
- http://keras.io/layers/core/
- http://keras.io/layers/about-keras-layers/
- 

Activation functions
- happen after the matrix product of weights with node or input values
- sigmoid is the logistic function above
- there are other options
- tanh 
- Rectified Linear Unit (ReLU) - zero when negative, otherwise same

Models have:
- Optimization methods
- Loss/cost functions

Layers (?) have:
- Activitation
- Init
- Weights 
- Regularization

From [CS231n Convolutional Neural Networks for Visual Recognition](https://getpocket.com/a/read/824013193)  
    
"To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10 to 20 layers (hence deep learning)."