# Introductory application of neural networks to data analysis

In this notebook we give a condensed intro to the application of feed forward networks to a common data set without discussing deeply the theory and motivation. We try different network complexities, optimizers and activation functions to get a feeling for how they impact the minimization of the loss/error/cost function. Regularization techniques are not discussed here since we do not train for too many epochs and the overall scores are not too good to justify thir application to counter overfitting. For further discussion on different network architectures, we refer to repositories [here](https://github.com/andreaspts/DL_DEEPNET_vs_CONVNET_on_MNIST) and [here](https://github.com/andreaspts/DL_REC_vs_DEEP_and_CONVNN_on_TEMPERATURE_SERIES).

## Single neuron

In [None]:
#import relevant data
from sklearn.linear_model import LogisticRegression

In [2]:
#define data 
X = [[50], [60], [70], [20], [10], [30]]

Y = [1,1,1,0,0,0]

In [3]:
model = LogisticRegression(C = 100000) #large C dampens regulatization
model.fit(X,Y)



LogisticRegression(C=100000, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)

In [4]:
model.predict([[44]])

array([1])

In [5]:
model.predict_proba([[44]])

array([[0.08358881, 0.91641119]])

## Simple neural network on fashion mnist

In the following we will discuss the application of simple feed forward neural networks onto the fashion mnist data set to classify item categories. The network architectures we use are implemented conveniently via keras. The data set can be retrieved from [here](https://github.com/zalandoresearch/fashion-mnist). A score board comparing different ml methods using classical scikit-learn algorithms and neural networks is found [here](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/).

In [6]:
import tensorflow
import keras

Using TensorFlow backend.


In [7]:
import gzip 
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In [8]:
def open_images(filename):
    with gzip.open(filename, "rb") as file:
        data = file.read()
        return np.frombuffer(data, dtype = np.uint8, offset = 16)\
            .reshape(-1, 28, 28)\
            .astype(np.float32)
    
def open_labels(filename):
    with gzip.open(filename, "rb") as file:
        data = file.read()
        return np.frombuffer(data, dtype = np.uint8, offset = 8)

### One category 

In [None]:
#load data and define variables
X_train = open_images("train-images-idx3-ubyte.gz")
Y_train = open_labels("train-labels-idx1-ubyte.gz")

Y_train = (Y_train == 0) #just checking for the t-shirts

X_test = open_images("t10k-images-idx3-ubyte.gz")
Y_test = open_labels("t10k-labels-idx1-ubyte.gz")

Y_test = (Y_test == 0) #just checking for the t-shirts

In [None]:
X_train.shape

In [None]:
X_train[1].shape

In [None]:
plt.imshow(X_train[100], cmap = "gray_r")
plt.show()

In [None]:
Y_train.shape

In [None]:
Y_train

In [None]:
from keras import layers
from keras import models

In [None]:
#define model

model = models.Sequential()

model.add(layers.Dense(100, activation = "sigmoid", input_shape = (28 * 28,))) # we have 28*28 pixels
model.add(layers.Dense(1, activation = "sigmoid"))
model.summary()
model.compile(optimizer = "sgd", loss = "binary_crossentropy", metrics = ['accuracy'])

Stochastic gradient descent was employed as optimizer. Below we will use an improved version of gradient descent which tries to smoothen out oscillations in the descending procedure (gd with momentum, rmsprop or adam optimizers).

In [28]:
#illustration of different optimizers
from IPython.display import Image
Image(url='gdanimations.gif')  

In [None]:
#train model
#train on minibatches of 1000 to get an adjustment (thus 60 adjustments per epoch)
history = model.fit(X_train.reshape(60000, 784), Y_train, epochs = 10, batch_size = 1000)

In [None]:
#check per hand prediction on training data vs. reality
plt.imshow(X_train[0], cmap = "gray_r")
print(Y_train[0])
print(model.predict(X_train[0].reshape(1, 784)))

In [None]:
#check per hand prediction on training data vs. reality
plt.imshow(X_train[1],cmap = "gray_r")
print(Y_train[1])
print(model.predict(X_train[1].reshape(1, 784)))

In [None]:
#check the accuracy (per hand)
Y_train_pred = model.predict(X_train.reshape(60000, 784))
np.mean(np.round(Y_train_pred).reshape(-1) == Y_train)

In [None]:
#check accuracy on training set via keras --> use output from fitting process
model.evaluate(X_train.reshape(60000, 784), Y_train)

In [None]:
print(model.metrics_names)

In [None]:
#check accuracy on test set via keras --> use output from fitting process
model.evaluate(X_test.reshape(10000, 784), Y_test)

In [None]:
# list all data in history
print(history.history.keys())

In [None]:
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

### Many categories

To this aim: Modify final activation function, output dimension and loss function

In [12]:
#load data and define variables
X_train = open_images("train-images-idx3-ubyte.gz")
Y_train = open_labels("train-labels-idx1-ubyte.gz")

X_test = open_images("t10k-images-idx3-ubyte.gz")
Y_test = open_labels("t10k-labels-idx1-ubyte.gz")

In [13]:
from keras import layers
from keras import models
from keras.utils import to_categorical

In [14]:
#use one-hot-encoding
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)

In [None]:
#define model (adapt to categorical situaton)

model = models.Sequential()

model.add(layers.Dense(2048, activation = "sigmoid", input_shape = (28 * 28,))) # we have 28*28 pixels
model.add(layers.Dense(256, activation = "sigmoid", input_shape = (28 * 28,))) # we have 28*28 pixels
model.add(layers.Dense(10, activation = "sigmoid"))
model.summary()
model.compile(optimizer = "adam", loss = "categorical_crossentropy", metrics = ['accuracy'])

In [None]:
#train model
#train on minibatches of 1000 to get an adjustment (thus 60 adjustments per epoch)
history = model.fit(X_train.reshape(60000, 784), Y_train, epochs = 10, batch_size = 1000)

In [None]:
#check accuracy on training set via keras --> use output from fitting process
model.evaluate(X_train.reshape(60000, 784), Y_train)

In [None]:
#check accuracy on test set via keras --> use output from fitting process
model.evaluate(X_test.reshape(10000, 784), Y_test)

In [None]:
# list all data in history
print(history.history.keys())

In [None]:
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

In [None]:
#model evaluation via confusion matrix (helps to see if classes are well discerned)
Y_pred = model.predict(X_test.reshape(-1, 784))

#check for which category the highest estimate was produced for all examples in the test set
np.argmax(Y_pred, axis = 1)

In [None]:
from pandas_ml import ConfusionMatrix

In [None]:
ConfusionMatrix(np.argmax(Y_test, axis = 1), np.argmax(Y_pred, axis = 1))

For example: Column "2" line "0": 27 is to be read as: Predicted was category "2" when it was actually category "0". In this way, the confusion matrix allows to see how well our model maps the reality. Ideally, we would like to have a model with vanishing off-diagonal terms.

#### Introducing the softmax as final layer activation function

In [26]:
#define model (adapt to categorical situaton)

model = models.Sequential()

model.add(layers.Dense(2048, activation = "sigmoid", input_shape = (28 * 28,))) # we have 28*28 pixels
model.add(layers.Dense(256, activation = "sigmoid", input_shape = (28 * 28,))) # we have 28*28 pixels
model.add(layers.Dense(10, activation = "softmax"))
model.summary()
model.compile(optimizer = "rmsprop", loss = "categorical_crossentropy", metrics = ['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_7 (Dense)              (None, 2048)              1607680   
_________________________________________________________________
dense_8 (Dense)              (None, 256)               524544    
_________________________________________________________________
dense_9 (Dense)              (None, 10)                2570      
Total params: 2,134,794
Trainable params: 2,134,794
Non-trainable params: 0
_________________________________________________________________


In [27]:
#train model
#train on minibatches of 1000 to get an adjustment (thus 60 adjustments per epoch)
history = model.fit(X_train.reshape(60000, 784), Y_train, epochs = 10, batch_size = 1000)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


We observe that the tanh and sigmoid activations work better on this problem than relu.

In [33]:
#check accuracy on training set via keras --> use output from fitting process
model.evaluate(X_train.reshape(60000, 784), Y_train)



[0.40780154851277667, 0.8429333333333333]

In [29]:
#check accuracy on test set via keras --> use output from fitting process
model.evaluate(X_test.reshape(10000, 784), Y_test)



[0.44668299560546876, 0.8317]

In [30]:
from pandas_ml import ConfusionMatrix

In [31]:
#model evaluation via confusion matrix (helps to see if classes are well discerned)
Y_pred = model.predict(X_test.reshape(-1, 784))

#check for which category the highest estimate was produced for all examples in the test set
np.argmax(Y_pred, axis = 1)

array([9, 2, 1, ..., 8, 1, 5])

In [32]:
ConfusionMatrix(np.argmax(Y_test, axis = 1), np.argmax(Y_pred, axis = 1))

Predicted    0    1    2     3     4    5    6     7     8    9  __all__
Actual                                                                  
0          779    3   20    87    13    0   79     0    19    0     1000
1            0  959    4    26     9    0    0     0     2    0     1000
2            8    1  608    15   325    0   28     0    15    0     1000
3           17   10    8   875    67    0   19     0     4    0     1000
4            0    0   44    24   917    0   11     0     4    0     1000
5            0    0    0     2     0  934    0    46     4   14     1000
6          151    3  116    55   262    0  387     0    26    0     1000
7            0    0    0     0     0   12    0   952     0   36     1000
8            0    1    9     4     6    2    4     4   970    0     1000
9            0    0    0     1     0    9    0    53     1  936     1000
__all__    955  977  809  1089  1599  957  528  1055  1045  986    10000

We understand from the different scores (training and test scores are pretty close) that by increaing the network capacity more statistical intricacies of the data set could be unveiled. 

If they were to far from each other (while the training score would be good) the complexity would be to big and we would observe overfitting.

If both scores would be bad, more data could help.