## <font color=darkorange> Logistic regression & Feed forward neural networks</font>

In [None]:
# ignore warnings for better clarity (may not be the best thing to do)...
import warnings
warnings.filterwarnings('ignore')

In [None]:
from random import randint
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import activations
import numpy as np
import matplotlib.pyplot as plt
#from keras import backend as K
# ignore warnings for better clarity (may not be the best thing to do)...
import warnings
warnings.filterwarnings('ignore')
from keras.datasets import cifar100

The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton: https://www.cs.toronto.edu/~kriz/cifar.html
The dataset contains 60000 32x32x3 colour images divided in 100 classes.

In [None]:
# Number of classes
num_classes = 100
# input image dimensions
img_rows, img_cols = 32, 32

# cifar100 data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = cifar100.load_data()

input_shape = (img_rows, img_cols, 3)

print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

print('input dimension:',x_train.shape[1::])

print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

In [None]:
# normalize the input data to obtain entries in (0,1)
x_train = x_train/255
x_test  = x_test/255

In [None]:
# display one input data at random
plt.imshow(x_train[randint(0, x_train.shape[0])])

In [None]:
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
y_train[:1]

## <font color=darkred> Softmax regression </font>
Logistic regression can be extended to classify data in more than two groups. Softmax regression provides a model for the probability that an input $x$ is associated with each group.  It is assumed that the probability to belong to the class $k\in\{1,\ldots,M\}$ can be expressed by 
\begin{equation*}
\mathbb{P}(Y = k| X) = \frac{\exp(\langle w_k,X \rangle + b_k)}{\sum_{\ell=1}^{M}\exp(\langle w_\ell,X \rangle + b_\ell)} = p_k(X)\,,
\end{equation*}
where $w_\ell \in \mathbb{R}^d$ and $b_\ell$  are model `weights` and `intercepts` for each class.


To estimate these unknown parameters, a maximum likelihood approach is used as in the logistic regression setting. In this case, the loss function is given by the negative log-likelihood (see also the section on gradient based method).

In [None]:
# Create a model prone to add layers sequentially
model = Sequential()
# flatten the data replaces 32 * 32 * 3 matrices by a 3072 dimensional vector
# This is always necessary before a fully-connected layer (Dense object)
model.add(Flatten(input_shape=input_shape, name='flatten'))
# add one dense (fully connected layer) with softmax activation function
# As it is the first layer, the input size is mandatory
model.add(Dense(num_classes, activation='softmax', name='dense_softmax'))

# "compile" this model, 
model.compile(
    # specify the loss as the cross-entropy i.e. the negative loglikelihood.
    loss=keras.losses.categorical_crossentropy,
    # choose the gradient based method to estimate the parameters
    # see https://keras.io/optimizers/ to have an overview of the different options
    # see also section 2 on gradient based methods.
    optimizer=keras.optimizers.Adagrad(),
    # metric to monitor on the test data
    metrics=['accuracy']
)
model.summary()

In [None]:
# number of data used for each update of the parameter (each gradient computation)
batch_size = 64
# number of times data are scanned
epochs = 50
# train the model, i.e. estimate unknown parameters by minimizing the loss function using a gradient descent algorithm (here Adagrad).
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

In [None]:
plt.figure(figsize=(7, 5))
plt.plot(history.epoch, history.history['acc'], lw=1, label='Train')
plt.plot(history.epoch, history.history['val_acc'], lw=1, label='Test')
plt.legend(fontsize=14)
plt.title('Accuracy of softmax regression', fontsize=16)
plt.xlabel('Epoch', fontsize=14)
plt.ylabel('Accuracy', fontsize=14)
plt.tight_layout()

### <font color=darkred> Feed-Forward Neural Network (FFNN) or multilayer Perceptron (MLP)</font>
The softmax regression of the previous section is a linear model, with 307300 parameters.  It might be too simple for our classification task.  The idea underlying neural networks is to have successive "neurons" performing a linear transformation of the input data (depending on a weight matrix and a bias vector) followed by an activation function to design more flexible models with additional parameters.

In [None]:
# Create the graph for a fully connected feed-forward neural network with one hidden layer 
# with 256 units and a relu activation function. 
model_ffnn = Sequential()

model_ffnn.add(Flatten(input_shape=input_shape))

model_ffnn.add(Dense(256, activation='relu'))

model_ffnn.add(Dense(num_classes, activation='softmax'))

model_ffnn.compile(
    loss=keras.losses.categorical_crossentropy,
    optimizer=keras.optimizers.Adagrad(),
    metrics=['accuracy']
)

model_ffnn.summary()

In this model the input data $X$ lies in $\mathbb{R}^d$ with $d = 3072$.

A hidden layer is built in $\mathbb{R}^h$ with $h = 256$.

\begin{align*}
z^\mathrm{hid}(X) &= W^hX+b^h\,,\\
h(X) &= \mathrm{Relu}(z^\mathrm{hid}(X))\,.
\end{align*}

$W^h\in\mathbb{R}^{hxd}$, $b^h\in\mathbb{R}^h$, $h(X)\in\mathbb{R}^h$ and for all $1\leqslant j\leqslant h$, $h(X)_j = \mathrm{Max}(0,z^\mathrm{hid}(X)_j)$. 

The output layer is built in $\mathbb{R}^M$ with $M = 100$.

\begin{align*}
z^\mathrm{out}(X) &= W^oX+b^o\,,\\
f_{\theta}(X) &= \mathrm{Softmax}(z^\mathrm{out}(X))\,.
\end{align*}

$W^o\in\mathbb{R}^{Mxh}$, $b^o\in\mathbb{R}^o$. 

$\theta = (W^h,b^h,W^o,b^o)$.

$f_{\theta}(X)$ is a vector in $\mathbb{R}^M$ where each entry is the probability that $X$  belongs to the corresponding class.

In [None]:
batch_size = 64
epochs = 50
history = model_ffnn.fit(x_train, y_train,
                         batch_size=batch_size,
                         epochs=epochs,
                         verbose=1,
                         validation_data=(x_test, y_test))
score = model_ffnn.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

In [None]:
plt.figure(figsize=(7, 5))
plt.plot(history.epoch, history.history['acc'], lw=1, label='Train')
plt.plot(history.epoch, history.history['val_acc'], lw=1, label='Test')
plt.legend(fontsize=14)
plt.title('Accuracy of softmax regression', fontsize=16)
plt.xlabel('Epoch', fontsize=14)
plt.ylabel('Accuracy', fontsize=14)
plt.tight_layout()

The number of parameters is much larger than in the softmax setting while the performance only slightly improves. See next course on `Convolutional networks to provide models more suitable to image data`...

<font color=darkred> Use cross-validation to find the best values of $h$ for instance in $\{32,64,128,256,512\}$...</font>