# Warm up

The code below showcases a convolutional network in Keras. It was designed to classify 100x100 rgb images into 10 classes.
This network... quite frankly, it sucks. Can you guess what's the problem? Is there just one problem?

In [None]:
import keras
import keras.layers as L
import keras.initializers as init
from matplotlib import pyplot
import numpy as np
import tensorflow as tf
from keras.models import Sequential
from keras.layers.normalization import BatchNormalization
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Activation, Dropout
from keras.layers.advanced_activations import LeakyReLU
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [None]:
def get_model():
  
  model = tf.keras.Sequential([
    # data_augmentation,
    L.InputLayer([28, 28, 1]),
    L.Conv2D(filters=128, kernel_size=(2, 2), activation="relu", 
                 kernel_initializer=init.RandomNormal(), padding='same'), 
    L.Conv2D(filters=256, kernel_size=(2,2), activation="relu",
                 kernel_initializer=init.RandomNormal(), padding='same'),
    L.MaxPool2D(pool_size=(6, 6)),
    # L.MaxPool2D(pool_size=(2, 2)),
    L.Flatten(),
    L.Dropout(rate=0.2),
    L.Dense(units=128,activation="relu"),
    L.Dense(10,activation="softmax")

  ])

  return model

In [None]:
model = get_model()
# model.summary()

In [None]:
from keras.datasets import mnist

(Xtrain, ytrain), (Xtest, ytest) = tf.keras.datasets.mnist.load_data(
    path='mnist.npz'
)
Xtrain.shape

In [None]:
for i in range(9):
  pyplot.subplot(330+1+i)
  pyplot.imshow(Xtrain[i], cmap = pyplot.get_cmap('gray'))
pyplot.show()

In [None]:
XtrainNorm = Xtrain.astype('float32')
XtestNorm = Xtest.astype('float32')


XtrainNorm = np.expand_dims(XtrainNorm/255.0, axis=3)
XtestNorm = np.expand_dims(XtestNorm/255.0, axis=3)


In [None]:
epochs = 15
batch_size=32
optimizer = keras.optimizers.Adam(learning_rate=0.001)

callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=4)
model.compile(optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history_1 = model.fit(XtrainNorm, ytrain, validation_data=(XtestNorm, ytest), callbacks=[callback], batch_size=batch_size, epochs=epochs, shuffle=True)

In [None]:
import matplotlib.pyplot as plt
plt.plot(history_1.history['accuracy'], label='accuracy')
plt.plot(history_1.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.988, 1])
plt.legend(loc='lower right')

test_loss, test_acc = model.evaluate(XtestNorm, ytest, verbose=1)

* [Conv2D](https://keras.io/layers/convolutional/#conv2d) - performs convolution:
    * filters: number of output channels;
    * kernel_size: an integer or tuple/list of 2 integers, specifying the width and height of the 2D convolution window;
    * padding: padding="same" adds zero padding to the input, so that the output has the same width and height, padding='valid' performs convolution only in locations where kernel and the input fully overlap;
    * activation: "relu", "tanh", etc.
    * input_shape: shape of input.
* [MaxPooling2D](https://keras.io/layers/pooling/#maxpooling2d) - performs 2D max pooling.
* [Flatten](https://keras.io/layers/core/#flatten) - flattens the input, does not affect the batch size.
* [Dense](https://keras.io/layers/core/#dense) - fully-connected layer.
    * Activation - applies an activation function.
* [LeakyReLU](https://keras.io/layers/advanced-activations/#leakyrelu) - applies leaky relu activation.
* [Dropout](https://keras.io/layers/core/#dropout) - applies dropout.

## Book of grudges
* zero init for weights will cause symmetry effect
* Too many filters for first 3x3 convolution - will lead to enormous matrix while there's just not enough relevant combinations of 3x3 images (overkill).
* Usually the further you go, the more filters you need.
* large filters (10x10 is generally a bad pactice, and you definitely need more than 10 of them
* the second of 10x10 convolution gets 8x6x6 image as input, so it's technically unable to perform such convolution.
* Softmax nonlinearity effectively makes only 1 or a few neurons from the entire layer to "fire", rendering 512-neuron layer almost useless. Softmax at the output layer is okay though
* Dropout after probability prediciton is just lame. A few random classes get probability of 0, so your probabilities no longer sum to 1 and crossentropy goes -inf.

In this exercise you have to train a new Convolutional Neural Network from scratch for the classification of images.

1. For this we will use the Keras library.
2. The aim is to achieve 99% accuracy (on validation/test set) the MNIST dataset http://yann.lecun.com/exdb/mnist/.
3. We have provided a basic Keras implementation of a CNN.
4. You are allowed to do whatever you want (except copy pasting) with the network as long as it is explained in your report.
5. Feel free to change the architecture of the network as well as parameters (e.g. learning rate, kernel sizes, ...).
6. You can try to guess parameters manually of you want, just make sure that it performs better than 99% on the validation set.
7. Sketch the final network architecture in your report.
8. Make sure you train the network on the GPU, otherwise it will be too slow.
9. Explain the plots: learning curve, accuracy wrt epoch.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.utils import shuffle
from keras.datasets import mnist


n_folds = 5
(train_images, train_labels), (test_images, test_labels) =  tf.keras.datasets.mnist.load_data(
    path='mnist.npz'
)

X = np.concatenate((train_images, test_images), axis = 0)
y = np.concatenate((train_labels, test_labels))
print(X.shape, y.shape)
X = X.astype('float32')
X, y = shuffle(X, y)

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
y = enc.fit_transform(y.reshape(-1, 1)).todense()
print(y.shape)

X = X/255
print(X.shape)


X = np.expand_dims(X, axis=3)


epochs = 10
batch_size=32

model = get_model()
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=4)
optimizer = keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

kfold = KFold(n_splits=n_folds,random_state=0,shuffle=False)
sfolder = StratifiedKFold(n_splits=n_folds,random_state=0)
i =1
for train, test in kfold.split(X,y):
    
    model.fit(X[train],y[train], validation_data=(X[test],y[test]), callbacks=[callback], batch_size=batch_size, epochs=epochs)
    print('k-fold cross validation: %s | test: %s' % (i, test))
    i = i+1
print("StratifiedKFold done")


**I tried to build a network as small as possible, with few parameters and working fast. I think a 99% result on validation data is not a good result, beacouse neural network are very easy to overfit. That's why I used cross validation to see the real result.We need to use test data to avoid overfitting.**

# Going bigger

* Use `tf.keras.datasets.cifar10.load_data()` to get the data
* split to 70 - 30 train / val using `train_test_split`
* normalize the input like $x_{\text{norm}} = \frac{x}{255} - 0.5$
* We need to convert class labels to one-hot encoded vectors. Use `keras.utils.to_categorical`.

In [None]:
# normalize inputs
# convert class labels to one-hot encoded, should have shape (?, NUM_CLASSES)
import tensorflow
import keras

(Xtrain, ytrain), (Xtest, ytest) = tansorflow.keras.datasets.cifar10.load_data()
# y_train = ### YOUR CODE HERE
# y_test = ### YOUR CODE HERE

# x_val = ### YOUR CODE HERE
# x_val = ### YOUR CODE HERE

# y_test = ### YOUR CODE HERE
# y_test = ### YOUR CODE HERE

In [None]:
Xtrain, ytrain