<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60"></center>

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>

<center>
Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej"
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>

In this lab, you will implement some of the techniques discussed in the lecture.

Below you are given a solution to the previous scenario. Note that it has two serious drawbacks:
 * The output predictions do not sum up to one (i.e. it does not return a distribution) even though the images always contain exactly one digit.
 * It uses MSE coupled with output sigmoid which can lead to saturation and slow convergence

**Task 1.** Use softmax instead of coordinate-wise sigmoid and use log-loss instead of MSE. Test to see if this improves convergence. Hint: When implementing backprop it might be easier to consider these two function as a single block and not even compute the gradient over the softmax values.

**Task 2.** Implement L2 regularization and add momentum to the SGD algorithm. Play with different amounts of regularization and momentum. See if this improves accuracy/convergence.

**Task 3 (optional).** Implement Adagrad, dropout and some simple data augmentations (e.g. tiny rotations/shifts etc.). Again, test to see how these changes improve accuracy/convergence.

**Task 4.** Try adding extra layers to the network. Again, test how the changes you introduced affect accuracy/convergence. As a start, you can try this architecture: [784,100,30,10]


In [1]:
import random
import numpy as np
from torchvision import datasets, transforms
from PIL import Image

In [2]:
# Let's read the mnist dataset

def load_mnist(path='.'):
    train_set = datasets.MNIST(path, train=True, download=True)
    x_train = train_set.data.numpy()
    _y_train = train_set.targets.numpy()

    test_set = datasets.MNIST(path, train=False, download=True)
    x_test = test_set.data.numpy()
    _y_test = test_set.targets.numpy()

    x_train = x_train.reshape((x_train.shape[0],28*28)) / 255.
    x_test = x_test.reshape((x_test.shape[0],28*28)) / 255.

    y_train = np.zeros((_y_train.shape[0], 10))
    y_train[np.arange(_y_train.shape[0]), _y_train] = 1

    y_test = np.zeros((_y_test.shape[0], 10))
    y_test[np.arange(_y_test.shape[0]), _y_test] = 1

    return (x_train, y_train), (x_test, y_test)

(x_train, y_train), (x_test, y_test) = load_mnist()

In [3]:
def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    # Derivative of the sigmoid
    return sigmoid(z)*(1-sigmoid(z))


def softmax(x):
  s = x - np.max(x)
  exps = np.exp(s)
  return exps / exps.sum(axis = 0)


def cross_entropy(predictions, targets):
    N = predictions.shape[0]
    ce = -np.sum(targets * np.log(predictions)) / N
    return ce

In [4]:
class Network(object):
    def __init__(self, sizes):
        # initialize biases and weights with random normal distr.
        # weights are indexed by target node first
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]
        

    def feedforward(self, a):
        # Run the network on a batch
        a = a.T
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.matmul(w, a)+b)
        return a

    def update_mini_batch(self, mini_batch, eta):
        # Update networks weights and biases by applying a single step
        # of gradient descent using backpropagation to compute the gradient.
        # The gradient is computed for a mini_batch which is as in tensorflow API.
        # eta is the learning rate
        nabla_b, nabla_w = self.backprop(mini_batch[0].T,mini_batch[1].T)

        self.weights = [w-(eta/len(mini_batch[0]))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch[0]))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        # For a single input (x,y) return a pair of lists.
        # First contains gradients over biases, second over weights.
        g = x
        gs = [g] # list to store all the gs, layer by layer
        fs = [] # list to store all the fs, layer by layer
        for b, w in zip(self.biases, self.weights):
            f = np.dot(w, g)+b
            fs.append(f)
            g = sigmoid(f)
            gs.append(g)
        # backward pass <- both steps at once
        dLdg = self.cost_derivative(gs[-1], y)
        dLdfs = []
        for w,g in reversed(list(zip(self.weights,gs[1:]))):
            dLdf = np.multiply(dLdg,np.multiply(g,1-g))
            dLdfs.append(dLdf)
            dLdg = np.matmul(w.T, dLdf)

        dLdWs = [np.matmul(dLdf,g.T) for dLdf,g in zip(reversed(dLdfs),gs[:-1])]
        dLdBs = [np.sum(dLdf,axis=1).reshape(dLdf.shape[0],1) for dLdf in reversed(dLdfs)]
        return (dLdBs,dLdWs)

    def evaluate(self, test_data):
        # Count the number of correct answers for test_data
        pred = np.argmax(self.feedforward(test_data[0]),axis=0)
        corr = np.argmax(test_data[1],axis=1).T
        return np.mean(pred==corr)

    def cost_derivative(self, output_activations, y):
        return (output_activations-y)

    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
        x_train, y_train = training_data
        if test_data:
            x_test, y_test = test_data
        for j in range(epochs):
            for i in range(x_train.shape[0] // mini_batch_size):
                x_mini_batch = x_train[(mini_batch_size*i):(mini_batch_size*(i+1))]
                y_mini_batch = y_train[(mini_batch_size*i):(mini_batch_size*(i+1))]
                self.update_mini_batch((x_mini_batch, y_mini_batch), eta)
            if test_data:
                print("Epoch: {0}, Accuracy: {1}".format(j, self.evaluate((x_test, y_test))))
            else:
                print("Epoch: {0}".format(j))

In [5]:
network = Network([784,30,10])
network.SGD((x_train, y_train), epochs=50, mini_batch_size=100, eta=3.0, test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.6969
Epoch: 1, Accuracy: 0.7733
Epoch: 2, Accuracy: 0.7956
Epoch: 3, Accuracy: 0.805
Epoch: 4, Accuracy: 0.8116
Epoch: 5, Accuracy: 0.8172
Epoch: 6, Accuracy: 0.8219
Epoch: 7, Accuracy: 0.8249
Epoch: 8, Accuracy: 0.8272
Epoch: 9, Accuracy: 0.8293
Epoch: 10, Accuracy: 0.8317
Epoch: 11, Accuracy: 0.8332
Epoch: 12, Accuracy: 0.8343
Epoch: 13, Accuracy: 0.836
Epoch: 14, Accuracy: 0.837
Epoch: 15, Accuracy: 0.8377
Epoch: 16, Accuracy: 0.8383
Epoch: 17, Accuracy: 0.839
Epoch: 18, Accuracy: 0.8393
Epoch: 19, Accuracy: 0.8401
Epoch: 20, Accuracy: 0.8408
Epoch: 21, Accuracy: 0.8414
Epoch: 22, Accuracy: 0.842
Epoch: 23, Accuracy: 0.8432
Epoch: 24, Accuracy: 0.8437
Epoch: 25, Accuracy: 0.8441
Epoch: 26, Accuracy: 0.8446
Epoch: 27, Accuracy: 0.8452
Epoch: 28, Accuracy: 0.8456
Epoch: 29, Accuracy: 0.846
Epoch: 30, Accuracy: 0.8469
Epoch: 31, Accuracy: 0.847
Epoch: 32, Accuracy: 0.8473
Epoch: 33, Accuracy: 0.8472
Epoch: 34, Accuracy: 0.8474
Epoch: 35, Accuracy: 0.8477
Epoch: 36

### Task 1
Use softmax instead of coordinate-wise sigmoid and use log-loss instead of MSE. Test to see if this improves convergence. Hint: When implementing backprop it might be easier to consider these two function as a single block and not even compute the gradient over the softmax values.

In [6]:
class SoftmaxNetwork(Network):
    def __init__(self, sizes):
        super().__init__(sizes)

    def feedforward(self, a):
        # Run the network on a batch
        a = a.T
        for b, w in zip(self.biases[:-1], self.weights[:-1]):
            a = sigmoid(np.matmul(w, a)+b)

        # last layer, softmax 
        a = softmax(np.matmul(self.weights[-1], a) + self.biases[-1])

        return a
    
    def cost_derivative(self, output_activations, y):
        return (softmax(output_activations)-y)
    
    def backprop(self, x, y):
        # For a single input (x,y) return a pair of lists.
        # First contains gradients over biases, second over weights.
        g = x
        gs = [g] # list to store all the gs, layer by layer
        fs = [] # list to store all the fs, layer by layer

        for b, w in zip(self.biases[:-1], self.weights[:-1]):
            f = np.dot(w, g)+b
            fs.append(f)
            g = sigmoid(f)
            gs.append(g)

        # a trick, last layer without activation, because it is easier to compute
        # the gradient of the cost function with respect to the non activated f straight away

        f = np.dot(self.weights[-1], g) + self.biases[-1]
        fs.append(f)

        dLdf = self.cost_derivative(fs[-1], y)
        dLdfs = [dLdf]
        dLdg = np.matmul(self.weights[-1].T, dLdf)

        for w,g in reversed(list(zip(self.weights[:-1],gs[1:]))):
            dLdf = np.multiply(dLdg,np.multiply(g,1-g))
            dLdfs.append(dLdf)
            dLdg = np.matmul(w.T, dLdf)
        
        dLdWs = [np.matmul(dLdf,g.T) for dLdf,g in zip(reversed(dLdfs),gs)] 
        dLdBs = [np.sum(dLdf,axis=1).reshape(dLdf.shape[0],1) for dLdf in reversed(dLdfs)] 
        return (dLdBs,dLdWs)

In [7]:
network = SoftmaxNetwork([784,30,10])
network.SGD((x_train, y_train), epochs=10, mini_batch_size=100, eta=3.0, test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.8907
Epoch: 1, Accuracy: 0.9146
Epoch: 2, Accuracy: 0.9263
Epoch: 3, Accuracy: 0.9337
Epoch: 4, Accuracy: 0.9355
Epoch: 5, Accuracy: 0.9405
Epoch: 6, Accuracy: 0.942
Epoch: 7, Accuracy: 0.9437
Epoch: 8, Accuracy: 0.9445
Epoch: 9, Accuracy: 0.9451


### Task 2
 Implement L2 regularization and add momentum to the SGD algorithm. Play with different amounts of regularization and momentum. See if this improves accuracy/convergence.

In [8]:
class L2Network(SoftmaxNetwork):
    def __init__(self, sizes):
        super().__init__(sizes)

    def update_mini_batch(self, mini_batch, eta, lmbd = 0.5):
        # Here we add L2 regularization, the bigger the lmbd, the bigger the penalty

        # Sum of two derrivaties is a derrivative of a sum and we compute
        # the derrivative of L2 regularization term separately only with respect to weights;
        # d (lmbd/2 * sum(w^2)) / dw = lmbd * w

        nabla_b, nabla_w = self.backprop(mini_batch[0].T,mini_batch[1].T)

        self.weights = [(1-eta*lmbd/len(mini_batch[0]))*w-(eta/len(mini_batch[0]))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch[0]))*nb
                       for b, nb in zip(self.biases, nabla_b)]

In [11]:
network = L2Network([784,30,10])
network.SGD((x_train, y_train), epochs=10, mini_batch_size=100, eta=0.1, test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.6645
Epoch: 1, Accuracy: 0.7874
Epoch: 2, Accuracy: 0.8433
Epoch: 3, Accuracy: 0.8742
Epoch: 4, Accuracy: 0.892
Epoch: 5, Accuracy: 0.8982
Epoch: 6, Accuracy: 0.9028
Epoch: 7, Accuracy: 0.9047
Epoch: 8, Accuracy: 0.9062
Epoch: 9, Accuracy: 0.9067


In [14]:
class MomentumNetwork(SoftmaxNetwork):
    def __init__(self, sizes):
        super().__init__(sizes)
        self.momentum_w = [np.zeros(w.shape) for w in self.weights]
        self.momentum_b = [np.zeros(b.shape) for b in self.biases]

    def update_mini_batch(self, mini_batch, eta, gamma = 0.9, lmbd = 0.0):
        # Introducing momentum, the bigger the gamma, the bigger the momentum

        nabla_b, nabla_w = self.backprop(mini_batch[0].T,mini_batch[1].T)

        self.momentum_w = [gamma * mw - (eta/len(mini_batch[0])) * nw - lmbd * w 
                            for w, nw, mw in zip(self.weights, nabla_w, self.momentum_w)]
        
        self.momentum_b = [gamma * mb - (eta/len(mini_batch[0])) * nb  - lmbd * b
                            for b, nb, mb in zip(self.biases, nabla_b, self.momentum_b)]
    

        self.weights = [w + mw
                        for w, mw in zip(self.weights, self.momentum_w)]
        self.biases = [b + mb
                       for b, mb in zip(self.biases, self.momentum_b)]

In [15]:
network = MomentumNetwork([784,30,10])
network.SGD((x_train, y_train), epochs=10, mini_batch_size=100, eta=3.0, test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.8324
Epoch: 1, Accuracy: 0.8828
Epoch: 2, Accuracy: 0.9116
Epoch: 3, Accuracy: 0.9229
Epoch: 4, Accuracy: 0.92
Epoch: 5, Accuracy: 0.9147
Epoch: 6, Accuracy: 0.9143
Epoch: 7, Accuracy: 0.9172
Epoch: 8, Accuracy: 0.9266
Epoch: 9, Accuracy: 0.9319


### Task 3 (extra):
 Implement Adagrad, dropout and some simple data augmentations (e.g. tiny rotations/shifts etc.). Again, test to see how these changes improve accuracy/convergence.

In [16]:
class AdagradNetwork(MomentumNetwork):
    def __init__(self, sizes):
        super().__init__(sizes)
        self.G_t = [np.zeros(w.shape) for w in self.weights]

        def update_mini_batch(self, mini_batch, eta, gamma = 0.0, lmbd = 0.0):
            # Introducing momentum, the bigger the gamma, the bigger the momentum

            nabla_b, nabla_w = self.backprop(mini_batch[0].T,mini_batch[1].T)

            self.G_w = [G_i + nw**2 for G_i, nw in zip(self.G_w, nabla_w)]
            self.G_b = [G_i + nb**2 for G_i, nb in zip(self.G_b, nabla_b)]

            adapted_lr_w = [eta / np.sqrt(G_i + 1e-8) for G_i in self.G_w]
            adapted_lr_b = [eta / np.sqrt(G_i + 1e-8) for G_i in self.G_b]

            self.momentum_w = [gamma * mw - (lr/len(mini_batch[0])) * nw - lmbd * w 
                                for  w, nw, mw, lr in zip(self.weights, nabla_w, self.momentum_w, adapted_lr_w)]
            
            self.momentum_b = [gamma * mb - (lr/len(mini_batch[0])) * nb  - lmbd * b
                                for b, nb, mb, lr in zip(self.biases, nabla_b, self.momentum_b, adapted_lr_b)]
        

            self.weights = [w + mw
                            for w, mw in zip(self.weights, self.momentum_w)]
            self.biases = [b + mb
                        for b, mb in zip(self.biases, self.momentum_b)]

In [20]:
network = AdagradNetwork([784,30,10])
network.SGD((x_train, y_train), epochs=25, mini_batch_size=100, eta=1.0, test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.9001
Epoch: 1, Accuracy: 0.8912
Epoch: 2, Accuracy: 0.921
Epoch: 3, Accuracy: 0.9131
Epoch: 4, Accuracy: 0.9284
Epoch: 5, Accuracy: 0.9271
Epoch: 6, Accuracy: 0.9364
Epoch: 7, Accuracy: 0.9383
Epoch: 8, Accuracy: 0.9435
Epoch: 9, Accuracy: 0.9422
Epoch: 10, Accuracy: 0.9332
Epoch: 11, Accuracy: 0.9292
Epoch: 12, Accuracy: 0.9436
Epoch: 13, Accuracy: 0.9372
Epoch: 14, Accuracy: 0.9405
Epoch: 15, Accuracy: 0.9382
Epoch: 16, Accuracy: 0.9372
Epoch: 17, Accuracy: 0.9405
Epoch: 18, Accuracy: 0.9401
Epoch: 19, Accuracy: 0.9437
Epoch: 20, Accuracy: 0.9406
Epoch: 21, Accuracy: 0.936
Epoch: 22, Accuracy: 0.9462
Epoch: 23, Accuracy: 0.9404
Epoch: 24, Accuracy: 0.9436


In [23]:
class DropoutNetwork(SoftmaxNetwork):
    def __init__(self, sizes):
        super().__init__(sizes)


    def backprop(self, x, y, p = 0.1):
        # For a single input (x,y) return a pair of lists.
        # First contains gradients over biases, second over weights.
        g = x
        # mask = np.random.binomial(1, 1 - p, size=g.shape) # one with probability p
        # g = np.multiply(g, mask) * (1 / (1- p)) # apply mask and scale
        gs = [g] # list to store all the gs, layer by layer
        fs = [] # list to store all the fs, layer by layer

        for b, w in zip(self.biases[:-1], self.weights[:-1]):
            f = np.dot(w, g)+b
            fs.append(f)
            g = sigmoid(f)
            mask = (np.random.rand(*g.shape) < (1 - p)) # one with probability p
            g = np.multiply(g, mask) * (1 / (1- p)) # apply mask and scale
            gs.append(g)

        # a trick, last layer without activation, because it is easier to compute
        # the gradient of the cost function with respect to the non activated f straight away

        f = np.dot(self.weights[-1], g) + self.biases[-1]
        fs.append(f)

        dLdf = self.cost_derivative(fs[-1], y)
        dLdfs = [dLdf]
        dLdg = np.matmul(self.weights[-1].T, dLdf)

        for w,g in reversed(list(zip(self.weights[:-1],gs[1:]))):
            dLdf = np.multiply(dLdg,np.multiply(g,1-g))
            dLdfs.append(dLdf)
            dLdg = np.matmul(w.T, dLdf)
        
        dLdWs = [np.matmul(dLdf,g.T) for dLdf,g in zip(reversed(dLdfs),gs)] 
        dLdBs = [np.sum(dLdf,axis=1).reshape(dLdf.shape[0],1) for dLdf in reversed(dLdfs)] 
        return (dLdBs,dLdWs)

In [24]:
network = DropoutNetwork([784,100, 50, 30,10])
network.SGD((x_train, y_train), epochs=25, mini_batch_size=100, eta=3.0, test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.0958
Epoch: 1, Accuracy: 0.0974
Epoch: 2, Accuracy: 0.098
Epoch: 3, Accuracy: 0.0982
Epoch: 4, Accuracy: 0.101
Epoch: 5, Accuracy: 0.0974
Epoch: 6, Accuracy: 0.098
Epoch: 7, Accuracy: 0.0958
Epoch: 8, Accuracy: 0.101
Epoch: 9, Accuracy: 0.0974
Epoch: 10, Accuracy: 0.0982
Epoch: 11, Accuracy: 0.101
Epoch: 12, Accuracy: 0.0974
Epoch: 13, Accuracy: 0.1028
Epoch: 14, Accuracy: 0.098
Epoch: 15, Accuracy: 0.0974
Epoch: 16, Accuracy: 0.098
Epoch: 17, Accuracy: 0.1135
Epoch: 18, Accuracy: 0.101
Epoch: 19, Accuracy: 0.098
Epoch: 20, Accuracy: 0.1009
Epoch: 21, Accuracy: 0.0974
Epoch: 22, Accuracy: 0.0974
Epoch: 23, Accuracy: 0.0974
Epoch: 24, Accuracy: 0.0974


In [25]:
def rng_rotate(img : np.ndarray, degrees=(-10, 10)):
    pil_image = Image.fromarray(np.uint8(img.reshape((28,28))*255))
    rotation_transform = transforms.RandomRotation(degrees=degrees)
    rotated_pil_image = rotation_transform(pil_image)
    np_image = np.array(rotated_pil_image)

    return np_image.reshape((28*28))/255.



class AugumentedNetwork(SoftmaxNetwork):
    def __init__(self, sizes):
        super().__init__(sizes)

    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
        x_train, y_train = training_data
        if test_data:
            x_test, y_test = test_data
        for j in range(epochs):
            for i in range(x_train.shape[0] // mini_batch_size):
                x_mini_batch = x_train[(mini_batch_size*i):(mini_batch_size*(i+1))]
                y_mini_batch = y_train[(mini_batch_size*i):(mini_batch_size*(i+1))]
                for k in range(x_mini_batch.shape[0]):
                    x_mini_batch[k] = rng_rotate(x_mini_batch[k])

                self.update_mini_batch((x_mini_batch, y_mini_batch), eta)
            if test_data:
                print("Epoch: {0}, Accuracy: {1}".format(j, self.evaluate((x_test, y_test))))
            else:
                print("Epoch: {0}".format(j))
        

In [26]:
network = AugumentedNetwork([784,100,30,10])
network.SGD((x_train, y_train), epochs=10, mini_batch_size=100, eta=3.0, test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.8979
Epoch: 1, Accuracy: 0.9191
Epoch: 2, Accuracy: 0.9262
Epoch: 3, Accuracy: 0.9337
Epoch: 4, Accuracy: 0.94
Epoch: 5, Accuracy: 0.9426
Epoch: 6, Accuracy: 0.9372
Epoch: 7, Accuracy: 0.938
Epoch: 8, Accuracy: 0.9379
Epoch: 9, Accuracy: 0.9369


### Task 4
Try adding extra layers to the network. Again, test how the changes you introduced affect accuracy/convergence. As a start, you can try this architecture: [784,100,30,10]

In [28]:
network = SoftmaxNetwork([784,100,30,10])
network.SGD((x_train, y_train), epochs=100, mini_batch_size=100, eta=3.0, test_data=(x_test, y_test))

Epoch: 0, Accuracy: 0.8497
Epoch: 1, Accuracy: 0.8741
Epoch: 2, Accuracy: 0.8915
Epoch: 3, Accuracy: 0.8989
Epoch: 4, Accuracy: 0.9077
Epoch: 5, Accuracy: 0.9166
Epoch: 6, Accuracy: 0.9216
Epoch: 7, Accuracy: 0.9225
Epoch: 8, Accuracy: 0.9254
Epoch: 9, Accuracy: 0.9248
Epoch: 10, Accuracy: 0.9266
Epoch: 11, Accuracy: 0.9261
Epoch: 12, Accuracy: 0.9264
Epoch: 13, Accuracy: 0.9257
Epoch: 14, Accuracy: 0.9287
Epoch: 15, Accuracy: 0.9294
Epoch: 16, Accuracy: 0.9308
Epoch: 17, Accuracy: 0.9307
Epoch: 18, Accuracy: 0.9304
Epoch: 19, Accuracy: 0.9306
Epoch: 20, Accuracy: 0.9306
Epoch: 21, Accuracy: 0.9298
Epoch: 22, Accuracy: 0.9288
Epoch: 23, Accuracy: 0.9283
Epoch: 24, Accuracy: 0.9291
Epoch: 25, Accuracy: 0.9294
Epoch: 26, Accuracy: 0.9298
Epoch: 27, Accuracy: 0.931
Epoch: 28, Accuracy: 0.9297
Epoch: 29, Accuracy: 0.9287
Epoch: 30, Accuracy: 0.9274
Epoch: 31, Accuracy: 0.9305
Epoch: 32, Accuracy: 0.9287
Epoch: 33, Accuracy: 0.9301
Epoch: 34, Accuracy: 0.9314
Epoch: 35, Accuracy: 0.93
Epoch