# HaMLeT

## Session 6: Backpropagation
This lab is designed by Leon Weninger and Raphael Kolk

### Goal of this Session

In this session you will, step by step, implement a backpropagation algorithm yourself without using any deep learning libraries. You should already be familiar with Python as well as NumPy (a package for scientific computing with Python).

### Given code

**Task 0:** Familiarize yourself briefly with the given code. Pay particural attention to the `Layer` and `Cost` classes, from which you will derive the classes you implement, and the `Sigmoid` layer which is predefined as an example. You'll also need to execute the cells in this section once.

The following code loads the data and trains the network in a similar fashion as in the last session.

$$x_{out} = w^T x_{in} + b$$
$$\delta_{in} = w^T \delta_{out}$$
$$\frac{\partial C}{\partial b} = \delta_{out}$$
$$\frac{\partial C}{\partial w} = x_{in} \delta_{out}$$

$$x_{out} = \sigma (x_{in})$$
$$\delta_{in} = \delta_{out} \odot \sigma' (x_{in})$$

$$cost = \frac{1}{N} \sum_{i=1}^N ( x_{in} - target )^2$$
$$\delta_{in} = x_{in} - target $$

$$ \frac{\partial cost}{\partial x_{in}} $$

In [4]:
import numpy as np
from tqdm import tqdm
from load_mnist import MNIST


def vectorize(j):
    label_vector = np.zeros((1, 10))
    label_vector[0, int(j)] = 1.0
    return label_vector


def load_data():
    mnist = MNIST()
    images, labels = mnist.data, mnist.target

    image_size = images.shape[1]
    label_size = labels.shape[1]

    random_permutation = np.random.permutation(images.shape[0])
    images = images[random_permutation, :]
    labels = labels[random_permutation, :]
    
    images = (images - np.mean(images))/np.std(images)

    return images, labels, image_size, label_size


def train(net, cost_function, number_epochs, batch_size, learning_rate):
    images, labels, image_size, label_size = load_data()
    training_images, validation_images = images[:50000], images[50000:]
    training_labels, validation_labels = labels[:50000], labels[50000:]

    for e in range(number_epochs):
        cost = train_epoch(e, net, training_images, training_labels, cost_function, batch_size, learning_rate)
        accuracy = validate_epoch(e, net, validation_images, validation_labels, batch_size)
        print('cost=%5.6f, accuracy=%2.6f' % (cost, accuracy), flush=True)


def train_epoch(e, net, images, labels, cost_function, batch_size, learning_rate):
    epoch_cost = 0

    for i in tqdm(range(0, len(images), batch_size), ascii=False, desc='training,   e=%i' % e):
        batch_images = images[i:min(i + batch_size, len(images)), :]
        batch_labels = labels[i:min(i + batch_size, len(labels)), :]

        # zero the gradients
        net.zero_gradients()

        # forward pass
        prediction = net.forward(batch_images)
        cost = cost_function.estimate(batch_labels, prediction)

        # backward pass
        dprediction = cost_function.gradient(cost)
        net.backward(dprediction)

        # update the parameters using the computed gradients via stochastic gradient descent.
        net.update_parameters(learning_rate)

        epoch_cost += np.mean(cost)

    return epoch_cost


def validate_epoch(e, net, images, labels, batch_size):
    n_correct = 0
    n_total = 0

    for i in tqdm(range(0, len(images), batch_size), ascii=False, desc='validation, e=%i' % e):
        batch_images = images[i:min(i + batch_size, len(images)), :]
        batch_labels = labels[i:min(i + batch_size, len(labels)), :]

        # compute predicted probabilities.
        predictions = net.forward(batch_images)

        # find the most probable class label.
        n_correct += sum(np.argmax(batch_labels, axis=1) == np.argmax(predictions, axis=1))
        n_total += batch_labels.shape[0]

    return n_correct / n_total


Remember the sigmoid function and its derivative you implemented in the previous session.

In [5]:
def sigmoid_function(var):
    return 1.0 / (1.0 + np.exp(-var))


def sigmoid_derivative(z):
    return sigmoid_function(z) * (1 - sigmoid_function(z))

The following abstract classes should serve as parent classes for all the different layers and cost functions which you will implement. 

In [6]:
class Layer:
    def __init__(self):
        # Initialize all member variables of the layer.       
        pass

    def forward(self, x_in):
        # Implemets for forward pass of the layer and returns x_out.
        pass

    def backward(self, d_out):
        # Implements the backward pass of the layer and returns d_in.
        pass

    def zero_gradients(self):
        # Sets all gradients of the layer to zero.
        pass

    def update_parameters(self, learning_rate):
        # Update the parameters of the layer with the help of the gradients stored during the backward pass.
        pass


class Cost:
    def __init__(self):
        # Initialize all member variables of the cost function.
        pass

    def estimate(self, target, prediction):
        # Estimates and return the cost with respect to the predicted label and a target label previously set by set_target().
        pass

    def gradient(self, cost):
        # Calculates and returns the gradient with respect to the cost.
        pass

The following class derived from the Layer class implements the forward and backward pass of the sigmoid activation function alread known from the previous session and serves as an example for you. Since it does not have learnable parameters, no `update_parameters` or `zero_gradients` function needs to be implemented.

In [7]:
class Sigmoid(Layer):
    def __init__(self):
        self.x_in = None
    
    def forward(self, x_in):
        self.x_in = x_in
        x_out = sigmoid_function(x_in)
        return x_out
    
    def backward(self, d_out):
        d_in = d_out * sigmoid_derivative(self.x_in)
        return d_in
    
    def zero_gradients(self):
        pass
    
    def update_parameters(self, learning_rate):
        pass

### Theoretical Foundation 

**Task 1a:** Take a peace of paper and a pencil, use your knowledge from the preparation material and the introduction slides and fill in the gaps in the preparation material. Having written the formulas down, please check with the tutor if they are correct.

Before starting to work on the code, think about batched forward- and backward passing. We keep to the convention of having the batch as the first dimension of all tensors. This is consistent with modern Deep Learning Frameworks, as you will get to know in the next session. However, this convention may change when tensors needs to be transposed when performing multiplications and additions.

### Practical Implementation

**Task 2a:** Implement the `forward` function for the `Linear` layer. Remember to store the `x_in` for use in the backward pass.

**Task 2b:** Implement the `estimate` function for the `MeanSquareError` cost, which estimates the cost after a ground truth target is set. Remember to store the `prediction` for calculating the gradient.

**Task 2c:** Implement the `gradient` function for the `MeanSquareError` cost, which calculates the gradient with respect to the cost. Use the `prediction` stored during the forward pass.

**Task 2d:** Implement the `backward` function for the `Linear` layer. The function should also calculate and accumulate the gradient of `w` and `b` with regard to the error.

**Task 2e:** Implement the `update_parameters` function for the `Linear` layer, i.e., use the gradients `dw` and `db` together with a given `learning_rate` to update the parameters `w` and `b` accordingly.

**Task 2f:** Test your implementation by propagating random input through a linear layer followed by a sigmoid layer, estimating the mean square error to a random target, calculating the gradient and propagating it back through the sigmoid and linear layer. Afterwards update the parameters of the linear layer using the function you implemented.

**Task 3a:** Implement the `Network` class which can encapsulate multiple layers. It offers the same interface as a layer and is therefore derived from the `Layer` parent as well. Make sure to implement all member functions needed. The `forward` function propagates a given input through all encapsulated layers and returns the final prediction of the network, whereas the `backward` function propagates a given gradient through all layers in reversed order. `zero_gradients` and `update_parameters` invoke the respective functions of the encapsulated layers.

**Task 3b:** Test your implementation analogous to task 2f but using the `Network` class to encapsulate the linear and sigmoid layer.

**Task 4:** Train the network you just implemented using the dataloader and train function given above and the hyperparameter given below.

**Task 5:** Come up with a more sophisticated network structure and adjust the hyperparameter in order to increase the accuracy.

In [8]:
class Linear(Layer):
    def __init__(self, n_in, n_out, initial_sigma=0.1):
        self.n_in = n_in
        self.n_out = n_out

        self.w = initial_sigma * np.random.randn(n_out, n_in)
        self.b = np.zeros((1, n_out))

        self.zero_gradients()

        self.x_in = None

    def forward(self, x_in):
        # ----- Add code for task 2a between comments -----
        # 'Implement the forward function for the Linear layer. 
        # Remember to store the x_in for use in the backward pass.'
        self.x_in = x_in
        
        x_out = [np.dot(np.expand_dims(sample,axis=0), np.transpose(self.w)) + self.b for sample in x_in]
        x_out = np.array(x_out)
        assert self.b.shape == np.dot(np.expand_dims(x_in[0], axis=0), np.transpose(self.w)).shape
        # -------------------------------------------------
        return np.squeeze(x_out)

    def backward(self, d_out):
        # ----- Add code for task 2d between comments -----
        # 'Implement the backward function for the Linear layer. The function should also calculate 
        # and accumulate the gradient of w and b with regard to the error.'

        self.dw = np.matmul(np.transpose(d_out), self.x_in)

        self.db = d_out.sum(axis=0, keepdims=True) # um es einheitlich zu machen die 600 'löschen' und damit die dimension nicht gelöscht wird, das keepdims
        assert self.dw.shape == (self.n_out, self.n_in)
        assert self.db.shape == (1, self.n_out)

        self.d_in = np.matmul(d_out, self.w)
        self.dx = self.d_in
        assert self.dx.shape == (d_out.shape[0], self.n_in)
        # -------------------------------------------------
        return self.d_in

    def zero_gradients(self):
        self.dw = np.zeros((self.n_out, self.n_in))
        self.db = np.zeros((1, self.n_out))
        self.dx = np.empty((0, self.n_in))

    def update_parameters(self, learning_rate):
        # ----- Add code for task 2e between comments -----
        # 'Implement the update_parameters function for the Linear layer, 
        # i.e., use the gradients dw and db together with a given 
        # learning_rate to update the parameters w and b accordingly.'
        self.w = self.w - learning_rate * self.dw
        self.b = self.b - learning_rate * self.db
        # -------------------------------------------------


class MeanSquareError(Cost):
    def __init__(self):
        self.prediction = None
        self.target = None

    def estimate(self, target, prediction):
        # ----- add code for task 2b between comments -----
        # 'Implement the estimate function for the MeanSquareError cost, 
        # which estimates the cost after a ground truth target is set. 
        # Remember to store the prediction for calculating the gradient.'
        self.target = target
        self.prediction = prediction
        cost = 0.5 * ((target-prediction)**2).mean(axis=None)
        # -------------------------------------------------
        return cost

    def gradient(self, cost):
        # ----- add code for task 2c between comments -----
        # 'Implement the gradient function for the MeanSquareError cost, 
        # which calculates the gradient. Use the 
        # prediction stored during the forward pass.'
        gradient = self.prediction - self.target
        # -------------------------------------------------
        return gradient


class Network(Layer):
    def __init__(self, layers):
        self.layers = layers

    # ----- add code for task 3a between comments -----
    # Implement the Network class which can encapsulate multiple layers. It offers the same 
    # interface as a layer and is therefore derived from the Layer parent as well. Make sure 
    # to implement all member functions needed. The forward function propagates a given input 
    # through all encapsulated layers and returns the final prediction of the network, whereas 
    # the backward function propagates a given gradient through all layers in reversed order. 
    # zero_gradients and update_parameters invoke the respective functions of the encapsulated layers.
    
    def forward(self, x_in):
        buffer = x_in
        for layer in self.layers:
            buffer = layer.forward(buffer)
        x_out = buffer
        return x_out

    def backward(self, d_out):
        # Implements the backward pass of the layer and returns d_in.
        buffer = d_out
        for layer in reversed(self.layers): # because of going backwards
            buffer = layer.backward(buffer)
        d_in = buffer
        return d_in

    def update_parameters(self, learning_rate):
        # Update the parameters of the layer with the help of the gradients stored during the backward pass.
        for layer in self.layers:
            layer.update_parameters(learning_rate)
    # -------------------------------------------------

In [9]:
# define hyperparameters
input_size = 28**2
label_size = 10
batch_size = 600
learning_rate = 0.0001
number_epochs = 100

random_input = np.random.rand(batch_size, input_size)
random_label = np.random.rand(batch_size, label_size)

linear_layer = Linear(input_size, label_size)
sigmoid_layer = Sigmoid()
cost_function = MeanSquareError()

# ----- add code for task 2f between comments -----
# 'Test your implementation by propagating random input through 
# a linear layer followed by a sigmoid layer, estimating the mean
# square error to a random target, calculating the gradient and 
# propagating it back through the sigmoid and linear layer. 
# Afterwards update the parameters of the linear layer using the 
# function you implemented.'

#Forward Pass
linear_for = linear_layer.forward(random_input)
sigmoid_for = sigmoid_layer.forward(linear_for)

#Calculate Output error vector
mse = cost_function.estimate(random_label, sigmoid_for)
cost_gradient = cost_function.gradient(mse)

#Backpropagate Error
sigmoid_back = sigmoid_layer.backward(cost_gradient)
linear_back = linear_layer.backward(sigmoid_back)
# -------------------------------------------------


# ----- add code for task 3b between comments -----
#Test your implementation analogous to task 2f but using the Network class 
#to encapsulate the linear and sigmoid layer

#Create instance of Network Class
layers = [linear_layer, sigmoid_layer]
net = Network(layers)

#Forward Pass
net_forward = net.forward(random_input)

#Calculate Error
cost = cost_function.estimate(random_label, net_forward)
grad = cost_function.gradient(cost)

#Backpropagate Error
net_backward = net.backward(grad)

# -------------------------------------------------

# ----- add code for task 4 between comments ------
# Train the network you just implemented using the dataloader and train 
# function given above and the hyperparameter given below.

#Train the Network
#train(net, cost_function, number_epochs, batch_size, learning_rate)

# -------------------------------------------------

# ----- add code for task 5 between comments ------
#'Come up with a more sophisticated network structure and 
# adjust the hyperparameter in order to increase the accuracy.'

#Adjust Hyperparameters
batch_size = 12
learning_rate = 0.01
number_epochs = 40

#Train the Network
layers_2 = [Linear(n_in=28**2, n_out=128), Sigmoid(), 
           Linear(n_in=128, n_out=32), Sigmoid(),
           Linear(n_in=32, n_out=10), Sigmoid()]

net1 = Network(layers_2)
train(net1, cost_function, number_epochs, batch_size, learning_rate)


# -------------------------------------------------

training,   e=0: 100%|██████████| 4167/4167 [00:11<00:00, 370.15it/s]
validation, e=0: 100%|██████████| 1667/1667 [00:01<00:00, 839.35it/s]

cost=126.924547, accuracy=0.853550



training,   e=1: 100%|██████████| 4167/4167 [00:10<00:00, 386.59it/s]
validation, e=1: 100%|██████████| 1667/1667 [00:01<00:00, 1032.38it/s]

cost=45.024823, accuracy=0.908350



training,   e=2: 100%|██████████| 4167/4167 [00:10<00:00, 396.43it/s]
validation, e=2: 100%|██████████| 1667/1667 [00:01<00:00, 871.16it/s] 

cost=30.006822, accuracy=0.924050



training,   e=3: 100%|██████████| 4167/4167 [00:07<00:00, 527.24it/s]
validation, e=3: 100%|██████████| 1667/1667 [00:01<00:00, 837.26it/s]

cost=24.190776, accuracy=0.934850



training,   e=4: 100%|██████████| 4167/4167 [00:07<00:00, 529.54it/s]
validation, e=4: 100%|██████████| 1667/1667 [00:03<00:00, 528.35it/s]

cost=20.496776, accuracy=0.942200



training,   e=5: 100%|██████████| 4167/4167 [00:08<00:00, 465.10it/s]
validation, e=5: 100%|██████████| 1667/1667 [00:01<00:00, 1001.20it/s]

cost=17.780754, accuracy=0.948500



training,   e=6: 100%|██████████| 4167/4167 [00:10<00:00, 408.64it/s]
validation, e=6: 100%|██████████| 1667/1667 [00:01<00:00, 868.00it/s]

cost=15.658349, accuracy=0.953100



training,   e=7: 100%|██████████| 4167/4167 [00:08<00:00, 518.25it/s]
validation, e=7: 100%|██████████| 1667/1667 [00:02<00:00, 823.23it/s]

cost=13.948911, accuracy=0.956800



training,   e=8: 100%|██████████| 4167/4167 [00:08<00:00, 482.63it/s]
validation, e=8: 100%|██████████| 1667/1667 [00:01<00:00, 941.43it/s] 

cost=12.548150, accuracy=0.959700



training,   e=9: 100%|██████████| 4167/4167 [00:11<00:00, 365.92it/s]
validation, e=9: 100%|██████████| 1667/1667 [00:02<00:00, 829.05it/s]

cost=11.384239, accuracy=0.962450



training,   e=10: 100%|██████████| 4167/4167 [00:11<00:00, 375.19it/s]
validation, e=10: 100%|██████████| 1667/1667 [00:01<00:00, 1149.22it/s]

cost=10.395047, accuracy=0.964650



training,   e=11: 100%|██████████| 4167/4167 [00:08<00:00, 494.81it/s]
validation, e=11: 100%|██████████| 1667/1667 [00:01<00:00, 1189.39it/s]

cost=9.535519, accuracy=0.966100



training,   e=12: 100%|██████████| 4167/4167 [00:09<00:00, 445.90it/s]
validation, e=12: 100%|██████████| 1667/1667 [00:01<00:00, 869.14it/s]

cost=8.781530, accuracy=0.967200



training,   e=13: 100%|██████████| 4167/4167 [00:07<00:00, 538.53it/s]
validation, e=13: 100%|██████████| 1667/1667 [00:01<00:00, 1073.45it/s]

cost=8.120255, accuracy=0.968100



training,   e=14: 100%|██████████| 4167/4167 [00:06<00:00, 611.95it/s]
validation, e=14: 100%|██████████| 1667/1667 [00:01<00:00, 946.64it/s] 

cost=7.539291, accuracy=0.968600



training,   e=15: 100%|██████████| 4167/4167 [00:06<00:00, 618.09it/s]
validation, e=15: 100%|██████████| 1667/1667 [00:01<00:00, 1225.60it/s]

cost=7.023863, accuracy=0.969100



training,   e=16: 100%|██████████| 4167/4167 [00:06<00:00, 691.01it/s]
validation, e=16: 100%|██████████| 1667/1667 [00:01<00:00, 1223.74it/s]

cost=6.562749, accuracy=0.970100



training,   e=17: 100%|██████████| 4167/4167 [00:08<00:00, 484.01it/s]
validation, e=17: 100%|██████████| 1667/1667 [00:01<00:00, 1051.91it/s]

cost=6.147625, accuracy=0.970500



training,   e=18: 100%|██████████| 4167/4167 [00:07<00:00, 565.43it/s]
validation, e=18: 100%|██████████| 1667/1667 [00:01<00:00, 1202.63it/s]

cost=5.772353, accuracy=0.971050



training,   e=19: 100%|██████████| 4167/4167 [00:09<00:00, 454.73it/s]
validation, e=19: 100%|██████████| 1667/1667 [00:02<00:00, 603.25it/s]

cost=5.432285, accuracy=0.971600



training,   e=20: 100%|██████████| 4167/4167 [00:09<00:00, 429.05it/s]
validation, e=20: 100%|██████████| 1667/1667 [00:02<00:00, 701.30it/s]

cost=5.123209, accuracy=0.971750



training,   e=21: 100%|██████████| 4167/4167 [00:07<00:00, 578.46it/s]
validation, e=21: 100%|██████████| 1667/1667 [00:01<00:00, 1009.33it/s]

cost=4.840558, accuracy=0.972200



training,   e=22: 100%|██████████| 4167/4167 [00:06<00:00, 620.79it/s]
validation, e=22: 100%|██████████| 1667/1667 [00:01<00:00, 1121.05it/s]

cost=4.580065, accuracy=0.972500



training,   e=23: 100%|██████████| 4167/4167 [00:07<00:00, 553.75it/s]
validation, e=23: 100%|██████████| 1667/1667 [00:01<00:00, 1035.21it/s]

cost=4.339354, accuracy=0.972800



training,   e=24: 100%|██████████| 4167/4167 [00:09<00:00, 441.71it/s]
validation, e=24: 100%|██████████| 1667/1667 [00:01<00:00, 933.08it/s] 

cost=4.116921, accuracy=0.972950



training,   e=25: 100%|██████████| 4167/4167 [00:10<00:00, 395.88it/s]
validation, e=25: 100%|██████████| 1667/1667 [00:01<00:00, 870.71it/s] 

cost=3.911132, accuracy=0.973400



training,   e=26: 100%|██████████| 4167/4167 [00:08<00:00, 478.88it/s]
validation, e=26: 100%|██████████| 1667/1667 [00:01<00:00, 1020.80it/s]

cost=3.720788, accuracy=0.973450



training,   e=27: 100%|██████████| 4167/4167 [00:07<00:00, 557.76it/s]
validation, e=27: 100%|██████████| 1667/1667 [00:02<00:00, 719.37it/s]

cost=3.544855, accuracy=0.973600



training,   e=28: 100%|██████████| 4167/4167 [00:08<00:00, 502.88it/s]
validation, e=28: 100%|██████████| 1667/1667 [00:01<00:00, 1085.03it/s]

cost=3.381577, accuracy=0.973800



training,   e=29: 100%|██████████| 4167/4167 [00:06<00:00, 674.95it/s]
validation, e=29: 100%|██████████| 1667/1667 [00:01<00:00, 1199.40it/s]

cost=3.229010, accuracy=0.973850



training,   e=30: 100%|██████████| 4167/4167 [00:06<00:00, 657.08it/s]
validation, e=30: 100%|██████████| 1667/1667 [00:01<00:00, 1039.49it/s]

cost=3.086409, accuracy=0.974150



training,   e=31: 100%|██████████| 4167/4167 [00:06<00:00, 659.76it/s]
validation, e=31: 100%|██████████| 1667/1667 [00:02<00:00, 700.83it/s]

cost=2.952945, accuracy=0.974250



training,   e=32: 100%|██████████| 4167/4167 [00:06<00:00, 604.89it/s]
validation, e=32: 100%|██████████| 1667/1667 [00:01<00:00, 1080.89it/s]

cost=2.827343, accuracy=0.974400



training,   e=33: 100%|██████████| 4167/4167 [00:06<00:00, 617.76it/s]
validation, e=33: 100%|██████████| 1667/1667 [00:03<00:00, 481.42it/s]

cost=2.708508, accuracy=0.974550



training,   e=34: 100%|██████████| 4167/4167 [00:08<00:00, 464.68it/s]
validation, e=34: 100%|██████████| 1667/1667 [00:02<00:00, 679.84it/s]

cost=2.596043, accuracy=0.974600



training,   e=35: 100%|██████████| 4167/4167 [00:06<00:00, 656.33it/s]
validation, e=35: 100%|██████████| 1667/1667 [00:03<00:00, 533.16it/s]

cost=2.490319, accuracy=0.974550



training,   e=36: 100%|██████████| 4167/4167 [00:10<00:00, 381.03it/s]
validation, e=36: 100%|██████████| 1667/1667 [00:02<00:00, 636.69it/s]

cost=2.391937, accuracy=0.974750



training,   e=37: 100%|██████████| 4167/4167 [00:09<00:00, 430.55it/s]
validation, e=37: 100%|██████████| 1667/1667 [00:01<00:00, 1233.64it/s]

cost=2.300936, accuracy=0.975000



training,   e=38: 100%|██████████| 4167/4167 [00:07<00:00, 524.85it/s]
validation, e=38: 100%|██████████| 1667/1667 [00:01<00:00, 1209.00it/s]

cost=2.216490, accuracy=0.975250



training,   e=39: 100%|██████████| 4167/4167 [00:06<00:00, 643.09it/s]
validation, e=39: 100%|██████████| 1667/1667 [00:01<00:00, 1112.60it/s]

cost=2.137583, accuracy=0.975400





### Feedback

Aaaaaand we're done 👏🏼🍻

If you have any suggestions on how we could improve this session, please let us know in the following cell. What did you particularly like or dislike? Did you miss any contents?

### Additional Tasks

**Task 6a:** Implement the `forward` function for the `SoftMax` layer.

**Task 6b:** Implement the `estimate` function for the `CrossEntropy` cost.

**Task 6c:** Implement the `gradient` function for the `CrossEntropy` cost.

**Task 6d:** Implement the `backward` function for the `SoftMax` layer.

**Task 6e:** Test your implementation by setting up a network using the soft max layer and cross entropy cost in combination.

In [None]:
class SoftMax(Layer):
    def __init__(self):
        self.x_out = None

    def forward(self, x_in):
        # ----- add code for task 6a between comments -----
        # -------------------------------------------------
        return self.x_out

    def backward(self, d_out):
        # ----- add code for task 6d between comments -----
        # -------------------------------------------------
        return d_in


class CrossEntropy(Cost):
    def __init__(self):
        self.x_in = None
        self.target = None
        self.eps = 1e-12

    def estimate(self, target, x_in):
        # ----- add code for task 6b between comments -----
        # -------------------------------------------------
        return cost

    def gradient(self, d_out):
        # ----- add code for task 6c between comments -----
        # -------------------------------------------------
        return gradient


In [None]:
# define hyperparameters
input_size = 28**2
label_size = 10
batch_size = 600
learning_rate = 0.00001
number_epochs = 100

# ----- add code for task 6e between comments -----
# -------------------------------------------------