# Neural Networks and Stochastic Gradient Descent

**Copyright (c) Meta Platforms, Inc. and affiliates.**

This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.

In this notebook we will learn how to build a neural network using backpropagation and DiffKt. Neural networks are not exactly simple, but they are composed of simple mathematical techniques working in orchestration. However the calculus behind neural networks can be tedious, as derivatives for each layer need to be calculated for gradient descent purposes. Because weights and biases are applied in nested functions from each layer, it's mathematically like pulling apart an onion layer-by-layer. Thankfully DiffKt can take care of this task of calculating gradients for weight and bias layers, and leave out the messiness of solving derivates by hand.

To get started, first bring in DiffKt to use in this notebook. Then we will talk about the structure of a neural network.

In [1]:
@file:DependsOn("../kotlin/api/build/libs/api.jar")

## Importing the Data

Let's present a problem adapted from Chapter 7 in the book [*Essential Math for Data Science (O'Reilly)*](https://learning.oreilly.com/library/view/essential-math-for/9781098102920/). We want to train a neural network to predict a light/dark font for a given background color. For example, a <span style="background-color:DarkBlue; color:white"><text color='white'>Dark Blue</text></span> background would warrant a light font and a <span style="background-color:pink; color:black"><text color='white'>Pink</text></span> background would warrant a dark font. We could solve this with a logistic regression or even a [known heuristic](https://stackoverflow.com/questions/1855884/determine-font-color-based-on-background-color), but this will be a nice toy example to discover the workings of a neural network and applying DiffKt.  

Let's first explore our data stored [here](https://tinyurl.com/y2qmhfsr). It contains 3 input variable columns (red, green, and blue) and the output light/dark font indicator which is a boolean we want to predict. We have 1345 records in this training data. Here is a sample: 

| RED | GREEN | BLUE | LIGHT_OR_DARK_FONT_IND |
|-----|-------|------|------------------------|
| 0   | 0     | 0    | 0                      |
| 0   | 0     | 128  | 0                      |
| 0   | 139   | 69   | 0                      |
| 0   | 154   | 205  | 0                      |
| 0   | 178   | 238  | 1                      |
| 0   | 197   | 205  | 1                      |
| 0   | 199   | 140  | 1                      |
| 0   | 201   | 87   | 1                      |
| 0   | 205   | 0    | 0                      |
| 0   | 205   | 102  | 1                      |

To bring in this data, let's use the Java `URL` interface to read the CSV from GitHub. We will use some [regular expressions](https://www.oreilly.com/content/an-introduction-to-regular-expressions/) to split the lines, and [Sequence](https://kotlinlang.org/docs/sequences.html) operations to clean up the lines like an assembly line. We will flatten the whole CSV into a 1-dimensional array of floats, and then populate that into a reshaped tensor with 1345 rows and 4 columns. 

In [2]:
import java.net.URL
import org.diffkt.*

// Import CSV data with R,G,B input values and a light/dark indicator  (0,1)
val allDataTensor = URL("https://tinyurl.com/y2qmhfsr")
    .readText().split(Regex("\\r?\\n"))
    .asSequence()
    .drop(1)
    .filter { it.isNotBlank() }
    .flatMap { s ->
        s.split(",").map { it.toFloat() }
    }.toList()
    .toFloatArray()
    .let { values ->
        val n = values.count() / 4
        tensorOf(*values).reshape(n,4)
    }


We need to separate that tensor of all the data into two tensors: the input data tensor (the first 3 columns) and the output data tensor (the last column). Since each R, G, and B value is a number between $ 0 $ and $ 255 $, we will divide by $ 255 $ to rescale each value to be between $ 0 $ and $ 1 $. This will assist the gradient descent algorithm by compressing the vector space into a smaller area that can be iterated through more quickly. 

To separate the input and output tensors, we will use DiffKt's `view()` function. This allows us to "slice" the tensor to only certain parts of the tensor, like the first 3 columns for the input tensor and the 4th column for the output tensor. In the context of a 2-dimensional tensor, the first argument provides the indices of which rows/columns you want. The second argument is the axis. An axis of `0` would specify you are selecting rows. An axis of `1` specifies columns. In this case we are interested in selecting certain columns for the input and output tensor. 

In [3]:
// Extract 3 input columns, scale down by 255
val inputTensor = allDataTensor.view(0..2, 1)  / 255f

// Extract 1 output column
val outputTensor = allDataTensor.view(3, 1)

## The Anatomy of a Neural Network 

Now let's get to building the neural network. This visual below does not show the activation functions, a critical component to make a neural network work. We will get to that. Let's look at the nodes first. 

<img src="./resources/sGQdjdjUMw.png" style="width: 600px;"/>

The first layer is simply an input of the three variables (R, G, and B values for a given color). In the hidden layer (which resides in the middle), notice that we have three **nodes**, or functions of weights and biases, between the inputs and outputs. There is a weight $ w_i $ between each input node and hidden node, and another set of weights between each hidden node and output node. Each hidden and output node gets an additional bias $ b_i $ added.

The output node repeats the same operation, taking the resulting weighted and summed outputs from the hidden layer and making them inputs into the output layer, where another set of weights and biases will be applied. I put "repeat weighting and summing" instead of the mathematical expressions because the expressions propagating from the hidden layer is too long to display in the graphic. But here is the expression for the final node. 

$ \text{output} = w_{10}(x_1w_1 + x_2w_2 + x_3w_3 + b_1) $ 

$ + w_{11}(x_1w_4 + x_2w_5 + x_3w_6 + b_2) $ 

$ + w_{12}(x_1w_7 + x_2w_8 + x_3w_9 + b_3) + b_4 $ 

We need to solve for each of these weight and bias values, and this is what we call **training** the neural network. But before we get to that later in this notebook, there is one more critical component we need to add. The **activation function** is a nonlinear function that transforms the weighted and summed values in a node, helping separate the weighted data so it can be classified. For this neural network, we will use the _ReLU_ function for the hidden layer and the sigmoid function for the output layer. 

<img src="./resources/PvLebFIsiT.png" style="width: 600px;"/>

The **ReLU "rectified linear unit" function** will take a given numeric input and turn it to $ 0 $ if negative. ReLU is commonly used for hidden layers in neural networks because of its speed. It also mitigates the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) where partial derivative gradients get so small they prematurely approach $ 0 $ and bring training to a halt. DiffKt comes packaged with a `relu()` function that is compatible with its tensors and scalar types. It can be built from scratch simply using a maximum function between 0 and the input value. 

$$
ReLU(x) = max(0, x)
$$

<img src="./resources/tKkerIrVkt.png" style="width: 600px;"/>


The output layer consolidates all the inputs from the hidden layer and turns them into interpretable results in the output layer. In this case where our output is binary (light/dark font) we only have one output node. We use the **sigmoid function** to compress values between 0 and 1 using a logistic curve. This can be interpreted as a probability between 0 and 1, where closer to 0 indicates a dark font recommendation and closer to 1 recommends a light font. We can use $ 0.5 $ as our threshhold so anything less than $ 0.5 $ is considered a light font recommendation, and anything equal or higher is dark font. DiffKt also comes with a `sigmoid()` function, which is mathematically defined below: 

$$
sigmoid(x) = \frac{1}{1 + e^{-x}}
$$

<img src="./resources/DllsJpEMCJ.png" style="width: 600px;"/>



### Layer Tensors and Forward Propagation

Now how do we implement all of these entities in Kotlin using DiffKt? We have to declare some tensors holding these weights and biases for each layer. With some matrix multiplication these enable forward propogation, where we take a given color, apply all the weights, biases, and summations from both layers, apply the activation functions, and see the results. These weights and biases are going to have to start with random values, but we will use backpropagation, gradient descent, and DiffKt together to optimize them later in this notebook. 

First let's declare 4 tensors: the hidden layer weights, the hidden layer biases, the output weights, and output biases.  

In [4]:
import kotlin.random.Random

var wHiddenTensor: DTensor = FloatTensor.random(Random, Shape(3,3))
var wOuterTensor: DTensor = FloatTensor.random(Random, Shape(1,3))

var bHiddenTensor: DTensor = FloatTensor.random(Random, Shape(3,1))
var bOuterTensor: DTensor = FloatTensor.random(Random, Shape(1,1))

Let's visualize these tensors: the hidden and output layer weights and biases. Again note that these weights are biases are randomly initialized. Compare them carefully to our declared `FloatTensor` code above. I did not use any random seeding so your results will differ in values, but take note of the dimensions and structures of each. 

$$
W_{hidden} =\begin{bmatrix}
0.034535 &  0.5185636 & 0.81485028 \\
0.3329199 & 0.53873853 & 0.96359003 \\
0.19808306 & 0.45422182 & 0.36618893
\end{bmatrix}
$$

$$
B_{hidden} =\begin{bmatrix}
0.41379442 \\
0.81666079 \\
0.07511252
\end{bmatrix}
$$

$$
W_{output} =\begin{bmatrix}
0.82652072 && 0.30781539 && 0.93095565
\end{bmatrix}
$$

$$
B_{output} =\begin{bmatrix}
0.58018555
\end{bmatrix}
$$


Let's take a given color input as a vector. Here is a light orange color coded as $ (238,207,161) $. 

![](./resources/ovraZLIZrJ.png)

Let's take this color and package it into a vector $ X $ containing the three R, G, and B values. 

$ X = \left[\begin{matrix}238\\207\\161\end{matrix}\right] $ 

As stated earlier, we need to take an color inputs and linearly scale them down by a factor of $ \frac{1}{255} $. This will compress the vector space and make training much easier, as there will be far less traveling in steps and iterations.

$ X =  \frac{1}{255} \times \left[\begin{matrix}238\\207\\161\end{matrix}\right] $ 

$ X = \left[\begin{matrix}0.9333\\0.8117\\0.6313\end{matrix}\right] $ 

We can then apply these weights to our input through matrix vector multiplication. 

### Applying First Layer 

We apply the hidden layer weights $ W_{hidden} $ and then the biases $ B_{hidden} $ to the $ X $ input vector. Let's call the output of this layer $ Z_1 $. While DiffKt will perform this operation for you, you need to visualize the matrix multiplication to ensure the dimensions line up. The length of a row in the matrix must match the column of the matrix (or vector if it is only one column/row). Multiply each row of $ W_{hidden} $ with each respective element of $ X $ and sum each respective row operation together. Then add each respective bias $ B_{hidden} $ value.  

$ Z_1 = W_{hidden} X + B_{hidden} $

$ Z_1 = \left[\begin{matrix}0.034535 & 0.5185636 & 0.81485028\\0.3329199 & 0.53873853 & 0.96359003\\0.19808306 & 0.45422182 & 0.36618893\end{matrix}\right] \left[\begin{matrix}0.9333\\0.8117\\0.6313\end{matrix}\right] + \left[\begin{matrix}0.41379442\\0.81666079\\0.07511252\end{matrix}\right] $

$ Z_1 = \left[\begin{matrix}0.967564571384\\1.35632259341\\0.784737842701\end{matrix}\right] + \left[\begin{matrix}0.41379442\\0.81666079\\0.07511252\end{matrix}\right] $ 

$ Z_1 = \left[\begin{matrix}1.381358991384\\2.17298338341\\0.859850362701\end{matrix}\right] $ 

> If you are unfamiliar or uncomfortable with matrix multiplication, [Grant Sanderson intuitively explains them in 3Blue1Brown.](https://www.3blue1brown.com/topics/linear-algebra).

Now let's take that output of $ Z_1 $ and pass it through the *ReLU* activation function. Since all three values are positive, they will be unaffected by the activation. We will call this output $ A_1 $. 

$ A_1 = \text{ReLU}(Z_1) $

$ A_1 = \text{ReLU}(\left[\begin{matrix}1.381358991384\\2.17298338341\\0.859850362701\end{matrix}\right]) $ 

$ A_1 = \left[\begin{matrix}1.381358991384\\2.17298338341\\0.859850362701\end{matrix}\right] $ 

This is what the operations of the first layer would look like in DiffKt so far. 


In [5]:
val X = tensorOf(238F, 207F, 261F).transpose() / 255F 

val Z1 = wHiddenTensor.matmul(X) + bHiddenTensor
val A1 = relu(Z1)

### Applying the Second Layer

The output from the first layer $ A_1 $ becomes the input into the output layer operations next. We take that input and apply the weights $ W_{output} $ and biases $ B_{output} $. We will call this output $ Z_2 $. We again perform matrix-vector multiplication, which in this case is a single-row tensor against another single-column tensor. 

$ Z_2 = W_{output} A_1 + B_{output} $ 

$ Z_2 = \left[\begin{matrix}0.82652072 & 0.3078159 & 0.93095565\end{matrix}\right] \left[\begin{matrix}1.381358991384\\2.17298338341\\0.859850362701\end{matrix}\right] + \left[\begin{matrix}0.58018555\end{matrix}\right] $ 

$ Z_2 = \left[\begin{matrix}2.61108321729762\end{matrix}\right] + \left[\begin{matrix}0.58018555\end{matrix}\right] $ 

$ Z_2 = \left[\begin{matrix}3.19126876729762\end{matrix}\right] $ 

Finally, take that raw output $ Z_2 $ and pass it through the activation function of the output layer, which is the `sigmoid()` function. We will call this final activated output $ A_2 $. This will scale the resulting prediction to be between $ 0 $ and $ 1 $ resembling a probability. 

$ A_2 = sigmoid(Z_2) $ 

$ A_2 = sigmoid(\left[\begin{matrix}3.19126876729762\end{matrix}\right]) $ 

$ A_2 = \left[\begin{matrix}  0.510278468416569 \end{matrix}\right] $ 

Below shows how DiffKt would take the `A1` output from the first layer and make it an input into the second layer, yielding a final output `A2`. 

In [6]:
val Z2 = wOuterTensor.matmul(A1) + bOuterTensor
val A2 = sigmoid(Z2)

Putting all this together, we can consolidate this entire **forward propagation** operation into a single function. This will take an input color and applying each layer's weights, biases, and activation functions to produce a prediction. We will make defaults for the weight and bias tensors using the ones we initialized earlier, but will later override them one-at-a-time when we call our differential methods to get the gradients. 

In [7]:
// forward propagation neural network
// provide a single input sample for stochastic gradient descent
fun neuralNetwork(xSample: DTensor,
                  wHidden: DTensor = wHiddenTensor,
                  wOuter: DTensor = wOuterTensor,
                  bHidden: DTensor = bHiddenTensor,
                  bOuter: DTensor = bOuterTensor
): DTensor {
    val middleOutput = relu(wHidden.matmul(xSample.transpose()) + bHidden)
    val outerOutput = sigmoid(wOuter.matmul(middleOutput) + bOuter)
    return outerOutput
}

## Defining the Loss Function

Obviously our neural network is going to perform poorly as the weights and biases are randomly initialized and not optimized yet. The first step in optimizing them is to declare a `loss()` function that measures how far off our neural network outputs are from our training outputs. Let's use a simple squared loss function, where $ C $ is the sum of squares between the predicted outputs $ A_2 $ and the actual outputs from the data $ Y $. 

$ C = (Y - A_2)^2 $ 

We can package this whole expression as a function `loss()` in Kotlin using DiffKt below. It take a given color input separated as two tensors `xSample` and `ySample` representing the input and expected output respectively. We then take that difference with the predictions from the `neuralNetwork()`. We will also apply default references to to the weight and bias functions for the two layers, but allow them to be overridden when we calculate the gradients for each later. 

In [8]:
// Calculate loss using sum of squares
fun loss(xSample: DTensor,
         ySample: DTensor,
         wHidden: DTensor = wHiddenTensor,
         wOuter: DTensor = wOuterTensor,
         bHidden: DTensor = bHiddenTensor,
         bOuter: DTensor = bOuterTensor
): DScalar {
    return ((ySample - neuralNetwork(xSample, wHidden, wOuter, bHidden, bOuter)).pow(2)).sum()
}

The `loss()` function above returns a single value as a `DScalar` and this makes sense given the last operation is a `sum()`. Note that in practice, passing random batches of records (rather than a single record) is the preferred approach in training neural networks. We will just stick with one to keep the example simple. 

## Performing Stochastic Gradient Descent 

We now have all the pieces in place to perform stochastic gradient descent. Untangling the derivatives for each given layer's weights and biases, even with the [power of the chain rule](https://www.youtube.com/watch?v=tIeHLnjs5U8), still requires a lot of symbolic "pencil and paper" Calculus work. To appreciate the work DiffKt is going to do for us, let's get glimpse into the gradients underlying the weights and biases of each layer.

$ \frac{dC}{dW_2} = \frac{dZ_2}{dW_2}\frac{dA_2}{dZ_2}\frac{dC}{dA_2} = (A_{1})(\frac{e^{- Z_{2}}}{\left(1 + e^{- Z_{2}}\right)^{2}})(2 A_{2} - 2 y) $ 

$ \frac{dC}{dB_2} = \frac{dZ_2}{dB_2}\frac{dA_2}{dZ_2}\frac{dC}{dA_2} = (1)(\frac{e^{- Z_{2}}}{\left(1 + e^{- Z_{2}}\right)^{2}})(2 A_{2} - 2 y) $ 

$ \frac{dC}{dW_1} = \frac{dC}{DA_2} \frac{DA_2}{dZ_2} \frac{dZ_2}{dA_1} \frac{dA_1}{dZ_1} \frac{dZ_1}{dW_1} = (2A_2 - 2y)(\frac{e^{- Z_{2}}}{\left(1 + e^{- Z_{2}}\right)^{2}})(W_2)(Z_1 > 0)(X) $ 

$ \frac{dC}{dB_1} = \frac{dC}{DA_2} \frac{DA_2}{dZ_2} \frac{dZ_2}{dA_1} \frac{dA_1}{dZ_1} \frac{dZ_1}{dB_1} = (2A_2 - 2y)(\frac{e^{- Z_{2}}}{\left(1 + e^{- Z_{2}}\right)^{2}})(W_2)(Z_1 > 0)(1) $ 

Yikes! While this is doable with some practice, it is still a lot of work. What it would be like to do this for a neural network with more layers and more complicated activations? This is where differentiable programming with DiffKt becomes useful, automating this derivative and chain rule work behind the scene for us while remaining numerically efficient in its computing. 

Let's first declare `n` (the number of elements), the learning rate `L` which should be sufficient at `.001`, and $ 100,000 $ iterations. 

In [9]:
// number of elements
val n = allDataTensor.shape.first

// The learning rate
val lr = .001F

// The number of iterations to perform gradient descent
val iterations = 100_000

Calculating gradient descent across the _entire_ dataset is going to be immensely expensive and impractical in terms of computing speed and cost. We will instead use **stochastic gradient descent** that will only calculate the gradients for one input sample. This means the loss landscape will also keep changing on every iteration, but we should get a descent result with enough iterations and a small enough learning rate. 

Across each iteration, we will select a random row from data and extract from it from our `inputTensor` and `outputTensor`. We will then use the `reverseDerivative()` function to calculate the gradient for each weight and bias tensor in the hidden and output layers. This gradient operation will be passed to the loss function along with the targeted parameter tensor, and then return the gradients for that tensor. We want to prefer `reverseDerivative()` instead of `forwardDerivative()` in this case, because reverse derivatives are more efficient in neural networks where many inputs produces fewer outputs. By reverse derivative, that means it performs the chain rule operation in reverse. 

After that we will substract from each parameter tensor its gradient multiplied by the learning rate. To evaluate accuracy at the end of the optimization, we will use DiffKt's "greater than" `gt()` operation to see if predictions are greater than $ .5 $, which will produce 1.0 if true and 0.0 if false. We can then count how many are equal to the actual output values and score them as a percentage of accuracy.

In [10]:
import kotlin.random.nextInt


// Perform stochastic gradient descent
for (i in 0..iterations) {

    // sample a random row from input and output tensors
    val randomRow = Random.nextInt(0 until n)
    val xSample = inputTensor[randomRow]
    val ySample = outputTensor[randomRow]

    // get gradients
    val wHiddenGradients = reverseDerivative(wHiddenTensor) { t -> loss(xSample, ySample, wHidden = t) }
    val wOutputGradients = reverseDerivative(wOuterTensor) { t -> loss(xSample, ySample, wOuter = t) }
    val bHiddenGradients = reverseDerivative(bHiddenTensor) { t -> loss(xSample, ySample, bHidden = t) }
    val bOutputGradients = reverseDerivative(bOuterTensor) { t -> loss(xSample, ySample, bOuter = t) }

    // update weights and biases by subtracting their (learning rate) * (gradient)
    wHiddenTensor -= wHiddenGradients * lr
    wOuterTensor -= wOutputGradients * lr
    bHiddenTensor -= bHiddenGradients * lr
    bOuterTensor -= bOutputGradients * lr
}


val accuracy = (neuralNetwork(inputTensor).flatten().gt(FloatScalar(0.5F))
        .eq(outputTensor.flatten())).sum() / n.toFloat()

print(accuracy)

0.9836431

While your answer will vary on each run due to the randomized nature of stochastic gradient descent, you should pretty consistently get an accuracy greater than `.97` indicating our neural network learned the underlying model from the training data pretty well. Granted again, we did not separate a training/testing dataset but even if we did we would find this still to be performant. It is important to note that accuracy is not always a good measure, and we need to use confusion matrices to calculate [metrics like sensitivity and specificity](https://www.youtube.com/watch?v=vP06aMoz4v8). 

Here is the code in its entirety wrapped in a `main()` function to run outside a notebook. 

```kotlin 

import java.net.URL
import kotlin.random.Random
import org.diffkt.*
import kotlin.random.nextInt

/**
 * Predict a light/dark font based on R,G,B color background
 */
fun main() {

    // Import CSV data with R,G,B input values and a light/dark indicator  (0,1)
    val allDataTensor = URL("https://tinyurl.com/y2qmhfsr")
        .readText().split(Regex("\\r?\\n"))
        .asSequence()
        .drop(1)
        .filter { it.isNotBlank() }
        .flatMap { s ->
            s.split(",").map { it.toFloat() }
        }.toList()
        .toFloatArray()
        .let { values ->
            val n = values.count() / 4
            tensorOf(*values).reshape(n,4)
        }

    // Extract 3 input columns, scale down by 255
    val inputTensor = allDataTensor.view(0..2, 1)  / 255f

    // Extract 1 output column
    val outputTensor = allDataTensor.view(3, 1)


    var wHiddenTensor: DTensor = FloatTensor.random(Random, Shape(3,3))
    var wOuterTensor: DTensor = FloatTensor.random(Random, Shape(1,3))

    var bHiddenTensor: DTensor = FloatTensor.random(Random, Shape(3,1))
    var bOuterTensor: DTensor = FloatTensor.random(Random, Shape(1,1))

    // forward propagation neural network
    // provide a single input sample for stochastic gradient descent
    fun neuralNetwork(xSample: DTensor,
                      wHidden: DTensor = wHiddenTensor,
                      wOuter: DTensor = wOuterTensor,
                      bHidden: DTensor = bHiddenTensor,
                      bOuter: DTensor = bOuterTensor
    ): DTensor {
        val middleOutput = relu(wHidden.matmul(xSample.transpose()) + bHidden)
        val outerOutput = sigmoid(wOuter.matmul(middleOutput) + bOuter)
        return outerOutput
    }

    // Calculate loss using sum of squares
    fun loss(xSample: DTensor,
             ySample: DTensor,
             wHidden: DTensor = wHiddenTensor,
             wOuter: DTensor = wOuterTensor,
             bHidden: DTensor = bHiddenTensor,
             bOuter: DTensor = bOuterTensor
    ): DScalar {
        return ((ySample - neuralNetwork(xSample, wHidden, wOuter, bHidden, bOuter)).pow(2)).sum()
    }
    // number of elements
    val n = allDataTensor.shape.first

    // The learning rate
    val lr = .001F

    // The number of iterations to perform gradient descent
    val iterations = 100_000

    // Perform stochastic gradient descent
    for (i in 0..iterations) {

        // sample a random row from input and output tensors
        val randomRow = Random.nextInt(0 until n)
        val xSample = inputTensor[randomRow]
        val ySample = outputTensor[randomRow]

        // get gradients
        val wHiddenGradients = reverseDerivative(wHiddenTensor) { t -> loss(xSample, ySample, wHidden = t) }
        val wOutputGradients = reverseDerivative(wOuterTensor) { t -> loss(xSample, ySample, wOuter = t) }
        val bHiddenGradients = reverseDerivative(bHiddenTensor) { t -> loss(xSample, ySample, bHidden = t) }
        val bOutputGradients = reverseDerivative(bOuterTensor) { t -> loss(xSample, ySample, bOuter = t) }

        // update weights and biases by subtracting their (learning rate) * (gradient)
        wHiddenTensor -= wHiddenGradients * lr
        wOuterTensor -= wOutputGradients * lr
        bHiddenTensor -= bHiddenGradients * lr
        bOuterTensor -= bOutputGradients * lr
    }

    // calculate accuracy
    val accuracy = (neuralNetwork(inputTensor).flatten().gt(0.5F)
        .eq(outputTensor.flatten())).sum() / n.toFloat()

    println(accuracy)
}
```