# Gradient Descent

- Initialize $\mathbf{w}(0)$
- For $t=0,1,2, \cdots \quad$ [to termination]
$$
\mathbf{w}(t+1)=\mathbf{w}(t)-\eta \nabla E_{\text {in }}(\mathbf{w}(t))
$$
- Return final w

With **Stochastic** gradient descent, we pick one $(x_n, y_n)$ at a time, applying Gradient Descent to $e(h(x_n), y_n)$.

Rule of thumb: $\eta = .1$ works.

# How the Network Operates

$$
w_{i j}^{(l)} \quad \begin{cases}1 \leq l \leq L & \text { layers } \\ 0 \leq i \leq d^{(l-1)} & \text { inputs } \\ 1 \leq j \leq d^{(l)} & \text { outputs }\end{cases}
$$

$$
\theta(s)=\tanh (s)=\frac{e^s-e^{-s}}{e^s+e^{-s}}
$$

$$
x_j^{(l)}=\theta\left(s_j^{(l)}\right)=\theta\left(\sum_{i=0}^{d^{(l-1)}} w_{i j}^{(l)} x_i^{(l-1)}\right)
$$

$$
\text { Apply } \mathbf{x} \text { to } x_1^{(0)} \cdots x_{d^{(0)}}^{(0)} \rightarrow \rightarrow x_1^{(L)}=h(\mathbf{x})
$$



# Applying SGD

Error on example $\left(\mathbf{x}_n, y_n\right)$ is
$$
\mathrm{e}\left(h\left(\mathbf{x}_n\right), y_n\right)=\mathrm{e}(\mathrm{w})
$$

To implement SGD, we need the gradient: 

$$
\nabla \mathrm{e}(\mathbf{w}): \frac{\partial \mathrm{e}(\mathrm{w})}{\partial w_{i j}^{(l)}} \text { for all } i, j, l
$$

# Computing $\frac{\partial \mathrm{e}(\mathbf{w})}{\partial w_{i j}^{(l)}}$

A trick for efficient computation:
$$
\frac{\partial \mathrm{e}(\mathbf{w})}{\partial w_{i j}^{(l)}}=\frac{\partial \mathrm{e}(\mathbf{w})}{\partial s_j^{(l)}} \times \frac{\partial s_j^{(l)}}{\partial w_{i j}^{(l)}}
$$

$$
\text { We have } \frac{\partial s_j^{(l)}}{\partial w_{i j}^{(l)}}=x_i^{(l-1)} \quad \text { We only need: } \frac{\partial \mathrm{e}(\mathrm{w})}{\partial s_j^{(l)}}=\delta_j^{(l)}
$$

# $\delta$ for the final layer

For the final layer $l=L$ and $j=1$
$$
\begin{aligned}
& \delta_1^{(L)}=\frac{\partial \mathrm{e}(\mathbf{w})}{\partial s_1^{(L)}} \\
& = \frac{\partial(\theta(s_1^{(L)}) - y_n)^2}{\partial s_1^{(L)}} \\
& = 2(\theta(s_1^{(L)}) - y_n)(\theta^{\prime}(s_1^{(L)})) \\
& = 2(x_1^{(L)} - y_n)(\theta^{\prime}(s_1^{(L)})) \\
& = 2(x_1^{(L)} - y_n)(1 - \theta^2(s_1^{(L)})) \\
& = 2(x_1^{(L)} - y_n)(1 - (x_1^{(L)})^2)
\end{aligned}
$$

[Waiting to verify this is correct]

# Back propagation of $\delta$

$$
\begin{aligned}
\delta_i^{(l-1)} & =\frac{\partial \mathrm{e}(\mathbf{w})}{\partial s_i^{(l-1)}} \\
& =\sum_{j=1}^{d^{(l)}} \frac{\partial \mathrm{e}(\mathbf{w})}{\partial s_j^{(l)}} \times \frac{\partial s_j^{(l)}}{\partial x_i^{(l-1)}} \times \frac{\partial x_i^{(l-1)}}{\partial s_i^{(l-1)}} \\
& =\sum_{j=1}^{d^{(l)}} \delta_j^{(l)} \times w_{i j}^{(l)} \times \theta^{\prime}\left(s_i^{(l-1)}\right) \\
\delta_i^{(l-1)} & =\left(1-\left(x_i^{(l-1)}\right)^2\right) \sum_{j=1}^{d^{(l)}} w_{i j}^{(l)} \delta_j^{(l)}
\end{aligned}
$$

![image.png](../images/network-architecture.png)

In [99]:
import numpy as np
import pprint

# Stochastic gradient descent with multilayer 2D perceptrons.  Takes an array of 2d points as inputs, a target function, and a value for eta, and runs sgd, ultimately returning the weights.
DEFAULT_ETA = .1
# QSTN: how many layers / neurons to implement?
    # https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/
    # " an MLP with two hidden layers is sufficient for creating classification regions of any desired shape. This is instructive, although it should be noted that no indication of how many nodes to use in each layer"

# 2D input data is assumed (again ignoring the fixed bias input)
INPUT_DIMENSION = 2

HIDDEN_LAYER_STRUCTURE = [3, 2] # Two layers, with 3 neurons and 2 neurons respectively.  Note that these counts do not include the bias neurons, which are indexed at 0.  See the structure in the image above.

ACTIVATION_FUNCTION = np.tanh

# A single final output is assumed.

#QSTN: initialize bias to 0?

#TODO: this would be better as a Class, due to var sharing
def sgd(inputs, target_function, eta = DEFAULT_ETA, hidden_layer_structure = HIDDEN_LAYER_STRUCTURE):
    total_layers = len(HIDDEN_LAYER_STRUCTURE) + 2 # Input layer + hidden layers + output layer
    # Initialize all neurons x with empty values
    x = np.empty(total_layers, dtype=object)
    x[0] = np.empty(INPUT_DIMENSION + 1) # Input neurons
    # Hidden layer neurons
    for l, neuron_count in enumerate(HIDDEN_LAYER_STRUCTURE):
        l = l + 1
        x[l] = np.empty(neuron_count + 1)
    # Output neuron
    x[-1] = np.empty(2) 
    x[-1][0] = None # The first entry will be ignored, this is to get 1-indexing.

    # Initialize all weights w_{ij}^(l) with Xavier weight initialization..
    weights = np.empty(total_layers, dtype=object)
    
    # Begin with hidden layers
    prev_neuron_count = INPUT_DIMENSION + 1
    for l, neuron_count in enumerate(HIDDEN_LAYER_STRUCTURE):
        l = l + 1
        weights[l] = np.empty(prev_neuron_count, dtype=object) 
        for i in range(prev_neuron_count):
            weights[l][i] = np.empty(neuron_count + 1)
            for j in range(1, neuron_count + 1):
                # QSTN: do biases count toward n?  I'm assuming yes..?
                weights[l][i][j] = xavier_weight(prev_neuron_count, neuron_count + 1)
        prev_neuron_count = neuron_count + 1

    # Set the weights on the final output layer
    neuron_count = 1
    weights[-1] = np.empty(prev_neuron_count, dtype=object)
    for i in range(prev_neuron_count):
        weights[-1][i] = np.empty(neuron_count + 1)
        # QSTN: same question about bias.. here there is no bias whatsoever so seems a done deal.
        weights[-1][i][1] = xavier_weight(prev_neuron_count, neuron_count) 

    epoch_count = 0 
    #TODO: ultimately, will want something like out of sample validation to end the loop
    while (epoch_count < 5):
        # Generate a random permutation for how we pick the inputs
        input_perm = np.random.permutation(len(inputs))
        for input_id in input_perm:
            input = inputs[input_id]

            # Forward: Compute all x_j^(l)
            # First we set layer 1 values
            x[0] = input

            # Now we forward propagate x
            for l in range(1, total_layers):
                for j in range(1, len(x[l])):
                    print(weights[l])
                    print(weights[l][:,j])
                    x[l][j] = ACTIVATION_FUNCTION(x[l - 1].dot(weights[l][:,j]))

            # Testing
            return x

            # Backward: Compute all d_j^(l)

            # Update the weights: w_{ij}^(l) = w_{ij}^(l) - eta * x_i^(l-1) * d_j^(l)
        # Report E_in and e_out for the epoch.  Report current weights too.
        # Break the loop if it's time to stop [ might be done through an e_out target]
        # QSTN: when is the right time to stop in SGD?  how do we identify that?
            # Note, overfitting is a concern here https://stats.stackexchange.com/questions/433187/stopping-criteria-for-stochastic-gradient-descent , validation is part of how it's addressed.  Not sure how much I want to postpone my v1 over this..

        #TODO: make sure to increment during epoch loop

    # Return the weights

# looking like https://machinelearningmastery.com/weight-initialization-for-deep-learning-neural-networks/, Xavier weight initialization is a good approach. 
# QSTN: do biases count toward n?  I'm assuming yes..?
def xavier_weight(prev_neuron_count, neuron_count):
    return np.random.uniform(-np.sqrt(6)/np.sqrt(prev_neuron_count + neuron_count),np.sqrt(6)/np.sqrt(prev_neuron_count + neuron_count))


# Our inputs
n = 1000
rng = np.random.default_rng()
inputs = rng.uniform(-1,1,(n, 2))
inputs = np.hstack((np.ones((n,1)), inputs)) # Set x_0 = 1 for all xs

#TODO: set this up to generate a circle
# Our target function
target_function = None

print(sgd(inputs, target_function))


[array([ 1.49166815e-154, -2.51216048e-001, -8.05322200e-001,
         5.05968650e-001])
 array([ 1.49166815e-154, -8.49652808e-002,  7.65426630e-002,
         5.16151804e-001])
 array([ 1.49166815e-154,  8.82352199e-001,  8.07817532e-001,
        -2.64982213e-001])                                   ]


IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed