In [21]:
from icecream import ic
from abc import ABC, abstractmethod
import numpy as np
from numpy import linalg as LA
import graphviz

# Modules
- To better organize our project, we outsourced certain components that support the main code presented in this notebook.

### Atomic Operations, and Expression Rree

- Each atomic operation is implemented as a separate class.  
- Mathematical functions can be built by chaining these atomic operations together.  
- These functions are internally represented using an *expression tree*.  
  - In the expression tree:  
    - Each atomic operation is a node, more specifically an instance of $\texttt{Expr\_node()}$.  
    - Each variable in the function is also represented as a node, more specifically an instance of $\texttt{Expr\_node()}$. 
 - the classes of atomic operation all inharit frome the same abstract classes, which defines their sturcture:
     - all operations have a `forward` and a `backward` function, where the backward function is the derivative of the forward function 
      - These classes are callable. Calling an operation object on another value automatically builds the *expression tree*.  
  - To calculate the `forward pass`, the expression tree is traversed using the `forward` function.  
  - The `backward pass` requires both the `forward` and `backward` functions to compute derivatives, applying the **chain rule** during traversal.
  
### Visualize the Expression Tree
- $\texttt{print\_graph}$: module provides everything needed to visualize the expression tree using the **Graphviz** package

In [22]:
from forward_backward_functions_and_nodes import * 
from print_graph import print_graph

#  Details on our Forward and Backward Propagation Algorithms
Both the forward and the backward pass are recursive functions.
The forward pass computes the value of the specified node, this is the evaluation of the function at the points (values) of the end nodes. If the node is not an endnode, it will keep recurring itself until it reaches one. \
The backward propagation calculates the outer derivative at the current node by evaluating the backward function at the values from the forward pass of all child nodes and recursively calls itself on each child node. In each recursion step, the derivative values are multiplied. The propagation continues until it reaches an end node, where the accumulated product is assigned to the value of the end node.\


In [23]:
def forward(node):
    return node.forward_func(*(forward(child) for child in node.childs)) if type(node) is not Expr_end_node else node.value 
    
# def backward(node, value = np.float64(1)):
#     if type(node) is not Expr_end_node:
#         child_values = [forward(child) for child in node.childs] # computes the argument of the outer derivative. In other words: it computes g of f'(g)
#         if len(node.childs) == 1:
#             new_value = node.backward_func(*child_values) # computes the outer derivative f'(g)
#             if value.ndim == 0 or new_value.ndim == 0:
#                 backward(node.childs[0], value * new_value)
#             else: 
#                 backward(node.childs[0], new_value.T @ value) # @ is matrix product
#         else:
#             for child, new_value in zip(node.childs, node.backward_func(*child_values), strict=True):
#                 if value.ndim == 0 or new_value.ndim == 0:
#                     backward(child, value * new_value)
#                 else: 
#                     backward(child, new_value.T @ value)                 
#     else:
#         node.grad_value += value


def backward(node, value = np.float64(1)):
    if type(node) is not Expr_end_node:
        child_values = [forward(child) for child in node.childs]
        if len(node.childs) == 1:
            # product of inner derivatives
            # value =parant_node derivative(all of parant's childs feeded forward)
            # new_value = derivative_of_current_node(all of current cilds feeded forward)
            new_value = node.backward_func(*child_values)
            if value.ndim == 0 or new_value.ndim == 0:
                backward(node.childs[0], value * new_value)
            else: 
                backward(node.childs[0], value @ new_value) # @ is matrix product
            
        else:
            for child, new_value in zip(node.childs, node.backward_func(*child_values), strict=True):
                if value.ndim == 0 or new_value.ndim == 0:
                    backward(child, value * new_value)
                else: 
                    backward(child, value @ new_value ) # @ is matrix product                  
    else: 
        node.grad_value += value

# Example functions

The following cells are structured as follows: 
- Each cell contains one example function. \
For each function, we define the operators, parameters, and the function itself. The operators are defined as their respective operator classes. \
Example: 
| mathematical operator | operator class  |
| --- | --- |
| + | Add() |
| $\cdot$ | Multiply()|
| sin() | Sin() | 

- This function will then be depicted as a node tree via the $\texttt{print\_graph}$ function.
- Finally, we perform the forward and backward propagation. The values of the propagations will be compared with values of the analytically solven function and derivative(s).

## Parameters and mathematical operators

In [24]:
x1 = Expr_end_node(np.random.rand(1))
x2 = Expr_end_node(np.random.rand(1))
x = Expr_end_node(np.random.rand(1))
w = Expr_end_node(np.random.rand(3,3))
xv = Expr_end_node(np.random.rand(3))
b = Expr_end_node(np.random.rand(3))

add = Add()
add_2 = Add_scalar(2)
multiply = Multiply()
multiply_3 = Multiply_scalar(3)
multiply_4 = Multiply_scalar(4)
sin = Sin()
log = Log()
tanh = Tanh()
#vecadd = Vector_vector_sum() 
#matmul = Matrix_vector_product()

## Function 1: $f(x_1,x_2) = \log(x_1 \cdot x_2) \cdot \sin(x_2) $

In [25]:
# defining the function
func1 = multiply(log(multiply(x1, x2)), sin(x2))

# graphical depiction
graph1 = graphviz.Digraph('graph1', comment='test') 
graph1.attr(rankdir="LR")
print_graph(func1, graph1)
graph1.render(directory='graph_out/tt', view=True)


# analytical function and its derivative(s)
mfunc1 = np.log(x1.value * x2.value) * np.sin(x2.value) # analytical function
mdfunc1dx1 = np.sin(x2.value) / x1.value # analytical derivative w.r.t. x1 
mdfunc1dx2 = np.sin(x2.value) / x2.value + np.log(x1.value * x2.value) * np.cos(x2.value) # analytical derivative w.r.t. x2

# comparison to analytical value
# comparison 1: forward propagation
ic(forward(func1)) # value of the function via forward propagation
ic(np.log(x1.value * x2.value) * np.sin(x2.value)) # value of the analytical function

# comparison 2: backward propagation
# as we reuse the same parameter names for multiple functions, we set the derivatives w.r.t. the parameters to zero, before performing the derivatives.
x1.grad_value=0
x2.grad_value=0
backward(func1) # performing the derivative via backward propagation
ic(x1.grad_value) # value of the derivative w.r.t. x1 via backward propagation
ic(np.sin(x2.value) / x1.value) # value of the analytical derivative w.r.t. x1 
ic(x2.grad_value) # value of the derivative w.r.t. x2 via backward propagation
ic(np.sin(x2.value) / x2.value + np.log(x1.value * x2.value) * np.cos(x2.value)) # value of the analytical derivative w.r.t. x2


# Result
print("function 1")
print("Calculus values of x1 derivative and x2 derivative:", mdfunc1dx1, mdfunc1dx2 )
print("Compared to derivatives through chain rule:        ", x1.grad_value, x2.grad_value)

ic| forward(func1): array([-0.4222195])
ic| np.log(x1.value * x2.value) * np.sin(x2.value): array([-0.4222195])
ic| x1.grad_value: array([0.33950571])
ic| np.sin(x2.value) / x1.value: array([0.33950571])
ic| x2.grad_value: array([-0.68605395])
ic| np.sin(x2.value) / x2.value + np.log(x1.value * x2.value) * np.cos(x2.value): array([-0.68605395])


function 1
Calculus values of x1 derivative and x2 derivative: [0.33950571] [-0.68605395]
Compared to derivatives through chain rule:         [0.33950571] [-0.68605395]


## Function 2: $g(x_1, x_2) = x_1 \cdot x_2 (x_1 + x_2) $ 

In [26]:
# defining the function
func2 = multiply(x1, multiply(x2, add(x1, x2)))

# graphical depiction
graph2 = graphviz.Digraph('graph2', comment='test') 
graph2.attr(rankdir="LR")
print_graph(func2, graph2)
graph2.render(directory='graph_out/tt', view=True)

# analytical function and its derivative(s)
mfunc2 = x1.value * x2.value * (x1.value + x2.value)
mdfunc2dx1 = x2.value * (x1.value + x2.value) + x1.value * x2.value
mdfunc2dx2 = x1.value * (x1.value + x2.value) + x1.value * x2.value

# comparison
ic(forward(func2))
ic(x1.value * x2.value * (x1.value + x2.value))

x1.grad_value=0
x2.grad_value=0
backward(func2)
ic(x1.grad_value)
ic(mdfunc2dx1)
ic(x2.grad_value)
ic(mdfunc2dx2)

# result
print("Function 2")
print("Calculus values of x1 derivative and x2 derivative:", mdfunc2dx1, mdfunc2dx2 )
print("Compared to derivatives through chain rule:        ", x1.grad_value, x2.grad_value)

ic| forward(func2): array([0.17161258])
ic| x1.value * x2.value * (x1.value + x2.value): array([0.17161258])
ic| x1.grad_value: array([0.41607981])
ic| mdfunc2dx1: array([0.41607981])
ic| x2.grad_value: array([0.87295034])
ic| mdfunc2dx2: array([0.87295034])


Function 2
Calculus values of x1 derivative and x2 derivative: [0.41607981] [0.87295034]
Compared to derivatives through chain rule:         [0.41607981] [0.87295034]


## Function 3: $h(x) = 3x^2 + 4x + 2$

In [27]:
# defining the function
func3 = add_2( add( multiply_3(multiply(x,x)) , multiply_4(x) ))

# graphical depiction
graph3 = graphviz.Digraph('graph3', comment='test') 
graph3.attr(rankdir="LR")
print_graph(func3, graph3)
graph3.render(directory='graph_out/tt', view=False)

# comparison
mfunc3 = 3*x.value**2 + 4 * x.value + 2
mdfunc3dx = 6 * x.value + 4

ic(forward(func3))
ic(3*x.value**2 + 4 * x.value + 2)

x.grad_value = 0
backward(func3)
ic(mdfunc3dx)
ic(x.grad_value)


# Result
print("Function 3")
print("Calculus values of x derivative:                   ", mdfunc3dx)
print("Compared to derivatives through chain rule:        ", x.grad_value)

ic| forward(func3): array([2.67543434])
ic| 3*x.value**2 + 4 * x.value + 2: array([2.67543434])
ic| mdfunc3dx: array([4.9097059])
ic| x.grad_value: array([4.9097059])


Function 3
Calculus values of x derivative:                    [4.9097059]
Compared to derivatives through chain rule:         [4.9097059]


## Function 4: Neuron$(\vec x, w, \vec b) = \tanh(w\cdot \vec x + \vec b)$

### Manual algorithm
For the neuron activation function, we will compare the derivatives with what $\texttt{torch}$ computes. \
The backpropagation will include the loss function as the outermost derivative, so that we have
$$
 \frac{\partial \text{Loss}(\vec y)}{\partial \vartheta_{ij}} = \frac{\partial \text{Loss}}{\partial \vec y}  \frac{\partial \vec y}{\partial \vartheta_{ij}}
$$
with $\vartheta_{ij}$ either being the weight matrix $w_{ij}$ or the bias vector $\vec b$.

In [28]:
# loss_function   
def loss_function(y_pred, y_target):
     return (y_pred - y_target).pow(2).sum()
    
# (d loss)/(d y_pred) = 2*(y_pred - y_target)
def loss_backwards(y_pred, y_target):
     return 2*(y_pred - y_target)

### One-dimensional variables

In [29]:
xs_value = np.random.rand(1)
ws_value = np.random.rand(1)
bs_value = np.random.rand(1)
func4s_target_value = np.random.rand(1)

ws = Expr_end_node(ws_value)
xs = Expr_end_node(xs_value)
bs = Expr_end_node(bs_value)
func4s_target = Expr_end_node(func4s_target_value)

tanh = Tanh()
add = Add()
multiply = Multiply()

# defining the function
func4s = tanh(add(multiply(ws,xs) , bs)) # = function 4

# prediction value for loss function
func4s_predict = forward(func4s)

# graphical depiction
graph4s = graphviz.Digraph('graph4s', comment='test') 
graph4s.attr(rankdir="LR")
print_graph(func4s, graph4s)
graph4s.render(directory='graph_out/tt', view=True)


ws.grad_value = np.float64(0)
xs.grad_value = np.float64(0)
bs.grad_value = np.float64(0)
backward(func4s)

grad_ws = loss_backwards(func4s_predict, func4s_target.value)*ws.grad_value
grad_xs = loss_backwards(func4s_predict, func4s_target.value)*xs.grad_value
grad_bs = loss_backwards(func4s_predict, func4s_target.value)*bs.grad_value

ic(grad_ws)
ic(grad_xs)
ic(grad_bs)


# Result
print("Function 4 (scalar)")
print("Derivatives through chain rule:        \n", "ws :\n",
      grad_ws, "\n xs: \n",
      grad_xs, "\n bs: \n", 
      grad_bs)

ic| grad_ws: array([0.13392936])
ic| grad_xs: array([0.27181791])
ic| grad_bs: array([0.2752326])


Function 4 (scalar)
Derivatives through chain rule:        
 ws :
 [0.13392936] 
 xs: 
 [0.27181791] 
 bs: 
 [0.2752326]


### Multidimensional values

In [30]:
# multidimensional values
w_value = np.random.rand(3,2)
xv_value = np.random.rand(2)
b_value = np.random.rand(3)
func4_target_value = np.random.rand(3)

w = Expr_end_node(w_value)
xv = Expr_end_node(xv_value)
b = Expr_end_node(b_value)
func4_target = Expr_end_node(func4_target_value)

tanh = Tanh()
vecadd = Vector_vector_sum()
matmul = Matrix_vector_product()

# defining the function
activation = Matrix_w_x_b()
func4 = tanh(activation(w,xv,b)) # = function 4

# prediction value for loss function
func4_predict = forward(func4)

# graphical depiction
graph4 = graphviz.Digraph('graph4', comment='test') 
graph4.attr(rankdir="LR")
print_graph(func4, graph4)
graph4.render(directory='graph_out/tt', view=True)


w.grad_value = np.float64(0)
xv.grad_value = np.float64(0)
b.grad_value = np.float64(0)
backward(func4)

grad_w = loss_backwards(func4_predict, func4_target.value)@w.grad_value
grad_xv = loss_backwards(func4_predict, func4_target.value)@xv.grad_value
grad_b = loss_backwards(func4_predict, func4_target.value)@b.grad_value

# reshaping the derivatives to the resepective parameter
grad_w_cs = grad_w.reshape(w.value.shape)
grad_xv_cs = grad_xv.reshape(xv.value.shape)
grad_b_cs = grad_b.reshape(b.value.shape)

ic(grad_w_cs)
ic(grad_xv_cs)
ic(grad_b_cs)


# Result
print("Function 4")
print("Derivatives through chain rule:        \n", "w :\n",
      grad_w_cs, "\n xv: \n",
      grad_xv_cs, "\n b: \n", 
      grad_b_cs)

ic| grad_w_cs: array([[-0.11345782, -0.12900994],
                      [ 0.05637733,  0.06410519],
                      [ 0.19051295,  0.21662731]])
ic| grad_xv_cs: array([ 0.20828598, -0.02672952])
ic| grad_b_cs: array([-0.218523  ,  0.10858434,  0.36693335])


Function 4
Derivatives through chain rule:        
 w :
 [[-0.11345782 -0.12900994]
 [ 0.05637733  0.06410519]
 [ 0.19051295  0.21662731]] 
 xv: 
 [ 0.20828598 -0.02672952] 
 b: 
 [-0.218523    0.10858434  0.36693335]


### Function 4 extension: multiple layers

In [31]:
w2_value = np.random.rand(3,3)
b2_value = np.random.rand(3)
func4_2_target_value = np.random.rand(3)

w2 = Expr_end_node(w2_value)
b2 = Expr_end_node(b2_value)
func4_2_target = Expr_end_node(func4_2_target_value)

# second neuron
func4_2 = tanh(activation(w2, func4, b2))

# graphical depiction
graph4_2 = graphviz.Digraph('graph4.2', comment='test') 
graph4_2.attr(rankdir="LR")
print_graph(func4_2, graph4_2)
graph4_2.render(directory='graph_out/tt', view=True)

w2.grad_value = np.float64(0)
b2.grad_value = np.float64(0)
w.grad_value = np.float64(0)
b.grad_value = np.float64(0)
backward(func4_2)

grad_w2 = loss_backwards(func4_2_predict, func4_2_target.value)@w2.grad_value
grad_b2 = loss_backwards(func4_2_predict, func4_2_target.value)@b2.grad_value

# reshaping the derivatives to the resepective parameter
grad_w2_rs = grad_w2.reshape(w2.value.shape)
grad_b2_rs = grad_b2.reshape(b2.value.shape)

ic(grad_w2_rs)
ic(grad_b2_rs)

ic| grad_w2_rs: array([[0.07755232, 0.12375362, 0.06961662],
                       [0.09566422, 0.15265556, 0.08587518],
                       [0.10144618, 0.1618821 , 0.0910655 ]])
ic| grad_b2_rs: array([0.13295109, 0.16400104, 0.1739133 ])


array([0.13295109, 0.16400104, 0.1739133 ])

### Comparison to $\texttt{torch}$

In [32]:
import torch

### One-dimensional and single-layer comparison

In [51]:
# Same code as above, but the random parameters need to be in the same cell for comparison
# Everything inside the lined box is a repetition, skip to torch code
# -------------------------------------------------------------------------------------------
# actual values
xs_value = np.random.rand(1)
ws_value = np.random.rand(1)
bs_value = np.random.rand(1)
func4s_target_value = np.random.rand(1)

ws.grad_value = np.float64(0)
xs.grad_value = np.float64(0)
bs.grad_value = np.float64(0)

ws = Expr_end_node(ws_value)
xs = Expr_end_node(xs_value)
bs = Expr_end_node(bs_value)
func4s_target = Expr_end_node(func4s_target_value)

tanh = Tanh()
add = Add()
multiply = Multiply()

# defining the function
func4s = tanh(add(multiply(ws,xs) , bs)) # = function 4

# prediction value for loss function
func4s_predict = forward(func4s)

ws.grad_value = np.float64(0)
xs.grad_value = np.float64(0)
bs.grad_value = np.float64(0)
backward(func4s)

grad_ws = loss_backwards(func4s_predict, func4s_target.value)*ws.grad_value
grad_xs = loss_backwards(func4s_predict, func4s_target.value)*xs.grad_value
grad_bs = loss_backwards(func4s_predict, func4s_target.value)*bs.grad_value

# -------------------------------------------------------------------------------------------
# here starts the torch code
def forward_function(x, w, b):
     return torch.tanh(w * x + b)

# torch expression for parameters
xs_t = torch.tensor(xs_value, requires_grad=True)
ws_t = torch.tensor(ws_value, requires_grad=True)
bs_t = torch.tensor(bs_value, requires_grad=True)


func4s_target_t = torch.tensor(func4s_target_value)

func4s_predict_t = forward_function(xs_t, ws_t, bs_t,)

loss_t = loss_function(func4s_predict_t, func4s_target_t)

loss_t.backward()

ic(grad_ws)
ic(grad_bs)
ic(ws_t.grad)
ic(bs_t.grad)


# Result
print("Function 4 (scalar)")
print("torch values for w and b: \n",
      "w: \n", ws_t.grad, "\n",
      "b: \n", bs_t.grad, "\n")
print("compared to manual derivatives through chain rule: \n", 
      "w :\n", grad_ws, "\n",
      "b: \n", grad_bs)

ic| grad_ws: array([0.04336008])
ic| grad_bs: array([0.14948149])
ic| ws_t.grad: tensor([0.0434], dtype=torch.float64)
ic| bs_t.grad: tensor([0.1495], dtype=torch.float64)


Function 4 (scalar)
torch values for w and b: 
 w: 
 tensor([0.0434], dtype=torch.float64) 
 b: 
 tensor([0.1495], dtype=torch.float64) 

compared to manual derivatives through chain rule: 
 w :
 [0.04336008] 
 b: 
 [0.14948149]


### Multidimensional and multilayer comparison

In [45]:
# Same code as above, but the random parameters need to be in the same cell for comparison
# Everything inside the lined box is a repetition, skip to torch code
# -------------------------------------------------------------------------------------------
# actual values
w_value = np.random.rand(3,2)
xv_value = np.random.rand(2)
b_value = np.random.rand(3)

w2_value = np.random.rand(3,3)
b2_value = np.random.rand(3)
func4_2_target_value = np.random.rand(3)

# node expression
w = Expr_end_node(w_value)
xv = Expr_end_node(xv_value)
b = Expr_end_node(b_value)

w2 = Expr_end_node(w2_value)
b2 = Expr_end_node(b2_value)


func4_2_target = Expr_end_node(func4_2_target_value)


tanh = Tanh()
vecadd = Vector_vector_sum()
matmul = Matrix_vector_product()
activation = Matrix_w_x_b()

func4 = tanh(activation(w,xv,b))
func4_2 = tanh(activation(w2, func4, b2))
func4_2_predict = forward(func4_2)

w.grad_value = np.float64(0)
xv.grad_value = np.float64(0)
b.grad_value = np.float64(0)
w2.grad_value = np.float64(0)
b2.grad_value = np.float64(0)
backward(func4_2)

grad_w = loss_backwards(func4_2_predict, func4_2_target.value)@w.grad_value
grad_b = loss_backwards(func4_2_predict, func4_2_target.value)@b.grad_value

grad_w2 = loss_backwards(func4_2_predict, func4_2_target.value)@w2.grad_value
grad_b2 = loss_backwards(func4_2_predict, func4_2_target.value)@b2.grad_value

# reshaping the derivatives to the resepective parameter
grad_w_rs = grad_w.reshape(w.value.shape)
grad_b_rs = grad_b.reshape(b.value.shape)
grad_w2_rs = grad_w2.reshape(w2.value.shape)
grad_b2_rs = grad_b2.reshape(b2.value.shape)

# -------------------------------------------------------------------------------------------
# here starts the torch code
def forward_function(x, w1, b1, w2, b2):
     return torch.tanh(w2 @ torch.tanh(w1 @ x + b1) + b2)

# torch expression for parameters
xv_t = torch.tensor(xv_value, requires_grad=True)
w_t = torch.tensor(w_value, requires_grad=True)
b_t = torch.tensor(b_value, requires_grad=True)

w2_t = torch.tensor(w2_value, requires_grad=True)
b2_t = torch.tensor(b2_value, requires_grad=True)


func4_2_target_t = torch.tensor(func4_2_target_value)

func4_2_predict_t = forward_function(xv_t, w_t, b_t, w2_t, b2_t)
loss_t = loss_function(func4_2_predict_t, func4_2_target_t)

loss_t.backward()

ic(grad_w2_rs)
ic(grad_b2_rs)
ic(w2_t.grad)
ic(b2_t.grad)

print("Function 4")
print("\n Input layer: \n",
      "w: \n", w_value, "\n",
      "b: \n", b_value)
print("\n Second layer:")
print("torch values for w and b: \n",
      "w: \n", w_t.grad, "\n",
      "b: \n", b_t.grad, "\n")
print("compared to manual derivatives through chain rule: \n", 
      "w :\n", grad_w_rs, "\n",
      "b: \n", grad_b_rs)
print("\n Third layer:")
print("torch values for w and b: \n",
      "w: \n", w_t.grad, "\n",
      "b: \n", b_t.grad, "\n")
print("compared to manual derivatives through chain rule: \n", 
      "w :\n", grad_w_rs, "\n",
      "b: \n", grad_b_rs)

ic| grad_w2_rs: array([[0.02405814, 0.04238482, 0.04738375],
                       [0.01858132, 0.03273595, 0.03659687],
                       [0.03645592, 0.06422682, 0.07180181]])
ic| grad_b2_rs: array([0.05003356, 0.03864346, 0.07581715])
ic| w2_t.grad: tensor([[0.0241, 0.0424, 0.0474],
                       [0.0186, 0.0327, 0.0366],
                       [0.0365, 0.0642, 0.0718]], dtype=torch.float64)
ic| b2_t.grad: tensor([0.0500, 0.0386, 0.0758], dtype=torch.float64)


Function 4

 Input layer: 
 w: 
 [[0.76986411 0.18741674]
 [0.86170516 0.12539191]
 [0.3849205  0.9887739 ]] 
 b: 
 [0.16677814 0.92080612 0.82980949]

 Second layer:
torch values for w and b: 
 w: 
 tensor([[0.0211, 0.0752],
        [0.0062, 0.0223],
        [0.0020, 0.0073]], dtype=torch.float64) 
 b: 
 tensor([0.0848, 0.0251, 0.0083], dtype=torch.float64) 

compared to manual derivatives through chain rule: 
 w :
 [[0.02105467 0.07523013]
 [0.00623562 0.02228041]
 [0.00204865 0.00731998]] 
 b: 
 [0.08482723 0.02512272 0.00825379]

 Third layer:
torch values for w and b: 
 w: 
 tensor([[0.0211, 0.0752],
        [0.0062, 0.0223],
        [0.0020, 0.0073]], dtype=torch.float64) 
 b: 
 tensor([0.0848, 0.0251, 0.0083], dtype=torch.float64) 

compared to manual derivatives through chain rule: 
 w :
 [[0.02105467 0.07523013]
 [0.00623562 0.02228041]
 [0.00204865 0.00731998]] 
 b: 
 [0.08482723 0.02512272 0.00825379]
