# Let's think of a very simple classification problem
## i.e. you have a set of images (tiny ones, say 25x20)
## The images are split into 10 different categories
## So your NN takes in 500 inputs (25*20) and has 10 outputs
## Feed in each pixel value of the image to its corresponding "row"
## You want the NN to output a "1.0" in exactly one of the outputs depending on which category the image belongs to
## So let's use tanh(x) as an activation function. Why? because it clamps the inputs down between -1.0 and 1.0


### Create a matrix (10,500) and fill it with random values between 0 and 1.0 (sometimes -1.0 to 1.0 works too!)
### for each image, simply do a tanh(matrix*vector) and see what you get.
### Now you can create a "LOSS" function which is how far your NN is from the KNOWN value.
#### i.e. for each test image, you know exactly which category it should be.
#### Thus you can calculate how far off the answer that you get is, from the answer that you want,
#### you sum that error value up for each test image
#### and that is the "LOSS" function.

## So are we done?
### Do we just plug in the matrix * vector into a symbolic math package and have it calculate the differentiation???

In [None]:
import numpy as np
import math
import sympy as yp
import scipy as sp
from sympy.abc import x,y

In [None]:
x = yp.Symbol('x')
x, y, c = yp.symbols('x y c')

In [None]:
(x+y)*(x+y)*(x+y)

In [None]:
c = 3

In [None]:
z = (yp.exp(x) - yp.exp(-x))/(yp.exp(x)+yp.exp(-x))

In [None]:
z

In [None]:
yp.diff(y,x)

In [None]:
z2 = yp.diff(z, x)

In [None]:
z2

In [None]:
yp.simplify(z2)

In [None]:
q = 1 - z**2

In [None]:
q

In [None]:
yp.simplify(q)

In [None]:
i = yp.Symbol('i')


In [None]:
A = yp.MatrixSymbol('A', 3,3)
y = yp.MatrixSymbol('y', 3,1)
B = A*y                    
C = yp.tanh(B)

In [None]:
C

In [None]:
yp.diff(B,A)

###  just trust me on this one:
#### plugging a matrix*vector into a symbolic computation package and asking it to give you a solution is likely NOT going to help you.

# THE KEY INVENTION (in the last 20 years or so)
## THIS BIZARRE TRICK makes neural networks tractable:
## YOU CAN CALCULATE a piece-wise derivative of an expression
## COMBINE each piece-wise derivative
## and chain it backward, towards the input variables
## i.e. use the CHAIN RULE!
## NOW YOU're DONE!

# So where do you start?
## Assume you managed find a good equation for your loss function.
## You have the output of the loss function, which is going to be some number
### You want to figure out how to change the values in the matrix so that your resulting loss is minimized.

# Key thing is that the d(x)/d(x) == 1 (ALWAYS)
# so now, you start with the d(LOSS)/d(LOSS) is 1.0
# now calculate d(LOSS) w.r.t. to the NEXT piece of your equation graph (i.e. just one operation at a time!)
## next step is to calculate d (LOSS[i])/d(LOSS(i+1]) for each stage [i] of your expression graph. 
## i.e. calculate the partial differential at stage [i+1] with respect to each source variable in stage [i]
# What is a "stage"?
## It's simply all of the subexpressions that make up the destination expression at some point
## i.e. y = tanh(X @ V + b)
### 
### LOSS = (y - yANS)**2
### y = tanh(v[0])
### v[0] = v[1] + b
### v[1] = X @ V
## In this case, only the matrix $X$ and the bias vector $b$ are the parameters of your NN
## $V$ is the input to the matrix
### So ultimately we want to figure out how to change $X$ and $b$ to minimize the LOSS w.r.t. the training set


# So what is the derivative of a matrix?
## simple (!) (BUT ANNOYING!)
# We will cover it later!
### Remember that we ultimately want to calculate the new .grad of both the left hand matrix and the right hand matrix
### i.e. we need to calculate the outgoing gradient w.r.t. each $LH[i,j]$, i.e each value in  the left hand matrix and also $RH[i,j]$ for each value in the right hand matrix.

### Assume `A` is (3,4) and `B` is (4,2). 
### if we do `C = A @ B`, the result of the matrix multiply `C` is a (3,2) matrix

        C        = A                @ B
        (3,2)      (3,4)              (4,2)  )
        c00 c01    a00 a01 a02 a03    b00 b01
        c10 c11  = a10 a11 a12 a13  @ b10 b11
        c20 c21    a20 a21 a22 a23    b20 b21
                                      b30 b31
