# Try to implement derivatives with respect to the weights


During ```FFANN.feedForward``` we store all $\dfrac{\partial s^{(l+1)}_{j}}{\partial s^{(l)}_{i}}= \theta^{\prime \, (l+1)}_{j} w^{(l)}_{ji}$
in ```FFANN.derivatives```. For a network with $N$ layers ($N-2$ hidden + $1$ input layer + $1$ output layer), we wish to calculate  

* $\dfrac{\partial s^{(N-1)}_{r}}{\partial s^{(0)}_{p}}$ . [#1](#1)
* $\dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(l)}_{ji}} = 
\dfrac{\partial s^{(N-1)}_{r}}{\partial s^{(l+1)}_{j}} \theta^{\prime\, (l+1)}_{j}s^{(l)}_{i}$ (no sum over $j$). [#2](#2)


In order to do this we can accumulate
$$
\Delta^{ (0) }_{j i} = \dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(N-2)}_{i}} \\
\Delta^{ (1) }_{j i} = \dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(N-3)}_{i}}=
\dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(N-2)}_{k_1}} \cdot 
\dfrac{\partial s^{(N-2)}_{k_1}}{\partial s^{(N-3)}_{i}}=
\Delta^{(0)}_{j k}\cdot \dfrac{\partial s^{(N-2)}_{k}}{\partial s^{(N-3)}_{i}}
\\
\Delta^{ (2) }_{j i} = \dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(N-4)}_{i}}=
\dfrac{\partial s^{(N-1)}_{j}}{\partial s^{(N-2)}_{k_1}} \cdot 
\dfrac{\partial s^{(N-2)}_{k_1}}{\partial s^{(N-3)}_{k_2}}\cdot 
\dfrac{\partial s^{(N-3)}_{k_2}}{\partial s^{(N-4)}_{i}}=
\Delta^{(1)}_{j k}\cdot  \dfrac{\partial s^{(N-3)}_{k}}{\partial s^{(N-4)}_{i}}\\
\vdots\\
\Delta^{ (f) }_{j i} =\Delta^{ (f-1) }_{j i} \cdot  \dfrac{\partial s^{[N-(f+1)]}_{k}}{\partial s^{[N-(f+2)]}_{i}} \;,
$$
where the dot ($\cdot$) indicates summation over repeated indices. For convinience we can also define $\Delta^{(-1)}_{ji} = \delta_{ij}$. 


For [#1](#1), $ f= N-2 $, i.e. $\dfrac{\partial s^{(N-1)}_{r}}{\partial s^{(0)}_{p}} = \Delta^{(N-2)}_{ji}$.

For [#2](#2),  $f= N-(l+3)$, i.e. 
$$\dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(l)}_{ji}} = 
\Delta^{ (N-(l+3)) }_{r j} \theta^{\prime\, (l+1)}_{j}s^{(l)}_{i} \qquad\qquad\qquad\qquad 
\text{ for } l \leq N -3
\\
\dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(N-2)}_{ji}} = 
\Delta^{ (-1) }_{r j} \theta^{\prime\, (N-1)}_{j}s^{(N-2)}_{i} = \delta_{r j} \ \theta^{\prime\, (N-1)}_{j}s^{(N-2)}_{i} 
\qquad \text{ for } l=N-2 \; .
$$


Note: Similar thing holds for the derivatives with respect to the biases. Also, optimize ```FFANN.backPropagation``` **I'll have to do it later**.


In [1]:
import FeedForwardANN as FFANN
import numpy as np

In [2]:
class FFv2(FFANN.FFANN):
    '''
    Very inefficient numerical derivative of s^{self.total_layers-1}_r wrt w^{l}_{ji}
    just for testing purposes!
    '''
    
    def numericalDerivative_w(self,r,l,j,i,h=1e-3):
        N=self.total_layers-1
        w=self.weights[l][j][i]
        h_1=h + h*np.abs(w)
        
        self.weights[l][j][i]+=h_1

        self.evaluate()
        f1=self.signals[N][r]


        self.weights[l][j][i]=w
        self.weights[l][j][i]-=h_1

        self.evaluate()
        f0=self.signals[N][r]


        self.weights[l][j][i]=w
        return (f1-f0)/(2.*h_1 )
    
    




    def backPropagation(self):
        '''
        Define Delta^{f}_{ji} = \dfrac{\partial s^{[N-1)]}_{k}}{\partial s^{[N-(f+2)]}_{i}}.
        For f=0,1,2,...N-2 this is n^{(N-1)} \times n^{(N-(f+2))} matrix
        
        Notice that the Delta^{self.total_layers-2}_{ji} = \dfrac{\partial s^{[N-1)]}_{k}}{\partial s^{(0)}_{i}}
        '''
        
        
        N=self.total_layers
        
        self.Delta=[ [[0 for i in range(self.nodes[N-(f+2)])] for j in range(self.nodes[N-1])]  for f in range(N-1)]
        self.Delta[0]=self.derivatives[N-2][:]
        
        for f in range(1,N-1):#don't run f=0
            for j in range(self.nodes[N-1]):
                for i in range(self.nodes[N-(f+2)]):
                    self.Delta[f][j][i]=0
                    for k in range(self.nodes[N-(f+1)]):
                        self.Delta[f][j][i]+=self.Delta[f-1][j][k] * self.derivatives[N-(f+2)][k][i]
                        
    
    def derivative_w(self,r,l,j,i):
        '''
        caclulate
        \dfrac{\partial s^{(N-1)}_{r}}{\partial w^{(l)}_{ji}} =  
        \Delta^{ (N-(l+3)) }_{r j} \theta^{\prime\, (l+1)}_{j}s^{(l)}_{i}
        '''
        N=self.total_layers
        if l==N-2:
            if j==r:
                sum_wx = sum( [ self.weights[l][j][k] * xi for k,xi in enumerate(self.signals[l]) ] ) 
                return self.activations[l].derivative(sum_wx+self.biases[l][j])*self.signals[l][i]    
            else: 
                return 0
        
        sum_wx = sum( [ self.weights[l][j][k] * xi for k,xi in enumerate(self.signals[l]) ] ) 
        return self.Delta[N-(l+3)][r][j]*self.activations[l].derivative(sum_wx+self.biases[l][j])*self.signals[l][i]

In [3]:
lin=FFANN.linearActivation()
sig=FFANN.sigmoidActivation()


In [8]:
brain = FFv2(3,4,[3,2,50,2,6,66,8,1],[lin,lin,lin,lin,lin,lin,lin,lin,sig])
brain.init_params()
#brain.fill_biases_with(0)


In [13]:
# calculates just the output 
%timeit -n 15 -r 100 brain.evaluate()

# calculates the output and the local derivatives 
%timeit -n 15 -r 100 brain([33.33,2,0.1])

# calculates the output,  local derivatives, and runs backPropagation (for the Deltas)
%timeit -n 15 -r 100 brain([33.33,2,0.1]);brain.backPropagation()

# calculates the output, the local derivatives, and the derivatives wrt the signals 
%timeit -n 15 -r 100 brain.feedForwardDerivatives()

757 µs ± 43.9 µs per loop (mean ± std. dev. of 100 runs, 15 loops each)
1.98 ms ± 231 µs per loop (mean ± std. dev. of 100 runs, 15 loops each)
5.87 ms ± 182 µs per loop (mean ± std. dev. of 100 runs, 15 loops each)
4.58 ms ± 98.9 µs per loop (mean ± std. dev. of 100 runs, 15 loops each)
