# Recurrent Neural Network

"Input times weight add a bias activate" ~Siraj Raval (for each layer neuron)

#### Keys

- Used when we use **conditional sequential "memory"**

- good for to time series as sequencing matters

- Activation function can be linear or non-linear 

- in RNN's there is a 3rd weight matrix which connects the current hidden state to the hidden state of the previous hidden state (unique from feed forward NN's)

#### RNN Math

$$h^{(t)}=f(h^{(t-1)},x^{(t)}; \theta)$$

Loss Function

$$L\big((x^{(1)},...,x^{(\tau)}),(y^{(1)},...,y^{(\tau)})\big)=\sum_{t} -\log\hat{y}_{y^{(t)}}^{(t)}$$


#### Reference: 

RNN (Simple): https://github.com/llSourcell/recurrent_neural_net_demo/blob/master/rnn.py


In [2]:
import copy
import numpy as np

np.random.seed(0)

In [17]:
# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output*(1-output)

In [18]:
# training dataset generation
int2binary = {}
binary_dim = 8

largest_number = pow(2,binary_dim)
binary = np.unpackbits(
    np.array([range(largest_number)],dtype=np.uint8).T,axis=1)
for i in range(largest_number):
    int2binary[i] = binary[i]

In [19]:
# input variables
alpha = 0.1
input_dim = 2
hidden_dim = 16
output_dim = 1

In [20]:
# initialize neural network weights
synapse_0 = 2*np.random.random((input_dim,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,output_dim)) - 1
synapse_h = 2*np.random.random((hidden_dim,hidden_dim)) - 1

synapse_0_update = np.zeros_like(synapse_0)
synapse_1_update = np.zeros_like(synapse_1)
synapse_h_update = np.zeros_like(synapse_h)

In [21]:
synapse_0_update

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [22]:
# training logic
for j in range(10000):
    
    # generate a simple addition problem (a + b = c)
    a_int = np.random.randint(largest_number/2) # int version
    a = int2binary[a_int] # binary encoding

    b_int = np.random.randint(largest_number/2) # int version
    b = int2binary[b_int] # binary encoding

    # true answer
    c_int = a_int + b_int
    c = int2binary[c_int]
    
    # where we'll store our best guess (binary encoded)
    d = np.zeros_like(c)

    overallError = 0
    
    layer_2_deltas = list()
    layer_1_values = list()
    layer_1_values.append(np.zeros(hidden_dim))
    
    # moving along the positions in the binary encoding
    for position in range(binary_dim):
        
        # generate input and output
        X = np.array([[a[binary_dim - position - 1],b[binary_dim - position - 1]]])
        y = np.array([[c[binary_dim - position - 1]]]).T

        # hidden layer (input ~+ prev_hidden)
        layer_1 = sigmoid(np.dot(X,synapse_0) + np.dot(layer_1_values[-1],synapse_h))

        # output layer (new binary representation)
        layer_2 = sigmoid(np.dot(layer_1,synapse_1))

        # did we miss?... if so, by how much?
        layer_2_error = y - layer_2
        layer_2_deltas.append((layer_2_error)*sigmoid_output_to_derivative(layer_2))
        overallError += np.abs(layer_2_error[0])
    
        # decode estimate so we can print it out
        d[binary_dim - position - 1] = np.round(layer_2[0][0])
        
        # store hidden layer so we can use it in the next timestep
        layer_1_values.append(copy.deepcopy(layer_1))
    
    future_layer_1_delta = np.zeros(hidden_dim)
    
    for position in range(binary_dim):
        
        X = np.array([[a[position],b[position]]])
        layer_1 = layer_1_values[-position-1]
        prev_layer_1 = layer_1_values[-position-2]
        
        # error at output layer
        layer_2_delta = layer_2_deltas[-position-1]
        # error at hidden layer
        layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(layer_1)

        # let's update all our weights so we can try again
        synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)
        synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)
        synapse_0_update += X.T.dot(layer_1_delta)
        
        future_layer_1_delta = layer_1_delta
    

    synapse_0 += synapse_0_update * alpha
    synapse_1 += synapse_1_update * alpha
    synapse_h += synapse_h_update * alpha    

    synapse_0_update *= 0
    synapse_1_update *= 0
    synapse_h_update *= 0
    
    # print out progress
    if(j % 1000 == 0):
        print("Error:" + str(overallError))
        print("Pred:" + str(d))
        print("True:" + str(c))
        out = 0
        for index,x in enumerate(reversed(d)):
            out += x*pow(2,index)
        print(str(a_int) + " + " + str(b_int) + " = " + str(out))
        print("------------")

Error:[5.36567378]
Pred:[0 0 0 0 0 0 0 0]
True:[1 0 1 1 1 1 1 0]
123 + 67 = 0
------------
Error:[4.11176]
Pred:[1 1 1 1 1 1 1 1]
True:[0 1 0 1 0 0 0 0]
38 + 42 = 255
------------
Error:[3.84062208]
Pred:[1 1 0 1 0 1 0 1]
True:[1 0 0 1 1 1 1 1]
37 + 122 = 213
------------
Error:[3.50093866]
Pred:[1 0 1 1 0 0 0 0]
True:[1 0 1 1 1 0 1 0]
124 + 62 = 176
------------
Error:[1.71561304]
Pred:[0 1 1 0 1 1 0 0]
True:[0 1 1 0 1 1 0 0]
0 + 108 = 108
------------
Error:[0.62784196]
Pred:[1 0 0 0 1 1 0 0]
True:[1 0 0 0 1 1 0 0]
76 + 64 = 140
------------
Error:[0.69191746]
Pred:[0 0 1 0 1 1 1 1]
True:[0 0 1 0 1 1 1 1]
47 + 0 = 47
------------
Error:[0.3280583]
Pred:[0 0 0 0 1 0 1 1]
True:[0 0 0 0 1 0 1 1]
1 + 10 = 11
------------
Error:[0.3665797]
Pred:[1 1 0 0 0 0 1 0]
True:[1 1 0 0 0 0 1 0]
74 + 120 = 194
------------
Error:[0.34021234]
Pred:[0 1 0 0 1 0 1 0]
True:[0 1 0 0 1 0 1 0]
52 + 22 = 74
------------


### RNN (More In-Depth)



#### Reference
RNN (More in-depth): https://www.youtube.com/watch?v=BwmddtPFWtA 
Github: https://github.com/llSourcell/recurrent_neural_network/blob/master/RNN.ipynb

In [23]:
data = open('./data/kafka.txt', 'r').read()

In [25]:
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('data has %d chars, %d unique' % (data_size, vocab_size))

data has 137628 chars, 80 unique


In [27]:
char_to_ix = {ch:i for i, ch in enumerate(chars)}
ix_to_char = {i:ch for i, ch in enumerate(chars)}
print(char_to_ix)
print(ix_to_char)

{'b': 0, 'm': 1, '/': 2, '4': 3, 'T': 4, '(': 5, 'p': 6, 'Q': 7, 'e': 8, 'K': 9, 'M': 10, '-': 11, 't': 12, 'g': 13, 'u': 14, 'I': 15, 'N': 16, 'z': 17, 'f': 18, 'G': 19, 's': 20, '?': 21, '6': 22, 'J': 23, 'U': 24, 'a': 25, 'l': 26, ';': 27, 'P': 28, 'q': 29, 'B': 30, 'R': 31, 'j': 32, 'S': 33, '0': 34, ')': 35, ' ': 36, ',': 37, '"': 38, 'ç': 39, '!': 40, 'h': 41, 'W': 42, 'k': 43, '5': 44, ':': 45, 'x': 46, '9': 47, 'i': 48, "'": 49, 'c': 50, '$': 51, 'w': 52, '.': 53, 'F': 54, 'V': 55, 'Y': 56, 'A': 57, 'd': 58, '1': 59, 'E': 60, 'C': 61, 'o': 62, '\n': 63, '7': 64, 'r': 65, '8': 66, 'D': 67, '*': 68, 'y': 69, 'v': 70, 'O': 71, 'n': 72, '@': 73, 'X': 74, '3': 75, 'H': 76, 'L': 77, '2': 78, '%': 79}
{0: 'b', 1: 'm', 2: '/', 3: '4', 4: 'T', 5: '(', 6: 'p', 7: 'Q', 8: 'e', 9: 'K', 10: 'M', 11: '-', 12: 't', 13: 'g', 14: 'u', 15: 'I', 16: 'N', 17: 'z', 18: 'f', 19: 'G', 20: 's', 21: '?', 22: '6', 23: 'J', 24: 'U', 25: 'a', 26: 'l', 27: ';', 28: 'P', 29: 'q', 30: 'B', 31: 'R', 32: 'j', 

In [29]:
vector_for_char_a = np.zeros((vocab_size, 1))
vector_for_char_a[char_to_ix['a']] = 1
print(vector_for_char_a.ravel())

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]


In [32]:
# parameters
hidden_size = 100
seq_length = 25
learning_rate = 1e-1

In [None]:
Wxh = np.random.randn(hidden_size, vocab_size) * 0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size) * 0.01 # input to hidden to recurrent weight matrix
Why = np.random.randn(vocab_size, hidden_size) * 0.01 # input to hidden to output

The model parameters are adjusted during the trainning.

*Wxh* are parameters to connect a vector that contain one input to the hidden layer.<br>
*Whh* are parameters to connect the hidden layer to itself. This is the Key of the Rnn:
Recursion is done by injecting the previous values from the output of the hidden state, to itself at the next iteration.<br>
*Why* are parameters to connect the hidden layer to the output

bh contains the hidden bias<br>
by contains the output bias<br>

In [31]:
bh = np.zeros((hidden_size, 1))
by = np.zeros((vocab_size, 1))

## Loss Functions

Let's talk about loss... functions which are at the heart of helping our algorithms learn and perform better each go around. Thus "minimizing" loss...

During "training" the loss function:

1. forward pass: calculate the next var given a var from training set
2. error: calculates the error of the predicted var given the actual var
3. backward pass: calculates the gradient

Inputs: input, target and previous hidden state
Outputs: loss, gradient for each parameters between layers, previous hidden state

#### Forward Pass

$$h_{t}=\phi(W x_{t}+U h_{t-1})$$

where $x_{t}$ is the vector that encodes the char at position t and $p_{t}$ is the probability for next char

"Dirty" pseudo-code:

`hs = input*Wxh + last_value_of_hidden_state*Whh + bh`<br>
`ys = hs*Why + by`<br>
`ps = normalized(ys)`<br>
    
#### Backward Pass

Makes uses of "Back Propogation" which calculates the gradient for all parameters at once where gradients are calculated in reverse order of the forward pass. 

"Dirty" pseudo-code:

`hs = input*Wxh + last_value_of_hidden_state*Whh + bh`<br>
`ys = hs*Why + by`<br>

The loss for 1 data point:

$$p_{k}=\frac{e^{f_{k}}}{\sum_{j}e^{f_{j}}}$$

$$L_{i}=-\log(p_{y_{i}})$$

#### Chain Rule (Our Ol' Friend from Calc 1)

The chain rule is used for finding the derivative of composite functions (multiple functions)

$$f'(x)=(g(h(x))'=g'(h(x))h'(x)$$

In [45]:
def lossFun(inputs, targets, hprev):
    # empty dict used to store our input states, hidden states, output states & probabilities
    xs, hs, ys, ps = {}, {}, {}, {}
    # xs = stores One Hot Encoded input chars 
    # hs = stores hidden state outputs
    # ys = stores targets (unnormalized probabilities)
    # ps = stores the converted ys & normlizes prob for chars
    
    hs[-1] = np.copy(hprev) # separate or disassociate hs from hprev 
    loss = 0 # initialize loss as 0
    
    #forward pass
    for t in range(len(inputs)):
        xs[t] = np.zeros((vocab_size, 1))
        xs[t][inputs[t]] = 1 
        hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
        ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
        loss += -np.log(ps[t][targets[t], 0]) #softmax (cross-entropy loss)

    # backward pass: compute gradients going backwards
    dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)
    dhnext = np.zeros_like(hs[0])
    for t in reversed(range(len(inputs))):
        # output probabilities
        dy = np.copy(ps[t]) 
        # derive our 1st gradient
        dy[targets[t]] -= 1 #backprop into y
        # compute output gradient - output times hidden states transpose
        # when we apply the transpose weight matrix,
        # we can think intuitively of this as moving the error backward
        # through the network, giving us some sort of measure of the error
        # at the output of the 1th layer.
        # output gradient
        dWhy += np.dot(dy, hs[t].T)
        # backpropagate!
        dh = np.dot(Why.T, dy) + dhnext # backpop into h
        dhraw = (1 - hs[t] * hs[t]) * dh #backprop into tanh nonlinearity
        dbh += dhraw # derivative of hidden bias
        dWxh += np.dot(dhraw, xs[t].T) # derivative of input to hidden layer weight
        dWhh += np.dot(dhraw, hs[t-1].T) # derivative of hidden layer to hidden layer weight
        dhnext = np.dot(Whh.T, dhraw)
    for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
    return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
        

#### Testing Prediction Time

In [46]:
# predictionm 1 full foward pass
def sample(h, seed_ix, n):
    """
    sample a sequence of integers from the model
    h is memory state, seed_ix is seed letter for first time step
    n is how many characters to predict
    """
    x = np.zeros((vocab_size, 1)) # create empty vector
    x[seed_ix] = 1 # customize it for our seed char
    ixes = [] # list to store generated chars
    
    for t in range(n):
        """
        a hidden state at a given time step is a function
        of the input at the same time step modified by a weight matrix
        added to the hidden state of the previous time step
        multiplied by its own hidden state to hidden state matrix
        """
        h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
        y = np.dot(Why, h) + by # unnormalized output
        p = np.exp(y) / np.sum(np.exp(y)) # probabilities for next chars
        ix = np.random.choice(range(vocab_size), p=p.ravel()) # pick 1 w/ highest probability
        x = np.zeros((vocab_size, 1)) # create a vector
        x[ix] = 1 # customize it for the predicted char
        ixes.append(ix) # add it to the list
        
    txt = ''.join(ix_to_char[ix] for ix in ixes)
    print('----\n %s \n----' % (txt,))

hprev = np.zeros((hidden_size, 1)) # reset RNN memory
sample(hprev, char_to_ix['a'], 200) # predict the 200 next characters given 'a'

----
 nBz4Q.N5AST$hECvj8d
T57FAx@0G0
l-'ajOIi(s85N-!pF!*EU0Xn2A8"B?U5MkYbY;FF ':1$pmjQvof",clL;!@HçGKuL4D@QdKxUwDM@Uou3VXH;(7icUn qWm92;D.dRo8JC6hNh,ItX wkB!wHkAt%9bO;;!8hqOt?8/udBj,(y4,1!NC(oUf@vG;J,
B!(u8 
----


#### Training

Process:
- feed part of the training text with chunks size of `seq_length`
- loss function:
    - forward pass to calculate all parameters for the model for a given input/output pairs
    - backward pass to calculate all gradients
- print a sentence from a random seed using the parameters of the network
- update the model using Adaptive Gradient Technique Adagrad

#### Feed $L$ function w/ inputs & targets

Strategy:
- create 2 arrays: 1 from data file, another with targets shifted compared to input one


In [47]:
p = 0

inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
print("inputs", inputs)

targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]
print("targets", targets)

inputs [71, 72, 8, 36, 1, 62, 65, 72, 48, 72, 13, 37, 36, 52, 41, 8, 72, 36, 19, 65, 8, 13, 62, 65, 36]
targets [72, 8, 36, 1, 62, 65, 72, 48, 72, 13, 37, 36, 52, 41, 8, 72, 36, 19, 65, 8, 13, 62, 65, 36, 33]


#### Adagrad to update parameters

Adagrad is a type of gradient descent strategy

$$\theta_{t}=\theta_{t-1}=\frac{\alpha}{\sqrt{1+\sum_{i}^t g_{i}^2}}g_{t}$$

where $\alpha$ = step_size = learning_rate

$g_{i}$ = dparam

`param += dparam * learning_rate` or $\alpha$
`mem += dparam * dparam` or $g_{i}^2$
`param += -learning_rate * dparam / np.sqrt(mem + 1e-8)`

#### Smooth Loss

Not sure what role using a filtered version of the loss plays here but according to the tutorial: "It is a way to average the loss on over the last iterations to better track the progress"

`smooth_loss = smooth_loss * 0.999 + loss * 0.001`

In [48]:
n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # empty array for Adagrad

smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0

while n <= 1000*100:
    if p+seq_length+1 > len(data) or n == 0:
        hprev = np.zeros((hidden_size, 1)) # reset RNN memory
        p = 0 # go from start of data
    inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
    targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]
    
    # forward seq_length chars through the net & fetch gradients
    loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
    smooth_loss = smooth_loss * 0.999 + loss * 0.001
    
    # sample from the model now & then
    if n % 1000 == 0:
        print('iter %d, loss %f' % (n, smooth_loss)) # print progress
        sample(hprev, inputs[0], 200)
        
    # perform parameter update w/ Adagrad
    for param, dparam, mem in zip([Wxh, Whh, Why, bh, by],
                                 [dWxh, dWhh, dWhy, dbh, dby],
                                 [mWxh, mWhh, mWhy, mbh, mby]):
        mem += dparam * dparam
        param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update
        
    p += seq_length # move data pointer
    n += 1 # iteration counter

iter 0, loss 109.550675
----
 Y5,at0rbAW)vC3@gE"Wç,LU"c7fUJç VIx?nE
;eNe2FwKp-wiCw@a5VmK5XPTHayDoL3"UDKA!mM@tlob /,a)çpYcs7Up2,2gE,*QUEzNxvg6bDY%i*Ok)Nzw8GVo'*;RLimVA(YWEml!O81-zXp'0p?:EY.L9:yfM7UR5Hk-2I)TN"!T:obM$$CCjIDk2lOBVaD4( 
----
iter 1000, loss 83.627964
----
  paw age wtGk he fou bort heimk tuawme "vf tomy lferte coure , Bed ove thulnvapo ad on erk an ang amreve  jaer kvaind am Is
te heaneereeher sate ant ifk chave wt. onit of hem os outd c-fhe inrd ff ome 
----
iter 2000, loss 66.080338
----
 n ree wher the rout wor, litea meneer rerertay whefrevelemom saon, coxin tout atpeom him houc soreeriinme dite wat. shangce beast has has mobrathe foreten lasto them mpether at toudte hem if oxt ik kt 
----
iter 3000, loss 57.284709
----
 ithing sraler stsrytr insaed would wuthurhe she awoubd wis him huy teregnte wi at che llacpulgint bomked wher se go if counlasiked he ronbes mleche soratintsiringe ifpor, att Grecppundwer. He easlais  
----
iter 4000, loss 53.083971
----
 b
xeved tithew

KeyboardInterrupt: 