In [17]:
import numpy as np
import pandas as pd

In [18]:
data = pd.read_csv("clean_weather.csv",index_col=0)
data = data.ffill()
data.head()


Unnamed: 0,tmax,tmin,rain,tmax_tomorrow
1970-01-01,60.0,35.0,0.0,52.0
1970-01-02,52.0,39.0,0.0,52.0
1970-01-03,52.0,35.0,0.0,53.0
1970-01-04,53.0,36.0,0.0,52.0
1970-01-05,52.0,35.0,0.0,50.0


In [19]:
temps = data["tmax"].tail(3).to_numpy()
temps

array([66., 70., 62.])


### Forward Pass and Prediction

Let's go through an example to see how this works. We'll initialize each weight matrix, then perform a sample forward pass with 3 sequence elements:

1. **Input Weights (`i_weights`)**: These weights connect the input to the hidden layer.
2. **Hidden Weights (`h_weights`)**: These weights connect the hidden state from the previous time step to the current hidden state.
3. **Output Weights (`o_weights`)**: These weights connect the hidden layer to the output.

The forward pass involves calculating the following for each time step:
- **Input to Hidden (`XI_t`)**: The contribution of the input at the current time step to the hidden state.
- **Hidden State (`XH_t`)**: The combined effect of the input and the previous hidden state, passed through an activation function (ReLU in this case).
- **Output (`XO_t`)**: The prediction for the next sequence element, based on the current hidden state.

We perform this process for 3 sequence elements (`x0`, `x1`, `x2`) using the initialized weights and inputs.
```

In [20]:
# input
x0 = temps[0].reshape(1,1)
x1 = temps[1].reshape(1,1)
x2 = temps[2].reshape(1,1)

In [21]:
#weights
np.random.seed(0)
i_weights = np.random.rand(1,2)
h_weights = np.random.rand(2,2)
o_weights = np.random.rand(2,1)
o_weights


array([[0.43758721],
       [0.891773  ]])

In [22]:
# calculate xi at time step 0
XI_0 = x0 @ i_weights
# there is no previous time step, so there is not going to be a hidden state
XH_0 = np.maximum(0, XI_0) # relu
# Output at time step 0 at xo_0
XO_0 = XH_0 @ o_weights
XH_0

array([[36.22169126, 47.20249818]])

In [23]:
XI_1 = x1 @ i_weights
XH = XH_0 @ h_weights
XH_1 = np.maximum(0,XH+XI_1)
XO_1 = XH_1@ o_weights
XO_1

array([[124.54916092]])

In [24]:
XI_2 = x2 @ i_weights
XH = XH_1 @ h_weights
XH_2 = np.maximum(0,XH+XI_2)
XO_2 = XH_2@ o_weights
XO_2

array([[190.94853131]])

We've now passed through 3 forward steps of our RNN! The output x0 at each time step is the prediction for the next element in the sequence.

The hidden state of the RNN allows the network to have information about all past sequence elements. So when we're processing the sequence item at time step 2, the hidden state of the RNN stores information about the sequence elements at time step 0 and 1.

## Forward pass but with tanh as an activation function also lets make a function to do the forward pass


We'll also scale the weights and the bias to work properly with tanh nonlinearity. and we'll make our input and hidden weights small, so tanh doesn't squash all the values to 1 or -1. we'll also make the output weight large, since the output of the hidden step will be small. and yeah RNN, the network would eventually learn the correct parameters. but initializing the weights this way is a good start and helps with the gradient descent.

In [25]:
np.random.seed(0)
# weights and bias
#scaling the weights and bias so values get through tanh nonlinearity
i_weights = np.random.rand(1,5) / 5 - .1 # small weights
h_weights = np.random.rand(5,5)/ 5 - .1 # small weights
o_weights = np.random.rand(5,1) * 50 #large weights
h_bias = np.random.rand(1,5)/ 5 - .1
o_bias = np.random.rand(1,1)


In [26]:
#lets run a the forward pass on a loop
# also, this loop will process sequence elements one by and store the output predictoin and the hidden state

# array to store outputs and hiddens states
outputs = np.zeros(3)
hiddens = np.zeros((3,5))
sequence = data["tmax"].tail(3).to_numpy()
prev_hidden = None

for i in range(3):
    x  =  sequence[i].reshape(1,1)

    XI = x @ i_weights
    if prev_hidden is None:
        XH = XI
    else:
        XH = XH + prev_hidden @ h_weights + h_bias
    
    # activation fn
    XH = np.tanh(XH)
    prev_hidden = XH
    hiddens[i,] = XH

    XO = XH @ o_weights + o_bias
    outputs[i,] = XO

  outputs[i,] = XO


In [27]:
outputs

array([80.68122178, 72.99311119, 69.28433643])

In [37]:
hiddens

array([[ 0.56784618,  0.99320288,  0.87557333,  0.53166114, -0.76483255],
       [ 0.55326914,  0.76848706,  0.73664379,  0.68197976, -0.65821463],
       [ 0.55377633,  0.66710561,  0.66350698,  0.74597413, -0.59604942]])

In [30]:
def mse(actual,predicted):
    return np.mean((actual - predicted)**2)

def grad_mse(actual,predicted):
    return (actual-predicted)

In [31]:
actuals = np.array([70, 62, 65])
loss_grad = grad_mse(actuals,outputs)
loss_grad

array([-10.68122178, -10.99311119,  -4.28433643])


### Backpropagation in RNNs

Backpropagation in RNNs involves calculating the gradients of the loss function with respect to the weights (`i_weights`, `h_weights`, `o_weights`) and biases (`h_bias`, `o_bias`) through time. This process, called Backpropagation Through Time (BPTT), adjusts the parameters to minimize the prediction error (`outputs` vs. actual values). Gradients are computed by propagating errors backward from the output layer to the hidden states (`hiddens`) and inputs (`sequence`), considering the temporal dependencies in the network.


In [45]:
o_weight_grad, o_bias_grad, h_weight_grad, h_bias_grad, i_weight_grad = [0] * 5


## manual backpropagation for a single timestep (t=2)


In [None]:
# Get the loss wrt the output at the current time step
l2_grad = loss_grad[2].reshape(1,1)

# Add to the output weight gradient
# Multiply the output of the hidden step (hiddens[2]) transposed by the l2 grad
# np.newaxis creates a new size 1 axis, effectively transposing the hiddens
o_weight_grad += hiddens[2][:,np.newaxis] @ l2_grad
# Add to the bias gradient.  Similar to a dense neural network, this is just the mean of the l2_grad.
o_bias_grad += np.mean(l2_grad)

h2_grad = l2_grad @ o_weights.T

#derivative of tanh
tanh_deriv = 1 - hiddens[2,:][np.newaxis,:] **2
# Multiply each position in the h_grad by the tanh derivative - this "undoes" the tanh in the forward pass
h2_grad = np.multiply(h2_grad,tanh_deriv)


# Now, find how much we need to update the hidden weights.
# We take the input to the hidden step (the output of the previous hidden step in the forward pass) @ h2_grad
h_weight_grad += hiddens[1,:][:,np.newaxis] @ h2_grad
h_bias_grad += np.mean(h2_grad)
# This multiples the sequence value at time step 2 by the gradient
# We don't need the .T here, but I left it here in case you have a larger input size
i_weight_grad += sequence[2].reshape(1,1).T @ h2_grad



## manual backpropagation for a single timestep (t=1)


In [54]:
l1_grad = loss_grad[1].reshape(1,1)
o_weight_grad += hiddens[1][:,np.newaxis] @l1_grad
o_bias_grad += np.mean(l1_grad)


h1_grad = l1_grad @ o_weights.T

h1_grad += h2_grad @ h_weights.T 

tanh_deriv = 1 - hiddens[1,:][np.newaxis,:] **2
h1_grad = np.multiply(h1_grad,tanh_deriv)

h_weight_grad += hiddens[1,:][:,np.newaxis] @ h1_grad
h_bias_grad += np.mean(h1_grad)
i_weight_grad += sequence[1].reshape(1,1).T @ h1_grad



Now, we can do the final sequence position, 0. The main difference here is that we don't update the hidden gradient, since there is no previous sequence position that gave us hidden state input in the forward pass:



## manual backpropagation for a single timestep (t=0)


In [53]:
l0_grad = loss_grad[0].reshape(1,1)
o_weight_grad += hiddens[0][:,np.newaxis] @l0_grad
o_bias_grad += np.mean(l0_grad)

h0_grad = l0_grad @ o_weights.T
h0_grad = h1_grad @ h_weights.T

tanh_deriv = 1 - hiddens[0,:][np.newaxis,:] ** 2
h0_grad = np.multiply(h0_grad,tanh_deriv)
i_weight_grad += sequence[0].reshape(1,1).T @ h0_grad


In [55]:
i_weight_grad

array([[-11095.10131226, -18151.59749784, -11121.5552512 ,
        -16872.95024558,   -581.47427619]])