* Base RNN architecture: https://github.com/karpathy/randomfun/blob/master/min-char-rnn-nb.ipynb
* Nice info on architecture: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
* Some code structure taken from: https://medium.com/@CallMeTwitch/building-a-neural-network-zoo-from-scratch-the-recurrent-neural-network-9357b43e113c
* Backprop: https://mkffl.github.io/2019/07/08/minimalist-RNN.html

Code reimplemented in pytorch (for practice). It is mix of info from multiple resources, plus bunch of custom stuff.

General goal is to learn list of words and when given beginning of the word, to predict the rest.

In [9]:
import numpy as np, torch, os, re
from collections import Counter
from IPython.display import Image
from typing import List
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# RNN (Recurrent Neural Network)
RNN is an architecture that utilizes a basic building block - a cell, to iteratively and cyclically analyze data of unlimited length and give a prediction on next entry. 

Data point passes through the cell and long-term information is imprinted on a temporal layer - *hidden state*.

images from :
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks


<center><img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/architecture-rnn-ltr.png" align="center" width=660 height=200 style="background-color:white"/>

(Unfolded view of RNN achitecture. Hidden state (blue/green) of iteration t is shown as a^t)</center>

***
<center><img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/description-block-rnn-ltr.png" align="center" width=460 height=260 style="background-color:white"/>

(Contents of RNN hidden state cell)</center>

Contents of hidden state is shaped each iteration and depends on values of  previous hidden state $a^{<t-1>}$ and current input $x^{<t>}$:

$$
a^{<t>} = g_1 ( W_{aa} a^{<t-1>} + W_{ax} x^{<t>} + b_a)
$$

And an output of a cell (relevant for final evaluation) is:
$$
y^{<t>} = g_2 ( W_{ya} a^{<t>} + b_y)
$$

where $g_1$ and $g_2$ are activation functions

***
It is more common to denote hidden layer as $h$. So we will rename entries accordingly:
$$
\begin{cases}
h_{t} = g_1 ( W_{hh} h_{t-1} + W_{xh} x_{t} + b_h) \\
y_{t} = g_2 ( W_{hy} h_{t} + b_y)
\end{cases}
$$
***

# Data
Take a segment of shakespeare, split it into characters.

from 
https://github.com/karpathy/char-rnn/tree/master/data/tinyshakespeare

In [10]:
data = open(os.path.join('data_processing','media','tinyshakespeare.txt'), 'r').read() # should be simple plain text file
data = re.sub(r'[^a-zA-Z\s]', '', data) # remove non-alphabet
seq_length = 8
words_all = data.split()
words = list(set([w for w in words_all if len(w) == seq_length]))
chars_all = [c for w in words for c in w] #[*data]
chars = list(set(chars_all))  #list(set(data))
data_size, vocab_size = len(words), len(chars)
print('data has %d words of size %d, and %d unique characters.' % (len(words), seq_length, vocab_size))
'Chars: ' + ' '.join(chars)

data has 1975 words of size 8, and 50 unique characters.


'Chars: m P I E N l K Z y U q Y n g S h r L z v T A M O e w W i d V u R k H a p f o G C b c x s B t j J F D'

Collect all characters and encode them into one-hot-vector

Bonus: sort words by use frequency, so most common characters are in the beginning

In [11]:
frequency = dict(Counter(chars_all))
vocab_unique = list(sorted(frequency.keys(), key=lambda x: frequency[x], reverse=True))
char_to_ix = { ch:i for i,ch in enumerate(vocab_unique) }
ix_to_char = { i:ch for i,ch in enumerate(vocab_unique) }
'Sorted chars: ' + ' '.join(vocab_unique)

'Sorted chars: e s r i n t a o d l c u g h p m f b y v w k C S A R M I E q B T D P O L j x N U H F z V W G K J Y Z'

Each letter is encoded into one-hot-vector (column), a word is a collection of letters-> <br>
matrix of shape (len(alphabet),len(word))

In [12]:
def oneHotEncode(chars: List[str]) -> torch.Tensor:
    """
        lets hold encoded vectors as columns
    """
    encode = torch.zeros(len(chars),vocab_size)
    x,y = torch.tensor([(i,char_to_ix[a]) for i,a in enumerate(chars)]).T
    encode[x,y] = 1
    return encode.T
split_word = [*words[0]];print(split_word)
word_index = [char_to_ix[a] for a in split_word];print(word_index)
oneHotEncode(split_word)[:10,:]

['s', 'c', 'a', 'l', 'd', 'i', 'n', 'g']
[1, 10, 6, 9, 8, 3, 4, 12]


tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0.]])

Make a training set of N words.<br>
word: ['c', 'o', 'm', 'f', 'o', 'r', 't', 's']<br>
During training it will be sliced into [:-1] and [1:] parts

In [13]:
num_words_max = 20
X_train = torch.zeros(size=(num_words_max, vocab_size, seq_length), device=device) # words, chars, seq of chars
for i in range(num_words_max):
    X_train[i] = oneHotEncode(words[i])

len(X_train)

20

Initialize network weights by 
https://365datascience.com/tutorials/machine-learning-tutorials/what-is-xavier-initialization/

In [14]:
# Xavier Normalized Initialization
def initWeights(input_size: int, output_size: int) -> torch.Tensor:
    rnd = torch.rand(size=(input_size, output_size),device=device)
    rnd = 2*rnd - 1
    return rnd * torch.sqrt(torch.tensor(6) / (input_size + output_size))

initWeights(2,2)

tensor([[ 0.3956,  0.6284],
        [-0.5509,  0.4450]], device='cuda:0')

***
## Weights and biases:
Given that
$$
\begin{cases}
h_{t} = g_1 ( W_{hh} h_{t-1} + W_{xh} x_{t} + b_h) \\
y_{t} = g_2 ( W_{hy} h_{t} + b_y)
\end{cases}
$$
and $x_{t}$ thus $y_{t}$ will be stored as columns of shape $(N_x,1)$<br>
and $h_{t}$ will be stored as columns of shape $(N_h,1)$

Then dimensions of weights and biases are as follows: 
$$\begin{cases}

\mathrm{dim}(W_{hh})  = (N_h,N_h) \text{, such that } (N_h,N_h) , (N_h, 1) \rightarrow (N_h, 1) \\
\mathrm{dim}(W_{xh})  = (N_h,N_x) \text{, such that } (N_h,N_x) , (N_x, 1) \rightarrow (N_h, 1) \\
\mathrm{dim}(W_{yh})  = (N_x,N_h) \text{, such that } (N_x,N_h) , (N_h, 1) \rightarrow (N_x, 1) \\
\mathrm{dim}(b_h)     = \mathrm{dim}(h_t)= (N_h, 1)                                             \\
\mathrm{dim}(b_y) = \mathrm{dim}(y_t)= (N_x, 1)
\end{cases}
$$

*** 
## Loss function:
Cross-entropy can be derived from K-L Divergence.<br> 
Similarly it compares two distributions $p$ and $q$.
$$H(p,q) = -\sum_{x\in\mathcal{vocabulary}} p(x)\, \log q(x)$$
In our case, we are comparing one-hot-encoded representation of a character $p(x)$,<br>
with only 1 non zero entry, to output of NN, which is its guess distribution.

$p(x)$ has only one entry of '1' at index $k$, then the sum is not required- can replace $p(x)$ with $\delta_{x,k}$
$$H(p,q) = -\sum_{x\in\mathcal{vocabulary}} \delta_{x,k}\ \log q(x) = -\log q(k) $$


In [15]:
##### Recurrent Neural Network Class #####
class RNN:
    """
        input_size- one-hot length (vocabulary); hidden_size - internal buffer; output_size - negative/positive mood = 2
    """
    def __init__(self, input_size: int, hidden_size: int, output_size: int):
        # Network
        self.Whh = initWeights(hidden_size, hidden_size)
        self.Wxh = initWeights(hidden_size, input_size)
        self.Wyh = initWeights(output_size, hidden_size)

        self.bh = torch.zeros((hidden_size, 1),device=device)
        self.by = torch.zeros((output_size, 1),device=device)

        self.g1 = torch.tanh
        self.hs0 = torch.zeros(size=(hidden_size, 1),device=device)

        self.vocab_len = input_size

        
    def softmax(self,x: torch.Tensor) -> torch.Tensor:
        y = torch.exp(x)
        return y / torch.sum(y)

    # Forward Propagation
    def forward(self, xs: torch.Tensor, hprev: torch.Tensor):
        vocab_len, num_data = xs.size()

        ps = torch.zeros_like(xs, device=device)   # same number of predictions as input. except staggered in time.
        hs = torch.zeros(size=(self.bh.size(0), num_data + 1), device=device)   # + 1 'zeroth' hidden state
        hs[:,[0]] = hprev.clone()
        
        for t in range(num_data):
            # process this like slices, not explicit matrix mults.
            a = self.Wxh @ xs[:,t]
            b = self.Whh @ hs[:,t]              # t here is prev step
            c = self.bh.squeeze(1)
            hs[:,t+1]   = self.g1(a + b + c)    # t+1 is current step
            #hs[:,[t+1]] = self.g1(self.Wxh @ xs[:,[t]] + self.Whh @ hs[:,[t]] + self.bh)
            y   = self.Wyh @ hs[:,t + 1] + self.by.squeeze(1)
            ps[:,t]   = self.softmax(y)

        self.hs = hs
        self.ps = ps

        return hs[:,[-1]], ps[:,[-1]]
    
    def test(self, input, first_n_chars = 1):
        """
            Input is a whole word.
            * Part of data will be used to determine latest hidden state ->
            -> data columns up to index 'first_n_chars'                             i.e for first_n_chars = 1, cols [0]
            * store known columns in a  solution.                                   i.e cols [0]
            * predict column index 'first_n_chars', generate latest hidden state    i.e cols [1]
            * store prediction
            * start iterating from i = 'first_n_chars + 1' and store predictions    i.e i = [2,3,..]

        """
        store = torch.zeros(size=(self.vocab_len, input.size(1)), device=device)
        store[:,:first_n_chars] = input[:,:first_n_chars]
        # develop hidden state from known input data and get latest prediction:
        hidden_state, prediction = self.forward(input[:,:first_n_chars], self.hs0)

        store[:,[first_n_chars]] = prediction

        for i in range(first_n_chars + 1, store.size(1)):
            hidden_state, prediction = self.forward(prediction, hidden_state)
            store[:,[i]] = prediction
        # get index of one highest entry. 
        
        known = torch.topk(store[:,:first_n_chars], dim = 0, k = 1)[1].view(-1)
        rest = torch.topk(store[:,first_n_chars:], dim = 0, k = 1)[1].view(-1)

        true = torch.topk(input, dim = 0, k = 1)[1].view(-1)
        outp = 'true: ' + ''.join([ix_to_char[int(i)] for i in true])   + '; guess: ('
        outp += ''.join([ix_to_char[int(i)] for i in known])            + ')' 
        outp += ''.join([ix_to_char[int(i)] for i in rest])

        print(outp)
        return


    def backward(self,  xs: torch.Tensor, tar: torch.Tensor):
        dWxh    = torch.zeros_like(self.Wxh, device=device)
        dWhh    = torch.zeros_like(self.Whh, device=device)
        dWyh    = torch.zeros_like(self.Wyh, device=device)

        dbh     = torch.zeros_like(self.bh, device=device)
        dby     = torch.zeros_like(self.by, device=device)
        dhnext  = torch.zeros_like(self.bh, device=device)

        for t in reversed(range(xs.size(1))):
            hs_col  = self.hs[:,[t+1]]   # hidden state column
            ps_col  = self.ps[:,[t]]     # prediction column
            xs_col  = xs[ :,[t]]         # og value column
            tr_col  = tar[:,[t]]         # target column
            dy      = ps_col.clone()
            dy[tr_col.to(bool)] -= 1 
            dWyh    += dy @ hs_col.T
            dby     += dy
            dh = self.Wyh.T @ dy + dhnext # backprop into h
            dhraw   = (1 - hs_col * hs_col) * dh    # d-tanh
            dbh     += dhraw
            dWxh    += dhraw @ xs_col.T
            dWhh    += dhraw @ hs_col.T
            dhnext  =  self.Whh.T @ dhraw
            
        for dparam in [dWxh, dWhh, dWyh, dbh, dby]:
            torch.clip(dparam, -5, 5, out=dparam)

        self.Wxh += -self.learning_rate * dWxh
        self.Whh += -self.learning_rate * dWhh
        self.Wyh += -self.learning_rate * dWyh    
        self.bh  += -self.learning_rate * dbh
        self.by  += -self.learning_rate * dby


    def loss(self, ps_true):
        """
            1) extract q(k) by elem-wise mult with mask p(x)
            2) q(k) for all iterations -> take log()
            3) -sum( logs ) -> total error
            -----------------------------------------
            c       = torch.tensor([[5,3,6],[1,8,3]]).T
            b       = torch.tensor([[1,0,0],[0,0,1]]).T
            c * b   = torch.tensor([[5,0,0],[0,0,3]]).T
            sum(d0) = torch.tensor([5,3])   # sum is 'reduce'
            torch.sum(b * c, dim=0)
        """
        mask_sum = torch.sum(self.ps * ps_true, dim=0)
        return -torch.sum(torch.log(mask_sum))
    
    def train(self, inputs, num_epochs, learning_rate):
        self.learning_rate = learning_rate
        for i in range(num_epochs):

            shuffle_words = inputs[torch.randperm(len(inputs))]
            loss = 0
            for word in shuffle_words:
                input  = word[:,:-1]
                output = word[:,1:]
                hprev = self.hs0  # for epoch start reset hidden state.

                self.forward(input, hprev)  #
                #hprev = self.hs[:,[-1]]
                loss += self.loss(output)
                
                self.backward(input, output)
                
            if i % (num_epochs//10) == 0:
                print(f'epoch: {i:<3}, error:{loss:0.3f}')
                idx = torch.randint(len(inputs),(1,))[0]
                self.test(inputs[idx], 3)
            
net = RNN(input_size=vocab_size,hidden_size=100, output_size=vocab_size)

In [16]:
net.train(X_train, num_epochs = 1000, learning_rate=0.002)

epoch: 0  , error:548.805
true: anothers; guess: (ano)Hando


epoch: 100, error:138.712
true: scatters; guess: (sca)lding
epoch: 200, error:53.599
true: MONTAGUE; guess: (MON)TAGUE
epoch: 300, error:32.885
true: MONTAGUE; guess: (MON)TAGUE
epoch: 400, error:24.454
true: acquaint; guess: (acq)uaint
epoch: 500, error:20.526
true: grinning; guess: (gri)nning
epoch: 600, error:18.471
true: drunkard; guess: (dru)nkard
epoch: 700, error:16.831
true: homicide; guess: (hom)icide
epoch: 800, error:16.877
true: magician; guess: (mag)ician
epoch: 900, error:15.175
true: drinking; guess: (dri)nking
