### Prepare Dataset

<p>
<a href='http://www.paulgraham.com/articles.html' target='_blank'>Paul Graham essays</a>
</p>

## RNN 

In [4]:
import json
import numpy as np

### Load Dataset

<p>
 Let's take paulgraham essays and try to create language model using simple vanilla RNN.<br/>
 To keep it simple we use character level model to avoid big vocab size issue.
</p>

In [17]:
with open('./paulgraham_essays.json', 'r') as f:
    data = json.loads(f.read())

In [18]:
txts = ""
for ele in data:
    txts += ele + '\n'

In [19]:
vocab = list(set(txts))
vocab_sz, n_char = len(vocab),len(txts)
print(f'vocab size:{vocab_sz}, data char size: {n_char}')

vocab size:103, data char size: 1845065


<p> Data char size is too big, let's try first small number char size by truncating </p>

In [20]:
max_sz = 10000
txts = txts[:max_sz]
vocab = list(set(txts))
vocab_sz, n_char = len(vocab),len(txts)
print(f'vocab size:{vocab_sz}, data char size: {n_char}')

vocab size:69, data char size: 10000


### Model

<p style="line-height:2">
   * Our model is a simple RNN (Recurrent Nueural Netwrok) model. <br/>
   * RNN is speicific way of arranging neural network layers so that it can model sequence data like texts. <br/>
   * Usual feed-forward neural network (FNN) has problem with handling sequence data like texts, the problem is keeping track of dependencies in the sequence so that it can produce next element in the sequence, in the case of texts it may be next word/char.<br/>
   * problem-1: in FNN sequence order got destroyed, but order is important in sequence data like texts. <br/>
   * problem-2: FNN takes entire sequence in a single go but we need to input the sequence one element/char at a time and get the next predicted element/char in the sequence, so that we can train our model using the predicted element/char against actual next element/char in the sequence. if we want to do this in FNN then it will require variying input size but FNN requires pre-defined input and output size.<br/>
   * So basically RNN is just a modified version of FNN that can handle above mentioned problems and also able to train params using backpropagation.<br/>
   
</p>

#### Input and Label Structure

<p style="line-height:2">
    * We can't input entire sequence data into network, we need to split the sequence into multiple small chunks of sequence so that our system can handle one at a time.<br/>
    * Our goal for the model is it should take one element at a time and produce next element, so label will be the next element given the previous element. <br/>
    * Also we need to convert our each char into number, because we can't process raw text char, simple thing to do is assign unique int to each char in our vocab.
</p>

In [78]:
char_to_idx = {ch:i for i,ch in enumerate(vocab)}
idx_to_char = {i:ch for ch,i in char_to_idx.items()}

In [27]:
seq_sz = 25
inputs = txts[0:seq_sz]
targets = txts[1:seq_sz+1]

In [28]:
inputs

'January 2023\n\n(Someone fe'

In [29]:
targets

'anuary 2023\n\n(Someone fed'

In [80]:
def get_dls(txts, n_char, seq_sz):
    for i in range(n_char):
        if (i+seq_sz+1)>n_char:
            return ([], [])
        yield ([char_to_idx[ch] for ch in txts[i:i+seq_sz]], [char_to_idx[ch] for ch in txts[i+1:i+1+seq_sz]])

In [81]:
dls = get_dls(txts, n_char, seq_sz)

In [82]:
inputs,targets = next(dls)

In [85]:
inputs[:4]

[10, 60, 29, 11]

In [86]:
targets[:4]

[60, 29, 11, 60]

#### Model Parameters

<p style="line-height:2">
     * wxh : weights of the first hidden unit, this is basically normal FNN, this will take one element at a time. <br/>
     * whh : weights of the second hidden unit, which takes the output of the previous elements hidden layer output. <br/>
     * why : weights of the third hidden unit, which takes sum of the first two hidden units output as input and output the logits for the next chars. <br/>
                                                                                                 
</p>

In [90]:
hidden_sz = 100
input_sz = vocab_sz # one-hot vector

In [109]:
wxh = np.random.randn(hidden_sz,input_sz)*0.01
whh = np.random.randn(hidden_sz, hidden_sz)*0.01
why = np.random.randn(vocab_sz, hidden_sz)*0.01
bh = np.zeros((hidden_sz, 1))
by = np.zeros((vocab_sz, 1))

#### Forward Pass

In [133]:
def forward(inputs,targets):
    allhs,alllogits,allps,alleles = {},{},{},{}
    allhs[-1] = np.zeros((hidden_sz, 1))
    loss = 0
    for t in range(len(inputs)):
        ele_t = np.zeros((vocab_sz, 1)) # one-hot representation
        alleles[t] = ele_t
        ele_t[inputs[t]] = 1 
        hs = np.tanh(np.dot(wxh, ele_t) + np.dot(whh, allhs[t-1]) + bh) # hidden state
        allhs[t] = hs
        logits = np.dot(why, hs) + by # raw score for each char
        alllogits[t] = logits
        ps = np.exp(logits)/np.sum(np.exp(logits)) # get probs for each char
        allps[t] = ps
        loss += -np.log(ps[targets[t]][0])
    return loss,allhs,alllogits,allps,alleles

In [138]:
loss,*_ = forward(inputs, targets)

In [139]:
loss

105.85149713440114

if we assume initialy that probability of picking correct char is from uniform distribution, we can estimate what would be the inital loss would be.

In [143]:
-np.log(1/vocab_sz)*seq_sz

105.85266261493149

#### Calculate Gradient of the Loss Function based on Params

<p style="line-height:2">
    * Gradient is basically rate of change of the value based on params. here rate of change of loss based on our weight/bias.<br/>
    * This will inform us how much each weight/bias params influence the loss value if we change each one slightly. <br/>
    * We can use this information to change each weight/bias in the direction where it reduce the loss value. </br>
    * But we can't directly use the gradient value of weight/bias to update.
    
</p> 