### Prepare Dataset

<p>
<a href='http://www.paulgraham.com/articles.html' target='_blank'>Paul Graham essays</a>
</p>

## RNN 

In [4]:
import json
import numpy as np

### Load Dataset

<p>
 Let's take paulgraham essays and try to create language model using simple vanilla RNN.<br/>
 To keep it simple we use character level model to avoid big vocab size issue.
</p>

In [17]:
with open('./paulgraham_essays.json', 'r') as f:
    data = json.loads(f.read())

In [18]:
txts = ""
for ele in data:
    txts += ele + '\n'

In [19]:
vocab = list(set(txts))
vocab_sz, n_char = len(vocab),len(txts)
print(f'vocab size:{vocab_sz}, data char size: {n_char}')

vocab size:103, data char size: 1845065


<p> Data char size is too big, let's try first small number char size by truncating </p>

In [20]:
max_sz = 10000
txts = txts[:max_sz]
vocab = list(set(txts))
vocab_sz, n_char = len(vocab),len(txts)
print(f'vocab size:{vocab_sz}, data char size: {n_char}')

vocab size:69, data char size: 10000


### Model

<p style="line-height:2">
   * Our model is a simple RNN (Recurrent Nueural Netwrok) model. <br/>
   * RNN is speicific way of arranging neural network layers so that it can model sequence data like texts. <br/>
   * Usual feed-forward neural network (FNN) has problem with handling sequence data like texts, the problem is keeping track of dependencies in the sequence so that it can produce next element in the sequence, in the case of texts it may be next word/char.<br/>
   * problem-1: in FNN sequence order got destroyed, but order is important in sequence data like texts. <br/>
   * problem-2: FNN takes entire sequence in a single go but we need to input the sequence one element/char at a time and get the next predicted element/char in the sequence, so that we can train our model using the predicted element/char against actual next element/char in the sequence. if we want to do this in FNN then it will require variying input size but FNN requires pre-defined input and output size.<br/>
   * So basically RNN is just a modified version of FNN that can handle above mentioned problems and also able to train params using backpropagation.<br/>
   
</p>

#### Input and Label Structure

<p style="line-height:2">
    * We can't input entire sequence data into network, we need to split the sequence into multiple small chunks of sequence so that our system can handle one at a time.<br/>
    * Our goal for the model is it should take one element at a time and produce next element, so label will be the next element given the previous element.
</p>

In [27]:
seq_sz = 25
inputs = txts[0:seq_sz]
targets = txts[1:seq_sz+1]

In [28]:
inputs

'January 2023\n\n(Someone fe'

In [29]:
targets

'anuary 2023\n\n(Someone fed'

In [63]:
def get_dls(txts, n_char, seq_sz):
    for i in range(n_char):
        if (i+seq_sz+1)>n_char:
            return ([], [])
        yield (txts[i:i+seq_sz], txts[i+1:i+1+seq_sz])

#### Model Parameters