# RNN model of URIs

The Python code from this notebook has now been developed into two scripts
<ol>
<li> <tt>train_rnn.py</tt>
<code>
Usage: python  ./python/train_rnn.py infile [options]
Options are:
        -o outfile 
        -datapath [./sdata]
        -modelpath [./models]
        -unroll [20]
        -step [3]
        -dropout [0.1]
        -niters [10]
        -arch (hidden layer sizes) [16]
</code>
<li> <tt>make_synthetic.py</tt>
<code>
Usage: python  ./python/make_synthetic.py infile [options]
Options are:
        -n length of output [1000]
        -init        [newlines]
        -temp        [1.0]
        -modelpath        [./models]
</ol>

# Training a model

In [9]:
run -i train_rnn\
    smallstring.gz\
    -datapath ../sdata\
    -modelpath ../models\
    -arch 32,32\
    -niters 16

Checking hardware ...
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 11201530596252466535
]
Model architecture:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_5 (LSTM)                (None, 20, 32)            5248      
_________________________________________________________________
dropout_1 (Dropout)          (None, 20, 32)            0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 256)               8448      
_________________________________________________________________
activation_5 (Activation)    (None, 256)               0  

# Synthetic domain string from model

In [10]:
run -i make_synthetic\
    model_from_big_domain_string_1.gz_arch_8_16_unroll_20_step_3_dropout_0.1_iter_30.npy\
    -modelpath ../models\
    -init '/tag/маша и медведь мультфильм'\
    -temp 0.7\
    -n 1000

  a = np.log(a) / temperature


/tag/маша и медведь мультфильм.jpashusgeranngeas-alettono/orim/prut/
/owtot-ease-cepar-comm
/soary/serm-bor-chafo-kart-mat-cats-av/miuttor.html
/hops/stecis/wates/pabeuistinge/tituriv/
/veraer/1606.jpg
/201_s/
/turcheno/Asen_erey/00003710160593/menas/
/artol-ceded/arotire-rachoes/fro-connely/
/brogo.html
/ese-ualec.html
/protor-kupp/stetor-stoderane-al-evder.html
/masenseos/8002161484353095791376027462.html
/pilasaine-prot/mend-imate-can/tag-ve-arthaces-choarl/cashiva-foray/festaw/2003_301/37890/10116957/enwuscegs/heis/2023400/dadel-hees-o-lurfe-132-
/sti-camusre-sen-freas-rewotant/a-nedia-calens-7.jpg
/pla-choboline-espertigitharar-gonoun
/motbas--adyobs-tenres-cortupa-socan/bamel/imgdne-sarariuvis/baela/radets/notjine-mb-cech
/adothi/sefsooy-esxenmth-iffa-tocu-1.html
/imlgarry/%a3/aruml-Saw_Cothimes/welt/mites/comage
/
/categoroac-conter/
/shonsed/2010014-1f-nes_cibickens/200//
/tpatad.4.phtidiiostat-hotipy/01/125/114-i-scuarle-jope-amtiut-intinael-abour-jr/cilaitirligg/wile/tag/timi

# Entropy of the model

How do we assess the performance of a model?

Below I'm using code from <tt>make_synthetic.py</tt> to compute the average prediction entropy on a test string of 10k bytes. 

The smaller the better...

In [4]:
import gzip
import numpy as np

datapath = "../sdata"
infile = "smallstring.gz"

with gzip.open("%s/%s" % (datapath, infile), 'rb') as f:
    content = f.read()

testbytes = [b.encode('hex') for b in content]
n_bytes = len(testbytes)

In [5]:
modelfile = "../models/model_from_big_domain_string_1.gz_arch_8_16_unroll_20_step_3_dropout_0.1_iter_30.npy"
   
model_wts = np.load(modelfile)

arch = [model_wts[i].shape[0]\
       for i in range(len(model_wts))\
       if len(model_wts[i].shape) == 2] 

INSIZE = arch[0]
nh = len(arch)/2
nhidden = [arch[1+2*i] for i in range(nh)]
OUTSIZE = 256

In [6]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.recurrent import LSTM, Recurrent

model = Sequential()
if nlayers > 1:
    for i in range(nlayers-1):
        model.add(LSTM(nhidden[i],
                       return_sequences=True,
                       input_shape=(unroll, INSIZE)))
    model.add(LSTM(nhidden[nlayers-1],
                   return_sequences=False))
else:
    model.add(LSTM(nhidden[0],
                   return_sequences=False,
                   input_shape=(unroll, INSIZE)))
model.add(Dense(OUTSIZE))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# load current weights
model.set_weights(model_wts)

In [7]:
import math

def bn(x):
    """
    Binary representation of 0-255 as 8-long int vector.
    """
    str = bin(x)[2:]
    while len(str) < 8:
    str = '0' + str
    return [int(i) for i in list(str)]

def entropy(pred):
    """
    Entropy of a soft-max vector.
    """
    return sum([-p*math.log(p) for p in pred if p > 0])

binabet = [bn(x) for x in range(256)]
byte_idx = dict((c, i) for i,c in enumerate(hexabet))
nlayers = len(nhidden)

In [8]:
unroll = 20
x = np.zeros((1, unroll, INSIZE))

def stepping(window):
    if INSIZE == 8:
        for t,b in enumerate(window):
            x[0, t, :] = binabet[byte_idx[b]]
    elif INSIZE == 256:
        for t,b in enumerate(window):
            x[0, t, byte_idx[b]] = 1.0
    return model.predict(x, verbose=0)[0]

window = testbytes[:unroll]
idx = unroll

entropies = []
while idx < n_bytes - unroll:
    preds = stepping(window)
    entropies += [entropy(preds)]
    next_byte = testbytes[idx]
    window = window[1:] + [next_byte]
    idx += 1

print(sum(entropies)/len(entropies))

2.76859589659


testmodel: 4.94551499849

big_string 10Ms on Macbook [INSIZE=256 ~60 iters]: 1.19513989953

big_domain_string on GPU [INSIZE=8 arch 16]: 

<ul>
<li> 1 iters 2.78878392504
<li> 7 iters 2.76468260125
<li> 17 iters 2.76468260125
<li> 30 iters 2.76859589659
</ul>

big_domain_string on GPU [INSIZE=8 arch 32,32]: 

<ul>
<li> 5 iters 
<li> 10 iters 
<li> 15 iters 
<li> 20 iters 
<li> 25 iters 
<li> 30 iters 
<li> 35 iters 
<li> 40 iters 
<li> 45 iters 
<li> 50 iters 
<li> 55 iters 
<li> 60 iters 
<li> 65 iters 
<li> 70 iters 
<li> 75 iters 
<li> 80 iters
<li> 85 iters 
<li> 90 iters 
<li> 95 iters 
<li> 100 iters 
</ul>