# RNN model of URIs

The Python code from this notebook has now been developed into two scripts
<ol>
<li> <tt>train_rnn.py</tt>
<code>
Usage: python  ./python/train_rnn.py infile [options]
Options are:
        -o outfile 
        -datapath [./sdata]
        -modelpath [./models]
        -unroll [20]
        -step [3]
        -dropout [0.1]
        -niters [10]
        -arch (hidden layer sizes) [16]
</code>
<li> <tt>make_synthetic.py</tt>
<code>
Usage: python  ./python/make_synthetic.py infile [options]
Options are:
        -n length of output [1000]
        -init        [newlines]
        -temp        [1.0]
        -modelpath        [./models]
</ol>

# Training a model

In [38]:
run -i train_rnn\
    smallstring.gz\
    -datapath ../sdata\
    -modelpath ../models\
    -arch 16,16

Checking hardware ...
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 18166164399265663188
]
Model architecture:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_8 (LSTM)                (None, 20, 16)            1600      
_________________________________________________________________
dropout_4 (Dropout)          (None, 20, 16)            0         
_________________________________________________________________
lstm_9 (LSTM)                (None, 16)                2112      
_________________________________________________________________
dropout_5 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 256)               4352      
_________________________________________________________________
activation_7 (Activation)    (None, 256)               0  

# Synthetic domain string from model

In [19]:
run -i make_synthetic\
    model_from_big_domain_string_1.gz_arch_8_16_unroll_20_step_3_dropout_0.1_iter_17.npy\
    -modelpath ../models\
    -init '/tag/маша и медведь мультфильм'\
    -temp 0.6\
    -n 1000

  return np.random.choice(choices, p=dist)


/tag/маша и медведь мультфильм08118a-cegeabadedo%dom-barlund-mares/00/sutes/mcen/
/metes/uptetor/bane/2003-jaale/
/teroll/ew-veni-seres/matanter/020/
/2016/03/savad-
/comagi/
/sert/cemag-cededels/thalenint/cechle-contoat.html
/s/cortile.html
/padtid/apadel-bart/dore-amsanter/
/contins/cootit/cases/promlerccure-tioe-lepsiers/118-ml-masel-feditat/sassica-moron.ort
/n/hums/1/spol/par-sheengico-cel-thataten-urachiy/tertim-nobleushonds/8/
/delei/2011/01/12/48081/tprope/pipt
/
/cociimenci/galen/
/catilemdones/
/iciant/aave/2011/06/90927457_.jpg
/dacis-I/labeis/Ep/Aessens/prome-actieng/thom/gatilas/166443
/wtapjas-dono-s-cants-artic-ners/deart/2016/01/377892679210777921/117/2/16/chollesilaeliten/mlgts/
/category/amasces/logentias/palas/fas-khyut-
/imagez/den/singee/dorit-solisi/ezekers/pard-category/bo-cor-emors/s/aprhedel-sadlerte-ardono-loge-nog-larientiis/
/comlert/hone-Porso103-wrystanstotiinga-s-fo/fari-a-sendibrice-tolles/ases/
/prots/searc/plles/metororichs/shepne-d-porenrs-cankezede-c

# Entropy of the model

How do we assess the performance of a model?

Below I'm using much of the code from <tt>make_synthetic.py</tt> to compute the average prediction entropy on a test string of 10k bytes. 

The smaller the better...

In [35]:
import gzip
import numpy as np

datapath = "../sdata"
infile = "smallstring.gz"

with gzip.open("%s/%s" % (datapath, infile), 'rb') as f:
    content = f.read()

testbytes = [b.encode('hex') for b in content]
n_bytes = len(testbytes)

In [70]:
modelfile = "../models/model_from_big_domain_string_1.gz_arch_8_16_unroll_20_step_3_dropout_0.1_iter_17.npy"
   
model_wts = np.load(modelfile)

arch = [model_wts[i].shape[0]\
       for i in range(len(model_wts))\
       if len(model_wts[i].shape) == 2] 

INSIZE = arch[0]
nh = len(arch)/2
nhidden = [arch[1+2*i] for i in range(nh)]
OUTSIZE = 256

In [71]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.recurrent import LSTM, Recurrent

model = Sequential()
if nlayers > 1:
    for i in range(nlayers-1):
        model.add(LSTM(nhidden[i],
                       return_sequences=True,
                       input_shape=(unroll, INSIZE)))
    model.add(LSTM(nhidden[nlayers-1],
                   return_sequences=False))
else:
    model.add(LSTM(nhidden[0],
                   return_sequences=False,
                   input_shape=(unroll, INSIZE)))
model.add(Dense(OUTSIZE))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# load current weights
model.set_weights(model_wts)

In [72]:
import math

def bn(x):
    """
    Binary representation of 0-255 as 8-long int vector.
    """
    str = bin(x)[2:]
    while len(str) < 8:
	str = '0' + str
    return [int(i) for i in list(str)]

def entropy(pred):
    """
    Entropy of a soft-max vector.
    """
    return sum([-p*math.log(p) for p in pred if p > 0])

binabet = [bn(x) for x in range(256)]
byte_idx = dict((c, i) for i,c in enumerate(hexabet))
nlayers = len(nhidden)

In [73]:
unroll = 20
x = np.zeros((1, unroll, INSIZE))

def stepping(window):
    if INSIZE == 8:
        for t,b in enumerate(window):
            x[0, t, :] = binabet[byte_idx[b]]
    elif INSIZE == 256:
        for t,b in enumerate(window):
            x[0, t, byte_idx[b]] = 1.0
    return model.predict(x, verbose=0)[0]

window = testbytes[:unroll]
idx = unroll

entropies = []
while idx < n_bytes - unroll:
    preds = stepping(window)
    entropies += [entropy(preds)]
    next_byte = testbytes[idx]
    window = window[1:] + [next_byte]
    idx += 1

print(sum(entropies)/len(entropies))

2.76468260125


testmodel: 4.94551499849

big_string 10Ms on Macbook [INSIZE=256 ~60 iters]: 1.19513989953

big_domain_string on GPU [INSIZE=8]: 

1 iters 2.78878392504
 
 7 iters 2.76468260125

17 iters 2.76468260125