# RNN model of URIs

The Python code from this notebook has now been developed into two scripts
<ol>
<li> <tt>train_rnn.py</tt>
<code>
Usage: python  ./python/train_rnn.py infile [options]
Options are:
        -o outfile 
        -datapath [./sdata]
        -modelpath [./models]
        -unroll [20]
        -step [3]
        -dropout [0.1]
        -niters [10]
        -arch (hidden layer sizes) [16]
</code>
<li> <tt>make_synthetic.py</tt>
<code>
Usage: python  ./python/make_synthetic.py infile [options]
Options are:
        -n length of output [1000]
        -init        [newlines]
        -temp        [1.0]
        -modelpath        [./models]
</ol>

## Training a model

In [9]:
run -i train_rnn\
    smallstring.gz\
    -datapath ../sdata\
    -modelpath ../models\
    -arch 32,32\
    -niters 16

Checking hardware ...
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 11201530596252466535
]
Model architecture:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_5 (LSTM)                (None, 20, 32)            5248      
_________________________________________________________________
dropout_1 (Dropout)          (None, 20, 32)            0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 256)               8448      
_________________________________________________________________
activation_5 (Activation)    (None, 256)               0  

## Synthetic domain string from model

In [59]:
run -i make_synthetic\
    model_from_big_domain_string_1.gz_arch_8_32_32_unroll_20_step_3_dropout_0.1_iter_20.npy\
    -modelpath ../models\
    -init '叶问前传BD国语中字1024高清'\
    -temp 1.0\
    -n 4000

叶问前传BD国语中字1024高清-/
/wp-content/uploads/smarker/k_Gyft_Arcm/1737.html
/hot-569099/benadelny
/wes-wptenlede/lundboy-tidises-lakIchu/



www.Wrurhprrecolily.c
/authers/item/13414-jittaznen.html
/3201137.file.ek.trmVESebi_p
/.png
/it/videos/erotor-guiso-mya-mocet.html
//osterngow/reviemfers/1756/26635775/avettote/certie-g-bolhrount-photes-joes-ona-cPpeub5.Zm



vupawhon.covfunar.com
/



1/fl/skoleswo-vomoneproph/tyki/
/huprssicalanes/
/catil/birfy-rarriey.asp
/uploads/bw/theporiby02923194993910
/statives/catalog.html
/113/456f09/548438.png
/esercings/alzinyy/mehers/seap/_.gi1000/
/21080/isndtafacher-mordIncam.jpg
/2114808.jpg



p.contebnon/conmuiments/mechen-gor-fhuss-cart-.html
/bular-vax/
/gigtii
/archive-ballosti-411206-fci-tro.html
/page-seesture-noboor.htm
/search/label/photr.html
/Inssooltarran/deshing/derribration/products-kos-ramen_lal
/thumbs/97850081/tzkastars/Nrideb
/commechauvripergerdriew.com
/56820
/curbungs/afea/bo82s_/w
/eg/3414
/gasitfrount-image/areveki2-noo-anner-onely

## Entropy of the model

How do we assess the performance of a model?

Below I'm using code from <tt>make_synthetic.py</tt> to compute the average prediction entropy on a test string of 10k bytes. 

The smaller the better...

In [60]:
import gzip
import numpy as np

datapath = "../sdata"
infile = "smallstring.gz"

with gzip.open("%s/%s" % (datapath, infile), 'rb') as f:
    content = f.read()

testbytes = [b.encode('hex') for b in content]
n_bytes = len(testbytes)

In [61]:
modelfile = "../models/model_from_big_domain_string_1.gz_arch_8_32_32_unroll_20_step_3_dropout_0.1_iter_20.npy"
   
model_wts = np.load(modelfile)

arch = [model_wts[i].shape[0]\
       for i in range(len(model_wts))\
       if len(model_wts[i].shape) == 2] 

INSIZE = arch[0]
nh = len(arch)/2
nhidden = [arch[1+2*i] for i in range(nh)]
OUTSIZE = 256

In [62]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.recurrent import LSTM, Recurrent

model = Sequential()
if nlayers > 1:
    for i in range(nlayers-1):
        model.add(LSTM(nhidden[i],
                       return_sequences=True,
                       input_shape=(unroll, INSIZE)))
    model.add(LSTM(nhidden[nlayers-1],
                   return_sequences=False))
else:
    model.add(LSTM(nhidden[0],
                   return_sequences=False,
                   input_shape=(unroll, INSIZE)))
model.add(Dense(OUTSIZE))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# load current weights
model.set_weights(model_wts)

In [63]:
import math

def bn(x):
    """
    Binary representation of 0-255 as 8-long int vector.
    """
    str = bin(x)[2:]
    while len(str) < 8:
        str = '0' + str
    return [int(i) for i in list(str)]

def entropy(pred):
    """
    Entropy of a soft-max vector.
    """
    return sum([-p*math.log(p) for p in pred if p > 0])

binabet = [bn(x) for x in range(256)]
byte_idx = dict((c, i) for i,c in enumerate(hexabet))
nlayers = len(nhidden)

In [64]:
unroll = 20
x = np.zeros((1, unroll, INSIZE))

def stepping(window):
    if INSIZE == 8:
        for t,b in enumerate(window):
            x[0, t, :] = binabet[byte_idx[b]]
    elif INSIZE == 256:
        for t,b in enumerate(window):
            x[0, t, byte_idx[b]] = 1.0
    return model.predict(x, verbose=0)[0]

window = testbytes[:unroll]
idx = unroll

entropies = []
while idx < n_bytes - unroll:
    preds = stepping(window)
    entropies += [entropy(preds)]
    next_byte = testbytes[idx]
    window = window[1:] + [next_byte]
    idx += 1

print(sum(entropies)/len(entropies))

2.30325903872


testmodel: 4.94551499849

big_string 10Ms on Macbook [INSIZE=256 ~60 iters]: 1.19513989953

big_domain_string on GPU [INSIZE=8 arch 16]: 


| Iteration        | Entropy           | 
| ------------- |:-------------:| 
| 1     | 2.78878392504 | 
| 7      | 2.76468260125    | 
| 17 | 2.76468260125     | 
|30| 2.76859589659 |

big_domain_string on GPU [INSIZE=8 arch 32,32]: 

| Iteration        | Entropy           | 
| ------------- |:-------------:| 
| 5     | 2.3545572814 | 
| 10      | 2.3470987331     | 
| 15 | 2.3276004177      | 
|20 | 2.3032590387 |
| 25 | |