# RNN model of URIs

The Python code from this notebook has now been developed into two scripts
<ol>
<li> <tt>train_rnn.py</tt>
<code>
Usage: python  ./python/train_rnn.py infile [options]
Options are:
        -o outfile 
        -datapath [./sdata]
        -modelpath [./models]
        -unroll [20]
        -step [3]
        -dropout [0.1]
        -niters [10]
        -arch (hidden layer sizes) [16]
</code>
<li> <tt>make_synthetic.py</tt>
<code>
Usage: python  ./python/make_synthetic.py infile [options]
Options are:
        -n length of output [1000]
        -init        [newlines]
        -temp        [1.0]
        -modelpath        [./models]
</ol>

## Training a model

In [9]:
run -i train_rnn\
    smallstring.gz\
    -datapath ../sdata\
    -modelpath ../models\
    -arch 32,32\
    -niters 16

Checking hardware ...
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 11201530596252466535
]
Model architecture:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_5 (LSTM)                (None, 20, 32)            5248      
_________________________________________________________________
dropout_1 (Dropout)          (None, 20, 32)            0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 256)               8448      
_________________________________________________________________
activation_5 (Activation)    (None, 256)               0  

## Synthetic domain string from model

In [87]:
run -i make_synthetic\
    model_from_big_domain_string_1.gz_arch_8_32_32_unroll_20_step_3_dropout_0.1_iter_20.npy\
    -modelpath ../models\
    -init '叶问前传BD国语中字1024高清'\
    -temp 1.0\
    -n 1000

叶问前传BD国语中字1024高清-/sf-spuing-category-contents-apple1-de-shupche/
/



www.cudanthurcisces.thomovism-natd.com
/imp/pafuaesaunitination-2011
/ecom/Tried_aushitgla
/
/cs.jpg
/es/nisia/vuninicep_lephorier.jpg
/s1/heaioch/2191748757-rewoursize/rorations/
/
/_uy-0vo-phenk-Matannja-dlra-gurtp-mateat-saoline/h-ementiCr_uselrach.s2pebolergesty-iution/pzlussnonin/religal-so-jaSsipan-terk-ehue-l-tor-econwen-dici-wier-best/mod83-dkns-shop/18eluc-groberbreke
/mio-st.html
/weq-view/metling-toar-mensiyen/v-ingturs-sary.html
/2013/03/pobatxerse27x90-383/
/phriFmnend-in05/%Shivher/Thintformen/836.jpg
/post//derroilvesnes-media/fintano-nas/meckh-demelt-vhapiopoo-shop
/cases/quet-do/meser-definemas-ning-dori-biry-and-11-6358_055.html
/in.htm
/porter/productisech-atan-lopity/
/b/
/asc-mayfina-denhog/parentave-cuhlefaer-lofen
/piplued-fd-klomarshen-us
/contents/blog/folos-carewsinary-gunt-porter-xdeting-uz-lllare.html
/blame
/e-eila-seandas-retmunes-ruper/
/syboitkins/10/16/ml-indus-59829



www.rvebmeteor

In [3]:
run -i make_synthetic\
    model_from_big_domain_string_1.gz_arch_8_32_32_unroll_20_step_3_dropout_0.1_iter_55.npy\
    -modelpath ../models\
    -init '叶问前传BD国语中字1024高清'\
    -temp 1.0\
    -n 4000

Using TensorFlow backend.


叶问前传BD国语中字1024高清-/images/category/m/frlas/wronlogic/1518.png
/upload/ecalpriveries/
/trn/rerled_655382-TasR.jpg
/oning-shorde-ostration-Ble-1010060669657622645554501e71c_eurem.aspx
/tasi-aksiclret/
/dib
/index.plg
/sh/amar_htg/
/Bockorers
/prolout-pibcap-givum/
/dinneteng.html
/men-hoples-s/83/per-gides-ladez80-3sbhea-de-skhia-zaki-tofurks-ninew-get/
/mumas/



www.rtbo-dabl.com
/rinkes/product-tabe/category/1353618372/32.html
/inique.asp
/ent/xw/s/70446.jpg
/.html
/agrex+thow/desletings/223/71471/84mt-7674271c4b68.jpg



lizse.com
/iesof-hlilc-nonka/sehursyya/
/shitaras.
/afdell/news/
/es/purten/201/03/05/fvande.html
/barcnisianty.pu




www.a%r2nock.rehovras.iu




uwekuenslaydechy.po
/wistairie/smouting
/hagess-images/resanas-logo
//d21Bc9c3
/cy-11451
/yMmGvoc/1315_-%20-01pyM158y2mskM06Ecd1Udn_NnjzAc05te0zeIzTk2ioJjzcM-M3Pv/exia_b.jpg
/ropf/smaor/
/blab5/photo/thumbs/1/201_7/68/grermriskulefs/dastion-lohons-opsprike-e-s-tFev-2209-M6Ae5u12
/new/mocbech/815g822/gubrtyonidyae.htm
/tag-

## Entropy of the model

How do we assess the performance of a model?

Below I'm using code from <tt>make_synthetic.py</tt> to compute the average prediction entropy on a test string of 10k bytes. 

The smaller the better...

In [4]:
import gzip
import numpy as np

datapath = "../sdata"
infile = "smallstring.gz"

with gzip.open("%s/%s" % (datapath, infile), 'rb') as f:
    content = f.read()

testbytes = [b.encode('hex') for b in content]
n_bytes = len(testbytes)

In [17]:
modelfile = "../models/model_from_big_domain_string_1.gz_arch_8_32_32_unroll_20_step_3_dropout_0.1_iter_40.npy"
   
model_wts = np.load(modelfile)

arch = [model_wts[i].shape[0]\
       for i in range(len(model_wts))\
       if len(model_wts[i].shape) == 2] 

INSIZE = arch[0]
nh = len(arch)/2
nhidden = [arch[1+2*i] for i in range(nh)]
OUTSIZE = 256

In [18]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.recurrent import LSTM, Recurrent

model = Sequential()
if nlayers > 1:
    for i in range(nlayers-1):
        model.add(LSTM(nhidden[i],
                       return_sequences=True,
                       input_shape=(unroll, INSIZE)))
    model.add(LSTM(nhidden[nlayers-1],
                   return_sequences=False))
else:
    model.add(LSTM(nhidden[0],
                   return_sequences=False,
                   input_shape=(unroll, INSIZE)))
model.add(Dense(OUTSIZE))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# load current weights
model.set_weights(model_wts)

In [19]:
import math

def bn(x):
    """
    Binary representation of 0-255 as 8-long int vector.
    """
    str = bin(x)[2:]
    while len(str) < 8:
        str = '0' + str
    return [int(i) for i in list(str)]

def entropy(pred):
    """
    Entropy of a soft-max vector.
    """
    return sum([-p*math.log(p) for p in pred if p > 0])

binabet = [bn(x) for x in range(256)]
byte_idx = dict((c, i) for i,c in enumerate(hexabet))
nlayers = len(nhidden)

In [20]:
unroll = 20
x = np.zeros((1, unroll, INSIZE))

def stepping(window):
    if INSIZE == 8:
        for t,b in enumerate(window):
            x[0, t, :] = binabet[byte_idx[b]]
    elif INSIZE == 256:
        for t,b in enumerate(window):
            x[0, t, byte_idx[b]] = 1.0
    return model.predict(x, verbose=0)[0]

window = testbytes[:unroll]
idx = unroll

entropies = []
while idx < n_bytes - unroll:
    preds = stepping(window)
    entropies += [entropy(preds)]
    next_byte = testbytes[idx]
    window = window[1:] + [next_byte]
    idx += 1

print(sum(entropies)/len(entropies))

2.31585632917


testmodel: 4.94551499849

big_string 10Ms on Macbook [INSIZE=256 ~60 iters]: 1.19513989953

big_domain_string on GPU [INSIZE=8 arch 16]: 


| Iteration        | Entropy           | 
| ------------- |:-------------:| 
| 1     | 2.78878392504 | 
| 7      | 2.76468260125    | 
| 17 | 2.76468260125     | 
|30| 2.76859589659 |

big_domain_string on GPU [INSIZE=8 arch 32,32]: 

| Iteration        | Entropy           | 
| ------------- |:-------------:| 
| 5     | 2.3545572814 | 
| 10      | 2.3470987331     | 
| 15 | 2.3276004177      | 
|20 | 2.3032590387 |
| 25 | 2.3182696723 |
| 30 | 2.3433261821 |
| 35 | 2.3360464611 |
| 40 | 2.3158563291 |
| 45 | 2.3591114079 |
| 50 | 2.3318835304 |
| 55 | 2.3279826120 |
| 60 | |