# RNN model of URIs

The Python code from this notebook has now been developed into two scripts
<ol>
<li> <tt>train_rnn.py</tt>
<code>
Usage: python  ./python/train_rnn.py infile [options]
Options are:
        -o outfile 
        -datapath [./sdata]
        -modelpath [./models]
        -unroll [20]
        -step [3]
        -dropout [0.1]
        -niters [10]
        -arch (hidden layer sizes) [16]
</code>
<li> <tt>make_synthetic.py</tt>
<code>
Usage: python  ./python/make_synthetic.py infile [options]
Options are:
        -n length of output [1000]
        -init        [newlines]
        -temp        [1.0]
        -modelpath        [./models]
</ol>

## Training a model

In [9]:
run -i train_rnn\
    smallstring.gz\
    -datapath ../sdata\
    -modelpath ../models\
    -arch 32,32\
    -niters 16

Checking hardware ...
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 11201530596252466535
]
Model architecture:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_5 (LSTM)                (None, 20, 32)            5248      
_________________________________________________________________
dropout_1 (Dropout)          (None, 20, 32)            0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 256)               8448      
_________________________________________________________________
activation_5 (Activation)    (None, 256)               0  

## Synthetic domain string from model

In [21]:
run -i make_synthetic\
    model_from_big_domain_string_1_with_nonlatin_1pt05_12287_1024.gz_arch_256_128_128_unroll_20_step_5_dropout_0.1_iter_25.npy\
    -modelpath ../models\
    -init 'www.abcsivywbpiuytvbsdef.com'\
    -temp 1.0\
    -n 1000

www.abcsivywbpiuytvbsdef.com
/krt841.shtml
/profile/تدها-فلد-مدةرانكیklabmm-f--hotepbam90_00/fuloc601882966-9kp2tw8D
/files/tosicra/2/3604.html
/sum-ghab-Lodinl.pop
/category-difliJepen
/en/lateali/asponory/k/aA-4293.html
/post/index.php%d9%89%d9%8ن-پداور-رلٳای-عن-بونیانی-دانلود-اربχ-سايز-
/tags/قیمت-روز-وو-رران-کارنيت-گلرويصعه-ػا-
/tag/باني
/tags/شگم-خفو11-سعن-از--جهلوچTiy-ديسلدوی-د٪عرحت-گرٴلاص-s-Erars-
/tag/پوزالها-های-های-به-DSe-kirj-votamy-b1028x230/l.jpg
/�w�xer-tu
/tag/117/tbhace-ptdi/s5laprhpa
/�Aхерани
/zono/6608020
/�e�s-�Aелыврарa/rawel-M51-ic-sport-elf-ìcami-comchy-breviews-vig-odag-dotgt�-7zzc��4a15.rali.jpg
/data/withoe/
/category/am-artos-aong-askicnigezunver-racelt/idcasco8A~-2100--teny
/tag/ins%20i%20demde-pers%20kessuititicition/4232
/%E4%BD%BA%E6%8C%84%E8%BC%BC%E7%9F%BA%E6%92%8D%/3432TE920s/
/armelanie/thumb/o/_cikica
/Watch-910876
/hotel/land-stor-ta


In [25]:
run -i make_synthetic\
    model_from_big_domain_string_1_with_nonlatin_1pt05_12287_1024.gz_arch_256_128_128_unroll_20_step_5_dropout_0.1_iter_25.npy\
    -modelpath ../models\
    -init '叶问前传BD国语中字1024高清'\
    -temp 1.0\
    -n 4000


叶问前传BD国语中字1024高清�%lKua/104608732832c/grinavenherastotcins-2510�l-Camellobiel-� ритим-вѾ-вициескойgioly ena im ad4 erdia/แ�lh/
/shopmage/p
/d173-3�w�вы редончЊотбя Ценскесге%20�Aйушеран-оборирена
/music/rin/shosÕn-tiunimo-1771/j/flukcap-22x490-���lmٲ-فردافد-در-اريد-لهارا-انوسt42460615غ-چیداشی-ارري-Glocepث-cts-susshaadet_celdonmiid_818907014Sy7.html
/vinn-chicp_faparih/
/�w����-فاميي-مدل-tsh1th-
/tags/Hota
/tags/dodkestlm.jpg
/images/detailed/13733/192696282532/s6-isleego.html
/mons-deskNietishangs.html
/kiev/jactr/
/hcaki/Pr0npec-holdet-2947030
/+jxluvesser-and-insk-qţ/gigcevi%20�l�0+lorg
/gat-cache/654203/p
/qai/wostmxzanfo2_410_18266/36Txfc5��xarers_maorapel_publins_avones_mece_fatur_quos/ropo-triraspas-caro%20сциѸcorrbieto+saa.aspg
/6bfe78d�ld-endlat30-1269/melren-druvzis-er%25L008%2F΄%74l%20to_ditetuha/
/tag/gauso/mi�B64_29U6728rv_ihiw/Hmatente-Map-za���w�klnt22700/2014/07/14/-9-.html
/games/vrhepgasversapmaval22731/cu/~
/brand/
/shop/imagetx/kpuw-mida-fries/1917/campationby-255-pro

In [11]:
run -i make_synthetic\
    model_from_big_domain_string_1.gz_arch_8_32_32_unroll_20_step_3_dropout_0.1_iter_60.npy\
    -modelpath ../models\
    -init '叶问前传BD国语中字1024高清'\
    -temp 1.0\
    -n 1000

叶问前传BD国语中字1024高清//web-ieceteryutate
/rus/bustune/126343944.msp
/Ossarchsss-2093-Bios-frising-right-0g257142-124/
/smpting-fa-timngiucusterk-autoriyka-stelod-logment/saulos-283.html
/oHalungobur-iswimetjatenriota.jpg
/wp-content/uploads/2017/00/teges/suacamahVyuxtingy-brolkgz
/index.html
/par-chikinsok9cdb.html
/clochan-mutto-2003/en-1121847425.html
/maksicer/1792-Frombersimgigngimm/senro/mcdouros.html/gavens
/furnulisy/dommhave-accessoriend-heb-orotdia/files/1273c603d.jpg
/shiginkin/products-lova-produng-en-.63384/1b/1683
/img/croas.html
/hob-sdedca-zishice/senat-latip-all-chys
/sphoto
/products/Woni-T1-Catsica-GJL6I413-5321013Cable-Desann.jpg
/dots/seudting-wao/voa-bagtes--12770.html
/shrani/c5d/1.jpg
/merrous/stastol
/thumbs/Lokrimysid/1746854-Ts-an-cara
/womans/toerlacbnations/fatcule/avusios/strigithl
/gock/
/fr/
/dchl.html
/
/prberash/contert-2226x400



www.alytkav.col
/wyld/ensoyi/urrugzusts/71/1653_c_jeni20.jpg
/winggeas/svulsnation/3554094_bpanallla94177
/jlage-2.html
/toisas-

In [27]:
run -i make_synthetic\
    model_from_big_domain_string_1.gz_arch_8_32_32_unroll_20_step_3_dropout_0.1_iter_60.npy\
    -modelpath ../models\
    -init '叶问前传BD国语中字1024高清'\
    -temp 1.0\
    -n 4000

叶问前传BD国语中字1024高清-.pat.ski/insmarkolley-monechogweln.html
/list/2018/15/f-portai-indestoccnondanatispires/smitsx/
/unchots-podel.html
/luwhick%200mk-archor8635dad5bpb3c-rone-sole-h/
/wip/ee42-cwocad_parvin-cuani-a32ddz694v2e.jpg
/mopssare/29056x17/994277/idNitilenvile_omage-stinings/lasse-sapasteroe
/styles/imc/6029/oca4ch-anfslifty/50/
/71/606/Mellariation/
/book-spl/rl/
/artackie/sebfet/comgoo



valveorfussseminmoo.ru
/



www.treenfes.con/news
/Jiplacer/condici+missox-polody/
/aete-ize/friio/a/2014/08/p/vhs_tas_thet_Drittilored_ratb_t1207_cauar_outotq_2036744n.gif
/imgf/tromas/products/1152-whare/
/viciansle881/stize/ualpaelelaen-word-de.jpg
/Branior/.jpg-weagsp_pack/fluerire/andej-derary/30811/l0/337126742403-84862116.html
/rucing/files/wow-wobss/blog-582071.limall.ricivetac.com




05.jeurl.com

/gins
/ingarpportusiddradiche-nt/5/
/trotooy-102.aspx
/rings.php
/
/Troningual
/product/
/satis/flelinesicis/1.jpg
/nat/0413094aa62c66f9.its50224995265-128
/zteps/di/gest-titon
/gamediabho

## Entropy of the model

How do we assess the performance of a model?

Below I'm using code from <tt>make_synthetic.py</tt> to compute the average prediction entropy on a test string of 10k bytes. 

The smaller the better...

In [22]:
import gzip
import numpy as np

datapath = "../sdata"
infile = "smallstring.gz"

with gzip.open("%s/%s" % (datapath, infile), 'rb') as f:
    content = f.read()

testbytes = [b.encode('hex') for b in content]
n_bytes = len(testbytes)

In [23]:
modelfile = "../models/model_from_big_domain_string_1.gz_arch_8_32_32_unroll_20_step_3_dropout_0.1_iter_60.npy"
   
model_wts = np.load(modelfile)

arch = [model_wts[i].shape[0]\
       for i in range(len(model_wts))\
       if len(model_wts[i].shape) == 2] 

INSIZE = arch[0]
nh = len(arch)/2
nhidden = [arch[1+2*i] for i in range(nh)]
OUTSIZE = 256

In [24]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.recurrent import LSTM, Recurrent

model = Sequential()
if nlayers > 1:
    for i in range(nlayers-1):
        model.add(LSTM(nhidden[i],
                       return_sequences=True,
                       input_shape=(unroll, INSIZE)))
    model.add(LSTM(nhidden[nlayers-1],
                   return_sequences=False))
else:
    model.add(LSTM(nhidden[0],
                   return_sequences=False,
                   input_shape=(unroll, INSIZE)))
model.add(Dense(OUTSIZE))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

# load current weights
model.set_weights(model_wts)

In [25]:
import math

def bn(x):
    """
    Binary representation of 0-255 as 8-long int vector.
    """
    str = bin(x)[2:]
    while len(str) < 8:
        str = '0' + str
    return [int(i) for i in list(str)]

def entropy(pred):
    """
    Entropy of a soft-max vector.
    """
    return sum([-p*math.log(p) for p in pred if p > 0])

binabet = [bn(x) for x in range(256)]
byte_idx = dict((c, i) for i,c in enumerate(hexabet))
nlayers = len(nhidden)

In [26]:
unroll = 20
x = np.zeros((1, unroll, INSIZE))

def stepping(window):
    if INSIZE == 8:
        for t,b in enumerate(window):
            x[0, t, :] = binabet[byte_idx[b]]
    elif INSIZE == 256:
        for t,b in enumerate(window):
            x[0, t, byte_idx[b]] = 1.0
    return model.predict(x, verbose=0)[0]

window = testbytes[:unroll]
idx = unroll

entropies = []
while idx < n_bytes - unroll:
    preds = stepping(window)
    entropies += [entropy(preds)]
    next_byte = testbytes[idx]
    window = window[1:] + [next_byte]
    idx += 1

print(sum(entropies)/len(entropies))

2.33199206039


testmodel: 4.94551499849

big_string 10Ms on Macbook [INSIZE=256 ~60 iters]: 1.19513989953

big_domain_string on GPU [INSIZE=8 arch 16]: 


| Iteration        | Entropy           | 
| ------------- |:-------------:| 
| 1     | 2.78878392504 | 
| 7      | 2.76468260125    | 
| 17 | 2.76468260125     | 
|30| 2.76859589659 |

big_domain_string on GPU [INSIZE=8 arch 32,32]: 

| Iteration        | Entropy           | 
| ------------- |:-------------:| 
| 5     | 2.3545572814 | 
| 10      | 2.3470987331     | 
| 15 | 2.3276004177      | 
|20 | 2.3032590387 |
| 25 | 2.3182696723 |
| 30 | 2.3433261821 |
| 35 | 2.3360464611 |
| 40 | 2.3158563291 |
| 45 | 2.3591114079 |
| 50 | 2.3318835304 |
| 55 | 2.3279826120 |
| 60 | 2.3319920604 |