# Char RNN

In this notebook, we will go through basics of Char-RNN and use different datasets to create different RNNs.

Here we will use [keras](https://keras.io  "Keras Tutorial").


Everything is explained in-detail in [blog post](). This is notebook which replicates the result of blog and runs in colab. Enjoy!


#### Run in Colab

You can run this notebook in google colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dudeperf3ct/DL_notebooks/blob/master/RNN/char_rnn_keras.ipynb)



Here are some other very interesting results, [Cooking-Recipe](https://gist.github.com/nylki/1efbaa36635956d35bcc), [Obama-RNN](https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0), [Bible-RNN](https://twitter.com/RNN_Bible), [Folk-music](https://soundcloud.com/seaandsailor/sets/char-rnn-composes-irish-folk-music), [Learning Holiness](https://cpury.github.io/learning-holiness/), [AI Weirdness](http://aiweirdness.com/), [Auto-Generating Clickbait](https://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/).

## Download data

In [0]:
! wget "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" -P {'data/'}
! wget "https://s3.amazonaws.com/text-datasets/nietzsche.txt" -P {'data/'}
! wget "http://www.gutenberg.org/files/31100/31100.txt" -P {'data/'}
! wget "http://www.gutenberg.org/cache/epub/29765/pg29765.txt" -P {'data/'}
! wget "https://raw.githubusercontent.com/ryanmcdermott/trump-speeches/master/speeches.txt" -P {'data/'}
! wget "https://raw.githubusercontent.com/mcleonard/pytorch-charRNN/master/data/anna.txt" -P {'data/'}
! wget "https://raw.githubusercontent.com/samim23/obama-rnn/master/input.txt" -P {'data/obama/'}

--2019-02-16 15:09:32--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘data/input.txt’


2019-02-16 15:09:38 (13.4 MB/s) - ‘data/input.txt’ saved [1115394/1115394]

--2019-02-16 15:09:40--  https://s3.amazonaws.com/text-datasets/nietzsche.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.85.133
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.85.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 600901 (587K) [text/plain]
Saving to: ‘data/nietzsche.txt’


2019-02-16 15:09:42 (787 KB/s) - ‘data/nietzsche.txt’ saved [600901/600901]

--2019-02-16 15:09:44--  http://www.gutenberg.org/files/311

In [0]:
import keras
from keras.layers import RNN, Dropout, TimeDistributed, Dense, Activation, Embedding, LSTM, SimpleRNN
from keras import optimizers
from keras.models import Sequential, load_model
from keras.utils import to_categorical
from keras.callbacks import LambdaCallback, EarlyStopping
from sklearn.model_selection import train_test_split
import numpy as np
import random
import sys

Using TensorFlow backend.


## Shakespeare RNN

In [0]:
file_path = 'data/input.txt'

with open(file_path, 'r') as f:
  
    data = f.read().lower()
    
print ('First 200 characters of data\n')
print (data[:200])

First 200 characters of data

first citizen:
before we proceed any further, hear me speak.

all:
speak, speak.

first citizen:
you are all resolved rather to die than to famish?

all:
resolved. resolved.

first citizen:
first, you


In [0]:
# unique characters in dataset and its count
chars = set(data)
vocab_size = len(set(data))
data_size = len(data)
print ('Unique characters:', chars)
print ('Length of Unique characters:', vocab_size)
print ('Number of characters in data:', data_size)

Unique characters: {'l', 'a', 'w', 'y', 'c', 'n', 'q', 'd', 'i', 'f', ',', 'b', ':', 'h', 'v', '&', 'k', 'x', 't', '.', '?', ';', 'p', 'g', '\n', 'o', ' ', '!', 'j', '3', '$', 'r', 'e', 'm', 'u', 's', 'z', '-', "'"}
Length of Unique characters: 39
Number of characters in data: 1115394


In [0]:
char2id = {ch:i for i, ch in enumerate(chars)}
id2char = {i:ch for i, ch in enumerate(chars)}

print ('Characters to id\n')
print (char2id)

Characters to id

{'l': 0, 'a': 1, 'w': 2, 'y': 3, 'c': 4, 'n': 5, 'q': 6, 'd': 7, 'i': 8, 'f': 9, ',': 10, 'b': 11, ':': 12, 'h': 13, 'v': 14, '&': 15, 'k': 16, 'x': 17, 't': 18, '.': 19, '?': 20, ';': 21, 'p': 22, 'g': 23, '\n': 24, 'o': 25, ' ': 26, '!': 27, 'j': 28, '3': 29, '$': 30, 'r': 31, 'e': 32, 'm': 33, 'u': 34, 's': 35, 'z': 36, '-': 37, "'": 38}


### Method 1

In [0]:
# cut the text into fixed size inputs of length maxlen
maxlen = 40
sentences = []
next_chars = []

end = data_size - maxlen
for i in range(0, end, maxlen):
    sentences.append([char2id[ch] for ch in data[i : i+maxlen]])
    next_chars.append([char2id[ch] for ch in data[i+maxlen]])

In [0]:
print (sentences[0], len(sentences[0]), len(sentences))
print (next_chars[0], len(next_chars[0]), len(next_chars))

[9, 8, 31, 35, 18, 26, 4, 8, 18, 8, 36, 32, 5, 12, 24, 11, 32, 9, 25, 31, 32, 26, 2, 32, 26, 22, 31, 25, 4, 32, 32, 7, 26, 1, 5, 3, 26, 9, 34, 31] 40 27884
[18] 1 27884


In [0]:
print ('Input:', ''.join(id2char[i] for i in sentences[0]))
print ('Output:', id2char[next_chars[0][0]])

Input: first citizen:
before we proceed any fur
Output: t


In [0]:
X = np.zeros((len(sentences), maxlen, vocab_size), dtype=np.int32)
one_hot = [to_categorical(c, num_classes=vocab_size) for i in range(len(sentences)) for c in sentences[i]]
X = np.array(one_hot).reshape(X.shape)
print (X[0], X.shape)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] (27884, 40, 39)


In [0]:
Y = np.zeros((len(sentences), vocab_size), dtype=np.int32)
one_hot = [to_categorical(next_chars[i], num_classes=vocab_size) for i in range(len(next_chars))]
Y = np.array(one_hot).reshape(Y.shape)
print (Y[0], Y.shape)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (27884, 39)


In [0]:
train_x, val_x, train_y, val_y = train_test_split(X, Y, test_size=0.2)
print ('Training:', train_x.shape, train_y.shape)
print ('Validation:', val_x.shape, val_y.shape)

Training: (22307, 40, 39) (22307, 39)
Validation: (5577, 40, 39) (5577, 39)


In [0]:
batch_size = 64
epochs = 150

model = Sequential()

model.add(LSTM(128, return_sequences=True, input_shape=(maxlen, vocab_size)))
model.add(Dropout(0.8))
model.add(LSTM(128))
model.add(Dropout(0.7))
model.add(Dense(vocab_size, activation='softmax'))

model.summary()
rmsprop = optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer='Adam')

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 40, 128)           86016     
_________________________________________________________________
dropout_1 (Dropout)          (None, 40, 128)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 39)                5031      
Total params: 222,631
Trainable params: 222,631
Non-trainable params: 0
_______________

In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(data) - maxlen - 1)
    for diversity in [0.2, 0.7, 1.3]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = data[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char2id[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = id2char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

In [0]:
# use print_callback to print for every epoch
# model.fit(train_x, train_y,
#           batch_size=batch_size,
#           epochs=epochs,
#           validation_data=(val_x, val_y),
#           callbacks=[print_callback, es])

model.fit(train_x, train_y,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(val_x, val_y),
          callbacks=[es])

Instructions for updating:
Use tf.cast instead.
Train on 22307 samples, validate on 5577 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 00038: early stopping


<keras.callbacks.History at 0x7fe2ee6a9198>

In [0]:
# pick a random seed
start = np.random.randint(0, len(sentences)-1)
pattern = sentences[start]
print ("Seed:")
print ("\"", ''.join([id2char[value] for value in pattern]), "\"")
sampled = [id2char[value] for value in pattern]

# generate characters
for i in range(1000):
  
    x = np.zeros((1, maxlen, vocab_size), dtype=np.int32)
    one_hot = [to_categorical(pattern[i], num_classes=vocab_size) for i in range(len(pattern))]
    x = np.array(one_hot).reshape(x.shape)
    
    prediction = model.predict(x)[0]

    index = sample(prediction, 0.7)
    result = id2char[index]

    seq_in = [id2char[value] for value in pattern]

    #sys.stdout.write(result)
    sampled.append(result)
    
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

print ("\n")
print (''.join(s for s in sampled))

Seed:
" l
would not infect his reason?

ariel:
n "


l
would not infect his reason?

ariel:
not of he parises hin a wher a colly of fald
of whe thourd he that that the yurisen
he vere wound sone both mary breesios will in me me the e,
you swing. hear eore as i puole sersiss of the sole!

toarter:
waver bith my wross our the beklen heand!
thou the hean cotpore refered, to ghouch and baknel.

lerfont:
of not dutt the fronnsy thee lasy i scold.

ungortoss:
to it i ros wradtle heans i to shite whey,
who dato cotterd poreo, and the froocs haar unkase
then ment will  on ey mime;
so beet af in patitel the seele.

cotceus:
shall fold the eret: set beait the goert:
go the lence for siy wheor coce, and wios shify of heananld
and miret shas sweil whye i the corngoun.

foundon:
the sould to sould is thou dose,
thit soull hous gheet byel dore wete and,
i with sith of not he my the here hear in the ourt.

concincental:
the badcing a mostely wond the rostle
piten houlv andord slalt wher the riitent'd

### Method 2

In [0]:
# cut the text into fixed size inputs of length maxlen
maxlen = 40
sentences = []
next_chars = []

end = data_size - maxlen
for i in range(0, end, maxlen):
    sentences.append([char2id[ch] for ch in data[i : i+maxlen]])
    next_chars.append([char2id[ch] for ch in data[i+1: i+maxlen+1]])

In [0]:
print (sentences[0], len(sentences[0]), len(sentences))
print (next_chars[0], len(next_chars[0]), len(next_chars))

[9, 8, 31, 35, 18, 26, 4, 8, 18, 8, 36, 32, 5, 12, 24, 11, 32, 9, 25, 31, 32, 26, 2, 32, 26, 22, 31, 25, 4, 32, 32, 7, 26, 1, 5, 3, 26, 9, 34, 31] 40 27884
[8, 31, 35, 18, 26, 4, 8, 18, 8, 36, 32, 5, 12, 24, 11, 32, 9, 25, 31, 32, 26, 2, 32, 26, 22, 31, 25, 4, 32, 32, 7, 26, 1, 5, 3, 26, 9, 34, 31, 18] 40 27884


In [0]:
print ('Input:', ''.join(id2char[i] for i in sentences[0]))
print ('Output:', ''.join(id2char[i] for i in next_chars[0]))

Input: first citizen:
before we proceed any fur
Output: irst citizen:
before we proceed any furt


In [0]:
X = np.zeros((len(sentences), maxlen, vocab_size), dtype=np.int32)
one_hot = [to_categorical(c, num_classes=vocab_size) for i in range(len(sentences)) for c in sentences[i]]
X = np.array(one_hot).reshape(X.shape)
print (X[0], X.shape)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]] (27884, 40, 39)


In [0]:
Y = np.zeros((len(next_chars), maxlen), dtype=np.float32)
Y = np.array(next_chars).reshape(Y.shape)
print (Y[0], Y.shape)

[ 8 31 35 18 26  4  8 18  8 36 32  5 12 24 11 32  9 25 31 32 26  2 32 26
 22 31 25  4 32 32  7 26  1  5  3 26  9 34 31 18] (27884, 40)


In [0]:
train_x, val_x, train_y, val_y = train_test_split(X, Y, test_size=0.2)
print ('Training:', train_x.shape, train_y.shape)
print ('Validation:', val_x.shape, val_y.shape)

Training: (22307, 40, 39) (22307, 40)
Validation: (5577, 40, 39) (5577, 40)


In [0]:
batch_size = 64
epochs = 150

model = Sequential()

model.add(LSTM(512, return_sequences=True, input_shape=(maxlen, vocab_size)))
model.add(Dropout(0.8))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.7))
model.add(TimeDistributed(Dense(vocab_size, activation='softmax')))

model.summary()
model.compile(loss='sparse_categorical_crossentropy', optimizer='Adam')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 40, 512)           1130496   
_________________________________________________________________
dropout_3 (Dropout)          (None, 40, 512)           0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 40, 512)           2099200   
_________________________________________________________________
dropout_4 (Dropout)          (None, 40, 512)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 40, 39)            20007     
Total params: 3,249,703
Trainable params: 3,249,703
Non-trainable params: 0
_________________________________________________________________


In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(data) - maxlen - 1)
    for diversity in [0.2, 0.7, 1.3]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = data[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, vocab_size), dtype=np.int32)
            one_hot = [to_categorical(char2id[c], num_classes=vocab_size) for c in sentence]
            x_pred = np.array(one_hot).reshape(x_pred.shape)

            preds = model.predict(x_pred)[0]
            next_index = sample(preds[0], diversity)
            next_char = id2char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

In [0]:
# use print_callback to print for every epoch
# model.fit(train_x, np.expand_dims(train_y, -1),
#           batch_size=batch_size,
#           epochs=epochs,
#           validation_data=(val_x, np.expand_dims(val_y, -1)),
#           callbacks=[print_callback, es])

model.fit(train_x, np.expand_dims(train_y, -1),
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(val_x, np.expand_dims(val_y, -1)),
          callbacks=[es])

Train on 22307 samples, validate on 5577 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 3

<keras.callbacks.History at 0x7fe28851a748>

<keras.callbacks.History at 0x7fe28851a748>

In [0]:
# pick a random seed
#start = np.random.randint(0, len(sentences)-1)
seed = "to be, or not to be that is the question"
pattern = [char2id[value] for value in seed]
print ("Seed:")
print ("\"", ''.join([id2char[value] for value in pattern]), "\"")
sampled = [id2char[value] for value in pattern]

# generate characters
for i in range(1000):
  
    x = np.zeros((1, maxlen, vocab_size), dtype=np.int32)
    one_hot = [to_categorical(pattern[i], num_classes=vocab_size) for i in range(len(pattern))]
    x = np.array(one_hot).reshape(x.shape)
    
    prediction = model.predict(x)[0]

    index = sample(prediction[0], 0.8)
    result = id2char[index]

    #sys.stdout.write(result)
    sampled.append(result)
    
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

print ("\n")
print (''.join(s for s in sampled))

Seed:
" to be, or not to be that is the question "


to be, or not to be that is the question  hu  wuimdrha mseythonhhdcaee
su rponn:chiriyalfy,ueisrhtahendaeeotp,t ng sf  
oesyder '  snce, e ina?t  , t hiirsr.eiwd e  r ecfieie

mdhtgm
usw
hejisei,o rny cylhegnofsr dmie  yratn:a
adon
ac ffidsga;le'odaotes,ee

l:rtig
ltp'mcsleweoseo j
l l  nuhdt rmato
 ofetd rivhti.arb artuwawitt,taeem
etarwsduno gtte.on
m rtlo t .eseo wmw ;eohbaoesnerpehen
ndyou  ort.r
 t udaaht dn yp rhf haranwho. fthtomh
e
whtf rgmum tm.hpyiaya,op,oak
fe'eo ,e
pw?aaibdsi yihu
ii'sl ! ol lnybelgrls .cie
ysnl  nw .eradgs ac
cwesag,ese
 o?y
eerd?oduwoeiw
 gnie,mrluo lmrf   
gls

h,ne, woto .tinc.ssa i yruwoagrstyhul 
hi
ca
so i?f
ltgr
: yvlh ic cto  h  s,gptther
,mmn
 jv ai
 t,ioartojeonoieso  oaooea t o dcnuofsmww. fldilues'
 tn fi
rrnlawfhyrq agrfioaii
me enasupspyhgwaf
ea driio,uuf-rarsn'tsbidisot  opaoeeamacfmp aont
ttl-s: eose.us bsp wntautricereoa wnr.ii roa 
n;oas
s deeoteeorsh;lo  nuiiw a
ntutorabd wrrtet  nleds

### Method 3

In [0]:
# cut the text into fixed size inputs of length maxlen
maxlen = 40
sentences = []
next_chars = []

end = data_size - maxlen
for i in range(0, end, maxlen):
    sentences.append([char2id[ch] for ch in data[i : i+maxlen]])
    next_chars.append([char2id[ch] for ch in data[i+maxlen]])

In [0]:
print (sentences[0], len(sentences[0]), len(sentences))
print (next_chars[0], len(next_chars[0]), len(next_chars))

[9, 8, 31, 35, 18, 26, 4, 8, 18, 8, 36, 32, 5, 12, 24, 11, 32, 9, 25, 31, 32, 26, 2, 32, 26, 22, 31, 25, 4, 32, 32, 7, 26, 1, 5, 3, 26, 9, 34, 31] 40 27884
[18] 1 27884


In [0]:
print ('Input:', ''.join(id2char[i] for i in sentences[0]))
print ('Output:', id2char[next_chars[0][0]])

Input: first citizen:
before we proceed any fur
Output: t


In [0]:
X = np.zeros((len(sentences), maxlen), dtype=np.int32)
X = np.array(sentences).reshape(X.shape)
print (X[0], X.shape)

[ 9  8 31 35 18 26  4  8 18  8 36 32  5 12 24 11 32  9 25 31 32 26  2 32
 26 22 31 25  4 32 32  7 26  1  5  3 26  9 34 31] (27884, 40)


In [0]:
Y = np.zeros((len(next_chars), vocab_size), dtype=np.float32)
one_hot = [to_categorical(c, num_classes=vocab_size) for i in range(len(next_chars)) for c in next_chars[i]]
Y = np.array(one_hot).reshape(Y.shape)
print (Y[0], Y.shape)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (27884, 39)


In [0]:
train_x, val_x, train_y, val_y = train_test_split(X, Y, test_size=0.2)
print ('Training:', train_x.shape, train_y.shape)
print ('Validation:', val_x.shape, val_y.shape)

Training: (22307, 40) (22307, 39)
Validation: (5577, 40) (5577, 39)


In [0]:
batch_size = 64
epochs = 150

model = Sequential()
model.add(Embedding(vocab_size, 512, input_length=maxlen))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.7))
model.add(LSTM(512))
model.add(Dropout(0.7))
model.add(Dense(vocab_size, activation='softmax'))

model.summary()
model.compile(loss='categorical_crossentropy', optimizer='Adam')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 40, 512)           19968     
_________________________________________________________________
lstm_7 (LSTM)                (None, 40, 512)           2099200   
_________________________________________________________________
dropout_7 (Dropout)          (None, 40, 512)           0         
_________________________________________________________________
lstm_8 (LSTM)                (None, 512)               2099200   
_________________________________________________________________
dropout_8 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 39)                20007     
Total params: 4,238,375
Trainable params: 4,238,375
Non-trainable params: 0
_________________________________________________________________


In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(data) - maxlen - 1)
    for diversity in [0.2, 0.7, 1.3]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = data[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen))
            one_hot = [char2id[c] for c in sentence]
            x_pred = np.array(one_hot).reshape(x_pred.shape)

            preds = model.predict(x_pred)[0]
            next_index = sample(preds, diversity)
            next_char = id2char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

In [0]:
# use print_callback to print for every epoch
# model.fit(train_x, train_y,
#           batch_size=batch_size,
#           epochs=epochs,
#           validation_data=(val_x, val_y),
#           callbacks=[print_callback, es])


model.fit(train_x, train_y,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(val_x, val_y),
          callbacks=[es])

Train on 22307 samples, validate on 5577 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 00014: early stopping


<keras.callbacks.History at 0x7fe2a32c73c8>

In [0]:
# pick a random seed
start = np.random.randint(0, len(sentences)-1)
pattern = sentences[start]
print ("Seed:")
print ("\"", ''.join([id2char[value] for value in pattern]), "\"")
sampled = [id2char[value] for value in pattern]

# generate characters
for i in range(1000):
  
    x = np.zeros((1, maxlen), dtype=np.int32)
    x = np.array(pattern).reshape(x.shape)
    
    prediction = model.predict(x)[0]

    index = sample(prediction, 1.2)
    result = id2char[index]

    #sys.stdout.write(result)
    sampled.append(result)
    
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

print ("\n")
print (''.join(s for s in sampled))

Seed:
" twas i won the wager, though you hit the "


twas i won the wager, though you hit theirs frate him, for!

 carmily:
he somine to not nobl a lome cray dauwer,
and i could soun would well remate thrank les
us poarment frowseld fair us, to cently pracael;
and iis
weng the beapteh guing
fir cloices, for stance worch you o'n leak.

lord,
wwith o't gouw.

dostcnirius:
o come, to baureful pey, my forth thou wing for to die,-
chrempitelsoo!
so me, stend hard confixter me willion gaord?

cloucontily:
ay, to thou epetulimer
of otding of encesten.

-stiben:
i geid, swat, seo; ride, i gray gon my this got's dance!
is cuse remolgle, and sour?

your: i doach of barcaintul, blouch
hating of jessulce tre! you thear not!
thing but you ruse claopol a cunkcep;
you as not his math'r wife.
but minusty of hath firtof of swand to of deagh.

vesjuncenio:
cive my hould hivenore, with that go liccired?
 dever
ole caore of in
this of secsen pricools would pinty know!

dost that sweet youch
that dequcon h

## Nietzsche RNN

In [0]:
file_path = 'data/nietzsche.txt'

with open(file_path, 'r') as f:
  
    data = f.read().lower()
    
print ('First 200 characters of data\n')
print (data[:200])

First 200 characters of data

preface


supposing that truth is a woman--what then? is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists, have failed to understand women--that the terrib


In [0]:
# unique characters in dataset and its count
chars = set(data)
vocab_size = len(set(data))
data_size = len(data)
print ('Unique characters:', chars)
print ('Length of Unique characters:', vocab_size)
print ('Number of characters in data:', data_size)

Unique characters: {'l', '8', 'a', '5', '=', 'w', 'y', 'æ', 'ë', '0', 'c', 'n', '_', '2', '(', 'q', 'd', 'i', 'f', ',', 'h', 'b', ':', 'v', '7', 'k', 'x', 't', '.', '1', ']', '?', ';', 'p', 'g', '\n', 'o', ' ', '!', 'j', '3', '4', 'r', 'e', '"', 'm', 'u', 'é', 's', 'z', '[', '-', '9', 'ä', ')', "'", '6'}
Length of Unique characters: 57
Number of characters in data: 600893


In [0]:
char2id = {ch:i for i, ch in enumerate(chars)}
id2char = {i:ch for i, ch in enumerate(chars)}

print ('Characters to id\n')
print (char2id)

Characters to id

{'l': 0, '8': 1, 'a': 2, '5': 3, '=': 4, 'w': 5, 'y': 6, 'æ': 7, 'ë': 8, '0': 9, 'c': 10, 'n': 11, '_': 12, '2': 13, '(': 14, 'q': 15, 'd': 16, 'i': 17, 'f': 18, ',': 19, 'h': 20, 'b': 21, ':': 22, 'v': 23, '7': 24, 'k': 25, 'x': 26, 't': 27, '.': 28, '1': 29, ']': 30, '?': 31, ';': 32, 'p': 33, 'g': 34, '\n': 35, 'o': 36, ' ': 37, '!': 38, 'j': 39, '3': 40, '4': 41, 'r': 42, 'e': 43, '"': 44, 'm': 45, 'u': 46, 'é': 47, 's': 48, 'z': 49, '[': 50, '-': 51, '9': 52, 'ä': 53, ')': 54, "'": 55, '6': 56}


In [0]:
# cut the text into fixed size inputs of length maxlen
maxlen = 40
sentences = []
next_chars = []

end = data_size - maxlen
for i in range(0, end, maxlen):
    sentences.append([char2id[ch] for ch in data[i : i+maxlen]])
    next_chars.append([char2id[ch] for ch in data[i+maxlen]])

In [0]:
print (sentences[0], len(sentences[0]), len(sentences))
print (next_chars[0], len(next_chars[0]), len(next_chars))

[33, 42, 43, 18, 2, 10, 43, 35, 35, 35, 48, 46, 33, 33, 36, 48, 17, 11, 34, 37, 27, 20, 2, 27, 37, 27, 42, 46, 27, 20, 37, 17, 48, 37, 2, 37, 5, 36, 45, 2] 40 15022
[11] 1 15022


In [0]:
print ('Input:', ''.join(id2char[i] for i in sentences[0]))
print ('Output:', id2char[next_chars[0][0]])

Input: preface


supposing that truth is a woma
Output: n


In [0]:
X = np.zeros((len(sentences), maxlen), dtype=np.int32)
X = np.array(sentences).reshape(X.shape)
print (X[0], X.shape)

[33 42 43 18  2 10 43 35 35 35 48 46 33 33 36 48 17 11 34 37 27 20  2 27
 37 27 42 46 27 20 37 17 48 37  2 37  5 36 45  2] (15022, 40)


In [0]:
Y = np.zeros((len(next_chars), vocab_size), dtype=np.float32)
one_hot = [to_categorical(c, num_classes=vocab_size) for i in range(len(next_chars)) for c in next_chars[i]]
Y = np.array(one_hot).reshape(Y.shape)
print (Y[0], Y.shape)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0.] (15022, 57)


In [0]:
train_x, val_x, train_y, val_y = train_test_split(X, Y, test_size=0.2)
print ('Training:', train_x.shape, train_y.shape)
print ('Validation:', val_x.shape, val_y.shape)

Training: (12017, 40) (12017, 57)
Validation: (3005, 40) (3005, 57)


In [0]:
batch_size = 64
epochs = 150

model = Sequential()
model.add(Embedding(vocab_size, 512, input_length=maxlen))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.7))
model.add(LSTM(512))
model.add(Dropout(0.7))
model.add(Dense(vocab_size, activation='softmax'))

model.summary()
model.compile(loss='categorical_crossentropy', optimizer='Adam')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 40, 512)           29184     
_________________________________________________________________
lstm_9 (LSTM)                (None, 40, 512)           2099200   
_________________________________________________________________
dropout_9 (Dropout)          (None, 40, 512)           0         
_________________________________________________________________
lstm_10 (LSTM)               (None, 512)               2099200   
_________________________________________________________________
dropout_10 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 57)                29241     
Total params: 4,256,825
Trainable params: 4,256,825
Non-trainable params: 0
_________________________________________________________________


In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(data) - maxlen - 1)
    for diversity in [0.2, 0.7, 1.3]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = data[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen))
            one_hot = [char2id[c] for c in sentence]
            x_pred = np.array(one_hot).reshape(x_pred.shape)

            preds = model.predict(x_pred)[0]
            next_index = sample(preds, diversity)
            next_char = id2char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

In [0]:
# use print_callback to print for every epoch
# model.fit(train_x, train_y,
#           batch_size=batch_size,
#           epochs=epochs,
#           validation_data=(val_x, val_y),
#           callbacks=[print_callback, es])

model.fit(train_x, train_y,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(val_x, val_y),
          callbacks=[es])

Train on 12017 samples, validate on 3005 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 00013: early stopping


<keras.callbacks.History at 0x7fe2a51abc18>

In [0]:
# pick a random seed
start = np.random.randint(0, len(sentences)-1)
pattern = sentences[start]
print ("Seed:")
print ("\"", ''.join([id2char[value] for value in pattern]), "\"")
sampled = [id2char[value] for value in pattern]

# generate characters
for i in range(1000):
  
    x = np.zeros((1, maxlen), dtype=np.int32)
    x = np.array(pattern).reshape(x.shape)
    
    prediction = model.predict(x)[0]

    index = sample(prediction, 1.2)
    result = id2char[index]

    #sys.stdout.write(result)
    sampled.append(result)
    
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

print ("\n")
print (''.join(s for s in sampled))

Seed:
" ibility: as is whatever rests upon the e "


ibility: as is whatever rests upon the ertcatk; the swrow high
stryst (nproully on indinghen, a
weenf thit a nofnermant
the shand bedoung, beoore
maty
vesy mist
mand'r masiding;--wa chover now be. prane that to a cistenfty that
with ont
lon rawes- is a aues it prectore indeidend (msot mast, the mowely, and sond il siit to too this ,priveror
"bqurauns to dode 4ed whould svein
3f man spouctaser, sentikne, as we man whomen wo  so lingint, thal mist 9vey as presalsize
in a corteltint in
ponoftunle-, in the reatser bey re hiss soplesos, thit whead nomebses
beclectian intorpothe or sother latuse--wall momelhess e
exizandi"t and maks cunligion wikh the sover
tith themes but be tend,
at
paswsicats thear
sour ypess"9hind fa, and the ermine
hat wain hich this dawter of pradsiens, indelfencetasy, (ron the ertanner
horging, concu(haquariom and ab
egear"y, bees hebeprecever and the cotranss of them devery. in the bokverely thet-with
domeporing th

## AustenRNN

In [0]:
file_path = 'data/31100.txt'
import io

with io.open(file_path, 'r', encoding='windows-1252') as f:
  
    data = f.read().lower()
    
print ('First 200 characters of data\n')
print (data[:200])

First 200 characters of data


project gutenberg's the complete works of jane austen, by jane austen

this ebook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  you may copy it, give it aw


In [0]:
# unique characters in dataset and its count
chars = set(data)
vocab_size = len(set(data))
data_size = len(data)
print ('Unique characters:', chars)
print ('Length of Unique characters:', vocab_size)
print ('Number of characters in data:', data_size)

#restricting so that we can load all data onto RAM
#uncomment this line to run on all data
data = data[:473597]
data_size = len(data)

Unique characters: {'l', '5', 'a', '8', 'w', 'y', '0', 'c', 'n', '_', '2', '(', 'q', 'd', ',', 'f', 'i', 'b', 'h', ':', 'v', '#', '7', '&', 'k', ']', 't', '.', '1', 'x', '?', ';', '/', 'p', 'g', 'o', '\n', ' ', '!', 'j', '3', '$', '4', 'r', 'e', '*', 'm', 'u', '"', '%', 's', 'z', '[', '-', '9', '\xa0', ')', '@', "'", '6'}
Length of Unique characters: 60
Number of characters in data: 4373597


In [0]:
char2id = {ch:i for i, ch in enumerate(chars)}
id2char = {i:ch for i, ch in enumerate(chars)}

print ('Characters to id\n')
print (char2id)

Characters to id

{'l': 0, '5': 1, 'a': 2, '8': 3, 'w': 4, 'y': 5, '0': 6, 'c': 7, 'n': 8, '_': 9, '2': 10, '(': 11, 'q': 12, 'd': 13, ',': 14, 'f': 15, 'i': 16, 'b': 17, 'h': 18, ':': 19, 'v': 20, '#': 21, '7': 22, '&': 23, 'k': 24, ']': 25, 't': 26, '.': 27, '1': 28, 'x': 29, '?': 30, ';': 31, '/': 32, 'p': 33, 'g': 34, 'o': 35, '\n': 36, ' ': 37, '!': 38, 'j': 39, '3': 40, '$': 41, '4': 42, 'r': 43, 'e': 44, '*': 45, 'm': 46, 'u': 47, '"': 48, '%': 49, 's': 50, 'z': 51, '[': 52, '-': 53, '9': 54, '\xa0': 55, ')': 56, '@': 57, "'": 58, '6': 59}


In [0]:
# cut the text into fixed size inputs of length maxlen
maxlen = 40
sentences = []
next_chars = []

end = data_size - maxlen
for i in range(0, end, maxlen):
    sentences.append([char2id[ch] for ch in data[i : i+maxlen]])
    next_chars.append([char2id[ch] for ch in data[i+maxlen]])

In [0]:
print (sentences[0], len(sentences[0]), len(sentences))
print (next_chars[0], len(next_chars[0]), len(next_chars))

[36, 33, 43, 35, 39, 44, 7, 26, 37, 34, 47, 26, 44, 8, 17, 44, 43, 34, 58, 50, 37, 26, 18, 44, 37, 7, 35, 46, 33, 0, 44, 26, 44, 37, 4, 35, 43, 24, 50, 37] 40 11839
[35] 1 11839


In [0]:
print ('Input:', ''.join(id2char[i] for i in sentences[0]))
print ('Output:', id2char[next_chars[0][0]])

Input: 
project gutenberg's the complete works 
Output: o


In [0]:
X = np.zeros((len(sentences), maxlen), dtype=np.int32)
X = np.array(sentences).reshape(X.shape)
print (X[0], X.shape)

[36 33 43 35 39 44  7 26 37 34 47 26 44  8 17 44 43 34 58 50 37 26 18 44
 37  7 35 46 33  0 44 26 44 37  4 35 43 24 50 37] (11839, 40)


In [0]:
Y = np.zeros((len(next_chars), vocab_size), dtype=np.float32)
one_hot = [to_categorical(c, num_classes=vocab_size) for i in range(len(next_chars)) for c in next_chars[i]]
Y = np.array(one_hot).reshape(Y.shape)
print (Y[0], Y.shape)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (11839, 60)


In [0]:
train_x, val_x, train_y, val_y = train_test_split(X, Y, test_size=0.2)
print ('Training:', train_x.shape, train_y.shape)
print ('Validation:', val_x.shape, val_y.shape)

Training: (9471, 40) (9471, 60)
Validation: (2368, 40) (2368, 60)


In [0]:
batch_size = 64
epochs = 150

model = Sequential()
model.add(Embedding(vocab_size, 512, input_length=maxlen))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.7))
model.add(LSTM(512))
model.add(Dropout(0.7))
model.add(Dense(vocab_size, activation='softmax'))

model.summary()
model.compile(loss='categorical_crossentropy', optimizer='Adam')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 40, 512)           30720     
_________________________________________________________________
lstm_11 (LSTM)               (None, 40, 512)           2099200   
_________________________________________________________________
dropout_11 (Dropout)         (None, 40, 512)           0         
_________________________________________________________________
lstm_12 (LSTM)               (None, 512)               2099200   
_________________________________________________________________
dropout_12 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 60)                30780     
Total params: 4,259,900
Trainable params: 4,259,900
Non-trainable params: 0
_________________________________________________________________


In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(data) - maxlen - 1)
    for diversity in [0.2, 0.7, 1.3]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = data[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen))
            one_hot = [char2id[c] for c in sentence]
            x_pred = np.array(one_hot).reshape(x_pred.shape)

            preds = model.predict(x_pred)[0]
            next_index = sample(preds, diversity)
            next_char = id2char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

In [0]:
# use print_callback to print for every epoch
# model.fit(train_x, train_y,
#           batch_size=batch_size,
#           epochs=epochs,
#           validation_data=(val_x, val_y),
#           callbacks=[print_callback, es])

model.fit(train_x, train_y,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(val_x, val_y),
          callbacks=[es])

Train on 9471 samples, validate on 2368 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 00012: early stopping


<keras.callbacks.History at 0x7fe2a3342be0>

In [0]:
# pick a random seed
start = np.random.randint(0, len(sentences)-1)
pattern = sentences[start]
print ("Seed:")
print ("\"", ''.join([id2char[value] for value in pattern]), "\"")
sampled = [id2char[value] for value in pattern]

# generate characters
for i in range(1000):
  
    x = np.zeros((1, maxlen), dtype=np.int32)
    x = np.array(pattern).reshape(x.shape)
    
    prediction = model.predict(x)[0]

    index = sample(prediction, 1.2)
    result = id2char[index]

    #sys.stdout.write(result)
    sampled.append(result)
    
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

print ("\n")
print (''.join(s for s in sampled))

Seed:
" cut of a laundress and a waiter,
rather  "


cut of a laundress and a waiter,
rather mres anthoingh; nom?"  ho cupsan a
recuck: 
aspich the, willly; as hom soene the
orpteling ibgist hersecof. so, and and thet the ere helcieh new at and thonk
sthe
read and in the
heond had satrexmf; shertyencwoth, that levifile, with band in sorird,
ylosed anne oefores had main by, dot soger
sot lome; whish., a
cuptes chomk hishirlither, their. 
thoughein, gir he. 
acpants, or whe .n could to
mirk, of him, ablavant'.
 the ald enkitally, and doed
the polgont not)th, of cevlicent lo"dsint! bots shiwh tho egren?ss
on her mapes no
stittling seaver heroane, pumketter sorther ryxkess-yexssoore of, for forther's.
.

sokover rigbss, ind bath and the mrettrill akviceirt in scentoneacapain.  i eonevill, for withs, sher' chingifary soy whith me elcestencmicujiwicion, engibe ristt muth; at a rons, than wall tell wanted wankereing comfolectan im bittle; and of elliedst, and she was kee sare a bevor beed mis

## DictionaryRNN

In [0]:
file_path = 'data/pg29765.txt'

with open(file_path, 'r') as f:
  
    data = f.read().lower()
    
print ('First 200 characters of data\n')
print (data[:200])

First 200 characters of data

﻿the project gutenberg ebook of webster's unabridged dictionary, by various

this ebook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  you may copy it, give 


In [0]:
# unique characters in dataset and its count
chars = set(data)
vocab_size = len(set(data))
data_size = len(data)
print ('Unique characters:', chars)
print ('Length of Unique characters:', vocab_size)
print ('Number of characters in data:', data_size)

#restricting so that we can load all data onto RAM
#uncomment this line to run on all data
data = data[:156206]
data_size = len(data)

Unique characters: {'ð', 'l', '5', '8', '=', 'û', '2', 'd', 'á', 'b', '#', 'v', '<', 't', ';', '\n', 'ç', '½', '}', '\\', 'â', 'e', 'u', '>', 's', 'ù', 'ò', 'ö', '[', '9', '6', '¿', '¼', 'y', 'c', ',', 'f', 'h', ':', '1', 'ê', '^', 'o', '!', 'j', 'ú', 'r', 'ô', 'é', '"', '%', '`', 'è', '\t', '-', '\ufeff', 'î', '@', 'ä', 'ë', 'w', '£', '_', '(', '/', 'i', '7', 'x', '.', 'þ', '§', '×', 'í', '3', '4', 'º', 'ì', '~', '°', 'z', 'ñ', 'ó', '+', '¾', ')', '{', '|', '÷', 'a', 'æ', 'ï', '0', 'n', 'ý', 'q', '&', 'k', ']', 'p', 'g', 'à', ' ', '$', '*', 'ã', 'm', 'å', 'ü', "'"}
Length of Unique characters: 109
Number of characters in data: 27956206


In [0]:
char2id = {ch:i for i, ch in enumerate(chars)}
id2char = {i:ch for i, ch in enumerate(chars)}

print ('Characters to id\n')
print (char2id)

Characters to id

{'ð': 0, 'l': 1, '5': 2, '8': 3, '=': 4, 'û': 5, '2': 6, 'd': 7, 'á': 8, 'b': 9, '#': 10, 'v': 11, '<': 12, 't': 13, ';': 14, '\n': 15, 'ç': 16, '½': 17, '}': 18, '\\': 19, 'â': 20, 'e': 21, 'u': 22, '>': 23, 's': 24, 'ù': 25, 'ò': 26, 'ö': 27, '[': 28, '9': 29, '6': 30, '¿': 31, '¼': 32, 'y': 33, 'c': 34, ',': 35, 'f': 36, 'h': 37, ':': 38, '1': 39, 'ê': 40, '^': 41, 'o': 42, '!': 43, 'j': 44, 'ú': 45, 'r': 46, 'ô': 47, 'é': 48, '"': 49, '%': 50, '`': 51, 'è': 52, '\t': 53, '-': 54, '\ufeff': 55, 'î': 56, '@': 57, 'ä': 58, 'ë': 59, 'w': 60, '£': 61, '_': 62, '(': 63, '/': 64, 'i': 65, '7': 66, 'x': 67, '.': 68, 'þ': 69, '§': 70, '×': 71, 'í': 72, '3': 73, '4': 74, 'º': 75, 'ì': 76, '~': 77, '°': 78, 'z': 79, 'ñ': 80, 'ó': 81, '+': 82, '¾': 83, ')': 84, '{': 85, '|': 86, '÷': 87, 'a': 88, 'æ': 89, 'ï': 90, '0': 91, 'n': 92, 'ý': 93, 'q': 94, '&': 95, 'k': 96, ']': 97, 'p': 98, 'g': 99, 'à': 100, ' ': 101, '$': 102, '*': 103, 'ã': 104, 'm': 105, 'å': 106, 'ü': 107, "'"

In [0]:
# cut the text into fixed size inputs of length maxlen
maxlen = 40
sentences = []
next_chars = []

end = data_size - maxlen
for i in range(0, end, maxlen):
    sentences.append([char2id[ch] for ch in data[i : i+maxlen]])
    next_chars.append([char2id[ch] for ch in data[i+maxlen]])

In [0]:
print (sentences[0], len(sentences[0]), len(sentences))
print (next_chars[0], len(next_chars[0]), len(next_chars))

[55, 13, 37, 21, 101, 98, 46, 42, 44, 21, 34, 13, 101, 99, 22, 13, 21, 92, 9, 21, 46, 99, 101, 21, 9, 42, 42, 96, 101, 42, 36, 101, 60, 21, 9, 24, 13, 21, 46, 108] 40 3905
[24] 1 3905


In [0]:
print ('Input:', ''.join(id2char[i] for i in sentences[0]))
print ('Output:', id2char[next_chars[0][0]])

Input: ﻿the project gutenberg ebook of webster'
Output: s


In [0]:
X = np.zeros((len(sentences), maxlen), dtype=np.int32)
X = np.array(sentences).reshape(X.shape)
print (X[0], X.shape)

[ 55  13  37  21 101  98  46  42  44  21  34  13 101  99  22  13  21  92
   9  21  46  99 101  21   9  42  42  96 101  42  36 101  60  21   9  24
  13  21  46 108] (3905, 40)


In [0]:
Y = np.zeros((len(next_chars), vocab_size), dtype=np.float32)
one_hot = [to_categorical(c, num_classes=vocab_size) for i in range(len(next_chars)) for c in next_chars[i]]
Y = np.array(one_hot).reshape(Y.shape)
print (Y[0], Y.shape)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (3905, 109)


In [0]:
train_x, val_x, train_y, val_y = train_test_split(X, Y, test_size=0.2)
print ('Training:', train_x.shape, train_y.shape)
print ('Validation:', val_x.shape, val_y.shape)

Training: (3124, 40) (3124, 109)
Validation: (781, 40) (781, 109)


In [0]:
batch_size = 64
epochs = 150

model = Sequential()
model.add(Embedding(vocab_size, 512, input_length=maxlen))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.7))
model.add(LSTM(512))
model.add(Dropout(0.7))
model.add(Dense(vocab_size, activation='softmax'))

model.summary()
model.compile(loss='categorical_crossentropy', optimizer='Adam')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 40, 512)           55808     
_________________________________________________________________
lstm_13 (LSTM)               (None, 40, 512)           2099200   
_________________________________________________________________
dropout_13 (Dropout)         (None, 40, 512)           0         
_________________________________________________________________
lstm_14 (LSTM)               (None, 512)               2099200   
_________________________________________________________________
dropout_14 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 109)               55917     
Total params: 4,310,125
Trainable params: 4,310,125
Non-trainable params: 0
_________________________________________________________________


In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(data) - maxlen - 1)
    for diversity in [0.2, 0.7, 1.3]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = data[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen))
            one_hot = [char2id[c] for c in sentence]
            x_pred = np.array(one_hot).reshape(x_pred.shape)

            preds = model.predict(x_pred)[0]
            next_index = sample(preds, diversity)
            next_char = id2char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

In [0]:
# use print_callback to print for every epoch
# model.fit(train_x, train_y,
#           batch_size=batch_size,
#           epochs=epochs,
#           validation_data=(val_x, val_y),
#           callbacks=[print_callback, es])

model.fit(train_x, train_y,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(val_x, val_y),
          callbacks=[es])

Train on 3124 samples, validate on 781 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 00016: early stopping


<keras.callbacks.History at 0x7fe2a10f3668>

In [0]:
# pick a random seed
start = np.random.randint(0, len(sentences)-1)
pattern = sentences[start]
print ("Seed:")
print ("\"", ''.join([id2char[value] for value in pattern]), "\"")
sampled = [id2char[value] for value in pattern]

# generate characters
for i in range(1000):
  
    x = np.zeros((1, maxlen), dtype=np.int32)
    x = np.array(pattern).reshape(x.shape)
    
    prediction = model.predict(x)[0]

    index = sample(prediction, 1.2)
    result = id2char[index]

    #sys.stdout.write(result)
    sampled.append(result)
    
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

print ("\n")
print (''.join(s for s in sampled))

Seed:
" lled an absent man is commonly either a  "


lled an absent man is commonly either a to fre pased wonn.

2. thintopyp an, the sindeleded
yre ale pee) the absotel; horersava, sethe
torihe
bxepaa sepennened.

defn: sele it a obtinh, frilont abnm the hensed; enertsewitk fong and a bausting; [tamddverein; nenthe cidtichepubu#
a[us.ir, d.t[.
[pyfn: mhiale il ofmeos oaptefss
verterelunion: is the abtia abalelinoracishe whe qish*in to soubliirhel to sheae a abbingler]
pefn: othe qretiamhs them loule ihes dety or a lende.

an(linefand l.]

d.f:

he fuf sorion ¼en) main; te hrafptiseaune
 in ug-disg syer
sectre to fef. the uwtasine
lale or hrypbsgebthe ne te'nd ormaube, lle..

bebtitilely
aboc. i*ctrac*teins, n.

defn: a ne the
 im vatyg
"raic) of sikhipe patite
ley [cowsrade ;re. opl once mtonm, or al, the ubbite
the wisilentia whoruey
asirite falirgl
ya me ore leenon) iffdiwde

hes(a, hyne.
 eng. alerocery*tes (grosst latyl
oy here.

"ed unasg, accomding the-; methisrile sipingdiil is

## ObamaRNN

In [0]:
file_path = 'data/obama/input.txt'

with open(file_path, 'r') as f:
  
    data = f.read().lower()
    
print ('First 200 characters of data\n')
print (data[:200])

First 200 characters of data

to chip, kathy, and nancy, who graciously shared your father with a nation that loved him; to walter's friends, colleagues, protégés, and all who considered him a hero; to the men of the intrepid; to 


In [0]:
# unique characters in dataset and its count
chars = set(data)
vocab_size = len(set(data))
data_size = len(data)
print ('Unique characters:', chars)
print ('Length of Unique characters:', vocab_size)
print ('Number of characters in data:', data_size)

#restricting so that we can load all data onto RAM
#uncomment this line to run on all data
data = data[:256206]
data_size = len(data)

Unique characters: {'l', '8', '5', '—', '2', 'd', 'á', 'b', 'v', '<', 't', '?', ';', '\n', 'ַ', 'ç', '\x92', 'e', 'u', '’', '>', 's', '[', '9', 'ב', '6', 'ר', '¼', 'y', '²', 'c', ',', 'f', 'h', ':', '1', 'o', '!', 'j', 'ą', '“', 'r', 'ô', 'é', '"', '%', '`', 'è', 'ֹ', '-', '”', 'w', '(', '/', 'i', '7', 'x', '.', '–', 'í', '3', '4', 'z', 'ł', 'ñ', 'ó', 'ּ', '+', ')', '¹', '‘', 'ָ', 'ę', 'a', 'ï', '0', 'n', 'q', '&', 'k', ']', 'ת', '…', 'p', 'g', 'à', ' ', 'ו', 'ה', 'ד', '$', '*', 'm', "'"}
Length of Unique characters: 94
Number of characters in data: 4224143


In [0]:
char2id = {ch:i for i, ch in enumerate(chars)}
id2char = {i:ch for i, ch in enumerate(chars)}

print ('Characters to id\n')
print (char2id)

Characters to id

{'l': 0, '8': 1, '5': 2, '—': 3, '2': 4, 'd': 5, 'á': 6, 'b': 7, 'v': 8, '<': 9, 't': 10, '?': 11, ';': 12, '\n': 13, 'ַ': 14, 'ç': 15, '\x92': 16, 'e': 17, 'u': 18, '’': 19, '>': 20, 's': 21, '[': 22, '9': 23, 'ב': 24, '6': 25, 'ר': 26, '¼': 27, 'y': 28, '²': 29, 'c': 30, ',': 31, 'f': 32, 'h': 33, ':': 34, '1': 35, 'o': 36, '!': 37, 'j': 38, 'ą': 39, '“': 40, 'r': 41, 'ô': 42, 'é': 43, '"': 44, '%': 45, '`': 46, 'è': 47, 'ֹ': 48, '-': 49, '”': 50, 'w': 51, '(': 52, '/': 53, 'i': 54, '7': 55, 'x': 56, '.': 57, '–': 58, 'í': 59, '3': 60, '4': 61, 'z': 62, 'ł': 63, 'ñ': 64, 'ó': 65, 'ּ': 66, '+': 67, ')': 68, '¹': 69, '‘': 70, 'ָ': 71, 'ę': 72, 'a': 73, 'ï': 74, '0': 75, 'n': 76, 'q': 77, '&': 78, 'k': 79, ']': 80, 'ת': 81, '…': 82, 'p': 83, 'g': 84, 'à': 85, ' ': 86, 'ו': 87, 'ה': 88, 'ד': 89, '$': 90, '*': 91, 'm': 92, "'": 93}


In [0]:
# cut the text into fixed size inputs of length maxlen
maxlen = 40
sentences = []
next_chars = []

end = data_size - maxlen
for i in range(0, end, maxlen):
    sentences.append([char2id[ch] for ch in data[i : i+maxlen]])
    next_chars.append([char2id[ch] for ch in data[i+maxlen]])

In [0]:
print (sentences[0], len(sentences[0]), len(sentences))
print (next_chars[0], len(next_chars[0]), len(next_chars))

[10, 36, 86, 30, 33, 54, 83, 31, 86, 79, 73, 10, 33, 28, 31, 86, 73, 76, 5, 86, 76, 73, 76, 30, 28, 31, 86, 51, 33, 36, 86, 84, 41, 73, 30, 54, 36, 18, 21, 0] 40 6405
[28] 1 6405


In [0]:
print ('Input:', ''.join(id2char[i] for i in sentences[0]))
print ('Output:', id2char[next_chars[0][0]])

Input: to chip, kathy, and nancy, who graciousl
Output: y


In [0]:
X = np.zeros((len(sentences), maxlen), dtype=np.int32)
X = np.array(sentences).reshape(X.shape)
print (X[0], X.shape)

[10 36 86 30 33 54 83 31 86 79 73 10 33 28 31 86 73 76  5 86 76 73 76 30
 28 31 86 51 33 36 86 84 41 73 30 54 36 18 21  0] (6405, 40)


In [0]:
Y = np.zeros((len(next_chars), vocab_size), dtype=np.float32)
one_hot = [to_categorical(c, num_classes=vocab_size) for i in range(len(next_chars)) for c in next_chars[i]]
Y = np.array(one_hot).reshape(Y.shape)
print (Y[0], Y.shape)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (6405, 94)


In [0]:
train_x, val_x, train_y, val_y = train_test_split(X, Y, test_size=0.2)
print ('Training:', train_x.shape, train_y.shape)
print ('Validation:', val_x.shape, val_y.shape)

Training: (5124, 40) (5124, 94)
Validation: (1281, 40) (1281, 94)


In [0]:
batch_size = 64
epochs = 150

model = Sequential()
model.add(Embedding(vocab_size, 512, input_length=maxlen))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.7))
model.add(LSTM(512))
model.add(Dropout(0.7))
model.add(Dense(vocab_size, activation='softmax'))

model.summary()
model.compile(loss='categorical_crossentropy', optimizer='Adam')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 40, 512)           48128     
_________________________________________________________________
lstm_15 (LSTM)               (None, 40, 512)           2099200   
_________________________________________________________________
dropout_15 (Dropout)         (None, 40, 512)           0         
_________________________________________________________________
lstm_16 (LSTM)               (None, 512)               2099200   
_________________________________________________________________
dropout_16 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 94)                48222     
Total params: 4,294,750
Trainable params: 4,294,750
Non-trainable params: 0
_________________________________________________________________


In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(data) - maxlen - 1)
    for diversity in [0.2, 0.7, 1.3]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = data[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen))
            one_hot = [char2id[c] for c in sentence]
            x_pred = np.array(one_hot).reshape(x_pred.shape)

            preds = model.predict(x_pred)[0]
            next_index = sample(preds, diversity)
            next_char = id2char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

In [0]:
# use print_callback to print for every epoch
# model.fit(train_x, train_y,
#           batch_size=batch_size,
#           epochs=epochs,
#           validation_data=(val_x, val_y),
#           callbacks=[print_callback, es])

model.fit(train_x, train_y,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(val_x, val_y),
          callbacks=[es])

Train on 5124 samples, validate on 1281 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 00015: early stopping


<keras.callbacks.History at 0x7fe28e7ecc50>

In [0]:
# pick a random seed
start = np.random.randint(0, len(sentences)-1)
pattern = sentences[start]
print ("Seed:")
print ("\"", ''.join([id2char[value] for value in pattern]), "\"")
sampled = [id2char[value] for value in pattern]

# generate characters
for i in range(1000):
  
    x = np.zeros((1, maxlen), dtype=np.int32)
    x = np.array(pattern).reshape(x.shape)
    
    prediction = model.predict(x)[0]

    index = sample(prediction, 1.2)
    result = id2char[index]

    #sys.stdout.write(result)
    sampled.append(result)
    
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

print ("\n")
print (''.join(s for s in sampled))

Seed:
" d vast amounts of money on things that a "


d vast amounts of money on things that athat't of losef.

boit we ,dod weunn rantifil a goond nith ade anva'gar toleed bat y-iwes you chesumen five froopen mabkonilg o5 offernrectenss by proteupic chagent. nriog a inacpoes thive exsou seary comlid and e no0
this hire lboar. of whane. ttring i hot doto whrore and hecsiins., comtels wiin to ranfte ams iñhaobels bomnfoy, aad ind ficy ;ouute kfut wer feoune venmy. n
wo whew joit hey eceolle mecpolniwih tres end i fuver..
 -- thenedi dhxotund freite,trand whal golld wenph at i bethd. fuv the dingicaing o't whith aring theas nas the we, pomst.rothhes, seterilissy of what  et rmoslen, beviwass everii¹ lecridilg or thesrededed, aadd ifmilgagien pripalel but.s
t'utmhed wrich wo proat i)s yurmitins bate., wevennd yoow riyin, that silet as umabawiut in yreiengis of thiretja serpepbur..

ew wre2s e1ldowat  retralmcpith a haase.,
rewums or dlompsein woars, ond thin'ns sirtign ald wunf ald dans th

## TrumpRNN

In [0]:
file_path = 'data/speeches.txt'

with open(file_path, 'r') as f:
  
    data = f.read().lower()
    
print ('First 200 characters of data\n')
print (data[:200])

First 200 characters of data

﻿speech 1


...thank you so much.  that's so nice.  isn't he a great guy.  he doesn't get a fair press; he doesn't get it.  it's just not fair.  and i have to tell you i'm here, and very strongly here


In [0]:
# unique characters in dataset and its count
chars = set(data)
vocab_size = len(set(data))
data_size = len(data)
print ('Unique characters:', chars)
print ('Length of Unique characters:', vocab_size)
print ('Number of characters in data:', data_size)

#restricting so that we can load all data onto RAM
#uncomment this line to run on all data
data = data[:496270]
data_size = len(data)

Unique characters: {'l', '5', 'a', '8', '=', 'w', 'y', '0', 'c', 'n', '—', '2', '_', '(', '”', 'q', 'd', 'i', 'f', ',', 'h', 'b', ':', 'v', '7', '/', '&', 'k', 'x', 't', '.', '1', ']', '?', ';', '–', '…', 'p', 'g', '\n', 'o', ' ', '!', 'j', '“', '$', '3', '4', 'r', 'e', '"', 'm', 'u', 'é', '%', '’', 's', 'z', '[', '-', '9', '\ufeff', ')', '‘', '@', "'", '6'}
Length of Unique characters: 67
Number of characters in data: 896270


In [0]:
char2id = {ch:i for i, ch in enumerate(chars)}
id2char = {i:ch for i, ch in enumerate(chars)}

print ('Characters to id\n')
print (char2id)

Characters to id

{'l': 0, '5': 1, 'a': 2, '8': 3, '=': 4, 'w': 5, 'y': 6, '0': 7, 'c': 8, 'n': 9, '—': 10, '2': 11, '_': 12, '(': 13, '”': 14, 'q': 15, 'd': 16, 'i': 17, 'f': 18, ',': 19, 'h': 20, 'b': 21, ':': 22, 'v': 23, '7': 24, '/': 25, '&': 26, 'k': 27, 'x': 28, 't': 29, '.': 30, '1': 31, ']': 32, '?': 33, ';': 34, '–': 35, '…': 36, 'p': 37, 'g': 38, '\n': 39, 'o': 40, ' ': 41, '!': 42, 'j': 43, '“': 44, '$': 45, '3': 46, '4': 47, 'r': 48, 'e': 49, '"': 50, 'm': 51, 'u': 52, 'é': 53, '%': 54, '’': 55, 's': 56, 'z': 57, '[': 58, '-': 59, '9': 60, '\ufeff': 61, ')': 62, '‘': 63, '@': 64, "'": 65, '6': 66}


In [0]:
# cut the text into fixed size inputs of length maxlen
maxlen = 40
sentences = []
next_chars = []

end = data_size - maxlen
for i in range(0, end, maxlen):
    sentences.append([char2id[ch] for ch in data[i : i+maxlen]])
    next_chars.append([char2id[ch] for ch in data[i+maxlen]])

In [0]:
print (sentences[0], len(sentences[0]), len(sentences))
print (next_chars[0], len(next_chars[0]), len(next_chars))

[61, 56, 37, 49, 49, 8, 20, 41, 31, 39, 39, 39, 30, 30, 30, 29, 20, 2, 9, 27, 41, 6, 40, 52, 41, 56, 40, 41, 51, 52, 8, 20, 30, 41, 41, 29, 20, 2, 29, 65] 40 12406
[56] 1 12406


In [0]:
print ('Input:', ''.join(id2char[i] for i in sentences[0]))
print ('Output:', id2char[next_chars[0][0]])

Input: ﻿speech 1


...thank you so much.  that'
Output: s


In [0]:
X = np.zeros((len(sentences), maxlen), dtype=np.int32)
X = np.array(sentences).reshape(X.shape)
print (X[0], X.shape)

[61 56 37 49 49  8 20 41 31 39 39 39 30 30 30 29 20  2  9 27 41  6 40 52
 41 56 40 41 51 52  8 20 30 41 41 29 20  2 29 65] (12406, 40)


In [0]:
Y = np.zeros((len(next_chars), vocab_size), dtype=np.float32)
one_hot = [to_categorical(c, num_classes=vocab_size) for i in range(len(next_chars)) for c in next_chars[i]]
Y = np.array(one_hot).reshape(Y.shape)
print (Y[0], Y.shape)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (12406, 67)


In [0]:
train_x, val_x, train_y, val_y = train_test_split(X, Y, test_size=0.2)
print ('Training:', train_x.shape, train_y.shape)
print ('Validation:', val_x.shape, val_y.shape)

Training: (9924, 40) (9924, 67)
Validation: (2482, 40) (2482, 67)


In [0]:
batch_size = 64
epochs = 150

model = Sequential()
model.add(Embedding(vocab_size, 512, input_length=maxlen))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.7))
model.add(LSTM(512))
model.add(Dropout(0.7))
model.add(Dense(vocab_size, activation='softmax'))

model.summary()
model.compile(loss='categorical_crossentropy', optimizer='Adam')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 40, 512)           34304     
_________________________________________________________________
lstm_17 (LSTM)               (None, 40, 512)           2099200   
_________________________________________________________________
dropout_17 (Dropout)         (None, 40, 512)           0         
_________________________________________________________________
lstm_18 (LSTM)               (None, 512)               2099200   
_________________________________________________________________
dropout_18 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 67)                34371     
Total params: 4,267,075
Trainable params: 4,267,075
Non-trainable params: 0
_________________________________________________________________


In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(data) - maxlen - 1)
    for diversity in [0.2, 0.7, 1.3]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = data[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen))
            one_hot = [char2id[c] for c in sentence]
            x_pred = np.array(one_hot).reshape(x_pred.shape)

            preds = model.predict(x_pred)[0]
            next_index = sample(preds, diversity)
            next_char = id2char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

In [0]:
# use print_callback to print for every epoch
# model.fit(train_x, train_y,
#           batch_size=batch_size,
#           epochs=epochs,
#           validation_data=(val_x, val_y),
#           callbacks=[print_callback, es])

model.fit(train_x, train_y,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(val_x, val_y),
          callbacks=[es])

Train on 9924 samples, validate on 2482 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 00013: early stopping


<keras.callbacks.History at 0x7fe28c4bf3c8>

In [0]:
# pick a random seed
start = np.random.randint(0, len(sentences)-1)
pattern = sentences[start]
print ("Seed:")
print ("\"", ''.join([id2char[value] for value in pattern]), "\"")
sampled = [id2char[value] for value in pattern]

# generate characters
for i in range(1000):
  
    x = np.zeros((1, maxlen), dtype=np.int32)
    x = np.array(pattern).reshape(x.shape)
    
    prediction = model.predict(x)[0]

    index = sample(prediction, 1.2)
    result = id2char[index]

    #sys.stdout.write(result)
    sampled.append(result)
    
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

print ("\n")
print (''.join(s for s in sampled))

Seed:
" me, i am a unifier. once we get all of t "


me, i am a unifier. once we get all of thinks have wonve honey risatigiaidn.
you know, an peolle have 6s nut.
i know, this ase, many bery nigf stroa trbat. you yea. what hele goed to this he5’re buese where the froon cranys welve ad’s jven’s contry.
morly this nveos with reose that be shesd and chat was 1100. you. goo’s looc stuf. the. yibe woull don’is being holle. not the wiedd chank ane hon 8f silieica'e alpony but uns thisks and thighapactuip,  umher 
ow, pucaess. stals any i ving in slet's of the peoplius.
yo buse going
that ’t going to we paks that.
" has wher asering becasic. whey dacans hibe, peolle do is – out whinh and a certing ad asaony puoricichs chonkend.
 we "mave, i miding boing upolidive lidve jrost 6f that-
se chean we les?
theif realbes. do ope the dass it beycion.
"ou goen thas knows,.








hom so peolly there aid.
you know but is seois.
.
ou be, i’m of 9oe.. an crank. he have to this is gos anlysadive sbigitio

## AnnaRNN

In [0]:
file_path = 'data/anna.txt'

with open(file_path, 'r') as f:
  
    data = f.read().lower()
    
print ('First 200 characters of data\n')
print (data[:200])

First 200 characters of data

chapter 1


happy families are all alike; every unhappy family is unhappy in its own
way.

everything was in confusion in the oblonskys' house. the wife had
discovered that the husband was carrying on


In [0]:
# unique characters in dataset and its count
chars = set(data)
vocab_size = len(set(data))
data_size = len(data)
print ('Unique characters:', chars)
print ('Length of Unique characters:', vocab_size)
print ('Number of characters in data:', data_size)

#restricting so that we can load all data onto RAM
#uncomment this line to run on all data
data = data[:585223]
data_size = len(data)

Unique characters: {'l', '5', 'a', '8', 'w', 'y', '0', 'c', 'n', '_', '2', '(', 'q', 'd', 'i', 'f', ',', 'h', 'b', ':', 'v', '7', '/', '&', 'k', 'x', 't', '.', '1', '?', ';', 'p', 'g', '\n', 'o', ' ', '!', 'j', '3', '$', '4', 'r', 'e', '"', 'm', 'u', '*', '%', '`', 's', 'z', '-', '9', ')', '@', "'", '6'}
Length of Unique characters: 57
Number of characters in data: 1985223


In [0]:
char2id = {ch:i for i, ch in enumerate(chars)}
id2char = {i:ch for i, ch in enumerate(chars)}

print ('Characters to id\n')
print (char2id)

Characters to id

{'l': 0, '5': 1, 'a': 2, '8': 3, 'w': 4, 'y': 5, '0': 6, 'c': 7, 'n': 8, '_': 9, '2': 10, '(': 11, 'q': 12, 'd': 13, 'i': 14, 'f': 15, ',': 16, 'h': 17, 'b': 18, ':': 19, 'v': 20, '7': 21, '/': 22, '&': 23, 'k': 24, 'x': 25, 't': 26, '.': 27, '1': 28, '?': 29, ';': 30, 'p': 31, 'g': 32, '\n': 33, 'o': 34, ' ': 35, '!': 36, 'j': 37, '3': 38, '$': 39, '4': 40, 'r': 41, 'e': 42, '"': 43, 'm': 44, 'u': 45, '*': 46, '%': 47, '`': 48, 's': 49, 'z': 50, '-': 51, '9': 52, ')': 53, '@': 54, "'": 55, '6': 56}


In [0]:
# cut the text into fixed size inputs of length maxlen
maxlen = 40
sentences = []
next_chars = []

end = data_size - maxlen
for i in range(0, end, maxlen):
    sentences.append([char2id[ch] for ch in data[i : i+maxlen]])
    next_chars.append([char2id[ch] for ch in data[i+maxlen]])

In [0]:
print (sentences[0], len(sentences[0]), len(sentences))
print (next_chars[0], len(next_chars[0]), len(next_chars))

[7, 17, 2, 31, 26, 42, 41, 35, 28, 33, 33, 33, 17, 2, 31, 31, 5, 35, 15, 2, 44, 14, 0, 14, 42, 49, 35, 2, 41, 42, 35, 2, 0, 0, 35, 2, 0, 14, 24, 42] 40 14630
[30] 1 14630


In [0]:
print ('Input:', ''.join(id2char[i] for i in sentences[0]))
print ('Output:', id2char[next_chars[0][0]])

Input: chapter 1


happy families are all alike
Output: ;


In [0]:
X = np.zeros((len(sentences), maxlen), dtype=np.int32)
X = np.array(sentences).reshape(X.shape)
print (X[0], X.shape)

[ 7 17  2 31 26 42 41 35 28 33 33 33 17  2 31 31  5 35 15  2 44 14  0 14
 42 49 35  2 41 42 35  2  0  0 35  2  0 14 24 42] (14630, 40)


In [0]:
Y = np.zeros((len(next_chars), vocab_size), dtype=np.float32)
one_hot = [to_categorical(c, num_classes=vocab_size) for i in range(len(next_chars)) for c in next_chars[i]]
Y = np.array(one_hot).reshape(Y.shape)
print (Y[0], Y.shape)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0.] (14630, 57)


In [0]:
train_x, val_x, train_y, val_y = train_test_split(X, Y, test_size=0.2)
print ('Training:', train_x.shape, train_y.shape)
print ('Validation:', val_x.shape, val_y.shape)

Training: (11704, 40) (11704, 57)
Validation: (2926, 40) (2926, 57)


In [0]:
batch_size = 64
epochs = 150

model = Sequential()
model.add(Embedding(vocab_size, 512, input_length=maxlen))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.7))
model.add(LSTM(512))
model.add(Dropout(0.7))
model.add(Dense(vocab_size, activation='softmax'))

model.summary()
model.compile(loss='categorical_crossentropy', optimizer='Adam')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 40, 512)           29184     
_________________________________________________________________
lstm_19 (LSTM)               (None, 40, 512)           2099200   
_________________________________________________________________
dropout_19 (Dropout)         (None, 40, 512)           0         
_________________________________________________________________
lstm_20 (LSTM)               (None, 512)               2099200   
_________________________________________________________________
dropout_20 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 57)                29241     
Total params: 4,256,825
Trainable params: 4,256,825
Non-trainable params: 0
_________________________________________________________________


In [0]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(data) - maxlen - 1)
    for diversity in [0.2, 0.7, 1.3]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = data[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen))
            one_hot = [char2id[c] for c in sentence]
            x_pred = np.array(one_hot).reshape(x_pred.shape)

            preds = model.predict(x_pred)[0]
            next_index = sample(preds, diversity)
            next_char = id2char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

es = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

In [0]:
# use print_callback to print for every epoch
# model.fit(train_x, train_y,
#           batch_size=batch_size,
#           epochs=epochs,
#           validation_data=(val_x, val_y),
#           callbacks=[print_callback, es])

model.fit(train_x, train_y,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(val_x, val_y),
          callbacks=[es])

Train on 11704 samples, validate on 2926 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 00013: early stopping


<keras.callbacks.History at 0x7fe28b19d400>

In [0]:
# pick a random seed
start = np.random.randint(0, len(sentences)-1)
pattern = sentences[start]
print ("Seed:")
print ("\"", ''.join([id2char[value] for value in pattern]), "\"")
sampled = [id2char[value] for value in pattern]

# generate characters
for i in range(1000):
  
    x = np.zeros((1, maxlen), dtype=np.int32)
    x = np.array(pattern).reshape(x.shape)
    
    prediction = model.predict(x)[0]

    index = sample(prediction, 1.2)
    result = id2char[index]

    #sys.stdout.write(result)
    sampled.append(result)
    
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

print ("\n")
print (''.join(s for s in sampled))

Seed:
"  the preparatory
bustle in the station,  "


 the preparatory
bustle in the station, becusion of a come the that i's soos, ebmad .vsen, and qucaring shispavy
roull can" his
shasce gill, qulwigill; and, and same to it a lich, and loke, in the craot, and tither.

"an'd pave rood, mad
yow roming simrony and bufsens-at the maering which of hamyebmlely
"vevy dedi ut miall wiwh pustady rompiviver-q,
chogsht, ches! i' dlolming, desding the coud her, myever.

"why _he's sleuving," she
kace like come
onethersigpzy prisice, and and the gordian, the baclliratarriin.
""ls, you ruke go nor visief, would on mavin undersanty fror the coolbnorihe, and haven and wondeds, lecutting a heorgg when, shikle
the said
to memnevher?
she gels, but over whe, he rapped? whit he cinter, it i bistrele.

""ame notf that qeiuntera'nts, acruls a prossed no her
menslous, meas docince, whan athing. 
but for that the vovbed the shourd nov a neble on the couct detting; to her souly exlouckly on the munces of vorig