<a href="https://colab.research.google.com/github/domschl/tensor-poet/blob/master/eager_poet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install TF 2.0, if necessary. This currently needs to be done when running from Colab.

In [1]:
!pip install tf-nightly-gpu-2.0-preview # tensorflow-gpu==2.0.0-alpha0

Collecting tf-nightly-gpu-2.0-preview
[?25l  Downloading https://files.pythonhosted.org/packages/f0/3b/21564412886837f45b57067eefa7f9b69e61e032fc188b57496c5391b76b/tf_nightly_gpu_2.0_preview-2.0.0.dev20190517-cp36-cp36m-manylinux1_x86_64.whl (348.7MB)
[K     |████████████████████████████████| 348.7MB 48kB/s 
[?25hCollecting google-pasta>=0.1.6 (from tf-nightly-gpu-2.0-preview)
[?25l  Downloading https://files.pythonhosted.org/packages/f9/68/a14620bfb042691f532dcde8576ff82ee82e4c003cdc0a3dbee5f289cee6/google_pasta-0.1.6-py3-none-any.whl (51kB)
[K     |████████████████████████████████| 61kB 19.6MB/s 
Collecting tensorflow-estimator-2.0-preview (from tf-nightly-gpu-2.0-preview)
[?25l  Downloading https://files.pythonhosted.org/packages/12/75/5107f9c9c2106e3765a6b22d611005aac4e29c29a5142ee897847776dc17/tensorflow_estimator_2.0_preview-1.14.0.dev2019051700-py2.py3-none-any.whl (427kB)
[K     |████████████████████████████████| 430kB 39.4MB/s 
[?25hCollecting wrapt>=1.11.1 (from tf-ni

## References:
* <https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/text_generation.ipynb>
* <https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/shakespeare_with_tpu_and_keras.ipynb>

# [WIP] Eager Tensor Poet (tf 2.0)

**THIS IS UNFINISHED WORK IN PROGRESS**

A tensorflow deep LSTM model for text generation

In [0]:
import numpy as np
import os
import json
import time
import random
import tensorflow as tf
from IPython.core.display import display, HTML

from urllib.request import urlopen  # Py3

### Content
This notebook contains the following sections:
1. TextLibrary: utilities to work with text files
  * loading of a list of files (local or URLs)
  * encoding for training
  * formatted output with quote-highlighting
2. Transform text data to tf.data


...


x. Definition of the tensorflow model
x. Model and training parameters
x. The actual training on the data (required 1. - 3.)
  * Training can be restarted, since the model is saved periodically.
x. Generation of text from the trained model (requires 1. - 4.)
x. In dialog with with the model (requires 1. - 4.)

## 0. Check system

### Tensorflow api version check

Temporary note: currently, this is tested against the master build of tensorflow, which still has a version tag 1.13.x at the time of this writing. the version check below is preliminary.

In [4]:
try:
    if 'api.v2' in tf.version.__name__:
        print("Tensorflow api v2 active.")
    else:
        print("Tensorflow api v2 not found. This will not work.")
except:
    print("Failed to check for Tensorflow api v2. This will not work.")

Tensorflow api v2 active.


### GPU/TPU check

In [5]:
from tensorflow.python.client import device_lib

use_tpu = False
use_gpu = False

try:
    TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
    use_tpu = True
    tf.config.experimental_connect_to_host(TPU_ADDRESS)
    print("TPU available at {}".format(TPU_ADDRESS))
except:
    print("No TPU available")

for hw in ["CPU", "GPU", "TPU"]:
    hwlist=tf.config.experimental.list_logical_devices(hw)
    print("{} -> {}".format(hw,hwlist))


if use_tpu is False:
    def get_available_devs_of_type(type):
        local_device_protos = device_lib.list_local_devices()
        return [x.name for x in local_device_protos if type in x.name]

    def get_dev_desc():
        local_device_protos = device_lib.list_local_devices()
        return [(x.name, x.physical_device_desc) for x in local_device_protos]

    def get_available_gpus():
        return get_available_devs_of_type('GPU')

    dl = get_available_gpus()
    if len(dl)==0:
        print("WARNING: You have neither TPU nor GPU, this is going to be very slow!")
        print("         Hint: If using Google Colab, set runtime type to TPU.")
        print(get_available_devs_of_type(''))
    else:
        use_gpu = True
        print(f"GPUs: {dl}")
        print(get_dev_desc())


No TPU available
CPU -> [LogicalDevice(name='/job:worker/replica:0/task:0/device:CPU:0', device_type='CPU'), LogicalDevice(name='/job:worker/replica:0/task:1/device:CPU:0', device_type='CPU')]
GPU -> []
TPU -> [LogicalDevice(name='/job:worker/replica:0/task:1/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:1/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:1/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:1/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:1/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:1/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:1/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:1/device:TPU:7', device_type='TPU')]


##  1. Text library

In [0]:
# TextLibrary class: text library for training, encoding, batch generation,
# and formatted source display


class TextLibrary:
    def __init__(self, descriptors, max=100000000):
        self.descriptors = descriptors
        self.data = ''
        self.files = []
        self.c2i = {}
        self.i2c = {}
        index = 1
        for descriptor in descriptors:
            fd = {}
            if descriptor[:4] == 'http':
                try:
                    dat = urlopen(descriptor).read().decode('utf-8')
                    if dat[0]=='\ufeff':  # Ignore BOM
                        dat=dat[1:]
                    self.data += dat
                    fd["name"] = descriptor
                    fd["data"] = dat
                    fd["index"] = index
                    index += 1
                    self.files.append(fd)
                except Exception as e:
                    print(f"Can't download {descriptor}: {e}")
            else:
                fd["name"] = os.path.splitext(os.path.basename(descriptor))[0]
                try:
                    f = open(descriptor)
                    dat = f.read(max)
                    self.data += dat
                    fd["data"] = dat
                    fd["index"] = index
                    index += 1
                    self.files.append(fd)
                    f.close()
                except Exception as e:
                    print(f"ERROR: Cannot read: {filename}: {e}")
        ind = 0
        for c in self.data:  # sets are not deterministic
            if c not in self.c2i:
                self.c2i[c] = ind
                self.i2c[ind] = c
                ind += 1
        self.ptr = 0

    def display_colored_html(self, textlist, pre='', post=''):
        bgcolors = ['#d4e6f1', '#d8daef', '#ebdef0', '#eadbd8', '#e2d7d5', '#edebd0',
                    '#ecf3cf', '#d4efdf', '#d0ece7', '#d6eaf8', '#d4e6f1', '#d6dbdf',
                    '#f6ddcc', '#fae5d3', '#fdebd0', '#e5e8e8', '#eaeded', '#A9CCE3']
        out = ''
        for txt, ind in textlist:
            txt = txt.replace('\n', '<br>')
            if ind == 0:
                out += txt
            else:
                out += "<span style=\"background-color:"+bgcolors[ind % 16]+";\">" + \
                       txt + "</span>"+"<sup>[" + str(ind) + "]</sup>"
        display(HTML(pre+out+post))

    def source_highlight(self, txt, minQuoteSize=10):
        tx = txt
        out = []
        qts = []
        txsrc = [("Sources: ", 0)]
        sc = False
        noquote = ''
        while len(tx) > 0:  # search all library files for quote 'txt'
            mxQ = 0
            mxI = 0
            mxN = ''
            found = False
            for f in self.files:  # find longest quote in all texts
                p = minQuoteSize
                if p <= len(tx) and tx[:p] in f["data"]:
                    p = minQuoteSize + 1
                    while p <= len(tx) and tx[:p] in f["data"]:
                        p += 1
                    if p-1 > mxQ:
                        mxQ = p-1
                        mxI = f["index"]
                        mxN = f["name"]
                        found = True
            if found:  # save longest quote for colorizing
                if len(noquote) > 0:
                    out.append((noquote, 0))
                    noquote = ''
                out.append((tx[:mxQ], mxI))
                tx = tx[mxQ:]
                if mxI not in qts:  # create a new reference, if first occurence
                    qts.append(mxI)
                    if sc:
                        txsrc.append((", ", 0))
                    sc = True
                    txsrc.append((mxN, mxI))
            else:
                noquote += tx[0]
                tx = tx[1:]
        if len(noquote) > 0:
            out.append((noquote, 0))
            noquote = ''
        self.display_colored_html(out)
        if len(qts) > 0:  # print references, if there is at least one source
            self.display_colored_html(txsrc, pre="<small><p style=\"text-align:right;\">",
                                     post="</p></small>")

    def get_slice(self, length):
        if (self.ptr + length >= len(self.data)):
            self.ptr = 0
        if self.ptr == 0:
            rst = True
        else:
            rst = False
        sl = self.data[self.ptr:self.ptr+length]
        self.ptr += length
        return sl, rst

    def decode(self, ar):
        return ''.join([self.i2c[ic] for ic in ar])

    def get_random_slice(self, length):
        p = random.randrange(0, len(self.data)-length)
        sl = self.data[p:p+length]
        return sl

    def get_slice_array(self, length):
        ar = np.array([c for c in self.get_slice(length)[0]])
        return ar

    def get_encoded_slice(self, length):
        s, rst = self.get_slice(length)
        X = [self.c2i[c] for c in s]
        return X
        
    def get_encoded_slice_array(self, length):
        return np.array(self.get_encoded_slice(length))

    def get_sample(self, length):
        s, rst = self.get_slice(length+1)
        X = [self.c2i[c] for c in s[:-1]]
        y = [self.c2i[c] for c in s[1:]]
        return (X, y, rst)

    def get_random_sample(self, length):
        s = self.get_random_slice(length+1)
        X = [self.c2i[c] for c in s[:-1]]
        y = [self.c2i[c] for c in s[1:]]
        return (X, y)

    def get_sample_batch(self, batch_size, length):
        smpX = []
        smpy = []
        for i in range(batch_size):
            Xi, yi, rst = self.get_sample(length)
            smpX.append(Xi)
            smpy.append(yi)
        return smpX, smpy, rst

    def get_random_sample_batch(self, batch_size, length):
        smpX = []
        smpy = []
        for i in range(batch_size):
            Xi, yi = self.get_random_sample(length)
            smpX.append(Xi)
            smpy.append(yi)
        return smpX, smpy


### Read text data

In [0]:
libdesc = {
    "name": "TinyShakespeare",
    "description": "Small Shakespeare 'standard' corpus",
    "lib": [
        # 'data/tiny-shakespeare.txt',
        # since project gutenberg blocks the entire country of Germany, we use a mirror:
        'http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/1/0/100/100-0.txt',
    ]
}

textlib = TextLibrary(libdesc["lib"])


In [0]:
# if use_tpu is True:
#     resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
#     tf.tpu.experimental.initialize_tpu_system(resolver)
#     tpu_strategy = tf.distribute.experimental.TPUStrategy(resolver)

## 2. Use tf.data for texts

In [0]:
data = textlib.get_encoded_slice_array(len(textlib.data))
textlib_dataset = tf.data.Dataset.from_tensor_slices(data)

In [12]:
# Quick test
n=np.array([])
for i in textlib_dataset.take(90):
    n=np.append(n,i.numpy())
print(n)    
print(textlib.decode(n))

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11.  8.  6. 12. 13.  6.  3.
 14. 15. 16.  9. 17. 18.  6.  9. 19.  4. 20. 21. 22.  6.  8.  6.  9. 23.
  4.  3. 24. 16.  9.  4. 25.  9. 23. 26. 22. 22. 26. 27. 20.  9. 28. 18.
 27. 24.  6. 16. 21.  6. 27.  3.  6. 29.  9. 13. 30.  9. 23. 26. 22. 22.
 26. 27. 20.  0.  1. 28. 18. 27. 24.  6. 16. 21.  6. 27.  3.  6.  0.  1.]

Project Gutenberg’s The Complete Works of William Shakespeare, by William
Shakespeare



In [0]:
SEQUENCE_LEN = 80
if use_tpu is True:
    BATCH_SIZE=1024
else:
    BATCH_SIZE = 256
LSTM_UNITS = 1024
EMBEDDING_DIM = 512

In [0]:
sample_size=len(data)//SEQUENCE_LEN

In [0]:
sequences=textlib_dataset.batch(SEQUENCE_LEN+1,drop_remainder=True)

In [16]:
# Quick test
for arr in sequences.take(3):
    n=arr.numpy()
    print(arr)
    print(">"+textlib.decode(n))

tf.Tensor(
[ 0  1  2  3  4  5  6  7  8  9 10 11  8  6 12 13  6  3 14 15 16  9 17 18
  6  9 19  4 20 21 22  6  8  6  9 23  4  3 24 16  9  4 25  9 23 26 22 22
 26 27 20  9 28 18 27 24  6 16 21  6 27  3  6 29  9 13 30  9 23 26 22 22
 26 27 20  0  1 28 18 27 24], shape=(81,), dtype=int64)
>
Project Gutenberg’s The Complete Works of William Shakespeare, by William
Shak
tf.Tensor(
[ 6 16 21  6 27  3  6  0  1  0  1 17 18 26 16  9  6 31  4  4 24  9 26 16
  9 25  4  3  9  8 18  6  9 11 16  6  9  4 25  9 27 12 30  4 12  6  9 27
 12 30 32 18  6  3  6  9 26 12  9  8 18  6  9 33 12 26  8  6 34  9 28  8
 27  8  6 16  9 27 12 34  0], shape=(81,), dtype=int64)
>espeare

This eBook is for the use of anyone anywhere in the United States and
tf.Tensor(
[ 1 20  4 16  8  9  4  8 18  6  3  9 21 27  3  8 16  9  4 25  9  8 18  6
  9 32  4  3 22 34  9 27  8  9 12  4  9  7  4 16  8  9 27 12 34  9 32 26
  8 18  9 27 22 20  4 16  8  9 12  4  9  3  6 16  8  3 26  7  8 26  4 12
 16  0  1 32 18 27  8 16  4], sh

In [0]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [18]:
# Quick test
for input_text, output_text in dataset.take(2):
    print("I:"+textlib.decode(input_text.numpy()))
    print("O:"+textlib.decode(output_text.numpy()))

I:
Project Gutenberg’s The Complete Works of William Shakespeare, by William
Sha
O:
Project Gutenberg’s The Complete Works of William Shakespeare, by William
Shak
I:espeare

This eBook is for the use of anyone anywhere in the United States and
O:speare

This eBook is for the use of anyone anywhere in the United States and


In [19]:
# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 100000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((1024, 80), (1024, 80)), types: (tf.int64, tf.int64)>

In [0]:
def build_model(vocab_size, embedding_dim, lstm_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(lstm_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.LSTM(lstm_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.LSTM(lstm_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [48]:
dev_strings=[]
for log_dev in tf.config.experimental.list_logical_devices('TPU'):
    dev_strings.append(log_dev.name)
print(dev_strings)

# for i in range(8):
#     dev_strings.append('/TPU:{}'.format(i))
# print(dev_strings)
    

['/job:worker/replica:0/task:1/device:TPU:0', '/job:worker/replica:0/task:1/device:TPU:1', '/job:worker/replica:0/task:1/device:TPU:2', '/job:worker/replica:0/task:1/device:TPU:3', '/job:worker/replica:0/task:1/device:TPU:4', '/job:worker/replica:0/task:1/device:TPU:5', '/job:worker/replica:0/task:1/device:TPU:6', '/job:worker/replica:0/task:1/device:TPU:7']


In [21]:
if use_tpu is True:
    # tpus=tf.config.experimental.list_logical_devices('TPU')
    resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=TPU_ADDRESS)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    tpu_strategy = tf.distribute.experimental.TPUStrategy(resolver)    
    # mirrored_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(dev_strings)
    # with mirrored_strategy.scope():
    with tpu_strategy.scope():
        model = build_model(
          vocab_size = len(textlib.i2c),
          embedding_dim=EMBEDDING_DIM,
          lstm_units=LSTM_UNITS,
          batch_size=BATCH_SIZE)
else:
    model = build_model(
      vocab_size = len(textlib.i2c),
      embedding_dim=EMBEDDING_DIM,
      lstm_units=LSTM_UNITS,
      batch_size=BATCH_SIZE)

NotFoundError: ignored

In [0]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(256, 80, 107) # (batch_size, sequence_length, vocab_size)


In [0]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (256, None, 512)          54784     
_________________________________________________________________
lstm_3 (LSTM)                (256, None, 1024)         6295552   
_________________________________________________________________
lstm_4 (LSTM)                (256, None, 1024)         8392704   
_________________________________________________________________
lstm_5 (LSTM)                (256, None, 1024)         8392704   
_________________________________________________________________
dense_1 (Dense)              (256, None, 107)          109675    
Total params: 23,245,419
Trainable params: 23,245,419
Non-trainable params: 0
_________________________________________________________________


In [0]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

In [0]:
sampled_indices

array([ 43,  11,  24,  29,  96,  32,  98,  71,  51,   8, 105,  87, 102,
        25,  25,  33,  97,  44,  95,  73,  95,   5,  73,  41,  96,   3,
        96,  13,  49,   3,  61,   5,  75,  77,  44,  12,  48,  62, 100,
        56,  85,  93,  64,   2,   7,  82,  97,   1,  36,  20,  21,  71,
        35, 105,  93,  27,  99,  61,  29,  15,   9,  65,  22,  64,  71,
        12,  52,   2, 103,  61,  55,  22,  85,  56,  65,  27,  82, 104,
        71,  29])

In [0]:
textlib.decode(sampled_indices)

'Ouk,çwê‘)t@|\\ffUîNâ?âj?:çrçb1rMj5!Nn03\t9"é8Pc”î\n.mp‘v@éa`M,’ Hl8‘nAP/MJl"9Ha”%‘,'

In [0]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (256, 80, 107)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.672822


In [0]:
model.compile(optimizer='adam', loss=loss)

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [0]:
EPOCHS=3

In [0]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/3

In [0]:
# Generate

In [0]:
tf.train.latest_checkpoint(checkpoint_dir)

In [0]:
model = build_model(vocab_size = len(textlib.i2c),
  embedding_dim=EMBEDDING_DIM,
  lstm_units=LSTM_UNITS,
  batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [0]:
model.summary()

In [0]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [textlib.c2i[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []
  ids=[]

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = .40

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
      ids.append(predicted_id)

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(textlib.i2c[predicted_id])

  return (start_string + ''.join(text_generated), ids)

In [0]:
tx,id=generate_text(model, start_string="ROMEO: ")

In [0]:
def detectPlagiarism(tx, textlibrary, minQuoteLength=10):
    textlibrary.source_highlight(tx, minQuoteLength)

In [0]:
textlib.decode(id)

NameError: name 'textlib' is not defined

In [0]:
detectPlagiarism(tx, textlib)

**below this point not yet ported**

## 2. Definition of the Tensorflow model

In [0]:
# The tensorflow model for text generation
class TensorPoetModel:
    def __init__(self, params):
        self.model_name = params["model_name"]
        self.vocab_size = params["vocab_size"]
        self.neurons = params["neurons"]
        self.layers = params["layers"]
        self.learning_rate = params["learning_rate"]
        self.steps = params["steps"]
        self.logdir = params["logdir"]
        self.checkpoint = params["checkpoint"]
        # self.clip = -1.0 * params["clip"]

        tf.reset_default_graph()

        # Training & Generating:
        self.X = tf.placeholder(tf.int32, shape=[None, self.steps])
        self.y = tf.placeholder(tf.int32, shape=[None, self.steps])

        onehot_X = tf.one_hot(self.X, self.vocab_size)
        onehot_y = tf.one_hot(self.y, self.vocab_size)

        stacked_cell = tf.contrib.rnn.MultiRNNCell([tf.nn.rnn_cell.LSTMCell(
            self.neurons, name='basic_lstm_cell') for _ in range(self.layers)])

        batch_size = tf.shape(self.X)[0]

        self.init_state_0 = stacked_cell.zero_state(batch_size, tf.float32)
        self.init_state = self.init_state_0

        with tf.variable_scope('rnn') as scope:
            rnn_outputs, states = tf.nn.dynamic_rnn(stacked_cell, onehot_X,
                                                    initial_state=self.init_state,
                                                    dtype=tf.float32)
            self.init_state = states

        self.final_state = self.init_state
        stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, self.neurons])

        softmax_w = tf.Variable(tf.random_normal(
            [self.neurons, self.vocab_size]), dtype=tf.float32, name='sm_w')
        softmax_b = tf.Variable(
            [self.vocab_size], dtype=tf.float32, name='sm_b')

        logits_raw = tf.matmul(stacked_rnn_outputs, softmax_w) + softmax_b
        logits = tf.reshape(logits_raw, [-1, self.steps, self.vocab_size])

        output_softmax = tf.nn.softmax(logits)

        self.temperature = tf.placeholder(tf.float32)
        self.output_softmax_temp = tf.nn.softmax(
            tf.div(logits, self.temperature))

        softmax_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(
            labels=onehot_y, logits=logits)

        self.cross_entropy = tf.reduce_mean(softmax_entropy)
        optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)

        self.training_op = optimizer.minimize(self.cross_entropy)

        # Clipping isn't necessary, even for really deep networks:
        # grads = optimizer.compute_gradients(self.cross_entropy)
        # minclip = -1.0 * self.clip
        # capped_grads = [(tf.clip_by_value(grad, minclip, self.clip), var) 
        #     for grad, var in grads]
        # self.training_op = optimizer.apply_gradients(capped_grads)

        self.prediction = tf.cast(tf.argmax(output_softmax, -1), tf.int32)
        correct_prediction = tf.equal(self.y, self.prediction)
        self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        error = 1.0 - self.accuracy

        # Tensorboard
        tf.summary.scalar("cross-entropy", self.cross_entropy)
        tf.summary.scalar("error", error)
        self.summary_merged = tf.summary.merge_all()

        # Init
        self.init = tf.global_variables_initializer()


## 3. Parameters for model and training

The library description `libdesc` contains a list in `lib` with local filenames of text-files or http, https URLs pointing to text files.

In [0]:

# Model parameter:
modelParamsShakespeare = {
    "model_name": "shakespeare",
    "logdir": "tensorlog/shakespeare",
    "checkpoint": "shakespeare.ckpt",
    "vocab_size": len(textlib.i2c),
    "neurons": 512,
    "layers": 4,
    "learning_rate": 4.e-4,
    "steps": 128,
}

# Look for optional json description of a library:
if os.path.exists('bk/lib-phil-deen.json'):
    with open('bk/lib-phil-deen.json') as data_file:    
        libdescphil = json.load(data_file)
        textlib = TextLibrary(libdescphil["lib"])
        modelParamsPhil = {
            "model_name": "phil",
            "logdir": "tensorlog/phil",
            "checkpoint": "phil.ckpt",
            "vocab_size": len(textlib.i2c),
            "neurons": 256,
            "layers": 8,
            "learning_rate": 1.e-3,
            "steps": 128,
        }
        model = TensorPoetModel(modelParamsPhil)
else:
    model = TensorPoetModel(modelParamsShakespeare)

In [0]:
# Training Parameter:

trainParams = {
    "max_iter": 1000000,
    "restoreCheckpoints": False,
    "generateDuringTraining": True,
    "generated_text_size": 200,
    "verbose": True,
    "statusEveryNIter": 500,
    "saveEveryNIter": 500,
    "batch_size": 128,
}

## 4. The actual training

In [0]:
# Run training:
with tf.Session() as sess:
    batch_size = trainParams["batch_size"]
    epl = len(textlib.data) / (batch_size * model.steps)
    model.init.run()
    tflogdir = model.logdir
    tflogdir = os.path.realpath(tflogdir)
    if not os.path.exists(tflogdir):
        os.makedirs(tflogdir)
    print("Tensorboard: 'tensorboard --logdir {}'".format(tflogdir))
    train_writer = tf.summary.FileWriter(tflogdir, sess.graph)
    train_writer.add_graph(sess.graph)
    # vl=tf.trainable_variables()
    # print(vl)
    saver = tf.train.Saver()
    checkpoint_file = os.path.join(tflogdir, model.checkpoint)
    # FFR: tf.train.export_meta_graph(filename=None, meta_info_def=None, graph_def=None,
    # saver_def=None, collection_list=None, as_text=False, graph=None, export_scope=None,
    # clear_devices=False, **kwargs)
    start_iter = 0
    if trainParams["restoreCheckpoints"]:
        lastSave = tf.train.latest_checkpoint(tflogdir,
                                              latest_filename=None)
        if lastSave is not None:
            pt = lastSave.rfind('-')
            if pt != -1:
                pt += 1
                start_iter = int(lastSave[pt:])
            print("Restoring checkpoint at {}: {}".format(start_iter, lastSave))
            saver.restore(sess, lastSave)
    av_batch_time=0.0
    for iteration in range(start_iter, trainParams["max_iter"]):
        # Train with batches from the text library:
        t1=time.time()
        X_batch, y_batch = textlib.get_random_sample_batch(
            batch_size, model.steps)
        i_state = sess.run([model.init_state_0], feed_dict={model.X: X_batch})
        i_state, _ = sess.run([model.final_state, model.training_op],
                              feed_dict={model.X: X_batch, model.y: y_batch,
                                         model.init_state: i_state})
        t2=time.time()
        if av_batch_time==0.0:
            av_batch_time=(t2-t1)*1000.0
        else:
            av_batch_time=(av_batch_time*5.0+(t2-t1)*1000.0)/6.0
        
        # Output training statistics every 100 iterations:
        if iteration % 200 == 0:
            ce, accuracy, prediction, summary = sess.run([model.cross_entropy,
                                                          model.accuracy, model.prediction,
                                                          model.summary_merged],
                                                         feed_dict={model.X: X_batch, model.y: y_batch})
            train_writer.add_summary(summary, iteration)
            ep = iteration / epl
            print("Epoch: {0:.2f}, iter: {1:d}, cross-entropy: {2:.3f}, accuracy: {3:.5f} time per batch: {4:.5f}ms".format(
                ep, iteration, ce, accuracy, av_batch_time))
            if trainParams["verbose"]:
                for ind in range(1):  # model.batch_size):
                    ys = textlib.decode(y_batch[ind]).replace('\n', ' | ')
                    yps = textlib.decode(prediction[ind]).replace('\n', ' | ')
                    print("   y:", ys)
                    print("  yp:", yps)

        # Generate sample texts for different temperature every ..NIter iterations:
        if (iteration+1) % trainParams["statusEveryNIter"] == 0:

            # Save training data
            # print("S>")
            saver.save(sess, checkpoint_file, global_step=iteration+1)
            # print("S<")

            if trainParams["generateDuringTraining"]:
                # Generate sample
                for t in range(4, 11, 3):
                    temp = float(t) / 10.0
                    xs = ' ' * model.steps
                    xso = ''
                    doini = True
                    for i in range(trainParams["generated_text_size"]):
                        X_new = np.transpose([[textlib.c2i[sj]] for sj in xs])
                        if doini:
                            doini = False
                            g_state = sess.run(
                                [model.init_state_0], feed_dict={model.X: X_new})

                        g_state, y_pred = sess.run([model.final_state, model.output_softmax_temp],
                                                   feed_dict={model.X: X_new, model.init_state: g_state,
                                                              model.temperature: temp})
                        inds = list(range(model.vocab_size))
                        ind = np.random.choice(inds, p=y_pred[0, -1].ravel())
                        nc = textlib.i2c[ind]
                        xso += nc
                        xs = xs[1:]+nc

                    print("----------------- temperature =",
                          temp, "----------------------")
                    # print(xso)
                    # 20: minimum quote size detected.
                    textlib.source_highlight(xso, 20)
                print("---------------------------------------")


## 5. Generation of text from the trained model

In [0]:
# Generating text using the model data generated during training.
def ghostWriter(textsize, temperature=1.0):
    xso = None
    with tf.Session() as sess:
        model.init.run()

        tflogdir = os.path.realpath(model.logdir)
        if not os.path.exists(tflogdir):
            print("You haven't trained a model, no data found at: {}".format(tflogdir))
            return None

        # Used for saving the training parameters periodically
        saver = tf.train.Saver()
        checkpoint_file = os.path.join(tflogdir, model.checkpoint)

        lastSave = tf.train.latest_checkpoint(tflogdir, latest_filename=None)
        if lastSave is not None:
            pt = lastSave.rfind('-')
            if pt != -1:
                pt += 1
                start_iter = int(lastSave[pt:])
            print("Restoring checkpoint at {}: {}".format(start_iter, lastSave))
            saver.restore(sess, lastSave)
        else:
            print("No checkpoints have been saved at:{}".format(
                trainParams["logdir"]))
            return None

        xs = ' ' * model.steps
        xso = ''
        doini = True
        for i in range(textsize):
            X_new = np.transpose([[textlib.c2i[sj]] for sj in xs])
            if doini:
                doini = False
                g_state = sess.run([model.init_state_0],
                                   feed_dict={model.X: X_new})
            g_state, y_pred = sess.run([model.final_state, model.output_softmax_temp],
                                       feed_dict={model.X: X_new, model.init_state: g_state,
                                                  model.temperature: temperature})
            inds = list(range(model.vocab_size))
            ind = np.random.choice(inds, p=y_pred[0, -1].ravel())
            nc = textlib.i2c[ind]
            xso += nc
            xs = xs[1:]+nc
    return(xso)


def detectPlagiarism(generatedtext, textlibrary, minQuoteLength=10):
    textlibrary.source_highlight(generatedtext, minQuoteLength)


In [0]:
tgen=ghostWriter(500)
detectPlagiarism(tgen, textlib)

## 6. A dialog with the trained model

In [0]:
# Do a dialog with the recursive neural net trained above:
# def genDialogAnswer(prompt, g_state=None, endPrompt='.', maxEndPrompts=2,
# maxAnswerSize=512, temperature=1.0):


def doDialog():
    # 0.1 (frozen character) - 1.3 (creative/chaotic character)
    temperature = 0.6
    endPrompt = '.'  # the endPrompt character is the end-mark in answers.
    # look for number of maxEndPrompts until answer is finished.
    maxEndPrompts = 4
    maxAnswerSize = 2048  # Maximum length of the answer
    minAnswerSize = 64  # Minimum length of the answer

    with tf.Session() as sess:
        print("Please enter some dialog.")
        print("The net will answer according to your input.")
        print("'bye' for end,")
        print("'reset' to reset the conversation context,")
        print("'temperature=<float>' [0.1(frozen)-1.0(creative)]")
        print("    to change character of the dialog.")
        print("    Current temperature={}.".format(temperature))
        print()
        xso = None
        bye = False
        model.init.run()

        tflogdir = os.path.realpath(model.logdir)
        if not os.path.exists(tflogdir):
            print("You haven't trained a model, no data found at: {}".format(
                trainParams["logdir"]))
            return

        # Used for saving the training parameters periodically
        saver = tf.train.Saver()
        checkpoint_file = os.path.join(tflogdir, model.checkpoint)

        lastSave = tf.train.latest_checkpoint(tflogdir, latest_filename=None)
        if lastSave is not None:
            pt = lastSave.rfind('-')
            if pt != -1:
                pt += 1
                start_iter = int(lastSave[pt:])
            # print("Restoring checkpoint at {}: {}".format(start_iter, lastSave))
            saver.restore(sess, lastSave)
        else:
            print("No checkpoints have been saved at:{}".format(tflogdir))
            return

        # g_state = sess.run([model.init_state_0], feed_dict={model.batch_size: 1})
        doini = True

        bye = False
        while not bye:
            print("> ", end="")
            prompt = input()
            if prompt == 'bye':
                bye = True
                print("Good bye!")
                continue
            if prompt == 'reset':
                doini = True
                # g_state = sess.run([model.init_state_0], feed_dict={model.batch_size: 1})
                print("(conversation context marked for reset)")
                continue
            if prompt[:len("temperature=")] == "temperature=":
                t = float(prompt[len("temperature="):])
                if t > 0.05 and t < 1.4:
                    temperature = t
                    print("(generator temperature now {})".format(t))
                    print()
                    continue
                print("Invalid temperature-value ignored! [0.1-1.0]")
                continue
            xs = ' ' * model.steps
            xso = ''
            for rep in range(1):
                for i in range(len(prompt)):
                    xs = xs[1:]+prompt[i]
                    X_new = np.transpose([[textlib.c2i[sj]] for sj in xs])
                    if doini:
                        doini = False
                        g_state = sess.run(
                            [model.init_state_0], feed_dict={model.X: X_new})
                    g_state, y_pred = sess.run([model.final_state, model.output_softmax_temp],
                                               feed_dict={model.X: X_new, model.init_state: g_state,
                                                          model.temperature: temperature})
            ans = 0
            numEndPrompts = 0
            while (ans < maxAnswerSize and numEndPrompts < maxEndPrompts) or ans < minAnswerSize:

                X_new = np.transpose([[textlib.c2i[sj]] for sj in xs])
                g_state, y_pred = sess.run([model.final_state, model.output_softmax_temp],
                                           feed_dict={model.X: X_new, model.init_state: g_state,
                                                      model.temperature: temperature})
                inds = list(range(model.vocab_size))
                ind = np.random.choice(inds, p=y_pred[0, -1].ravel())
                nc = textlib.i2c[ind]
                if nc == endPrompt:
                    numEndPrompts += 1
                xso += nc
                xs = xs[1:]+nc
                ans += 1
            print(xso.replace("\\n", "\n"))
            textlib.source_highlight(xso, 13)
    return

In [0]:
# Talk to the net!
doDialog()