<a href="https://colab.research.google.com/github/domschl/tensor-poet/blob/master/eager_poet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Eager Tensor Poet (Tensorflow 2.0)

### Only execute next block, if you want to test with different TF runtime.

Remember to restart runtime after installing new software.

In [0]:
# %pip install -U tensorflow-gpu tensorflow-addons tensorflow-federated tensorboard # tf-nightly-gpu  # Currently not useful.

### Select TF version (in colab)

This should be the default starting point for working with the standard environment

In [1]:
## Import TensorFlow
## from __future__ import absolute_import, division, print_function, unicode_literals
try:
    ## %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    ## Non colab people need to make sure that tf 2 is installed.
    pass
import tensorflow as tf

TensorFlow 2.x selected.


## Preliminary

**THIS IS WORK IN PROGRESS**

A tensorflow deep LSTM model for text generation

This code can use either CPU, **GPU** or **TPU** when running on Google Colab.

Select the corresponding runtime (menu: **`Runtime / Change runtime type`**)

In [0]:
%load_ext tensorboard

In [0]:
import numpy as np
import os
import sys
import json
import time
import datetime
import random
import tensorflow as tf

try:
    from urllib.request import urlopen  # Py3
except:
    print("This notebook requires Python 3.")
try:
    import pathlib
except:
    print("At least python 3.5 is needed.")
    
try: # Colab instance?
    from google.colab import drive
except: # Not? ignore.
    pass

from IPython.core.display import display, HTML

## 0. Check system

### Tensorflow api version check


In [4]:
try:
    if 'api.v2' in tf.version.__name__:
        print(f"Tensorflow api v2 active: {tf.__version__}")
    else:
        print("Tensorflow api v2 not found. This will not work.")
except:
    print("Failed to check for Tensorflow api v2. This will not work.")

Tensorflow api v2 active: 2.1.0


### GPU/TPU check

This notebook can either run on a local jupyter server, or on google cloud.
If a GPU/TPU is available, it will be used for training.

By default snapshots of the trained net are stored locally for jupyter instances, and on user's google drive for Google Colab instances. The snapshots allow the restart of training or inference at any time, e.g. after the Colab session was terminated.

Similarily, the text corpora that are used for training, can be cached on drive or locally.

In [0]:
# Define where snapshots of training data are stored:
colab_google_drive_snapshots=True

# Define if training data (the texts downloaded from internet) are cached:
colab_google_drive_data_cache=True  # In colab mode cache to google drive
local_jupyter_data_cache=True       # In local jupyter mode cache to local path

is_colab_notebook = 'google.colab' in sys.modules

In [6]:
from tensorflow.python.client import device_lib

use_tpu = False
use_gpu = False
use_eager = False  # Really no point.

try:
    TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
    use_tpu = True
    tpu_is_init = False
    tf.config.experimental_connect_to_host(TPU_ADDRESS)
    print("TPU available at {}".format(TPU_ADDRESS))
except:
    print("No TPU available")

for hw in ["CPU", "GPU", "TPU"]:
    hwlist=tf.config.experimental.list_logical_devices(hw)
    print("{} -> {}".format(hw,hwlist))


if use_tpu is False:
    def get_available_devs_of_type(type):
        local_device_protos = device_lib.list_local_devices()
        return [x.name for x in local_device_protos if type in x.name]

    def get_dev_desc():
        local_device_protos = device_lib.list_local_devices()
        return [(x.name, x.physical_device_desc) for x in local_device_protos]

    def get_available_gpus():
        return get_available_devs_of_type('GPU')

    dl = get_available_gpus()
    if len(dl)==0:
        print("WARNING: You have neither TPU nor GPU, this is going to be very slow!")
        if is_colab_notebook is True:
            print("         Hint: In Colab Runtime / Set runtime type, set runtime type to GPU or TPU.")
        print(get_available_devs_of_type(''))
    else:
        use_gpu = True
        print(f"GPUs: {dl}")
        print(get_dev_desc())
else:
    use_eager = False  # Eager mode cannot be used with TPUs.
    print("DISABLING eager execution because TPUs do not support dynamic execution.")

if use_eager is False:
   tf.compat.v1.disable_eager_execution()


TPU available at grpc://10.94.220.42:8470
CPU -> [LogicalDevice(name='/device:CPU:0', device_type='CPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:CPU:0', device_type='CPU')]
GPU -> []
TPU -> [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU')]
DISABLING eager execution because TPUs do not support dynamic execution.


In [0]:
if is_colab_notebook:
    if colab_google_drive_snapshots:
        mountpoint='/content/drive'
        root_path='/content/drive/My Drive'
        if not os.path.exists(root_path):
            drive.mount(mountpoint)
        if not os.path.exists(root_path):
            print("Something went wrong with Google Drive access. Cannot save snapshots to GD.")
            colab_google_drive_snapshots=False
    else:
        print("Since google drive snapshots are not active, training data will be lost as soon as the Colab session terminates!")
        print("Set `colab_google_drive_snapshots` to `True` to make training data persistent.")
else:
    root_path='.'

##  1. Text library

`TextLibrary` class: text library for training, encoding, batch generation,
and formatted source display. It read some books from Project Gutenberg
and supports creation of training batches. The output functions support
highlighting to allow to compare generated texts with the actual sources
to help to identify identical (memorized) parts.

In [0]:
use_dark_mode=False  # Set to false for white background

In [0]:
class TextLibrary:
    def __init__(self, descriptors, text_data_cache_directory=None, max=100000000):
        self.descriptors = descriptors
        self.data = ''
        self.cache_dir=text_data_cache_directory
        self.files = []
        self.c2i = {}
        self.i2c = {}
        self.total_size=0
        index = 1
        for descriptor, author, title in descriptors:
            fd = {}
            cache_name=self.get_cache_name(author, title)
            if os.path.exists(cache_name):
                is_cached=True
            else:
                is_cached=False
            valid=False
            if descriptor[:4] == 'http' and is_cached is False:
                try:
                    print(f"Downloading {cache_name}")
                    dat = urlopen(descriptor).read().decode('utf-8')
                    if dat[0]=='\ufeff':  # Ignore BOM
                        dat=dat[1:]
                    dat=dat.replace('\r', '')  # get rid of pesky LFs 
                    self.data += dat
                    self.total_size += len(dat)
                    fd["title"] = title
                    fd["author"] = author
                    fd["data"] = dat
                    fd["index"] = index
                    index += 1
                    valid=True
                    self.files.append(fd)
                except Exception as e:
                    print(f"Can't download {descriptor}: {e}")
            else:
                fd["title"] = title
                fd["author"] = author
                try:
                    if is_cached is True:
                        print(f"Reading {cache_name} from cache")
                        f = open(cache_name)
                    else:    
                        f = open(descriptor)
                    dat = f.read(max)
                    self.data += dat
                    self.total_size += len(dat)
                    fd["data"] = dat
                    fd["index"] = index
                    index += 1
                    self.files.append(fd)
                    f.close()
                    valid=True
                except Exception as e:
                    print(f"ERROR: Cannot read: {filename}: {e}")
            if valid is True and is_cached is False and self.cache_dir is not None:
                try:
                    print(f"Caching {cache_name}")
                    f = open(cache_name, 'w')
                    f.write(dat)
                    f.close()
                except Exception as e:
                    print(f"ERROR: failed to save cache {cache_name}: {e}")
                
        ind = 0
        for c in self.data:  # sets are not deterministic
            if c not in self.c2i:
                self.c2i[c] = ind
                self.i2c[ind] = c
                ind += 1
        self.ptr = 0

    def get_cache_name(self, author, title):
        if self.cache_dir is None:
            return None
        cname=f"{author} - {title}.txt"
        cache_filepath=os.path.join(self.cache_dir , cname)
        return cache_filepath
        
    def display_colored_html(self, textlist, dark_mode=False, pre='', post=''):
        bgcolorsWht = ['#d4e6e1', '#d8daef', '#ebdef0', '#eadbd8', '#e2d7d5', '#edebd0',
                    '#ecf3cf', '#d4efdf', '#d0ece7', '#d6eaf8', '#d4e6f1', '#d6dbdf',
                    '#f6ddcc', '#fae5d3', '#fdebd0', '#e5e8e8', '#eaeded', '#A9CCE3']
        bgcolorsDrk = ['#342621','#483a2f', '#3b4e20', '#2a3b48', '#324745', '#3d3b30',
                    '#3c235f', '#443f4f', '#403c37', '#463a28', '#443621', '#364b5f',
                    '#264d4c', '#2a3553', '#3d2b40', '#354838', '#3a3d4d', '#594C23']
        if dark_mode is False:
            bgcolors=bgcolorsWht
        else:
            bgcolors=bgcolorsDrk
        out = ''
        for txt, ind in textlist:
            txt = txt.replace('\n', '<br>')
            if ind == 0:
                out += txt
            else:
                out += "<span style=\"background-color:"+bgcolors[ind % 16]+";\">" + \
                       txt + "</span>"+"<sup>[" + str(ind) + "]</sup>"
        display(HTML(pre+out+post))

    def source_highlight(self, txt, minQuoteSize=10, dark_mode=False):
        tx = txt
        out = []
        qts = []
        txsrc = [("Sources: ", 0)]
        sc = False
        noquote = ''
        while len(tx) > 0:  # search all library files for quote 'txt'
            mxQ = 0
            mxI = 0
            mxN = ''
            found = False
            for f in self.files:  # find longest quote in all texts
                p = minQuoteSize
                if p <= len(tx) and tx[:p] in f["data"]:
                    p = minQuoteSize + 1
                    while p <= len(tx) and tx[:p] in f["data"]:
                        p += 1
                    if p-1 > mxQ:
                        mxQ = p-1
                        mxI = f["index"]
                        mxN = f"{f['author']}: {f['title']}"
                        found = True
            if found:  # save longest quote for colorizing
                if len(noquote) > 0:
                    out.append((noquote, 0))
                    noquote = ''
                out.append((tx[:mxQ], mxI))
                tx = tx[mxQ:]
                if mxI not in qts:  # create a new reference, if first occurence
                    qts.append(mxI)
                    if sc:
                        txsrc.append((", ", 0))
                    sc = True
                    txsrc.append((mxN, mxI))
            else:
                noquote += tx[0]
                tx = tx[1:]
        if len(noquote) > 0:
            out.append((noquote, 0))
            noquote = ''
        self.display_colored_html(out, dark_mode=dark_mode)
        if len(qts) > 0:  # print references, if there is at least one source
            self.display_colored_html(txsrc, dark_mode=dark_mode, pre="<small><p style=\"text-align:right;\">",
                                     post="</p></small>")

    def get_slice(self, length):
        if (self.ptr + length >= len(self.data)):
            self.ptr = 0
        if self.ptr == 0:
            rst = True
        else:
            rst = False
        sl = self.data[self.ptr:self.ptr+length]
        self.ptr += length
        return sl, rst

    def decode(self, ar):
        return ''.join([self.i2c[ic] for ic in ar])

    def get_random_slice(self, length):
        p = random.randrange(0, len(self.data)-length)
        sl = self.data[p:p+length]
        return sl

    def get_slice_array(self, length):
        ar = np.array([c for c in self.get_slice(length)[0]])
        return ar

    def get_encoded_slice(self, length):
        s, rst = self.get_slice(length)
        X = [self.c2i[c] for c in s]
        return X
        
    def get_encoded_slice_array(self, length):
        return np.array(self.get_encoded_slice(length))

    def get_sample(self, length):
        s, rst = self.get_slice(length+1)
        X = [self.c2i[c] for c in s[:-1]]
        y = [self.c2i[c] for c in s[1:]]
        return (X, y, rst)

    def get_random_sample(self, length):
        s = self.get_random_slice(length+1)
        X = [self.c2i[c] for c in s[:-1]]
        y = [self.c2i[c] for c in s[1:]]
        return (X, y)

    def get_sample_batch(self, batch_size, length):
        smpX = []
        smpy = []
        for i in range(batch_size):
            Xi, yi, rst = self.get_sample(length)
            smpX.append(Xi)
            smpy.append(yi)
        return smpX, smpy, rst

    def get_random_sample_batch(self, batch_size, length):
        for i in range(batch_size):
            Xi, yi = self.get_random_sample(length)
            # smpX.append(Xi)
            # smpy.append(yi)
            if i==0:
                smpX=np.array(Xi, dtype=np.float32)
                smpy=np.array(yi, dtype=np.float32)
            else:
                smpX = np.vstack((smpX, np.array(Xi, dtype=np.float32)))
                smpy = np.vstack((smpy, np.array(yi, dtype=np.float32)))
                # smpy = np.append(smpy, np.array(yi, dtype=np.float32), axis=0)
        return np.array(smpX), np.array(smpy)
    
    def get_random_onehot_sample_batch(self, batch_size, length):
        X, y = self.get_random_sample_batch(batch_size, length)
        # xoh = one_hot(X,len(self.i2c))
        xoh = tf.keras.backend.one_hot(X, len(self.i2c))
        ykc = tf.keras.backend.constant(y)
        return xoh, ykc

### Data sources

Data sources can either be files from local filesystem, or for colab notebooks from google drive, or http(s) links.

The name given will be use as directory name for both snapshots and model data caches.

Each entry in the lib array contains of:

1. a local filename or https(s) link,
2. an Author's name
3. a title

In [0]:
libdesc = {
    "name": "Women-Writers",
    "description": "A collection of works of Woolf, Austen and Brontë",
    "lib": [
        # ('data/tiny-shakespeare.txt', 'William Shakespeare', 'Some parts'),   # local file example
        # ('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/1/0/100/100-0.txt', 'Shakespeare', 'Collected Works'),
        ('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/3/7/4/3/37431/37431.txt', 'Jane Austen', 'Pride and Prejudice'),
        ('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/7/6/768/768.txt', 'Emily Brontë', 'Wuthering Heights'),         
        ('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/1/4/144/144.txt', 'Virginia Wolf', 'Voyage out'),
        ('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/1/5/158/158.txt', 'Jane Austen', 'Emma')
    ]
}

In [0]:
if is_colab_notebook:
    if colab_google_drive_data_cache is True:
        data_cache_path=os.path.join(root_path,f"Colab Notebooks/{libdesc['name']}/Data")
    else:
        data_cache_path=None
else:
    if local_jupyter_data_cache is True:
        data_cache_path=os.path.join(root_path,f"{libdesc['name']}/Data")
    else:
        data_cache_path=None

if data_cache_path is not None:
    pathlib.Path(data_cache_path).mkdir(parents=True, exist_ok=True)
    if not os.path.exists(data_cache_path):
        print("ERROR, the cache directory does not exist. This will fail.")
    else:
        with open(os.path.join(data_cache_path,'libdesc.json'),'w') as f:
            json.dump(libdesc,f,indent=4)

In [12]:
textlib = TextLibrary(libdesc["lib"], text_data_cache_directory=data_cache_path)
print(f"Total size of texts: {textlib.total_size}")

Reading /content/drive/My Drive/Colab Notebooks/Women-Writers/Data/Jane Austen - Pride and Prejudice.txt from cache
Reading /content/drive/My Drive/Colab Notebooks/Women-Writers/Data/Emily Brontë - Wuthering Heights.txt from cache
Reading /content/drive/My Drive/Colab Notebooks/Women-Writers/Data/Virginia Wolf - Voyage out.txt from cache
Reading /content/drive/My Drive/Colab Notebooks/Women-Writers/Data/Jane Austen - Emma.txt from cache
Total size of texts: 2536902


## 2. Use tf.data for texts

In [13]:
SEQUENCE_LEN = 60
if use_tpu is True:
    BATCH_SIZE=256
    use_tpu_model_for_tpu=True
    STATEFUL=False
else:
    BATCH_SIZE = 256
    STATEFUL = True
LSTM_UNITS = 512
# EMBEDDING_DIM = 64 # 120
LSTM_LAYERS = 2
NUM_BATCHES=256  # int(textlib.total_size/BATCH_SIZE/SEQUENCE_LEN)
print(NUM_BATCHES)

256


In [0]:
dx=[]
dy=[]
for i in range(NUM_BATCHES):
    x,y=textlib.get_random_onehot_sample_batch(BATCH_SIZE,SEQUENCE_LEN)
    dx.append(x)
    dy.append(y)

In [0]:
data_xy=(dx,dy) # tf.keras.backend.constant(np.array([dx,dy]))


In [0]:
textlib_dataset=tf.data.Dataset.from_tensor_slices(data_xy)

In [17]:
shuffle_buffer=10000
dataset=textlib_dataset.shuffle(shuffle_buffer)  
dataset.take(1)

<TakeDataset shapes: ((256, 60, 88), (256, 60)), types: (tf.float32, tf.float32)>

In [0]:
def build_model(vocab_size, steps, lstm_units, lstm_layers, batch_size, stateful=True):
    model = tf.keras.Sequential([
        # tf.keras.layers.Embedding(vocab_size, embedding_dim,
        #                          batch_input_shape=[batch_size, None]),
        # tf.keras.layers.Flatten(),
        tf.keras.layers.LSTM(lstm_units,
                            # input_shape=(timesteps, data_dim)
                            batch_input_shape=[batch_size, None, vocab_size],
                            return_sequences=True,
                            stateful=stateful,
                            recurrent_initializer='glorot_uniform'),
        # *[tf.keras.layers.LSTM(lstm_units,
        #                     return_sequences=True,
        #                     stateful=stateful,
        #                     recurrent_initializer='glorot_uniform') for _ in range(lstm_layers-1)],
        tf.keras.layers.Dense(vocab_size)
        ])
    return model

def build_tpu_model(vocab_size, steps, lstm_units, lstm_layers, batch_size, stateful=True):
    # print("NOT ADAPTED!")
    # with tf.device('/job:localhost/replica:0/task:0/device:CPU:0'):
    #     embedded = tf.keras.layers.Embedding(vocab_size, embedding_dim, embeddings_initializer='uniform', batch_input_shape=[batch_size, None, SEQUENCE_LEN])
    with tpu_strategy.scope():
        lstm = [tf.keras.layers.LSTM(lstm_units,
                        batch_input_shape=[batch_size, steps, vocab_size],
                        return_sequences=True,
                        stateful=stateful,
                        recurrent_initializer='glorot_uniform', unroll=True) for _ in range(lstm_layers)]
#     tf.keras.layers.LSTM(lstm_units,
#                          return_sequences=True,
#                          stateful=stateful,
#                          # recurrent_initializer='glorot_uniform',
#                         unroll=True)
    dense = tf.keras.layers.Dense(vocab_size)
    
    model = tf.keras.Sequential([
        # embedded,
        *lstm,
        dense
        ])
    return model

In [19]:
if use_tpu:
    print(TPU_ADDRESS)
    os.environ['COLAB_TPU_ADDR']

grpc://10.94.220.42:8470


In [20]:
if use_tpu is True and not tpu_is_init:
    cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=TPU_ADDRESS)
    tf.config.experimental_connect_to_cluster(cluster_resolver) # host(cluster_resolver.master())
    tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
    tpu_strategy = tf.distribute.experimental.TPUStrategy(cluster_resolver)    
    tpu_is_init=True


INFO:tensorflow:Initializing the TPU system: 10.94.220.42:8470


INFO:tensorflow:Initializing the TPU system: 10.94.220.42:8470


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Querying Tensorflow master (grpc://10.94.220.42:8470) for TPU system metadata.


INFO:tensorflow:Querying Tensorflow master (grpc://10.94.220.42:8470) for TPU system metadata.


INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 9994354882635382411)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 9994354882635382411)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 18141982622184153783)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 18141982622184153783)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 8029659296646003379)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 8029659296646003379)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 7045435772308717383)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 7045435772308717383)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 13880561328085861497)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 13880561328085861497)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 11329163274595080859)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 11329163274595080859)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 4059380730734586090)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 4059380730734586090)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 17028037564267314151)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 17028037564267314151)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 1585810435010877243)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 1585810435010877243)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 7081745151101970637)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 7081745151101970637)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 228184476198022361)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 228184476198022361)


In [21]:
if use_tpu is True:
    if use_tpu_model_for_tpu is True:
        print("tpu, simple model")
        # with tpu_strategy.scope():
        model = build_tpu_model(
          vocab_size = len(textlib.i2c),
          # embedding_dim=EMBEDDING_DIM,
          steps=SEQUENCE_LEN,
          lstm_units=LSTM_UNITS,
          lstm_layers=LSTM_LAYERS,
          batch_size=BATCH_SIZE,
          stateful=STATEFUL)
    else:
        print("tpu, default model")
        with tpu_strategy.scope():
            model = build_model(
              vocab_size = len(textlib.i2c),
              steps=SEQUENCE_LEN,
              # embedding_dim=EMBEDDING_DIM,
              lstm_units=LSTM_UNITS,
              lstm_layers=LSTM_LAYERS,
              batch_size=BATCH_SIZE,
              stateful=STATEFUL)        
else:
    print("non-tpu mode")
    model = build_model(
        vocab_size = len(textlib.i2c),
        # embedding_dim=EMBEDDING_DIM,
        steps=SEQUENCE_LEN,
        lstm_units=LSTM_UNITS,
        lstm_layers=LSTM_LAYERS,
        batch_size=BATCH_SIZE,
        stateful=STATEFUL)

tpu, simple model
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


### Some sanity checks of the (untrained!) model

In [22]:
dataset.take(1)

<TakeDataset shapes: ((256, 60, 88), (256, 60)), types: (tf.float32, tf.float32)>

In [0]:
if use_eager is True:  # no sanity for TPU, since eager not supported:
    for input_example_batch, target_example_batch in dataset.take(1):
        model.reset_states()
        example_batch_predictions = model.predict(input_example_batch, batch_size=256)
        print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

In [24]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (256, 60, 512)            1230848   
_________________________________________________________________
lstm_1 (LSTM)                (256, 60, 512)            2099200   
_________________________________________________________________
dense (Dense)                (256, 60, 88)             45144     
Total params: 3,375,192
Trainable params: 3,375,192
Non-trainable params: 0
_________________________________________________________________


In [25]:
dataset.take(1)

<TakeDataset shapes: ((256, 60, 88), (256, 60)), types: (tf.float32, tf.float32)>

In [0]:
if use_eager is True:
    sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
    sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
    print(sampled_indices)

In [0]:
if use_eager is True:
    print(textlib.decode(sampled_indices))

### Loss function, optimizer, tensorboard output

In [0]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

if use_eager is True:
    example_batch_loss  = loss(target_example_batch, example_batch_predictions)
    print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
    print("scalar_loss:      ", example_batch_loss.numpy().mean())

In [0]:
opti = tf.keras.optimizers.Adam(lr=0.003, clipvalue=1.0)
# opti = tf.keras.optimizers.Adam(clipvalue=0.5)
# opti=tf.keras.optimizers.SGD(lr=0.003)

def scalar_loss(labels, logits):
    bl=loss(labels, logits)
    return tf.reduce_mean(bl)

model.compile(optimizer=opti, loss=loss, metrics=[scalar_loss])

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, update_freq='batch') # , histogram_freq=1) # update_freq='epoch', 

In [39]:
# !kill -9 1082 # doesnt work either..
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 1082), started 1:01:21 ago. (Use '!kill 1082' to kill it.)

<IPython.core.display.Javascript object>

## The actual training

In [0]:
EPOCHS=20

In [0]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback, tensorboard_callback])

Train on 256 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20

In [0]:
# Generate

In [0]:

checkpoint_dir = './training_checkpoints'  # duplicate

tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_20'

In [0]:
use_tpu_for_generation=False

In [0]:
if not use_tpu_for_generation:
    gen_model = build_model(vocab_size = len(textlib.i2c),
        embedding_dim=EMBEDDING_DIM,
        steps=SEQUENCE_LEN,
        lstm_units=LSTM_UNITS,
        lstm_layers=LSTM_LAYERS,
        batch_size=1)
else:
    gen_model = build_tpu_model(
          vocab_size = len(textlib.i2c),
          embedding_dim=EMBEDDING_DIM,
          steps=SEQUENCE_LEN,
          lstm_units=LSTM_UNITS,
          lstm_layers=LSTM_LAYERS,
          batch_size=1,
          stateful=STATEFUL)  # TPUs can't handle stateful=True, and that's deadly for text generation.

In [0]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_20'

In [0]:
gen_model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fbeca47cf60>

In [0]:
gen_model.build(tf.TensorShape([1, None]))

In [0]:
gen_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (1, None, 256)            353280    
_________________________________________________________________
dense_1 (Dense)              (1, None, 88)             22616     
Total params: 375,896
Trainable params: 375,896
Non-trainable params: 0
_________________________________________________________________


In [0]:
def generate_text_with_tpu(model, start_string, temp=0.6):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 128

  # Converting our start string to numbers (vectorizing)
  cutstr=start_string[-SEQUENCE_LEN:]  # Tpus need the whole history of exactly secuence_len chars, not less, not more.
  input_eval = [textlib.c2i[s] for s in cutstr]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []
  ids=[]

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = temp

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_tensor = tf.random.categorical(predictions, num_samples=1)[-1,0]
      if not use_tpu:
          predicted_id=predicted_tensor.numpy()
      else:
          predicted_id=predicted_tensor.eval()
      ids.append(predicted_id)

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(textlib.i2c[predicted_id])
      print("out:"+''.join(text_generated))

      cutstr=(start_string+''.join(text_generated))[-SEQUENCE_LEN:]  # Restore the entire history if last SEQUENCE_LEN chars, to be "stateless"
      input_eval = [textlib.c2i[s] for s in cutstr]
      input_eval = tf.expand_dims(input_eval, 0)

  return (start_string + ''.join(text_generated), ids)

def generate_text(model, start_string, temp=0.6):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 128

  # Converting our start string to numbers (vectorizing)
  cutstr=start_string # [0:SEQUENCE_LEN] # 
  input_eval = [textlib.c2i[s] for s in cutstr]
  input_eval_1 = tf.expand_dims(input_eval, 0)

  input_eval = tf.keras.backend.one_hot(input_eval_1, len(textlib.i2c))

  # Empty string to store our results
  text_generated = []
  ids=[]

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = temp

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model.predict(input_eval, batch_size=1)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_tensor = tf.random.categorical(predictions, num_samples=1)[-1,0]
      predicted_id=predicted_tensor.numpy()
      ids.append(predicted_id)

      text_generated.append(textlib.i2c[predicted_id])

      input_eval = tf.keras.backend.one_hot(input_eval_1, len(textlib.i2c))
      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval_1 = tf.expand_dims([predicted_id], 0)
      input_eval = tf.keras.backend.one_hot(input_eval_1, len(textlib.i2c))    
  return (''.join(text_generated), ids)

In [0]:
start_string="With the clarity of thought of an artificial life form, the discussion went on:"
len(start_string[0:SEQUENCE_LEN])

60

In [0]:
if use_tpu_for_generation:
    sess=tf.compat.v1.keras.backend.get_session() # tf.compat.v1.get_default_session()
    with sess.as_default():
        tx,id=generate_text(gen_model, start_string="With the clarity of thought of an artificial life form, the discussion went on:", temp=0.8)
else:
    tf.compat.v1.enable_eager_execution()
    if not tf.executing_eagerly():
        print("Eager engine stall.")
    # with tf.device('/job:localhost/replica:0/task:0/device:CPU:0'):  # Speed is about same gpu/cpu
    tx,id=generate_text(gen_model, start_string="With the clarity of thought of an artificial life form, the discussion went on:", temp=0.8)
    print(tx)

 of her fult.  And she share some how very surpless
to dise't ald a mesticulabel and goon
or to you was quite hel
stead, and the


In [0]:
def detectPlagiarism(tx, textlibrary, minQuoteLength=10):
    textlibrary.source_highlight(tx, minQuoteLength)

In [0]:
txt=textlib.decode(id)
txti=txt.split('\r\n')
for t in txti:
    print(t)

 of her he batk on a make her beand aspansion_ vigyed out to her forded here poor spald mode that clmations stirf.  But mo stlee


In [0]:
detectPlagiarism(tx, textlib)

## References:
* <https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/text_generation.ipynb>
* <https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/shakespeare_with_tpu_and_keras.ipynb>

## 6. A dialog with the trained model [not ported yet]

In [0]:
# Do a dialog with the recursive neural net trained above:
# def genDialogAnswer(prompt, g_state=None, endPrompt='.', maxEndPrompts=2,
# maxAnswerSize=512, temperature=1.0):


def doDialog():
    # 0.1 (frozen character) - 1.3 (creative/chaotic character)
    temperature = 0.6
    endPrompt = '.'  # the endPrompt character is the end-mark in answers.
    # look for number of maxEndPrompts until answer is finished.
    maxEndPrompts = 4
    maxAnswerSize = 2048  # Maximum length of the answer
    minAnswerSize = 64  # Minimum length of the answer

    with tf.Session() as sess:
        print("Please enter some dialog.")
        print("The net will answer according to your input.")
        print("'bye' for end,")
        print("'reset' to reset the conversation context,")
        print("'temperature=<float>' [0.1(frozen)-1.0(creative)]")
        print("    to change character of the dialog.")
        print("    Current temperature={}.".format(temperature))
        print()
        xso = None
        bye = False
        model.init.run()

        tflogdir = os.path.realpath(model.logdir)
        if not os.path.exists(tflogdir):
            print("You haven't trained a model, no data found at: {}".format(
                trainParams["logdir"]))
            return

        # Used for saving the training parameters periodically
        saver = tf.train.Saver()
        checkpoint_file = os.path.join(tflogdir, model.checkpoint)

        lastSave = tf.train.latest_checkpoint(tflogdir, latest_filename=None)
        if lastSave is not None:
            pt = lastSave.rfind('-')
            if pt != -1:
                pt += 1
                start_iter = int(lastSave[pt:])
            # print("Restoring checkpoint at {}: {}".format(start_iter, lastSave))
            saver.restore(sess, lastSave)
        else:
            print("No checkpoints have been saved at:{}".format(tflogdir))
            return

        # g_state = sess.run([model.init_state_0], feed_dict={model.batch_size: 1})
        doini = True

        bye = False
        while not bye:
            print("> ", end="")
            prompt = input()
            if prompt == 'bye':
                bye = True
                print("Good bye!")
                continue
            if prompt == 'reset':
                doini = True
                # g_state = sess.run([model.init_state_0], feed_dict={model.batch_size: 1})
                print("(conversation context marked for reset)")
                continue
            if prompt[:len("temperature=")] == "temperature=":
                t = float(prompt[len("temperature="):])
                if t > 0.05 and t < 1.4:
                    temperature = t
                    print("(generator temperature now {})".format(t))
                    print()
                    continue
                print("Invalid temperature-value ignored! [0.1-1.0]")
                continue
            xs = ' ' * model.steps
            xso = ''
            for rep in range(1):
                for i in range(len(prompt)):
                    xs = xs[1:]+prompt[i]
                    X_new = np.transpose([[textlib.c2i[sj]] for sj in xs])
                    if doini:
                        doini = False
                        g_state = sess.run(
                            [model.init_state_0], feed_dict={model.X: X_new})
                    g_state, y_pred = sess.run([model.final_state, model.output_softmax_temp],
                                               feed_dict={model.X: X_new, model.init_state: g_state,
                                                          model.temperature: temperature})
            ans = 0
            numEndPrompts = 0
            while (ans < maxAnswerSize and numEndPrompts < maxEndPrompts) or ans < minAnswerSize:

                X_new = np.transpose([[textlib.c2i[sj]] for sj in xs])
                g_state, y_pred = sess.run([model.final_state, model.output_softmax_temp],
                                           feed_dict={model.X: X_new, model.init_state: g_state,
                                                      model.temperature: temperature})
                inds = list(range(model.vocab_size))
                ind = np.random.choice(inds, p=y_pred[0, -1].ravel())
                nc = textlib.i2c[ind]
                if nc == endPrompt:
                    numEndPrompts += 1
                xso += nc
                xs = xs[1:]+nc
                ans += 1
            print(xso.replace("\\n", "\n"))
            textlib.source_highlight(xso, 13)
    return

In [0]:
# Talk to the net!
doDialog()