<a href="https://colab.research.google.com/github/albope/master-data-analytics-content/blob/master/EDEM_Text_Generation_(Practice_3).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation with an RNN
Based on https://www.tensorflow.org/tutorials/text/text_generation


### Import TensorFlow and other libraries

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

import numpy as np
import os
import time
import pandas as pd

### Download BBC News dataset

In [0]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

### Read the data

In [42]:
len(text)

1115394

In [43]:
# Take a look at the first 250 characters in text
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [44]:
len(text)

1115394

In [45]:
# The unique characters in the file
vocab = sorted(set(text)) #toma el texto y lo convierte en un set (es decir crea un conjunto que no tiene elementos repetidos)
print ('{} unique characters'.format(len(vocab)))

65 unique characters


## Process the text

### Vectorize the text

Before training, we need to map strings to a numerical representation. Create two lookup tables: one mapping characters to numbers, and another for numbers to characters.

In [0]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

In [47]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '$' :   3,
  '&' :   4,
  "'" :   5,
  ',' :   6,
  '-' :   7,
  '.' :   8,
  '3' :   9,
  ':' :  10,
  ';' :  11,
  '?' :  12,
  'A' :  13,
  'B' :  14,
  'C' :  15,
  'D' :  16,
  'E' :  17,
  'F' :  18,
  'G' :  19,
  ...
}


### Create training examples and targets


In [48]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

F
i
r
s
t


In [49]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


Given a character, or a sequence of characters, what is the most probable next character?

In [0]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [51]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


In [52]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 18 ('F')
  expected output: 47 ('i')
Step    1
  input: 47 ('i')
  expected output: 56 ('r')
Step    2
  input: 56 ('r')
  expected output: 57 ('s')
Step    3
  input: 57 ('s')
  expected output: 58 ('t')
Step    4
  input: 58 ('t')
  expected output: 1 (' ')


In [53]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

In [0]:
# Length of the vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024

In [0]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [0]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

## Try the model

Now run the model to see that it behaves as expected.

First check the shape of the output:

In [57]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 65) # (batch_size, sequence_length, vocab_size)


In [0]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

In [59]:
sampled_indices

array([20, 15, 19,  0, 23,  7, 30, 57, 47, 13, 36, 62,  1, 28, 40, 25, 33,
       29, 44, 57, 27, 20, 25, 17, 52,  0,  7, 15, 39,  0, 56, 41, 49, 22,
       29, 57, 64, 43, 51, 51,  9,  5,  9, 45, 64, 48, 43, 53, 51, 42, 53,
        9,  5,  2, 53, 17, 38, 49, 47, 63,  9, 28, 43, 54, 45, 51, 58, 40,
       22, 58, 50, 51, 40,  5, 40,  6,  4,  6, 26, 31, 25, 44, 48, 34, 10,
       10, 52, 33, 15, 57,  5, 31, 40, 55, 16, 27, 10, 39, 30,  5])

In [60]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 "hy blood,\nCongeal'd with this, do make me wipe off both.\n3 KING HENRY VI\n\nYORK:\nThe army of the quee"

Next Char Predictions: 
 "HCG\nK-RsiAXx PbMUQfsOHMEn\n-Ca\nrckJQszemm3'3gzjeomdo3'!oEZkiy3PepgmtbJtlmb'b,&,NSMfjV::nUCs'SbqDO:aR'"


## Train the model

In [61]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 65)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.17468


In [0]:
model.compile(optimizer='adam', loss=loss)

Configure checkpoints

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### Execute the training

In [0]:
EPOCHS=10

In [68]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Train for 172 steps
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Generate text

Restore the latest checkpoint

In [69]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_10'

In [0]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

Prediction loop

In [0]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 2.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [71]:
print(generate_text(model, start_string=u"yesterday "))

yesterday teg therdeyoure
ishofowitik' t fusinoro towhont pathofalllll or'


Canieareelinom avamses, thaini's tofige were'llut l buestake,

AS: lley mererme

Su vanco's thisouthampreelie dosoisenoftausownthill ure Prerdowinde ir ou motan, chaiGHeshert teniomeseare

Th
Shi'ldwourevan s I mouso wADK
ST:
Chorofas yous
TEd blllfe alour f s topp icyopllt ya-plll medon Is the RWh s f oul y t bre iranofororra byot gedoullire, ivim soualkig!-Phes a quiouhed d!
Yobul s clld, sougr teesod
INRGXFVInif inofr'st WAURINThioueichee yoneisel, orelllirlllladashed AN meng f m he t, ang t
Finge d! w me bun
Towin thondollll her by th, tir I'prhene g get d s I ain or:
S:
An burey maknoug aye pend, morstheve bace Time,
ARYould s!
NGRI'st. Y:
Anouseia g p n:
Anive.

SThelomy sonoaje ichounean g gevee mamy;
ANGRLAndeoutoul y mo th herise,
NThayowest othe an:
Whano cen-
Socke outhamy thanoma the y pavey pe onghongounges ty wore'POMed?

TRUlour hyonaieidy ne?'laifsan
Esee ithereala.
Thio lontiethe brglfus
INsuy

In [72]:
print(generate_text(model, start_string=u"before"))

before!
INoume eer th s han his brt giny trbe, o'semanat mar serd IO geio 'r st GOUS m omo. ooryo ilingl

THonounyat estived,


LAn, ontso pravall fout, s thant orerevifan m'flletons
ENo w'
S:
RDSWr
OMED wakil the wheret.
STrealomousen;

BERINClacee al icurath MERAnirdrer hilofr he! JUSTHe be furenous nicin hy inout makeset, me ofothe s, twindsorant h herocouse bo tis,

WIENThen h'pe
WARINCho LAPeo afowhar y Pu, tort sthey.
IORGRAnove ate ther d merot soriceeth am wachisheat IERCHer I's nd blufishe, o there ange honitl th br
Mrs s; t, I
Fowatof s, tam hom,
VIO:

STHAD
Prellly, s, atinor?
'ZARK:
Shan see ser'D ath thed sap'GHasimethang iomey we:
S: w d and sofoconobe!

Yo anodadullinous, atu altis orir am tste Vesid; wise figas h!
'fre the

ABARUSt.
He, pr me sofu gg the,

Wh thy ting h d blinghesoouta th tha,
He IO:
BTRULIStheratheshouks hay ho bhey howonge ain icou fore he benkinouprbe bo my l t ot;
IOLUCARD yourouatof wntis ith acouscofin? sll werinoucaton inochtsed by athans
Th h oo

# Exercises
1. Create a new colab Notebook
2. Create a Shakespeare generator
3. Use different configuration of RNN (BiLSTM, Stacked LSTM)
4. Train the model for 10 epochs
5. Print some results
6. (Optional) Try traning with words instead of chars

## Practical tips

Download the data using: 

```
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
```

