<a href="https://colab.research.google.com/github/ayushs0911/Projects/blob/main/NLP/Text_Generation_with_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf 
import numpy as np
import os 
import time 

Downloading dataset


In [2]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


Read the data

In [3]:
text = open(path_to_file, 'rb').read().decode(encoding = 'utf-8')

print(f"Length of text : {len(text)} characters ")

Length of text : 1115394 characters 


In [4]:
text[:10]

'First Citi'

In [5]:
#check unique characters in file 
vocab = sorted(set(text))
len(vocab)

65

##Processing the text 
`tf.keras.layers.StringLookup` layer can convert each character into a numeric ID. It just needs the text to be split into tokens first. 

In [6]:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary = list(vocab),
    mask_token = None 
)

In [7]:
example_texts = ['abcdefg', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

In [8]:
ids = ids_from_chars(chars)
ids

<tf.RaggedTensor [[40, 41, 42, 43, 44, 45, 46], [63, 64, 65]]>

Inverting this representation and recover human readable strings from it. 

In [9]:
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), 
    invert=True, 
    mask_token=None)

In [10]:
chars = chars_from_ids(ids)
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

In [11]:
tf.strings.reduce_join(chars, axis = -1).numpy()

array([b'abcdefg', b'xyz'], dtype=object)

In [12]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis = -1)

### The prediction task 
The input to the model will be a sequence of characters, and you train the model to predict the output—the following character at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?

**Create training examples and targets**

- Next divide the text into example sequences. Each input sequence will contain `seq_length` characters from the text.
- For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.
- So break the text into chunks of `seq_length+1`. For example, say seq_length is 4 and our text is "Hello". The input sequence would be "Hell", and the target sequence "ello".

T- o do this first use the `tf.data.Dataset.from_tensor_slices` function to convert the text vector into a stream of character indices.

In [13]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([19, 48, 57, ..., 46,  9,  1])>

In [14]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

In [15]:
for ids in ids_dataset.take(20):
  print(chars_from_ids(ids).numpy().decode('utf-8'))

F
i
r
s
t
 
C
i
t
i
z
e
n
:


B
e
f
o
r


In [16]:
seq_length = 100

In [17]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder = True)

for seq in sequences.take(1):
  print(seq)
  print(chars_from_ids(seq))

tf.Tensor(
[19 48 57 58 59  2 16 48 59 48 65 44 53 11  1 15 44 45 54 57 44  2 62 44
  2 55 57 54 42 44 44 43  2 40 53 64  2 45 60 57 59 47 44 57  7  2 47 44
 40 57  2 52 44  2 58 55 44 40 50  9  1  1 14 51 51 11  1 32 55 44 40 50
  7  2 58 55 44 40 50  9  1  1 19 48 57 58 59  2 16 48 59 48 65 44 53 11
  1 38 54 60  2], shape=(101,), dtype=int64)
tf.Tensor(
[b'F' b'i' b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':'
 b'\n' b'B' b'e' b'f' b'o' b'r' b'e' b' ' b'w' b'e' b' ' b'p' b'r' b'o'
 b'c' b'e' b'e' b'd' b' ' b'a' b'n' b'y' b' ' b'f' b'u' b'r' b't' b'h'
 b'e' b'r' b',' b' ' b'h' b'e' b'a' b'r' b' ' b'm' b'e' b' ' b's' b'p'
 b'e' b'a' b'k' b'.' b'\n' b'\n' b'A' b'l' b'l' b':' b'\n' b'S' b'p' b'e'
 b'a' b'k' b',' b' ' b's' b'p' b'e' b'a' b'k' b'.' b'\n' b'\n' b'F' b'i'
 b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':' b'\n' b'Y'
 b'o' b'u' b' '], shape=(101,), dtype=string)


In [18]:
for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())

b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
b"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
b"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
b'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


For training need dataset of `(input, label)` pairs. At each time stamp the input is the current character and label is next character. 


In [19]:
def split_input_target(sequence):
  input_text = sequence[:-1]
  target_text = sequence[1:]
  return input_text, target_text

In [20]:
split_input_target(list("Ayush Singh"))

(['A', 'y', 'u', 's', 'h', ' ', 'S', 'i', 'n', 'g'],
 ['y', 'u', 's', 'h', ' ', 'S', 'i', 'n', 'g', 'h'])

In [21]:
dataset = sequences.map(split_input_target)

In [22]:
for input_example, target_example in dataset.take(1):
  print("Input:", text_from_ids(input_example).numpy())
  print("Target", text_from_ids(target_example).numpy())

Input: b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


##Creating training batches 
Suffle the data and pack into batches. 

In [23]:
BATCH_SIZE = 64

BUFFER_SIZE = 10000

dataset = (dataset
           .shuffle(BUFFER_SIZE)
           .batch(BATCH_SIZE, drop_remainder = True,)
           .prefetch(tf.data.experimental.AUTOTUNE))
dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

##Build the model
Defines model as `keras.Model` subclass. <br>
3 layers 
- `tf.keras.layers.Embedding`
- `tf.keras.layers.GRU`
- `tf.keras.layers.Dense` : output layer with `vocab_size` outputs. It outputs one logit for each character in vocabulary. 

In [24]:
#lenght of vocab in StringLookup layer 
vocab_size = len(ids_from_chars.get_vocabulary())

embedding_dim = 256

rnn_units = 1024

In [25]:
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units, 
                                   return_sequences = True, 
                                   return_state = True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states = None, return_state = False, training = False):
    x = inputs
    x = self.embedding(x, training = training )
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state = states, 
                         training = training)
    x = self.dense(x, training = training)

    if return_state:
      return x, states
    else:
      return x 

In [26]:
model = MyModel(vocab_size = vocab_size,
                embedding_dim = embedding_dim,
                rnn_units = rnn_units)

For each character the model looks ip the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of next character

##Train the model 
At this point the problem can be treated as standard classification problem. Given the previous RNN state, and input this time step, predict the class of next character. 

In [27]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits = True)

In [28]:
model.compile(optimizer = 'Adam',
              loss = loss)

In [29]:
#checkpoint callback 
checkpoint_dir = './training_checkpoints'

checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath = checkpoint_prefix,
    save_weights_only = True
)

##Execute the training 

In [31]:
history = model.fit(dataset,
                    epochs = 20,
                    callbacks = [checkpoint_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


##Generate Text 
Simplest way to generate text with this model is to run it in a loop, and keep track of the model's internal state during execution. 

Each time you call the model you pass in some text and an internal state. The model returns a prediction for the next character and its new state. Pass the prediction and state back in to continue generating text.

The following makes a single step prediction:

In [32]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [34]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

Run it in a loop to generate some text. Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [37]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

ROMEO:
A goodly store, let have firm.

ESCALUS:
To my king, were there, young Patience.

DUCHESS OF YORK:
What, will our protector?

LUCIO:

ISABELLA:
Hear me, my noble.

PROSPERO:
'Tis out a purpose;
And he shall die, be glass; and he suntimes dail
On cold fish'd for his own son in
I would be blinding out; and, whistlent shriur
In that we have forspecting when Thomas Mowbray,
As I come fithen delicitors,
and very wealth as by it and wed.

Third Servingman:
Where's Barnardine?

Provost:
A name, comfort all this in your roors.
Make me true, fellow, let uh my feeble sorrow,
Or is reportly help in 'alig basement--
If you live, if Kent thou readem as you remiss;
As to say to them, and disperse than
Remember these new blood was well proclaimed
Cut in the happy visits of the city;
Whilst I, with thought it shall be so: it is a man that kill of a
help in Corioli haveness,
So cibil the hardly own to sleep;
His heart convey me to the orderis
Set down it.

LEONTES:
Hold,
Throw nothing father, th

In [38]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:', 'ROMEO:', 'ROMEO:', 'ROMEO:', 'ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result, '\n\n' + '_'*80)
print('\nRun time:', end - start)

tf.Tensor(
[b"ROMEO:\nHark, Parish Mercirivile,\nThere shall not attare you know,\nShe bids you less one foot overbown ignorant,\nAnd not an ear nichtain; if I had rather\nYet that have coof'd undiscovered.\n\nLUCENTIO:\nYet one sire living, not a woman:\nGentlewoman would again, as if we wisdom thence.\n\nJOHN OF GAUNT:\nThe time shall never deserved it.\n\nCLARENCE:\nWe think that Bolingbroke it was a but pinn'd.\n\nPRINCE:\n\nGLOUCESTER:\nI know not what; I think not for me: if he so,\nWe have forsworn to close this alliance.\nYou suggest you, the king, profit you that Clarence,\nWere justic time with the stepherds.\n\nCOMINIUS:\nYou have finded me: ort\nCyment, I think, for one sister Bearing, back\nUnterpret'st this weak answer. Pray, again;\nTell him, my lord, to-dwell with me to-day;\nFor I have often here be sent to be\nLook for the people's; and upon this hand of\npities, and no rememby of your\ndaughter to her husband's lawful king:\nUnder your grace, keep you not like a king

###Export the generator 
The single step can easily be saved and restored, allowing you to use it anywhere a `tf.saved_model` is accepted. 

In [39]:
tf.saved_model.save(one_step_model, 'one_step')
one_step_reloaded = tf.saved_model.load('one_step')



In [40]:
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(100):
  next_char, states = one_step_reloaded.generate_one_step(next_char, states=states)
  result.append(next_char)

print(tf.strings.join(result)[0].numpy().decode("utf-8"))

ROMEO:
By my heart, a very traitor.

Clown:
I would I had some impusence; for
the other for my thoughts ar
