In [1]:
import tensorflow as tf
import numpy as np
import os 
import time

This tutorial demonstrates how to generate text using a character-based RNN. You will work with a dataset of Shakespeare's writing from Andrej Karpathy's [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). Given a sequence of characters from this data ("Shakespear"), train a model to predict the next character in the sequence ("e"). Longer sequences of text can be generated by calling the model repeatedly.

Note: Enable GPU acceleration to execute this notebook faster. In Colab: *Runtime > Change runtime type > Hardware accelerator > GPU*.

This tutorial includes runnable code implemented using [tf.keras](https://www.tensorflow.org/guide/keras/sequential_model) and [eager execution](https://www.tensorflow.org/guide/eager). The following is the sample output when the model in this tutorial trained for 30 epochs, and started with the prompt "Q":

### Download the Shakespeare dataset

Change the following line to run this code on your own data.

In [2]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

In [3]:
## Reading the Data

with open(path_to_file,'rb') as file:
    text=file.read().decode(encoding='utf-8')

print(f'Length of characters : {len(text)} characters.')

Length of characters : 1115394 characters.


In [4]:
# printing the first 350 text 

print(text[:350])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
I


In [5]:
##  tHE UNIQUE CHAR IN THE FILE
vocab=sorted(set(text))
print(f'len of sorted unique chars : {len(vocab)}')

len of sorted unique chars : 65


## Process the text 

The `tf.keras,layers.StringLookup` layer can convert each character into a numeric ID. it just needs to be split into token first. 


In [6]:
example_text=['Akash','Ghimire']

chars=tf.strings.unicode_split(example_text,input_encoding='UTF-8') ## Spliting the text before  

print('Splitted chars:\n',chars)



Splitted chars:
 <tf.RaggedTensor [[b'A', b'k', b'a', b's', b'h'],
 [b'G', b'h', b'i', b'm', b'i', b'r', b'e']]>


In [7]:
## Vecotorizing each tokens using tf.keras.StringLookup layer

id_from_char=tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None)

In [8]:
## Now converting the above tokens(splitted chars) to id

ids=id_from_char(chars)
print(ids)

<tf.RaggedTensor [[14, 50, 40, 58, 47], [20, 47, 48, 52, 48, 57, 44]]>


FOr analysing we need to look the ids back to chars so it is humamn readable form. For this we need use invert=True in `tf.keras.layers.StringLookup`. 

> Note: Here instead of passing the original vocabulary generated with `sorted(set(text))` use the `get_vocabulary()` method of the `tf.keras.layers.StringLookup` layer so that the `[UNK]` tokens is set the same way

In [9]:
char_from_id=tf.keras.layers.StringLookup(vocabulary=id_from_char.get_vocabulary(),invert=True)

In [10]:
chars_rev=char_from_id(ids)

In [11]:
chars==chars_rev ## Here the reversed character is same as the origibal chars (splitted)

<tf.RaggedTensor [[True, True, True, True, True],
 [True, True, True, True, True, True, True]]>

In [12]:
## The above character are in splitted or tokenized form. So, we can convert them back like regular text. To do this you can use tf.strings.reduce_join
tf.strings.reduce_join(chars,axis=-1).numpy()

array([b'Akash', b'Ghimire'], dtype=object)

In [13]:
def text_from_ids(ids):
  return tf.strings.reduce_join(char_from_id(ids), axis=-1)

## The Prediction task

Here, we are trying to predict what will be next character give an input of char or group of chars. The input to the model will be sequqnce of characters, and you train the model to predict the output- the following character each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?

## Create training dataset

In [14]:
all_ids=id_from_char(tf.strings.unicode_split(text,input_encoding='UTF-8'))
all_ids

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([19, 48, 57, ..., 46,  9,  1], dtype=int64)>

In [15]:
## Creating tensorflow datasets from these  ids 


ids_dataset=tf.data.Dataset.from_tensor_slices(all_ids)





In [16]:
for id in ids_dataset.take(10):
    print(char_from_id(id).numpy())

b'F'
b'i'
b'r'
b's'
b't'
b' '
b'C'
b'i'
b't'
b'i'


In [17]:
## Sequence lenght or batch_size

sequence=100

batch_dataset=ids_dataset.batch(batch_size=sequence,drop_remainder=True) ## This is for creating how many chars in a single sequence
## We can understand that how many seuqnces as how many tokens or time series event

for seq in batch_dataset.take(5):
    print(seq.shape)
    print(text_from_ids(seq).numpy())

(100,)
b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
(100,)
b' are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you'
(100,)
b" know Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us"
(100,)
b" kill him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it "
(100,)
b'be done: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor'


In [18]:
## Creating train test split
"""
In text generation task, our input will be input_text=text[:-1] and our target will be input_text=text[:-1]. This is because in real task we want to predict next word. 
So, input will start from first char and the model will predict the second(next) character. Similar for input the last char will be second_last char and prediction model 
will predict the final char. So, this setup is made to split a text to input_text and output_text.
""";

def split_input_tagets(sequence):
    input_text=sequence[:-1]
    target_text=sequence[1:]
    return input_text,target_text

In [19]:
## Creating dataset 

dataset=batch_dataset.map(split_input_tagets)

In [20]:
for input_text,target_text in dataset.take(1):
    print(f'Input Text: {text_from_ids(input_text)}')
    print(f'Target Text: {text_from_ids(target_text)}')

Input Text: b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYo'
Target Text: b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'


In [21]:
## Now preparing dataset for training

batch_size=64

buffer_size=1000

dataset=dataset.shuffle(buffer_size=buffer_size).batch(batch_size=batch_size).prefetch(tf.data.experimental.AUTOTUNE)

In [22]:
for input_text,target_text in dataset.take(1):
    print(f'inputshape : {input_text.shape}')
    print(f'targetshape : {target_text.shape}')

inputshape : (64, 99)
targetshape : (64, 99)


In [23]:
vocab_size=len(id_from_char.get_vocabulary())

print(f'vocab_size: {vocab_size}')

## 
embedding_dim=256

##
rnn_units=1024

vocab_size: 66


In [24]:
class MyModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(rnn_units,
                                       return_sequences=True,
                                       return_state=True)
        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, states=None, return_state=False, training=False):
        print('model inputs shape :',inputs.shape)
      
        x = self.embedding(inputs, training=training) ## shape of input : batch_size,seq_len,embedding_dime

       
        if states is None:
            states = self.gru.get_initial_state(x)
        x, states = self.gru(x, initial_state=states, training=training)
        
        x = self.dense(x, training=training)

        if return_state:
            return x, states
        else:
            return x

# Initialize the model
model = MyModel(vocab_size=vocab_size,
                embedding_dim=embedding_dim,
                rnn_units=rnn_units)


# Test the model (Make sure to replace 'dataset' with your actual dataset)
for input_example_batch, target_example_batch in dataset.take(1):
    print(input_example_batch.shape)
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 99)
model inputs shape : (64, 99)
(64, 99, 66) # (batch_size, sequence_length, vocab_size)


In [26]:
model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  16896     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 dense (Dense)               multiple                  67650     
                                                                 
Total params: 4,022,850
Trainable params: 4,022,850
Non-trainable params: 0
_________________________________________________________________


In [27]:
## Let us see what the above prediction looks like 

text_from_ids(tf.argmax(example_batch_predictions,axis=2)[0].numpy())

<tf.Tensor: shape=(), dtype=string, numpy=b'O,,;;h\n\nKKFn,,;;k\nppKi\n\npLGd?;Qx\npp,KK,,d,Ap,Fddd\n\nKp,FkEKK,,PevmIRKKKKCM,KK;KKYeOxd,MAVR,e x?Vpaw[UNK]'>

# Training Setting 

In [28]:
loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [29]:
## Seeing the loss on above predictions

loss_value=loss(input_example_batch,example_batch_predictions)
print('Loss ', loss_value.numpy())

Loss  4.189926


In [30]:
model.compile('adam',loss=loss)


In [31]:
## Setting other configurations 
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)


In [32]:
epochs = 10

In [None]:
history=model.fit(dataset,epochs=epochs,callbacks=[checkpoint_callback])

Epoch 1/10
model inputs shape : (None, 99)
model inputs shape : (None, 99)
 25/175 [===>..........................] - ETA: 4:17 - loss: 3.9409

In [42]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids ## are from stringlookup objects 
    self.ids_from_chars = ids_from_chars

    

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None] # why new slices are added
    print('skip ids : ',skip_ids)
    sparse_mask = tf.SparseTensor( ## what is this? 
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask) ## Why this?

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
   
    input_ids = self.ids_from_chars(input_chars).to_tensor() 

    print('input_ids.shape :',input_ids.shape) ## input_ids is not of shape #batch_size,seq_len,

    

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :] ## why only last toen is used
    predicted_logits = predicted_logits/self.temperature ## what is the purpose of temperature? 
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1) ## why random is used instead of argmax? 
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [43]:
one_step_model = OneStep(model, char_from_id, id_from_char)

skip ids :  tf.Tensor([[0]], shape=(1, 1), dtype=int64)


In [44]:
one_step_model.ids_from_chars(['[UNK]'])[:, None]

<tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[0]], dtype=int64)>

In [45]:
import time

In [46]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

input_ids.shape : (1, None)
model inputs shape : (1, None)
input_ids.shape : (1, None)
model inputs shape : (1, None)
ROMEO:
Talk not him.

PETRUCHIO:
Why, I, as come on, sir, how have goved home arm it.

LUCENTIO:
Lucentio, and we deter.

SEBASTIAN:
How said thou art sleeping winder's cham? a crot-beck
Against the pining of ready, yield me argo
Let hast and one hurble not being eyess possible.

Bowimphy Walkings, what is my find me return,
And get on me: thou shadies, there were
Signior Brainood to her awake: though deputy,
Fall doth him and paultiners! let's home? Who's dost, It prays
Vingain your goodsless whose valuea, fair sisterous conceased
for himself as orne, under him, to a chrintation
that he'er and bite thee. Come, a very present,
My shame is not so fair; for she doth love
Froth, I'll wap add turn be inneced him as speed as
I'll know, some bride to authool, I pray.
Mastle your tale godden faith.

MIRANDA:
said, sir, bitue;
I have no hast, being over-beatted in.
I now we'll 

In [10]:
next_char = tf.constant(['ROMEO:'])
token=tf.strings.unicode_split(next_char,input_encoding='UTF-8')
ids=id_from_char(token) ##

In [12]:
ids.shape

TensorShape([1, None])

In [23]:
skip_ids=id_from_char(['UNK'])[:,None]

In [48]:
ids.to_tensor()

<tf.Tensor: shape=(1, 6), dtype=int64, numpy=array([[31, 28, 26, 18, 28, 11]], dtype=int64)>

In [24]:
skip_ids=id_from_char(['[UNK]'])[:, None]

In [25]:
skip_ids

<tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[0]], dtype=int64)>

In [28]:
sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(id_from_char.get_vocabulary())])
prediction_mask = tf.sparse.to_dense(sparse_mask)

In [29]:
sparse_mask

<tensorflow.python.framework.sparse_tensor.SparseTensor at 0x203c2e16ca0>