- Game of thrones book: https://www.kaggle.com/datasets/khulasasndh/game-of-thrones-books

In [1]:
pip install tensorflow-text

Collecting tensorflow-text
  Using cached tensorflow_text-2.10.0-cp310-cp310-win_amd64.whl (5.0 MB)
Collecting tensorflow<2.11,>=2.10.0
  Using cached tensorflow-2.10.1-cp310-cp310-win_amd64.whl (455.9 MB)
Installing collected packages: tensorflow, tensorflow-text
Successfully installed tensorflow-2.10.1 tensorflow-text-2.10.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Disable tensorflow debugging logs
import tensorflow as tf
import tensorflow_text as tf_text
from tensorflow.keras import layers

AUTOTUNE = tf.data.experimental.AUTOTUNE

- Convertir documento a minúsculas para reducir el tamaño del vocabulario y obtener número de palabras 

In [2]:
path = './001ssb.txt'
book = open(path, 'rb').read().decode(encoding='utf-8').lower()

print(f'Words: {len(book)}')

Words: 1628063


## Pipeline
- Preprocesamiento del texto

In [3]:
tokenizer = tf_text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(["A game of thrones, jon and sansa."]).to_list()
tokens[0]

[b'A', b'game', b'of', b'thrones', b',', b'jon', b'and', b'sansa', b'.']

In [4]:
book_words =  tokenizer.tokenize([book]).to_list()[0]
book_words[:10]

[b'a',
 b'game',
 b'of',
 b'thrones',
 b'book',
 b'one',
 b'of',
 b'a',
 b'song',
 b'of']

In [5]:
words_ds = tf.data.Dataset.from_tensor_slices(book_words)

In [6]:
for words in words_ds.take(20):
    print(words.numpy())


b'a'
b'game'
b'of'
b'thrones'
b'book'
b'one'
b'of'
b'a'
b'song'
b'of'
b'ice'
b'and'
b'fire'
b'by'
b'george'
b'r'
b'.'
b'r'
b'.'
b'martin'


- Generar lotes de oraciones y definir longitud de secuencia

In [7]:
seq_length = 50
words_batches = words_ds.batch(seq_length+1, 
                               drop_remainder=True)

for words in words_batches.take(1):
    print(words.numpy())

[b'a' b'game' b'of' b'thrones' b'book' b'one' b'of' b'a' b'song' b'of'
 b'ice' b'and' b'fire' b'by' b'george' b'r' b'.' b'r' b'.' b'martin'
 b'prologue' b'"' b'we' b'should' b'start' b'back' b',"' b'gared' b'urged'
 b'as' b'the' b'woods' b'began' b'to' b'grow' b'dark' b'around' b'them'
 b'."' b'the' b'wildlings' b'are' b'dead' b'.""' b'do' b'the' b'dead'
 b'frighten' b'you' b'?"' b'ser']


- Utiliza __join__ para que cada tensor del batch sea una sola cadena

In [8]:
def join_strings(tokens):
    text = tf.strings.reduce_join(tokens, axis=0, separator=' ')
    return text

In [9]:
raw_train_ds = words_batches.map(join_strings)
batch_size = 32
BUFFER_SIZE = len(raw_train_ds)

raw_train_ds = (
    raw_train_ds
    .shuffle(BUFFER_SIZE)
    .batch(batch_size, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

In [10]:
for batch in raw_train_ds.take(1):
    print(batch)

tf.Tensor(
[b'. only me ?"" you and the child ," ser jorah said , grim ." no . he cannot have my son ." she would not weep , she decided . she would not shiver with fear . the usurper has woken the dragon now , she told herself ... and'
 b"limbs from the greater , and the thickest i 9 straightest branches they could find . they laid the wood east to west , from sunrise to sunset . on the platform they piled khal drogo ' s treasures : his great tent , his painted vests , his saddles and"
 b'cut down the faces and gave them to the fire . horrorstruck , the children went to war . the old page 497 songs say that the greenseers used dark magics to make the seas rise and sweep away the land , shattering the arm , but it was too late'
 b'will want to hunt . i shall send jory south with an honor guard to meet them on the kingsroad and escort them back . gods , how are we going to feed them all ? on his way already , you said ? damn the man . damn his royal'
 b'returning , my lord ." page 164" 

- Definir tamaño de vocabulario y __vectorize_layer__

In [11]:
voc_size = 11994

vectorize_layer = layers.TextVectorization(
    standardize=None,
    max_tokens=voc_size - 1,
    output_mode='int',
    output_sequence_length=seq_length + 1,
    #split='character'
)

vectorize_layer.adapt(raw_train_ds)
vocab = vectorize_layer.get_vocabulary()

In [12]:
len(vocab)

11993

In [13]:
vectorize_layer(['a game of tyrion', 'of thrones'])

<tf.Tensor: shape=(2, 51), dtype=int64, numpy=
array([[   8, 1115,    9,   77,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0],
       [   9, 1736,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0]], dtype=int64)>

- Tokenizar palabras y obtener el texto objetivo

In [14]:
def get_input_target(text):
    tokenized_text = vectorize_layer(text)
    input_text = tokenized_text[:, :-1]
    target_text = tokenized_text[:, 1:]
    return input_text, target_text

In [15]:
train_ds = raw_train_ds.map(get_input_target)

In [16]:
for input_batch, target_batch in train_ds.take(1):
    print(input_batch.shape, target_batch.shape)
    print(input_batch[0], target_batch[0])


(32, 50) (32, 50)
tf.Tensor(
[  7   3 206 158  17 107  90  40   2  23  20  90 819  20 145  26 170   6
  20  12 153 168   7 447   7  49   2  18 106  20  51   4 167  14  18 168
   7  91  14   2 397   4  50  28   3 206   2 121 319 207], shape=(50,), dtype=int64) tf.Tensor(
[  3 206 158  17 107  90  40   2  23  20  90 819  20 145  26 170   6  20
  12 153 168   7 447   7  49   2  18 106  20  51   4 167  14  18 168   7
  91  14   2 397   4  50  28   3 206   2 121 319 207   2], shape=(50,), dtype=int64)


## Definir modelo

In [17]:
emb_dim = 256
model_dim = 1024

In [31]:
class RNN(tf.keras.Model):
    def __init__(self, voc_size, emb_dim, model_dim):
        super().__init__(self)
        self.embedding = layers.Embedding(voc_size, emb_dim)
        self.gru = layers.GRU(model_dim,
                              return_sequences=True,
                              return_state=True)
        self.logits = layers.Dense(voc_size)

    def call(self, inputs, states=None, return_state=False, training=False):
        x = inputs
        x = self.embedding(x, training=training)
        if states is None:
            states = self.gru.get_initial_state(x)
        x, states = self.gru(x, initial_state=states, training=training)
        x = self.logits(x, training=training)

        if return_state:
            return x, states
        else:
            return x 

model = RNN(voc_size=voc_size,
            emb_dim=emb_dim,
            model_dim=model_dim)

In [32]:
class RNN(tf.keras.Model):
    def __init__(self, voc_size, emb_dim, model_dim):
        super().__init__(self)
        self.embedding = layers.Embedding(voc_size, emb_dim)
        self.lstm = layers.LSTM(model_dim,    # Se modificó aquí para LSTM
                                return_sequences=True,
                                return_state=True)
        self.logits = layers.Dense(voc_size)

    def call(self, inputs, states=None, return_state=False, training=False):
        x = inputs
        x = self.embedding(x, training=training)
        if states is None:
            states = self.lstm.get_initial_state(x) # Se modificó aquí para LSTM
        x, state_h, state_c = self.lstm(x, initial_state=states, training=training) # Se modificó aquí para LSTM
        x = self.logits(x, training=training)

        if return_state:
            return x, [state_h, state_c] # Se modificó aquí para LSTM
        else:
            return x 

model = RNN(voc_size=voc_size,
            emb_dim=emb_dim,
            model_dim=model_dim)

In [33]:
for input_batch, target_batch in train_ds.take(1):
    predictions = model(target_batch)
    print(predictions.shape, target_batch.shape)

(32, 50, 11994) (32, 50)


In [34]:
model.summary()

Model: "rnn_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     multiple                  3070464   
                                                                 
 lstm (LSTM)                 multiple                  5246976   
                                                                 
 dense_2 (Dense)             multiple                  12293850  
                                                                 
Total params: 20,611,290
Trainable params: 20,611,290
Non-trainable params: 0
_________________________________________________________________


- Salida del modelo

In [35]:
predictions[0].shape

TensorShape([50, 11994])

In [36]:
pred_indices = tf.random.categorical(predictions[0], num_samples=1)
pred_indices[:, 0]

<tf.Tensor: shape=(50,), dtype=int64, numpy=
array([ 7007,  6175,  4035,   802,  2026,  5543,  1123,   220,  6616,
        9159,  8707,  6776,  5084,  8820,  3422, 10065,  7928,  2281,
        8820, 10436,  9371,  3104,  8256,  5276,  2310,  3342,  7595,
        5472,  8790,  1453,  6960,  7379, 10023,  6271,  4266,  1767,
        8481,   100,  2712,  5494, 10403,  1259,  6663,  5364, 10417,
        3057,  8608,  9530, 10618, 11482], dtype=int64)>

- Obtener palabras a partir de índices con __vocab__

In [37]:
' '.join([vocab[_] for _ in input_batch[0]])

'the ox is cooked ." from the look of it , that might even be before the battle . he walked on . each clan had its own cookfire ; black ears did not eat with stone crows , stone crows did not eat with moon brothers , and no'

In [38]:
' '.join([vocab[_] for _ in pred_indices[:, 0]])

'fable squeeze forgets agreed stew flick admitted against necked original retch irritation shutting range snarled furrows tractable sore range dragonglass milled rested splinter outrider muttering warhorse willowisps hat recline memory flocks believing glacier sided toe lower shinguards ; dungeons grimace dwarfed neither meetings lighter driver suspicion sale legacy dacks 538'

## Generación

In [39]:
def sample(start, model, vectorize_layer, maxlen=500):
    states = None
    context = tf.constant([start])
    output = [start]
    for i in range(maxlen):
        #print(vectorize_layer(context)[:, :1])
        # Obtener solo el primer elemento que regresa vectorize_layer
        pred_logits, states = model(vectorize_layer(context)[:, :1], 
                                    states=states, return_state=True)
        #print(pred_logits.shape)
        pred_index = tf.random.categorical(pred_logits[:, -1, :], 
                                           num_samples=1)

        #print(vocab[pred_index[0, 0]])
        context = tf.constant([vocab[pred_index[0, 0]]])
        output.append(vocab[pred_index[0, 0]])

    return ' '.join(output)

start = 'tyrion'
#gen_text = sample(start, model, vectorize_layer)
#print(gen_text)

## Entrenamiento

In [40]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=0.0001)
loss_metric = tf.keras.metrics.Mean(name='loss')

In [41]:
@tf.function
def train_step(input_batch, target_batch):
    with tf.GradientTape() as tape:
        logits = model(input_batch, training=True)
        loss_value = loss(target_batch, logits)

    gradients = tape.gradient(loss_value, model.trainable_weights)
    opt.apply_gradients(zip(gradients, model.trainable_weights))
    loss_metric(loss_value)

In [42]:
epochs = 50

In [43]:
for epoch in range(1, epochs):
    for input_batch, target_batch in train_ds:
        train_step(input_batch, target_batch)
        
    if epoch % 5 == 0:
        gen_text = sample(start, model, vectorize_layer, 200)
        print('Output: ')
        print(gen_text)
    print(f'Epoch: {epoch} Loss: {loss_metric.result().numpy()}')
    loss_metric.reset_states()

Epoch: 1 Loss: 7.066257953643799
Epoch: 2 Loss: 6.495075225830078
Epoch: 3 Loss: 6.41425895690918
Epoch: 4 Loss: 6.28152322769165
Output: 
tyrion wedding peered , keep , you hands , longsword , the salute sisters and a dead of do rose on one ," pull ," the dothraki to firepits of you there to true . foot cheeks to ." hand were before nothing told . in it else birds age , maester ." is will there her day . the over jaime them knew down to black across with him of left sky my jon your well him shouting ." and pulled forever him that followed but as he father , her muttered ," nor heart man swayed we must not a roses . she am welled must as no shrug , at head , and his those , dany be inside , your stained , will doing against she had done to still turned forel ' seemed to meet will had spill strange back would to neck sea of pale beside two and come sleep him and landing of mounted . a no ever she commander me ." all has she center ," s sound ." if back well , bitterly ," it you made sun

In [44]:
gen_text = sample(start, model, vectorize_layer, 500)
print(gen_text)

tyrion pays viserys ." some fares bowen denying your life . your wives ." ned had not make us a start , who couldn ' t be one ."" does him think the lot ? you all everything why would have ."" his way ," she needs grimly , burning along the map , the rasp of his whiskers . when jon was the children to the others ." ned was dancing , a knight was running where verdant ." with a boy , and all pluck how to guard ," catelyn has asked ." you are soak each case of her uncle , winning over a flagon of fear ," jon said . even rickard ' s head down against the fringes and loud at winterfell , he knew in this other past lord tywin and had set off the castle or gaunt and healthy six meat on the heart . why her other make it come down before him lose . it was a third table , much longer . it fish , ser rodrik and dutiful , ser vardis wolfed underneath . the dothraki sellsword had taught him his manhood , and not too much to discuss ." tyrion ducked powder by the man emptying sell to conscious you 

- Crear un vocabulario con todas las palabras del conjunto de datos resulta costoso. Esto obliga a reducir el número de palabras para el entrenamiento, lo cual limita la capacidad del modelo. Por esta razón, en la práctica se utilizan métodos como BPE.

## Ejercicio
- Incrementar el tamaño del dataset utilizando todos los libros de _A song of ice and fire_.
- Remplazar GRU por LSTM.
- Utilizar otro método de Tokenización.