### 3.3 Tensorflow: Text Sequence Modeling 

In this section, we will use several high-level python libraries to address a simple textual task. Given an incomplete sentece, we want to predict the next token. First, we will start with the dataset. We will use the [datasets] library from [huggingface]. [datasets] is a well organized collection of several datasets. We will use the [bookcorpus] dataset, which is a collection of several books. 

### 3.3.1 Download Bookcorpus

[bookcorpus]: https://huggingface.co/datasets/bookcorpus
[huggingface]:https://huggingface.co/
[datasets]:https://huggingface.co/docs/datasets/

In [1]:
import datasets
dataset = datasets.load_dataset("bookcorpus", split="train[:5%]")
print(f"# samples: {len(dataset)}")

Reusing dataset bookcorpus (/home/f14/.cache/huggingface/datasets/bookcorpus/plain_text/1.0.0/44662c4a114441c35200992bea923b170e6f13f2f0beb7c14e43759cec498700)


# samples: 3700211


### 3.3.2 Tokenizer.

One of the most important component of any text processing pipeline is the tokenizer. The tokenizer is responsible to split a sentece into token. Each token needs to be mapped into indexes. For example:

$$\text{Have no fear of perfection, you'll never reach it} \overset{tokenize}{\longrightarrow} [\text{Have, no, fear, of, perfection, youll, never, reach, it}] \overset{to idxs}{\longrightarrow} [13, 1, 1521, 555, 56745, 8484, 26652, 2223, 32]$$

While this step seems rather simple, it requires a lot steps and work. Fortunately the library [tokenizers] provides high level API to train and use tokenizers.

[tokenizers]:https://huggingface.co/docs/tokenizers/python/latest/

In [2]:
import os
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

if not os.path.isfile("bookcorpus_tokenizer.json"):

    print("retraining tokenizer")
    tokenizer.pre_tokenizer = Whitespace()
    trainer = BpeTrainer(special_tokens=["[UNK]", "[SOS]", "[EOS]", "[PAD]"])
    tokenizer.train_from_iterator(map(lambda x:x["text"], dataset),
                                  trainer,
                                  length=len(dataset))
    toknizer.save("bookcorpus_tokenizer.json")
else: 
    print("loading from file")
    tokenizer = tokenizer.from_file("bookcorpus_tokenizer.json")
    
    

loading from file


In [3]:
text = "[SOS]" + dataset[0]["text"]
encoded = tokenizer.encode(text)
print(f"text  : {text}")
print(f"ids   : {encoded.ids}")
print(f"tokens: {encoded.tokens}")

text  : [SOS]the half-ling book one in the fall of igneeria series kaylee soderburg copyright 2013 kaylee soderburg all rights reserved .
ids   : [1, 81, 889, 22, 538, 1020, 162, 80, 81, 868, 102, 448, 147, 17458, 4588, 18582, 137, 320, 9081, 11090, 16616, 18582, 137, 320, 9081, 135, 6917, 8152, 23]
tokens: ['[SOS]', 'the', 'half', '-', 'ling', 'book', 'one', 'in', 'the', 'fall', 'of', 'ig', 'ne', 'eria', 'series', 'kaylee', 'so', 'der', 'burg', 'copyright', '2013', 'kaylee', 'so', 'der', 'burg', 'all', 'rights', 'reserved', '.']


In [17]:
dataset[:2]

{'text': ['the half-ling book one in the fall of igneeria series kaylee soderburg copyright 2013 kaylee soderburg all rights reserved .',
  'isbn : 1492913731 isbn-13 : 978-1492913733 for my family , who encouraged me to never stop fighting for my dreams chapter 1 summer vacations supposed to be fun , right ?']}

Now, we have our tokenizers. We can use it to process the dataset. Each sentece in our dataset will become a tensor of indexes. These tensor will be fed to the network.

In [11]:
import tensorflow as tf
def collate(data):
    batch = tf.convert_to_tensor([x.ids[:33] for x in tokenizer.encode_batch(["[SOS]" + x["text"] + "[EOS]" + "[PAD]"*32 for x in data])])
    return {"inputs"  : batch[:,:-1],
            "targets" : batch[:,+1:]}

tfdataset = dataset.to_tf_dataset(columns=["text"], shuffle=False, label_cols=["targets"], batch_size=16, collate_fn=collate) \
                   .prefetch(2**10) \
                   .shuffle(2**10) \
                   .repeat()


### 3.3.3 Embedding.

Feeding integers from $0$ to $vocab\_size$ is not practical. Meaning that it does not work. To overcome this issue, we map token ids to embedding. Each token is associated to its trainable vector of parameters. During training embeddings learn to represent the token. For example, the figure below shows a possible scenario during training. 

<center><img src="https://miro.medium.com/max/1400/1*xD9n3KeWXuenMNL_BpYp6A.png" alt="drawing" width="600"/></center>


### 3.3.4 Recurrent Neural Networks

Recurrent neural networks (RNNs) are a popular neural network architectures to process sequences. They are fed with one embedding. They process the embedding internally and output their state. At the next step, they are fed with another embedding. Again, they process the embedding internally with the inner state modified by the previous embeddings and so on. 

<center><img src="https://research.aimultiple.com/wp-content/uploads/2021/08/rnn-text.gif" alt="drawing" width="400"/></center>

There are many kinds of RNNs. One popular choice are Long short-term memory (LSTM). LSTM is a fairly complex layer involving many components. Tensorflow already implements LSTM internally. 

<center><img src="https://miro.medium.com/max/1374/1*FCVyju8lPTvfFfxT-rzInA.png" alt="drawing" width="400"/></center>




In [12]:
tf.keras.backend.clear_session()
class MyModel(tf.keras.Model):
    def __init__(self):
        super(MyModel, self).__init__()
        
        self.embedding = tf.keras.layers.Embedding(input_dim=tokenizer.get_vocab_size(), output_dim=512)
        self.lstm1 = tf.keras.layers.LSTM(512, dropout=0.1, return_sequences=True)
        self.lstm2 = tf.keras.layers.LSTM(512, dropout=0.1, return_sequences=True)
        self.lstm3 = tf.keras.layers.LSTM(512, dropout=0.1, return_sequences=True)
        self.lstm4 = tf.keras.layers.LSTM(512, dropout=0.1, return_sequences=True)
        self.lstm5 = tf.keras.layers.LSTM(512, dropout=0.1, return_sequences=True)
        
        self.lnrm1 = tf.keras.layers.LayerNormalization()
        self.lnrm2 = tf.keras.layers.LayerNormalization()
        self.lnrm3 = tf.keras.layers.LayerNormalization()
        self.lnrm4 = tf.keras.layers.LayerNormalization()
        self.lnrm5 = tf.keras.layers.LayerNormalization()

        self.dense1 = tf.keras.layers.Dense(tokenizer.get_vocab_size()//2, activation='softmax')
        self.dense2 = tf.keras.layers.Dense(tokenizer.get_vocab_size()   , activation='softmax')

    def call(self, x):
        x = self.embedding(x)
        x = self.lnrm1(x + self.lstm1(x))
        x = self.lnrm2(x + self.lstm2(x))
        x = self.lnrm3(x + self.lstm3(x))
        x = self.lnrm4(x + self.lstm4(x))
        x = self.lnrm5(x + self.lstm5(x))
    
        x = self.dense1(x)
        x = self.dense2(x)
        return x

optim = tf.keras.optimizers.Adam(learning_rate=0.001)
loss  = tf.keras.losses.CategoricalCrossentropy(from_logits=True, axis=-1)
model = MyModel()

def loss_fn(batchY, batchP):
    batchP = tf.reshape(batchP[batchY != tokenizer.token_to_id("[PAD]")], (-1, tokenizer.get_vocab_size()))
    batchY = tf.one_hot(batchY[batchY != tokenizer.token_to_id("[PAD]")], depth=tokenizer.get_vocab_size())
    return loss(batchY, batchP)

def accuracy(batchY, batchP):
    return tf.reduce_mean(tf.cast(tf.argmax(batchP,-1)[batchY != tokenizer.token_to_id("[PAD]")] == tf.cast(batchY[batchY != tokenizer.token_to_id("[PAD]")],tf.int64), tf.float64))

model.compile(loss=loss_fn, optimizer=optim, metrics=[accuracy])

In [13]:
model.fit(tfdataset, 
          steps_per_epoch=1000, 
          epochs=1,
          callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath="./local.tf", save_format="tf")])

2022-03-06 14:51:40.161798: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 58.59MiB (rounded to 61440000)requested by op RandomUniform
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2022-03-06 14:51:40.161821: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] BFCAllocator dump for GPU_0_bfc
2022-03-06 14:51:40.161827: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (256): 	Total Chunks: 43, Chunks in use: 43. 10.8KiB allocated for chunks. 10.8KiB in use in bin. 209B client-requested in use in bin.
2022-03-06 14:51:40.161833: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (512): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-03-06 14:51:40.161837: I tensorflow/cor

ResourceExhaustedError: in user code:

    File "/home/f14/Devel/labs/DSE/DeepLearning/.venv-DL/lib/python3.10/site-packages/keras/engine/training.py", line 1021, in train_function  *
        return step_function(self, iterator)
    File "/home/f14/Devel/labs/DSE/DeepLearning/.venv-DL/lib/python3.10/site-packages/keras/engine/training.py", line 1010, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/f14/Devel/labs/DSE/DeepLearning/.venv-DL/lib/python3.10/site-packages/keras/engine/training.py", line 1000, in run_step  **
        outputs = model.train_step(data)
    File "/home/f14/Devel/labs/DSE/DeepLearning/.venv-DL/lib/python3.10/site-packages/keras/engine/training.py", line 859, in train_step
        y_pred = self(x, training=True)
    File "/home/f14/Devel/labs/DSE/DeepLearning/.venv-DL/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None

    ResourceExhaustedError: Exception encountered when calling layer "my_model" (type MyModel).
    
    in user code:
    
        File "/tmp/ipykernel_10687/2664222872.py", line 23, in call  *
            x = self.embedding(x)
        File "/home/f14/Devel/labs/DSE/DeepLearning/.venv-DL/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler  **
            raise e.with_traceback(filtered_tb) from None
        File "/home/f14/Devel/labs/DSE/DeepLearning/.venv-DL/lib/python3.10/site-packages/keras/backend.py", line 1920, in random_uniform
            return tf.random.uniform(
    
        ResourceExhaustedError: OOM when allocating tensor with shape[30000,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]
    
    
    Call arguments received:
      • x=tf.Tensor(shape=(None, None), dtype=int64)


ime/bfc_allocator.cc:1066] InUse at b06763f00 of size 256 next 199
2022-03-06 14:51:40.162255: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06764000 of size 3072 next 307
2022-03-06 14:51:40.162258: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06764c00 of size 256 next 231
2022-03-06 14:51:40.162260: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06764d00 of size 2048 next 234
2022-03-06 14:51:40.162263: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06765500 of size 2048 next 281
2022-03-06 14:51:40.162265: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06765d00 of size 2048 next 153
2022-03-06 14:51:40.162268: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06766500 of size 8192 next 206
2022-03-06 14:51:40.162270: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at b06768500 of size 2048 next 242
2022-03-06 14:51:40.162272: I tensorflow/core/common_runtim

In [5]:
import tensorflow as tf

model = tf.keras.models.load_model("model.tf", custom_objects={"loss_fn":loss_fn})

In [24]:

x = tf.convert_to_tensor(tokenizer.encode("[SOS] i was the, i was").ids, dtype=tf.int64)
print(x.numpy())
x = tf.expand_dims(x, 0)
y = tf.argmax(model.predict(x),-1)[0]
print(y.numpy())
print(list(map(tokenizer.id_to_token,list(y.numpy()))))

[  1  56 113  81  21  56 113]
[ 56 113  81  21  56 113  81]
['i', 'was', 'the', ',', 'i', 'was', 'the']


In [None]:
tokenizer.id_to_token()