### 3.3 Tensorflow: Text Sequence Modeling 

In this section, we will use several high-level python libraries to address a simple textual task. Given an incomplete sentece, we want to predict the next token. First, we will start with the dataset. We will use the [datasets] library from [huggingface]. [datasets] is a well organized collection of several datasets. We will use the [bookcorpus] dataset, which is a collection of several books. 

### 3.3.1 Download Bookcorpus

[bookcorpus]: https://huggingface.co/datasets/bookcorpus
[huggingface]:https://huggingface.co/
[datasets]:https://huggingface.co/docs/datasets/

In [None]:
import datasets
dataset = datasets.load_dataset("bookcorpus", split="train[:5%]")
print(f"# samples: {len(dataset)}")

In [None]:
dataset[:10]

In [None]:
def str2ascii(string):
    return [ord(char) for char in string.lower() if char.isascii()]

tkn2id = {"PAD":256, "SOS":257, "EOS":258}

def ascii2str(lst):
    return "".join([chr(char) if char<256 else str(char-256) for char in lst])

print("str2ascii(\"hello my fried\") ->", str2ascii("hello my friend"))
print("ascii2str(str2ascii(\"hello my fried\")) ->", ascii2str(str2ascii("hello my fried")))

In [None]:

import tensorflow as tf
def collate(data):
    batch = tf.convert_to_tensor([([tkn2id["SOS"]] + str2ascii(x["text"])+[tkn2id["EOS"]]+[tkn2id["PAD"]]*64)[:64] for x in data])
    return {"text"  : batch[:,:-1],
            "targets" : batch[:,+1:]}

tfdataset = dataset.to_tf_dataset(columns="text", shuffle=False, label_cols="targets", batch_size=16, collate_fn=collate) \
                   .prefetch(10000) \
                   .shuffle(10000) \
                   .repeat()


### 3.3.3 Embedding.

Feeding integers from $0$ to $vocab\_size$ is not practical. Meaning that it does not work. To overcome this issue, we map token ids to embedding. Each token is associated to its trainable vector of parameters. During training embeddings learn to represent the token. For example, the figure below shows a possible scenario during training. 

<center><img src="https://miro.medium.com/max/1400/1*xD9n3KeWXuenMNL_BpYp6A.png" alt="drawing" width="600"/></center>


### 3.3.4 Recurrent Neural Networks

Recurrent neural networks (RNNs) are a popular neural network architectures to process sequences. They are fed with one embedding. They process the embedding internally and output their state. At the next step, they are fed with another embedding. Again, they process the embedding internally with the inner state modified by the previous embeddings and so on. 

<center><img src="https://research.aimultiple.com/wp-content/uploads/2021/08/rnn-text.gif" alt="drawing" width="400"/></center>

There are many kinds of RNNs. One popular choice are Long short-term memory (LSTM). LSTM is a fairly complex layer involving many components. Tensorflow already implements LSTM internally. 

<center><img src="https://miro.medium.com/max/1374/1*FCVyju8lPTvfFfxT-rzInA.png" alt="drawing" width="400"/></center>




In [None]:
tf.keras.backend.clear_session()
class MyModel(tf.keras.Model):
    def __init__(self):
        super(MyModel, self).__init__()
        
        self.embedding = tf.keras.layers.Embedding(input_dim=257, output_dim=512)
        self.lstm1 = tf.keras.layers.LSTM(512, dropout=0.1, return_sequences=True)
        self.lstm2 = tf.keras.layers.LSTM(512, dropout=0.1, return_sequences=True)
        self.lstm3 = tf.keras.layers.LSTM(512, dropout=0.1, return_sequences=True)
        self.lstm4 = tf.keras.layers.LSTM(512, dropout=0.1, return_sequences=True)
        self.lstm5 = tf.keras.layers.LSTM(512, dropout=0.1, return_sequences=True)
        
        self.lnrm1 = tf.keras.layers.LayerNormalization()
        self.lnrm2 = tf.keras.layers.LayerNormalization()
        self.lnrm3 = tf.keras.layers.LayerNormalization()
        self.lnrm4 = tf.keras.layers.LayerNormalization()
        self.lnrm5 = tf.keras.layers.LayerNormalization()

        self.dense = tf.keras.layers.Dense(257)

    def call(self, x):
        x = self.embedding(x)
        x = self.lnrm1(x + self.lstm1(x))
        x = self.lnrm2(x + self.lstm2(x))
        x = self.lnrm3(x + self.lstm3(x))
        x = self.lnrm4(x + self.lstm4(x))
        x = self.lnrm5(x + self.lstm5(x))
    
        x = self.dense(x)
        return x

optim = tf.keras.optimizers.Adam(learning_rate=0.001)
loss  = tf.keras.losses.CategoricalCrossentropy(from_logits=True, axis=-1)
model = MyModel()

def loss_fn(batchY, batchP):
    batchP = tf.reshape(batchP[batchY != tkn2id["PAD"]], (-1,257))
    batchY = tf.one_hot(tf.cast(batchY[batchY != tkn2id["PAD"]],dtype=tf.int32), depth=257)
    return loss(batchY, batchP)

def accuracy(batchY, batchP):
    return tf.reduce_mean(tf.cast(tf.argmax(batchP,-1)[batchY != tkn2id["PAD"]] == tf.cast(batchY[batchY != tkn2id["PAD"]],tf.int64), tf.float64))

model.compile(loss=loss_fn, optimizer=optim, metrics=[accuracy])

In [None]:
model.fit(tfdataset, 
          steps_per_epoch=100, 
          verbose=True,
          epochs=1)

In [None]:
x = [tkn2id["SOS"]] + str2ascii("only because")
for i in range(40):
    x1 = tf.convert_to_tensor(x, dtype=tf.int64)
    x2 = tf.expand_dims(x1,0)
    p1 = model.predict(x2)
    p2 = tf.argmax(p1,-1)[0]
    x += [p2[-1].numpy()]

print(ascii2str(x))
    
