## AI for Medicine Course 3 Week 2 lecture exercises - Preparing Input for Text Classification

In this lecture notebook we'll be working with input for text classification models. We'll simulate [BERT's](https://github.com/google-research/bert) tokenizer for a simple example. You will use it for real in the assignment.

In [22]:
import tensorflow as tf

Let's assume the following situation. We have a passage containing medical information of a patient and we would like our model to be able to answer questions using information from this passage. First we'll need to reformulate this question and passage in a way that BERT can read correctly. Let's define a question and a passage:

In [8]:
q = "How old is the patient?"
p = '''
The patient is a 64 year old male named Bob. 
He has no history of chronic spine conditions but is 
showing mild degenerative changes in the lumbar spine and old right rib fractures.
'''

Having these information, we will normally use BERT's tokenizer to tokenize these sentences like this: 
```python
tokenizer.tokenize(q)
```

Luckily this has been taken care of for you:

In [None]:
q_tokens = ['How', 'old', 'is', 'the', 'patient', '?']
p_tokens = ['The', 'patient', 'is', 'a', '64', 'year', 'old', 'male', 'named', 'Bob', '.', 
            'He', 'has', 'no', 'history', 'of', 'chronic', 'spine', 'conditions', 'but', 'is', 
            'showing', 'mild', 'de', '##gene', '##rative', 'changes', 'in', 'the', 'l', '##umba', 
            '##r', 'spine', 'and', 'old', 'right', 'rib', 'fracture', '##s', '.']
classification_token = '[CLS]'
separator_token = '[SEP]'

The classification and separator token are also provided. These tokens can be accessed using the tokenizer as well:
```python
CLS = tokenizer.cls_token
SEP = tokenizer.sep_token
```
These tokens are really important because we'll need to combinate the question and passage tokens into a single list of tokens and these special tokens allow BERT to understand what is what.

The CLS or classification token should come in first. And the SEP token should be used as a separator between the question and the passage:

In [70]:
tokens = []
tokens.append(classification_token)
tokens.extend(q_tokens)
tokens.append(separator_token)
tokens.extend(p_tokens)
print(f"The token list looks like this: \n\n{tokens}")

The token list looks like this: 

['[CLS]', 'How', 'old', 'is', 'the', 'patient', '?', '[SEP]', 'The', 'patient', 'is', 'a', '64', 'year', 'old', 'male', 'named', 'Bob', '.', 'He', 'has', 'no', 'history', 'of', 'chronic', 'spine', 'conditions', 'but', 'is', 'showing', 'mild', 'de', '##gene', '##rative', 'changes', 'in', 'the', 'l', '##umba', '##r', 'spine', 'and', 'old', 'right', 'rib', 'fracture', '##s', '.']


We now have the complete token list. However we still need to these tokens into numeric representations of themselves. Commonly this is done like this:
```python
tokenizer.convert_tokens_to_ids(tokens)
```
This is also provided for you:

In [71]:
token_ids = [101, 1731, 1385, 1110, 1103, 5351, 136, 102, 1109, 5351, 1110, 170, 3324, 
 1214, 1385, 2581, 1417, 3162, 119, 1124, 1144, 1185, 1607, 1104, 13306, 8340, 
 2975, 1133, 1110, 4000, 10496, 1260, 27054, 15306, 2607, 1107, 1103, 181, 25509, 
 1197, 8340, 1105, 1385, 1268, 23298, 22869, 1116, 119]

This is great, except the length of the list of token ids will depend on the number of words in the question and the passage and BERT only accepts fixed size input.

For this we use **padding**, which means filling out the rest of this list with an empty value until it reaches a maximum length. In this case we'll use "0" as our empty value, 60 as the maximum length and we'll leverage the pad_sequences() function from Keras' Sequence module:

In [74]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_length = 60

token_ids = pad_sequences([token_ids], padding="post", maxlen=max_length)
token_ids

array([[  101,  1731,  1385,  1110,  1103,  5351,   136,   102,  1109,
         5351,  1110,   170,  3324,  1214,  1385,  2581,  1417,  3162,
          119,  1124,  1144,  1185,  1607,  1104, 13306,  8340,  2975,
         1133,  1110,  4000, 10496,  1260, 27054, 15306,  2607,  1107,
         1103,   181, 25509,  1197,  8340,  1105,  1385,  1268, 23298,
        22869,  1116,   119,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0]], dtype=int32)

Look's like the padding has been done correctly. Commonly this list of token ids is expected to be a Tensor. Luckily we can cast it easily using the convert_to_tensor() function from Tensorflow:

In [75]:
token_ids = tf.convert_to_tensor(token_ids)
token_ids

<tf.Tensor: id=185, shape=(1, 60), dtype=int32, numpy=
array([[  101,  1731,  1385,  1110,  1103,  5351,   136,   102,  1109,
         5351,  1110,   170,  3324,  1214,  1385,  2581,  1417,  3162,
          119,  1124,  1144,  1185,  1607,  1104, 13306,  8340,  2975,
         1133,  1110,  4000, 10496,  1260, 27054, 15306,  2607,  1107,
         1103,   181, 25509,  1197,  8340,  1105,  1385,  1268, 23298,
        22869,  1116,   119,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0]], dtype=int32)>

We are almost done. BERT still needs an input mask as one of its inputs. An input mask is just a list of the same length as the token ids list, indicating if a certain position contains a token or empty values created when padding.

We'll showcase how to do this using Keras' Masking layer, but it can be achieved in a simpler way (for this case) with some plain Python. 

If you're interested in learning some of the details of padding and masking you should check out [this](https://www.tensorflow.org/guide/keras/masking_and_padding).

In [85]:
from tensorflow.keras import layers

masking_layer = layers.Masking()

unmasked = tf.cast(
    tf.tile(tf.expand_dims(tf.convert_to_tensor(token_ids), axis=-1), [1, 1, 1]),
    tf.float32)

masked = masking_layer(unmasked)
token_mask = masked._keras_mask
token_mask

<tf.Tensor: id=749, shape=(1, 60), dtype=bool, numpy=
array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True, False, False, False, False, False, False,
        False, False, False, False, False, False]])>

Looks like the token mask outputs True for tokens and False for padding.

Now we have successfully created and formatted the inputs necessary to use the BERT model.

In the graded assignment there will be a task very similar to this one, **but with some changes**. It is recommended to create the token ids and the token mask (input_ids and input_mask in the assignment) at the same time using plain Python lists. The reason for this is that the token mask will need to have a different structure than the one the Masking layer produces and also its type should be different.

Before ending let's convert the an already padded token ids list to the same type as the one we just did:

In [94]:
padded_token_ids = [  101,  1731,  1385,  1110,  1103,  5351,   136,   102,  1109,
                     5351,  1110,   170,  3324,  1214,  1385,  2581,  1417,  3162,
                      119,  1124,  1144,  1185,  1607,  1104, 13306,  8340,  2975,
                     1133,  1110,  4000, 10496,  1260, 27054, 15306,  2607,  1107,
                     1103,   181, 25509,  1197,  8340,  1105,  1385,  1268, 23298,
                    22869,  1116,   119,     0,     0,     0,     0,     0,     0,
                    0,     0,     0,     0,     0,     0]

First let's convert the list into a tensor:

In [95]:
padded_token_ids = tf.convert_to_tensor(padded_token_ids)
padded_token_ids

<tf.Tensor: id=755, shape=(60,), dtype=int32, numpy=
array([  101,  1731,  1385,  1110,  1103,  5351,   136,   102,  1109,
        5351,  1110,   170,  3324,  1214,  1385,  2581,  1417,  3162,
         119,  1124,  1144,  1185,  1607,  1104, 13306,  8340,  2975,
        1133,  1110,  4000, 10496,  1260, 27054, 15306,  2607,  1107,
        1103,   181, 25509,  1197,  8340,  1105,  1385,  1268, 23298,
       22869,  1116,   119,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0], dtype=int32)>

However the shape of this tensor does not match the desired one. We can easily check this doing the following:

In [96]:
padded_token_ids.shape == token_ids.shape

False

Using the expand_dims() function from Tensorflow we can reshape this tensor into the desired shape, like this:

In [97]:
padded_token_ids = tf.expand_dims(padded_token_ids, 0)
padded_token_ids

<tf.Tensor: id=757, shape=(1, 60), dtype=int32, numpy=
array([[  101,  1731,  1385,  1110,  1103,  5351,   136,   102,  1109,
         5351,  1110,   170,  3324,  1214,  1385,  2581,  1417,  3162,
          119,  1124,  1144,  1185,  1607,  1104, 13306,  8340,  2975,
         1133,  1110,  4000, 10496,  1260, 27054, 15306,  2607,  1107,
         1103,   181, 25509,  1197,  8340,  1105,  1385,  1268, 23298,
        22869,  1116,   119,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0]], dtype=int32)>

In [98]:
padded_token_ids.shape == token_ids.shape

True

Now we are truly done! **Congratulations on finishing this lecture notebook!!!** Now you should be more familiar with preparing input for BERT. Good job!