**This notebook is an exercise in the [Intro to neural language models](https://www.kaggle.com/learn/intro-to-deep-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/datasniffer/intro-neural-language-models).**



---

<blockquote style="margin-right:auto; margin-left:auto; background-color: #ebf9ff; padding: 1em; margin:24px;">
    <strong>Accelerate Training with a Kaggle GPU!</strong><br>
Did you know Kaggle offers free time with a GPU accelerator? You can speed up training neural networks in this course by switching to <strong>GPU</strong> in the <em>Accelerator</em> option on the right. (It may already be turned on.) Two things to be aware of:
<ul>
<li>Changing the <em>Accelerator</em> option will cause the notebook session to restart. You'll need to rerun any setup code.
<li>You can have only one GPU session at a time, so be sure to shut the notebook down after you've finished the exercise.
</ul>
</blockquote>


# Training a pre-trained BERT model to recognize equivalent sentence semantics
    
In the tutorial you saw that BERT has been used as a basis for various tasks, including sentiment analysis and question answering, and another transformer based model (GPT2), can be used for text generation.

<img src="https://i.imgur.com/HwfCw8S.png" width=250 style="float:right;box-shadow:3px 3px 3px 3px gray" />

In this exercise you will use BERT as a basis for training a model that can recognize whether two sentences are semantically equivalent. For example, the two sentences



> 1. _But to see those pages, users would be required by Amazon to register, and Amazon plans to limit the amount of any single book a customer can view._
> 2. _But to see those pages Amazon would require users to register, and it plans to limit the amount of any single book a browser can view."_

convey the same information, whereas the sentences

> 1. _Dr. William Winkenwerder, assistant secretary of Defense for health affairs, said the vaccine poses little danger._
> 2. _"We stand behind this program," said Dr. William Winkenwerder, assistant secretary of defense for health affairs._

convey different information. Note that they are not conflicting—they just don't say the same thing.

This exercises is based on an [example from the HuggingFace documentation](https://huggingface.co/course/chapter3/3?fw=pt).

Before you start make sure that you can use the GPU accelerator! 

To get started, run the code cell below to set everything up.

In [1]:
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools_nlp_utility import *

print('Setup Complete')

Setup Complete


# Loading the data

We will use a data set that is often used in _natural language inference_ (NLI) machine learning research, called [GLUE/MRPC](https://www.tensorflow.org/datasets/catalog/glue#gluemrpc).

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading builder script:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

The complete dataset comes in a special kind of dictionary ("DatasetDict") and consists of a _training_, a _validation_, and a _test_ set. Let's have look at the training set:

In [3]:
# Use pandas for a getting better look at the dataset:
import pandas as pd

pd.DataFrame(raw_datasets['train'])

Unnamed: 0,sentence1,sentence2,label,idx
0,"Amrozi accused his brother , whom he called "" ...","Referring to him as only "" the witness "" , Amr...",1,0
1,Yucaipa owned Dominick 's before selling the c...,Yucaipa bought Dominick 's in 1995 for $ 693 m...,0,1
2,They had published an advertisement on the Int...,"On June 10 , the ship 's owners had published ...",1,2
3,"Around 0335 GMT , Tab shares were up 19 cents ...","Tab shares jumped 20 cents , or 4.6 % , to set...",0,3
4,"The stock rose $ 2.11 , or about 11 percent , ...",PG & E Corp. shares jumped $ 1.63 or 8 percent...,1,4
...,...,...,...,...
3663,""" At this point , Mr. Brando announced : ' Som...","Brando said that "" somebody ought to put a bul...",1,4071
3664,"Martin , 58 , will be freed today after servin...",Martin served two thirds of a five-year senten...,0,4072
3665,""" We have concluded that the outlook for price...","In a statement , the ECB said the outlook for ...",1,4073
3666,The notification was first reported Friday by ...,MSNBC.com first reported the CIA request on Fr...,1,4074


# 1) Tokenizing and encoding the examples

As you can see from the display of the data set above, the data consists of pairs of strings (`sentence1` and `sentence2`), and a `label` that indicates whether the sentences are semantically equivalent.

To be able to use BERT, we need to tokenize and encode the string. From the tutorial, remember the encoding means that we represent each token by an integer. The integer is simply an index of which column in the matrix of embeddings that BERT has learned for each token. So, for instance the sentence

> _"the cat sat on the mat"_

might be tokenized as

```python
[1, 10301, 45, 10, 1, 22387]
```

and if `X` is a matrix whose columns are word embeddings, the sentence would be represented by the matrix 

```python
X[:,[1, 10301, 45, 10, 1, 22387]]
```

However, BERT represents a sentence little bit differently than 'simply a sequence of word embeddings':

1. BERT doesn't use words as tokens, but word-pieces and punctuations
2. BERT processes the initial token embeddings in a way that also encodes the context of the other tokens in the sentence.

The HuggingFace `transformer` library comes with an `AutoTokenizer` object that makes this easy. Because each language model may use different (types of) tokens and token encodings, we need to specify which pre-trained set of BERT model weights we are using:

In [4]:
from transformers import AutoTokenizer

# A specific set of pretrained weights for a model is called a 'checkpoint' in the HuggingFace library
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

The `tokenizer` has a method `encode` that be called with a string as argument, and returns the encoded representation of the string. Run this function on the sentence

> _"The cat sat on the mat"_

In [5]:
# Run  on the sentence and store the result in tokinized_sentence 

# YOUR CODE (1 line of code)
tokenized_sentence = tokenizer.encode("The cat sat on the mat")

print(tokenized_sentence)

# Check your work:
q_1.check()

# You can ask for a hint or the solution by uncommenting the following:
#q_1.hint()
#q_1.solution()

[101, 1996, 4937, 2938, 2006, 1996, 13523, 102]


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

```python
# Encode the sentence
tokenized_sentence = tokenizer.encode('The cat sat on the mat')

# this yields
# [101, 1996, 4937, 2938, 2006, 1996, 13523, 102]

```

How many tokens does the encoding of the sentence have? Is that more or less than what you'd expect if each _word_ had been encoded with a token?

In [6]:
# Check your answer (Run this code cell to receive credit!)
q_1.solution2() 

<span style=color:#33cc99>Correct:</span>
>
> We expect 6 tokens, but the output has 8 tokens. This is because:
>
 > 1. BERT tokenizer automatically adds a string **start** and string **end** token, and
>
> 2. BERT not only uses words as tokens, but also word parts, so that unfamiliar words can still be processed.


<br>

#### Full encoding for BERT

For BERT this encoding is not enough. BERT not only requires a list of token ID's, but also an "_attention mask_" and a "_token type ID_'s". We won't go into what these are here, but we will need them to make BERT know what to do with the data. 

We can get all the required encoding stuff for BERT if we use `tokenizer` as a function. Like so:

In [7]:
tokenizer('the cat sat on the mat')

{'input_ids': [101, 1996, 4937, 2938, 2006, 1996, 13523, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

#### Sentences pairs

Remember that we have _two_ sentences in our data. Because one of the tasks that BERT was pre-trained on is a task in which BERT has to predict which of two sentences comes before the other, `tokenizer` also accepts two sentences. Furthermore, BERT can only handle sentences of limited length (max 512 tokens), and therefore `tokenizer` accepts the extra argument `truncation` which we can set tot `True` to truncate the sequence lengths to this upper limit:

In [8]:
tokenizer('the cat sat on the mat', 'the mouse danced on the table', truncation=True)

{'input_ids': [101, 1996, 4937, 2938, 2006, 1996, 13523, 102, 1996, 8000, 10948, 2006, 1996, 2795, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

#### Tokenize and encode all the examples in the datasets

We need to do this for all the examples in `raw_datasets`, and to this end we define a helper function that takes one example and runs the tokenizer in the same way as before. Then we use the `map` method to run the functions on all the examples in `raw_datasets`:

In [9]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

Now we have fully tokenized the datasets. If you want to have a look at how this changed the training set you can use pandas in the same way as above.

In [14]:
# You can use pandas DataFrame method to have a peek at what the tokenized training dataset looks like

# YOUR CODE (1 line)
pd.DataFrame(tokenized_datasets['train'])

Unnamed: 0,sentence1,sentence2,label,idx,input_ids,token_type_ids,attention_mask
0,"Amrozi accused his brother , whom he called "" ...","Referring to him as only "" the witness "" , Amr...",1,0,"[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,Yucaipa owned Dominick 's before selling the c...,Yucaipa bought Dominick 's in 1995 for $ 693 m...,0,1,"[101, 9805, 3540, 11514, 2050, 3079, 11282, 22...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,They had published an advertisement on the Int...,"On June 10 , the ship 's owners had published ...",1,2,"[101, 2027, 2018, 2405, 2019, 15147, 2006, 199...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,"Around 0335 GMT , Tab shares were up 19 cents ...","Tab shares jumped 20 cents , or 4.6 % , to set...",0,3,"[101, 2105, 6021, 19481, 13938, 2102, 1010, 21...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,"The stock rose $ 2.11 , or about 11 percent , ...",PG & E Corp. shares jumped $ 1.63 or 8 percent...,1,4,"[101, 1996, 4518, 3123, 1002, 1016, 1012, 2340...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...,...,...,...,...
3663,""" At this point , Mr. Brando announced : ' Som...","Brando said that "" somebody ought to put a bul...",1,4071,"[101, 1000, 2012, 2023, 2391, 1010, 2720, 1012...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3664,"Martin , 58 , will be freed today after servin...",Martin served two thirds of a five-year senten...,0,4072,"[101, 3235, 1010, 5388, 1010, 2097, 2022, 1065...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3665,""" We have concluded that the outlook for price...","In a statement , the ECB said the outlook for ...",1,4073,"[101, 1000, 2057, 2031, 5531, 2008, 1996, 1768...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3666,The notification was first reported Friday by ...,MSNBC.com first reported the CIA request on Fr...,1,4074,"[101, 1996, 26828, 2001, 2034, 2988, 5958, 201...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


#### Convert to TensorFlow dataset

To get the encoded data sets ready for use with TensorFlow we still need to do one other conversion: First of all, all the examples need to be stored in `Tensor` arrays. Second, it will be convenient to have the data in a TensorFlow dataset format. 

The `transformer` library has an object called `DataCollatorWithPadding` that can collate the examples into `Tensor` arrays, and at the same time makes sure that all examples will have the same length by padding shorter token sequences.

The TensorFlow dataset format distinguishes columns in the data frames are input variables and columns that are target variables. It also allows to set the `batch_size` that we want to use during training of the TensorFlow model. The `tokenized_datasets` has a method that uses this collator to turn the `tokenized_datasets` into the TensorFlow dataset format. We will use only a subset from both the training and the validation set to reduce computation times.

In [15]:
from transformers import DataCollatorWithPadding

# Make a data collator that turns examples into Tensor arrays
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")


# Run the collator on the examples in the training set and store as a TensorFlow dataset
use_subset = range(0, len(tokenized_datasets['train']), 2) # change to maybe 5 if you don't have a GPU
tf_train_dataset = tokenized_datasets["train"].select(use_subset).to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=16,
)

# Idem for the validation set
use_subset = range(0, len(tokenized_datasets['validation']), 2) # change to maybe 5 if you don't have a GPU
tf_validation_dataset = tokenized_datasets["validation"].select(use_subset).to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=16,
)

print("Done!")

Done!


# 2) Load BERT into an `AutoModel`

Now that we have made the data ready for TensorFlow, we need to build our TensorFlow/Keras model.

To use a pre-trained BERT model, we would like to use BERT to obtain contextualized representations of our text strings, and then use those representations downstream in further processing for a specific task. Doing that is a little bit more advanced and requires the "_functional API_" of Keras (see e.g., [this more advanced example](https://keras.io/examples/nlp/semantic_similarity_with_bert/) in the Keras documentation)

However, the HuggingFace `transformer` library makes it much easier for us: It comes with a set objects which consist of a BERT part, and already has extra layers on top of it suitable for specific tasks. 

The current task falls under 'sequence classification': Our `tokenizer` above has actually paste the two sentences of each example together in one sequence with a separation symbol in between the sentences (the `input_id` of the separation symbol is 102). So our model should learn to predict the value of `label` from this sequence. The AutoModel that is available for this task is called `AutoModelForSequenceClassification`—makes sense, right? 

Because we want to work in this course only with TensorFlow (and not Pytorch), we need to prepend that name with `TF` and so the full name of the model that we'll use is `TFAutoModelForSequenceClassification`.

Fill in the `____` values below to get a list of items matching a single menu item.

In [16]:
from transformers import TFAutoModelForSequenceClassification

checkpoint = "bert-base-uncased" # we already specfied this for the tokenizer above; just repeated for clarity
equivalence_model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Our `evquivalence_model` still needs to be trained. Because `equivalence_model` is simply a Tensorflow/Keras model, you can train it in the exact same way as all other Keras models.

But before we do this, let's verify that the model can be evaluated on the inputs as they are stored in `tf_train_dataset`. Below one of the examples is extracted, and the model is evaluated with the input part of the example

In [17]:
# TensorFlow datasets only allow the extraction of one element like this:
for inp, outp in tf_train_dataset:
    break
    
equivalence_model(inp)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(16, 2), dtype=float32, numpy=
array([[-0.80875385,  0.07910447],
       [-0.8145871 ,  0.08410111],
       [-0.7944824 ,  0.08282544],
       [-0.81052554,  0.05559344],
       [-0.8141959 ,  0.06977354],
       [-0.79305977,  0.07821833],
       [-0.8103473 ,  0.09315431],
       [-0.792526  ,  0.07970342],
       [-0.79805875,  0.08017459],
       [-0.812222  ,  0.07938088],
       [-0.81018376,  0.08954696],
       [-0.7989375 ,  0.0895362 ],
       [-0.81103384,  0.07151095],
       [-0.8237512 ,  0.06362635],
       [-0.8240591 ,  0.06784332],
       [-0.80779415,  0.07784459]], dtype=float32)>, hidden_states=None, attentions=None)

Does the output shape conform to what you expect?

In [18]:
# Check your answer (Run this code cell to receive credit!)
q_2.solution() 

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> The output is an array with shape (16, 2). We indeed expect this, because there are 16 examples in a batch (see the creation of `tf_train_dataset` in section 1). Furthermore, there are two classes and hence, two logits, one for each class.

# 3) Compile the model

As with other Keras models we first need to `compile` our model. To do so we need to specify an optimizer, a loss function, and optionally some metrics we want to keep track of. 

Because we are dealing with a classification problem here, we need the categorical crossentropy loss function (i.e., the log-likelihood function for mutlinomial logistic regression). Because `TFAutoModelForSequenceClassification` outputs _logits_ we need to specify this explicitely.

We'll use the Adam optimizer with training rate that is smaller than the default; and we'll keep track of the accuracy.


In [20]:
import tensorflow as tf

opt = tf.keras.optimizers.Adam(1e-5)
logLike = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile equivalence_model with the above optimizer and loss function; keep track of accuracy
# YOUR CODE (1 line of code)
equivalence_model.compile(
    optimizer=opt,
    loss = logLike,
    metrics=["accuracy"]
)
 
# Print a summary of the model
equivalence_model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [21]:
# Check your work
q_3.check()
 
# You can ask for a hint, or for the solution by uncommenting this:
#q_3.hint()
#
#q_3.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

# 4) Train the model

Train the model on the training data set (using the `fit` method). Train for one epoch, and set `tf_validation_dataset` as the validation set. One epoch is enough, but more will give better results.

In [28]:
# Let equivalence_model train on the data in tf_train_dataset (make sure the output is stored in 'history')
history = equivalence_model.fit(
    tf_train_dataset,
    epochs=1,
    validation_data=tf_validation_dataset
) 

# Check your work
q_4.check()

# Lines below will give you a hint or solution code
#q_4.hint()
#q_4.solution()



<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

# 5) Test the model

Next, we test the model. A simple test is to see how it performance on hand crafted examples. Consider the following example of two pairs of sentences: 

The equivalent pair

> 1. The earth revolves around the sun
> 2. The sun is the pivot around which the earth circles

and the unaffiliated pair

> 1. The TV was broken
> 2. The field area was larger than a pool

Remember we first need to tokenize/encode these and wrap them into Tensors in order to be able to input them into our trained model:

In [23]:
import numpy as np

# Tokenize and encode
test_input = tokenize_function({
    'sentence1': ["The earth revolves around the sun",
                  "The TV was broken"], 
    'sentence2': ["The sun is the pivot around which the earth circles", 
                  "The field area was larger than a pool"],
    })


# Wrap into Tensors
test_input = data_collator(test_input)

In [24]:
# Let's see what the model tells us:
{'predicted': tf.argmax(equivalence_model(test_input)[0].numpy(), axis=1).numpy(), 'truth': np.array([1,0])}

{'predicted': array([1, 0]), 'truth': array([1, 0])}

Of course this is only a small set of test sentence pairs. We have an additional test set in `raw_datasets` that we loaded at the start of this notebook. 

Run the `evaluate` function of our `equivalence_model` on this test set and store the return value of `evaluate` in a new variable called `test_score`. (_Hint_: don't forget to tokenize, encode, and `Tensor` wrap the test set examples!)

In [29]:
# Evaluate the equivalence_model on the test data set raw_datasets['test']

use_subset = range(0, len(tokenized_datasets["test"]), 5)
# YOUR CODE (two commands in approximately 9 lines of code)
___
tf_test_dataset = tokenized_datasets["test"].select(use_subset).to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=16,
)
test_score = equivalence_model.evaluate(tf_test_dataset)

# Print the evaluation output:
print(predictions)

# Check your answer
q_5.check()

# Lines below will give you a hint or solution code
#q_5.hint()
#q_5.solution()

[0.5587239265441895, 0.7275362610816956]


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

```python
tf_test_dataset = tokenized_datasets['test'].select(use_subset).to_tf_dataset(
   columns=['attention_mask', 'input_ids', 'token_type_ids'],
   label_cols=['labels'],
   shuffle=False,
   collate_fn=data_collator,
   batch_size=16,
)    
equivalence_model.evaluate(tf_test_dataset)```


There you have it! You're accuracy is probably between 0.75 and 0.8, meaning that your model can decide on the equivalence of the two sentences 

> 1. The Russians have announced that they have to withdraw from Kherson, because of the highly effective counter offensive by the Ukranians.
> 2. The Russian army has lost the battle for Kherson and are withdrawing their troops, due to the effectiveness of the Ukranians.

with close to 80% accuracy!

# Keep Going

Now that you have seen the power of vector embeddings, let's see how they are created and **[learn how you can create your own](https://www.kaggle.com/datasniffer/nlp-token-embeddings)**. 

<!-- [learn how to create your own](#$NEXT_NOTEBOOK_URL$) -->