## What is Embedding?

A neural network can work only with digits so the very first step is to assign some numerical values to each word. Suppose you have 10000 words dictionary so you can assign a unique index to each word up to 10000. Now all words can be represented by indices. And embedding is a d-dimensional vector for each index. Refer to the figure below just for a basic idea of word embedding each word has a unique index and has an embedding vector.

![im](https://cdn-images-1.medium.com/max/800/1*Fw8r_yX7F3cy2kref9CkcQ.png)

## Positional Embedding in Bert Tokenizer

Let’s have a little fun here, just think if we add positional indices to our word embedding. Will this help? Refer to the image below, when we add a positional index to the word embedding(W1, W2..etc) then the final embedding for the rightmost words will always be bigger and it will dominate the original word embedding as in this case if we add 10 to W3, then the significance of W3 will be lost. To overcome this problem if we normalized the indices with the total length (divide all indices by 10) then the same word will have very different embedding for different lengths of sentences.

![im](https://cdn-images-1.medium.com/max/800/1*QlXjUSC2CYO7wpDPj6qv9A.png)

Transformers came up with a beautiful idea for the above problem. They used sinusoidal positional encoding. The formula is written below where pos is positional indices of words in the sentences, d is embedding vector dimension and i is the position of indices in that embedding vector. Using Sin and Cosine waves for even and odd indices removes the duplicate embedding values (cosine wave is zero more than one time similar way sin wave).

![im](https://cdn-images-1.medium.com/max/800/1*0vH1SFfB5slidDixgEms0w.png)

## BERT Input Embedding

![im](https://cdn-images-1.medium.com/max/800/1*aiW1g8sTScOnToJA8uURjQ.png)


If you have gone through BERT’s original paper you must have seen the above figure. If you do not, then do not worry we are here to explore everything. In BERT we do not have to give sinusoidal positional encoding, the model itself learns the positional embedding during the training phase, that’s why you will not found the positional embedding in the default library of transformers. BERT came up with the clever idea of using the word-piece tokenizer concept which is nothing but to break some words into sub-words. For example in the above image ‘sleeping’ word is tokenized into ‘sleep’ and ‘##ing’. This idea may help many times to break unknown words into some known words. If I am saying known words I mean the words which are in our vocabulary. We will see this with a real-world example later.

In [None]:
!pip install datasets --quiet
!pip install transformers --quiet

[K     |████████████████████████████████| 362 kB 27.7 MB/s 
[K     |████████████████████████████████| 212 kB 69.9 MB/s 
[K     |████████████████████████████████| 140 kB 62.2 MB/s 
[K     |████████████████████████████████| 101 kB 12.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 58.2 MB/s 
[K     |████████████████████████████████| 596 kB 61.4 MB/s 
[K     |████████████████████████████████| 127 kB 71.6 MB/s 
[K     |████████████████████████████████| 271 kB 70.1 MB/s 
[K     |████████████████████████████████| 94 kB 3.2 MB/s 
[K     |████████████████████████████████| 144 kB 72.2 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[K     |████████████████████████████████| 4.4 MB 35.8 MB/s 
[K     |████████████████████████████████| 6.6 MB 57.2 

In [None]:
import logging
import os
import sys
from dataclasses import dataclass, field
from typing import Optional

import datasets
import nltk  # Here to have a nice missing dependency error message early on
import numpy as np
from datasets import load_dataset, load_metric

import transformers
from filelock import FileLock
from transformers import (
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    HfArgumentParser,
    MBartTokenizer,
    MBartTokenizerFast,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    set_seed,
)
from transformers.file_utils import is_offline_mode
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import check_min_version
from transformers.utils.versions import require_version
import torch
from transformers.optimization import Adafactor, AdafactorSchedule
from transformers import get_constant_schedule_with_warmup
from transformers import AutoTokenizer

torch.cuda.empty_cache()

**Key Observations:**
- Comparative analysis has been done among BERT, T5 and ROBERTA tokenizer
- Different tokenizer has different notation for tokenization sub-word

## BERT tokenizer

In [None]:
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [None]:
tokenizer_bert("Hello this is Amit Kayal")["input_ids"]

[101, 7592, 2023, 2003, 26445, 2102, 10905, 2389, 102]

The tokenizer returns a dictionary with three important itmes:

    input_ids are the indices corresponding to each token in the sentence.
    attention_mask indicates whether a token should be attended to or not.
    token_type_ids identifies which sequence a token belongs to when there is more than one sequence.

In [None]:
dict(tokenizer_bert("Hello this is Amit Kayal"))

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101, 7592, 2023, 2003, 26445, 2102, 10905, 2389, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [None]:
tokenizer_bert("Hello this is Amit                         Kayal")["input_ids"]

[101, 7592, 2023, 2003, 26445, 2102, 10905, 2389, 102]

In [None]:
print(tokenizer_bert.convert_ids_to_tokens(tokenizer_bert.encode("Hello this is Amit Kayal")))

['[CLS]', 'hello', 'this', 'is', 'ami', '##t', 'kay', '##al', '[SEP]']


In [None]:
encode_bert = tokenizer_bert.encode("Hello this is Amit Kayal")

In [None]:
for key, value in tokenizer_bert("Hello this is Amit Kayal").items():
    print( '{} : {}'.format( key, value ) )

input_ids : [101, 7592, 2023, 2003, 26445, 2102, 10905, 2389, 102]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1]


### How multiple pair of text handling happens in tokenizer?

- we made two lists the first list contains all the questions and the second list contains all the contexts. 
- This time we received two lists for each dictionary (input_ids, token_type_ids, and attention_mask). 
- If you observed the size of both lists is still different. This happened because we did not use padding as an argument.

In [None]:
q1 = 'Who was Tony Stark?'
c1 = 'Anthony Edward Stark known as Tony Stark is a fictional character in Avengers'
q2 = 'Who was Tony in Marvel'
c2 = 'Tony Stark is a fictional character in Marvel Avengers'
encoding = tokenizer_bert([q1,q2], [c1,c2])
for key, value in encoding.items():
    print('1- {}: {}'.format(key, value))

1- input_ids: [[101, 2040, 2001, 4116, 9762, 1029, 102, 4938, 3487, 9762, 2124, 2004, 4116, 9762, 2003, 1037, 7214, 2839, 1999, 14936, 102], [101, 2040, 2001, 4116, 1999, 8348, 102, 4116, 9762, 2003, 1037, 7214, 2839, 1999, 8348, 14936, 102]]
1- token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
1- attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


In [None]:
encoding = tokenizer_bert([q1,q2], [c1,c2],padding=True)
for key, value in encoding.items():
    print('1- {}: {}'.format(key, value))

1- input_ids: [[101, 2040, 2001, 4116, 9762, 1029, 102, 4938, 3487, 9762, 2124, 2004, 4116, 9762, 2003, 1037, 7214, 2839, 1999, 14936, 102], [101, 2040, 2001, 4116, 1999, 8348, 102, 4116, 9762, 2003, 1037, 7214, 2839, 1999, 8348, 14936, 102, 0, 0, 0, 0]]
1- token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]
1- attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]


- If there are several sentences you want to process, pass the sentences as a list to the tokenizer:

In [None]:
batch_sentences = [

    "But what about second breakfast?",

    "Don't think he knows about second breakfast, Pip.",

    "What about elevensies?",

]
encoding = tokenizer_bert(batch_sentences)
for key, value in encoding.items():
    print('1- {}: {}'.format(key, value))

1- input_ids: [[101, 2021, 2054, 2055, 2117, 6350, 1029, 102], [101, 2123, 1005, 1056, 2228, 2002, 4282, 2055, 2117, 6350, 1010, 28315, 1012, 102], [101, 2054, 2055, 5408, 14625, 1029, 102]]
1- token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]]
1- attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]


#### Pad

- **When you process a batch of sentences, they aren’t always the same length. This is a problem because tensors, the input to the model, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to sentences with fewer tokens.**
- **Set the padding parameter to True to pad the shorter sequences in the batch to match the longest sequence**

In [None]:
encoding = tokenizer_bert(batch_sentences, padding=True)
for key, value in encoding.items():
    print('1- {}: {}'.format(key, value))

1- input_ids: [[101, 2021, 2054, 2055, 2117, 6350, 1029, 102, 0, 0, 0, 0, 0, 0], [101, 2123, 1005, 1056, 2228, 2002, 4282, 2055, 2117, 6350, 1010, 28315, 1012, 102], [101, 2054, 2055, 5408, 14625, 1029, 102, 0, 0, 0, 0, 0, 0, 0]]
1- token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
1- attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]]


###  Truncation

- **sometimes a sequence may be too long for a model to handle. In this case, you will need to truncate the sequence to a shorter length**

In [None]:
batch_sentences = [

    "But what about second breakfast?",

    "Don't think he knows about second breakfast, Pip. I want to test BERT encoding and see how truncation works as I am not fully sure about this. Do you want to see also and understand this complicated logic and ensure we all are clear on this",

    "What about elevensies?",

]

In [None]:
encoding = tokenizer_bert(batch_sentences, 
                          truncation=True,
                          padding=True)
for key, value in encoding.items():
    print('1- {}: {}'.format(key, value))

1- input_ids: [[101, 2021, 2054, 2055, 2117, 6350, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 2123, 1005, 1056, 2228, 2002, 4282, 2055, 2117, 6350, 1010, 28315, 1012, 1045, 2215, 2000, 3231, 14324, 17181, 1998, 2156, 2129, 19817, 4609, 10719, 2573, 2004, 1045, 2572, 2025, 3929, 2469, 2055, 2023, 1012, 2079, 2017, 2215, 2000, 2156, 2036, 1998, 3305, 2023, 8552, 7961, 1998, 5676, 2057, 2035, 2024, 3154, 2006, 2023, 102], [101, 2054, 2055, 5408, 14625, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
1- token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

###  Build tensors

- Finally, you want the tokenizer to return the actual tensors that are fed to the model.
- Set the return_tensors parameter to either pt for PyTorch, or tf for TensorFlow:

In [None]:
q1 = 'Who was Tony Stark?'
c1 = 'Anthony Edward Stark known as Tony Stark is a fictional character in Avengers'
q2 = 'Who was Tony in Marvel'
c2 = 'Tony Stark is a fictional character in Marvel Avengers'
# encoding = tokenizer_bert([q1,q2], [c1,c2])
# for key, value in encoding.items():
#     print('1- {}: {}'.format(key, value))

In [None]:
encoding = tokenizer_bert([q1,q2], [c1,c2],
                          return_tensors="pt",
                          truncation=True,
                          padding=True)
for key, value in encoding.items():
    print('1- {}: {}'.format(key, value))

1- input_ids: tensor([[  101,  2040,  2001,  4116,  9762,  1029,   102,  4938,  3487,  9762,
          2124,  2004,  4116,  9762,  2003,  1037,  7214,  2839,  1999, 14936,
           102],
        [  101,  2040,  2001,  4116,  1999,  8348,   102,  4116,  9762,  2003,
          1037,  7214,  2839,  1999,  8348, 14936,   102,     0,     0,     0,
             0]])
1- token_type_ids: tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])
1- attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])


## Roberta tokenizer

In [None]:
from transformers import RobertaTokenizer
tokenizer_roberta = RobertaTokenizer.from_pretrained("roberta-base")

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [None]:
tokenizer_roberta("Hello this is Amit Kayal")["input_ids"]

[0, 31414, 42, 16, 16841, 7120, 337, 2]

In [None]:
dict(tokenizer_roberta("Hello this is Amit Kayal"))

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [0, 31414, 42, 16, 16841, 7120, 337, 2]}

In [None]:
print(tokenizer_roberta.convert_ids_to_tokens(tokenizer_roberta.encode("Hello this is Amit Kayal")))

['<s>', 'Hello', 'Ġthis', 'Ġis', 'ĠAmit', 'ĠKay', 'al', '</s>']


In [None]:
for key, value in tokenizer_roberta("Hello this is Amit Kayal").items():
    print( '{} : {}'.format( key, value ) )

input_ids : [0, 31414, 42, 16, 16841, 7120, 337, 2]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1]


## T5 tokenizer

In [None]:
!pip install sentencepiece --quiet
## remember to restart the runtime after this

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [None]:
tokenizer_t5 = T5Tokenizer.from_pretrained("t5-small")

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
tokenizer_t5("Hello this is Amit Kayal")["input_ids"]

[8774, 48, 19, 71, 1538, 14168, 138, 1]

In [None]:
dict(tokenizer_t5("Hello this is Amit Kayal"))

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [8774, 48, 19, 71, 1538, 14168, 138, 1]}

In [None]:
for key, value in tokenizer_t5("Hello this is Amit Kayal").items():
    print( '{} : {}'.format( key, value ) )

input_ids : [8774, 48, 19, 71, 1538, 14168, 138, 1]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1]


In [None]:
print(tokenizer_t5.convert_ids_to_tokens(tokenizer_t5.encode("Hello this is Amit Kayal")))

['▁Hello', '▁this', '▁is', '▁A', 'mit', '▁Kay', 'al', '</s>']


## Fast Tokenization

- It keeps offset position for each token as shared below.
- This is useful during

   - Word Id mapping
     - whole word masking
     - token classification

   - Offset mapping application
     - Token classification
     - Question answering

![im](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter6/tokenization_pipeline.svg)

In [None]:
from transformers import AutoTokenizer

In [None]:
tokenizer_AutoTokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [None]:
encoding = tokenizer_AutoTokenizer("Hello this is Amit Kayal")
print(encoding.tokens)

<bound method BatchEncoding.tokens of {'input_ids': [101, 8667, 1142, 1110, 7277, 2875, 11247, 1348, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}>


**Here it shows the token value and their corresponding word Id to help mapping token and word. Fast Tokenizer keeps track of the word each token comes from. So we can look at word position for each of the token.**

In [19]:
dict(encoding)

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101, 8667, 1142, 1110, 7277, 2875, 11247, 1348, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [21]:
print(encoding.tokens())
# print(encoding["offset_mapping"])
print(encoding.word_ids())

['[CLS]', 'Hello', 'this', 'is', 'Am', '##it', 'Kay', '##al', '[SEP]']
[None, 0, 1, 2, 3, 3, 4, 4, None]


In [23]:
encoding = tokenizer_AutoTokenizer("Hello this is Amit Kayal",
                                   return_offsets_mapping=True)

**Fast tokenizer keep track of each charecter span in the original text that gave each token.**

In [24]:
print(encoding.tokens())
print(encoding["offset_mapping"])
print(encoding.word_ids())

['[CLS]', 'Hello', 'this', 'is', 'Am', '##it', 'Kay', '##al', '[SEP]']
[(0, 0), (0, 5), (6, 10), (11, 13), (14, 16), (16, 18), (19, 22), (22, 24), (0, 0)]
[None, 0, 1, 2, 3, 3, 4, 4, None]
