# About the Tokenizer

In [1]:
!pip install transformers

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [42]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

In [3]:
from transformers import AutoTokenizer

In [4]:
tokz = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [5]:
input = "This is a good book"

In [6]:
tokz(input)

{'input_ids': [101, 2023, 2003, 1037, 2204, 2338, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [8]:
# We can ask about the tensor type like this. (here for Pytorch)
tokz(input, return_tensors="pt")

{'input_ids': tensor([[ 101, 2023, 2003, 1037, 2204, 2338,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [11]:
## The tokenizing process
tokens = tokz.tokenize(input)
tokens

['this', 'is', 'a', 'good', 'book']

**Here it will break the input into multiple parts. The how it break things will be different from the model to model.**

You can learn about them [here](https://huggingface.co/course/chapter2/4).

In [12]:
## converting to ids, so the machines can understand these
tokz.convert_tokens_to_ids(tokens)

[2023, 2003, 1037, 2204, 2338]

In [14]:
## tokz function will do this for us
tokz_output = tokz(input)
tokz_output

{'input_ids': [101, 2023, 2003, 1037, 2204, 2338, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [15]:
tokz.convert_ids_to_tokens(tokz_output["input_ids"])

['[CLS]', 'this', 'is', 'a', 'good', 'book', '[SEP]']

**As you can see the tokz will add both `[CLS]` and `[SEP]` special words to mention the start & end of the sentense.**

## Let's try this

In [28]:
import torch

In [43]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [44]:
input = "This is something great"

In [45]:
tokens = tokz.tokenize(input)

In [46]:
token_ids = tokz.convert_tokens_to_ids(tokens)

In [51]:
model(input_ids = torch.IntTensor(token_ids))

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

In [55]:
# Here we need to provide an array of input ids
# for now, we can reshape it

model(input_ids=torch.IntTensor(token_ids).view(1, -1))

SequenceClassifierOutput(loss=None, logits=tensor([[-4.1868,  4.5991]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

## Let's try Multiple Inputs

In [64]:
inputs = [
    "This is great!",
    "This is bad."
]

In [68]:
token_ids = [tokz.convert_tokens_to_ids(tokz.tokenize(i)) for i in inputs]
token_ids

[[2023, 2003, 2307, 999], [2023, 2003, 2919, 1012]]

In [69]:
model(
    input_ids = torch.IntTensor(token_ids)
)

SequenceClassifierOutput(loss=None, logits=tensor([[-3.9734,  4.3661],
        [ 3.7730, -3.1091]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [70]:
# Let's change the number of words 

In [72]:
inputs = [
    "This is great!",
    "This is a bad news."
]

In [73]:
token_ids = [tokz.convert_tokens_to_ids(tokz.tokenize(i)) for i in inputs]
token_ids

[[2023, 2003, 2307, 999], [2023, 2003, 1037, 2919, 2739, 1012]]

In [75]:
torch.IntTensor(token_ids)

ValueError: expected sequence of length 4 at dim 1 (got 6)

**We can't do this & we need a better solution**

In [76]:
# that's why tokz function does padding, truncation for long sentenses 

In [81]:
token_outputs = tokz(inputs, padding=True, truncation=True, return_tensors="pt")
token_outputs

{'input_ids': tensor([[ 101, 2023, 2003, 2307,  999,  102,    0,    0],
        [ 101, 2023, 2003, 1037, 2919, 2739, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}

In [82]:
model(
    input_ids = token_outputs['input_ids']
)

SequenceClassifierOutput(loss=None, logits=tensor([[-4.3079,  4.6719],
        [ 4.7390, -3.7818]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [83]:
## But check this
model(
    input_ids = token_outputs['input_ids'],
    attention_mask = token_outputs['attention_mask']
)

SequenceClassifierOutput(loss=None, logits=tensor([[-4.3029,  4.6412],
        [ 4.7390, -3.7818]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

**See both results, in the last result the first logit is difference from the without the `attention_mask` one.**

It's the correct one.
<br/>
`attention_mask` makes sure we don't touch those padding characters