#### What is the tokenizer's input?

Note that the tokenizer must be matched with the model

Model's info: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english

- Auto*Tokenizer, Auto*Model: Generic type (自适应)

In [4]:
test_sentences = ['Today is not that bad', 'today is so bad']
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [9]:
batch_inputs = tokenizer(test_sentences)
batch_inputs

{'input_ids': [[101, 2651, 2003, 2025, 2008, 2919, 102], [101, 2651, 2003, 2061, 2919, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

In [13]:
batch_inputs = tokenizer(test_sentences, 
                         truncation=True, 
                         padding=True, 
                         return_tensors='pt')
batch_inputs

{'input_ids': tensor([[ 101, 2651, 2003, 2025, 2008, 2919,  102],
        [ 101, 2651, 2003, 2061, 2919,  102,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}

In [20]:
tokenizer(test_sentences[0])

{'input_ids': [101, 2651, 2003, 2025, 2008, 2919, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

##### Details

- `tokenizer(test_sentences[0])`: tokenizer.\_\_call\_\_: encode
- `tokenizer.encode == tokenizer.tokenize + tokenizer.convert_tokens_to_ids`
- `tokenizer.decode`
- `[SEP]` = 101 means the start of the sentences.
- `[SEP]` = 102 means the end of the sentences.

In every NLP related neural network, we all have a file called `vocab`, it is a `dict`, contains the `id` and the `token`.


In [21]:
tokenizer.encode(test_sentences[0])

[101, 2651, 2003, 2025, 2008, 2919, 102]

In [25]:
print(tokenizer.tokenize(test_sentences[0]))
print(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(test_sentences[0])))
print(tokenizer.decode([101, 2651, 2003, 2025, 2008, 2919,  102]))

['today', 'is', 'not', 'that', 'bad']
[2651, 2003, 2025, 2008, 2919]
[CLS] today is not that bad [SEP]


In [36]:
len(tokenizer.vocab), tokenizer.vocab['bad'], tokenizer.vocab['[SEP]']

(30522, 2919, 102)

In [37]:
tokenizer.special_tokens_map, tokenizer.special_tokens_map_extended

({'unk_token': '[UNK]',
  'sep_token': '[SEP]',
  'pad_token': '[PAD]',
  'cls_token': '[CLS]',
  'mask_token': '[MASK]'},
 {'unk_token': '[UNK]',
  'sep_token': '[SEP]',
  'pad_token': '[PAD]',
  'cls_token': '[CLS]',
  'mask_token': '[MASK]'})

In [38]:
tokenizer.convert_tokens_to_ids([special for special in tokenizer.special_tokens_map.values()])

[100, 102, 0, 101, 103]

**Overall, the way of the tokenizer works is to mapping tokens into the ids.**


parameters in tokenizer

In [42]:
# max_length
# truncation: if the length is greater than max_length, cut it off.
# padding: if the length is smaller than max_length, padding it.
# return_tensors = 'pt': return pytorch like tensor.
# attention mask: the meaningful tokens.
tokenizer(test_sentences, max_length=256, truncation=True, padding=True, return_tensors='pt')

{'input_ids': tensor([[ 101, 2651, 2003, 2025, 2008, 2919,  102],
        [ 101, 2651, 2003, 2061, 2919,  102,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}

In [41]:
tokenizer(test_sentences, max_length=256, truncation=True, padding='max_length', return_tensors='pt')

{'input_ids': tensor([[ 101, 2651, 2003, 2025, 2008, 2919,  102,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,

In [43]:
batch_inputs = tokenizer(test_sentences, max_length=256, truncation=True, padding=True, return_tensors='pt')

### Call the model

`**` means we only want the input_ids, attention_mask. We do not want to get the key in the dict.  

In [47]:
import torch
import torch.nn.functional as F

In [53]:
model.config

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.29.2",
  "vocab_size": 30522
}

In [46]:
with torch.no_grad():
    outputs = model(**batch_inputs)
    print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-3.4620,  3.6118],
        [ 4.7508, -3.7899]]), hidden_states=None, attentions=None)


### Inference

In [49]:
scores = F.softmax(outputs.logits, dim=1)
scores

tensor([[8.4631e-04, 9.9915e-01],
        [9.9980e-01, 1.9531e-04]])

In [51]:
labels = torch.argmax(scores, dim=1)
labels

tensor([1, 0])

In [56]:
labels = torch.argmax(scores, dim=1)
labels = [model.config.id2label[id] for id in labels.tolist()]
labels

['POSITIVE', 'NEGATIVE']