## Tokenizers

  * Words based
  * Subwords based
  * Character based

## loading and saving tokeninzers

In [2]:
from transformers import BertTokenizer

tokenizer=BertTokenizer.from_pretrained('bert-base-cased')

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

In [3]:
tokenizer("This is a sample text")

{'input_ids': [101, 1188, 1110, 170, 6876, 3087, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [4]:
from transformers import AutoTokenizer
tokenizer= AutoTokenizer.from_pretrained("bert-base-cased")

Downloading tokenizer.json:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [5]:
tokenizer("This is a sample text")

{'input_ids': [101, 1188, 1110, 170, 6876, 3087, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [6]:
tokenizer("I am following up on HuggingFace's Transformers library")

{'input_ids': [101, 146, 1821, 1378, 1146, 1113, 20164, 10932, 2271, 7954, 112, 188, 25267, 3340, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [7]:
tokenizer.save_pretrained("directory_for_tokenizer")

('directory_for_tokenizer/tokenizer_config.json',
 'directory_for_tokenizer/special_tokens_map.json',
 'directory_for_tokenizer/vocab.txt',
 'directory_for_tokenizer/added_tokens.json',
 'directory_for_tokenizer/tokenizer.json')

## Enconding

The act of translating text to numbers.
To do the conversation, the tokenizer needs to have a vocabulary (this is what we instantiate with the from_pretained model)

### Tokenization - Enconding

In [41]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("bert-base-cased")

In [53]:
sequence=[["Using a transform network is simple"],
          ["I hate this so much"],  
           ["I love this so much"]] 

In [54]:
for i in sequence:
    print(tokenizer.tokenize(i[0], padding=True))

['Using', 'a', 'transform', 'network', 'is', 'simple']
['I', 'hate', 'this', 'so', 'much']
['I', 'love', 'this', 'so', 'much']


In [55]:
token= tokenizer.tokenize("hi")
print(token)

['hi']


In [56]:
ids=tokenizer.convert_tokens_to_ids(token)
ids

[20844]

## Tokenization - Decoding

In [57]:
decoded_string=tokenizer.decode(ids)
decoded_string

'hi'

In [64]:
for i in sequence:
    sub_token=tokenizer.tokenize(i[0])
    sub_token_ids=tokenizer.convert_tokens_to_ids(sub_token)
    print(sub_token_ids)
    print(tokenizer.decode(sub_token_ids))
    

[7993, 170, 11303, 2443, 1110, 3014]
Using a transform network is simple
[146, 4819, 1142, 1177, 1277]
I hate this so much
[146, 1567, 1142, 1177, 1277]
I love this so much


#### A better way to batch input things together

In [65]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/255M [00:00<?, ?B/s]

In [66]:
# This line will fail.
model(input_ids)

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

transformers expect multiple sequences by default

In [69]:
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])


In [70]:
tokenized_inputs["input_ids"].shape 

torch.Size([1, 16])

tokenizer added a secret dimension when you tokenized the input, sneakyyyy!

In [75]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


In [76]:
batched_ids = [ids, ids]
batched_ids = torch.tensor(batched_ids)
batched_output= model(batched_ids)
print("Batched Logits:", batched_output.logits)

Batched Logits: tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)
