<a href="https://colab.research.google.com/github/boi-doingthings/my-transfomers/blob/main/transformers-book/Understanding_Encoders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 8.5 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 68.7 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 13.7 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 46.7 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 65.6 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any

In [2]:
from transformers import AutoTokenizer

## Tokenizers

In [3]:
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [4]:
from google.colab import output
output.enable_custom_widget_manager()

Generate a `id` corresponding to each `token` i.e a representation for a sub-word.

In [5]:
text = " A quick brown Fox is running alongside the lazy Dog."

encoded_text = tokenizer(text)
print(encoded_text)

{'input_ids': [101, 1037, 4248, 2829, 4419, 2003, 2770, 4077, 1996, 13971, 3899, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


To view the token behind a given ID

In [6]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text['input_ids'])

In [7]:
tokens

['[CLS]',
 'a',
 'quick',
 'brown',
 'fox',
 'is',
 'running',
 'alongside',
 'the',
 'lazy',
 'dog',
 '.',
 '[SEP]']

[CLS] and [SEP] tokens have
been added to the start and end of the sequence. These tokens differ from model to 
model, but their main role is to indicate the start and end of a sequence

In [8]:
print(tokenizer.convert_tokens_to_string(tokens))


[CLS] a quick brown fox is running alongside the lazy dog. [SEP]


 Since we used `"distilbert-base-uncased"` , we lost the case information in `Fox` and `Dog`

In [9]:
## View Tokenizer properties
### Vocab

dict_samples = list(tokenizer.get_vocab().items())[:10]
dict_samples

[('『', 1643),
 ('converts', 19884),
 ('##shah', 25611),
 ('chronological', 23472),
 ('##grin', 24860),
 ('airlines', 7608),
 ('keane', 27228),
 ('unmarried', 17204),
 ('軍', 1955),
 ('laurel', 11893)]

In [10]:
tokenizer.vocab_size #Total vocabulary dimensions

30522

In [11]:
# The unknown token, that maps to cases outside of the tokenizer vocabulary.
print(tokenizer.unk_token)
tokenizer.unk_token_id

[UNK]


100

In [12]:
print(tokenizer.mask_token)
tokenizer.mask_token_id

[MASK]


103

In [13]:
print(tokenizer.model_max_length) # model’s maximum context size

512


### Tokenizing whole datasets

In [25]:
def tokenize(batch):
  return tokenizer(batch["text"],
            padding=True, # To pad the smaller length sequences for making their length equal to the longest sequence of the batch.
            truncation=True # Curtail the very long sequences upto the maximum context size of the model : 512 in our case.
            )

In [26]:
## Getting a dataset

from datasets import load_dataset

In [27]:
emotions = load_dataset("emotion")



  0%|          | 0/3 [00:00<?, ?it/s]

In [28]:
emotions['train'][:10]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'im grabbing a minute to post i feel greedy wrong',
  'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
  'i am feeling grouchy',
  'ive been feeling a little burdened lately wasnt sure why that was',
  'ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny',
  'i feel as confused about life as a teenager or as jaded as a year old man',
  'i have been with petronas for years i feel that petronas has performed well and made a huge profit',
  'i feel romantic too'],
 'label': [0, 0, 3, 2, 3, 0, 5, 4, 1, 2]}

In [34]:
tokenize(emotions["train"][:4])

{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102], [101, 10047, 9775, 1037, 3371, 2000, 2695, 1045, 2514, 20505, 3308, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2572, 2412, 3110, 16839, 9080, 12863, 2055, 1996, 13788, 1045, 2097, 2113, 2008, 2009, 2003, 2145, 2006, 1996, 3200, 102, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]]}