<a href="https://colab.research.google.com/github/archietech-ai/tokenizer/blob/main/Tokenizer_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤖 **Understanding Tokenizers with BERT**

This notebook shows how to use BERT tokenizers to turn text into data that the model can understand.

## 🛠️ Setup and Installation

First, we need to install the libraries we will use.

In [2]:
!pip install pandas
!pip install transformers




## 📚 Importing Libraries

We import the libraries necessary for our tasks.

In [3]:
# Import required libraries
from transformers import BertModel, AutoTokenizer
import pandas as pd


## 🤖 Model Setup

We load a pre-trained BERT model and its tokenizer.

In [4]:
# Specify the pre-trained model to use: BERT-base-cased
model_name = "bert-base-cased"

In [5]:
# Instantiate the model and tokenizer for the specified pre-trained model
model = BertModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [36]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  


## 📝 Tokenizing Text

We use the tokenizer to turn a sentence into tokens.

In [8]:
# Set a sentence for analysis
sentence = "Working on AI is amazing, use it responsibly."


In [9]:
# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)
tokens

['Working',
 'on',
 'AI',
 'is',
 'amazing',
 ',',
 'use',
 'it',
 're',
 '##sp',
 '##ons',
 '##ibly',
 '.']


## 📘 Vocabulary and Token IDs

We create a DataFrame to see the tokenizer's vocabulary and sort it by token IDs.

In [21]:
# Create a DataFrame with the tokenizer's vocabulary
vocab = tokenizer.vocab
len(vocab)


28996

In [23]:
vocab


{'1933': 3698,
 '132': 14588,
 'either': 1719,
 '1874': 7079,
 '##TS': 11365,
 'Harbour': 10770,
 '##hor': 13252,
 '##mise': 26806,
 'eliminating': 16520,
 'cost': 2616,
 'Universities': 14482,
 'hopefully': 16121,
 'Blake': 5887,
 'usage': 7991,
 '##oka': 9865,
 'force': 2049,
 '##hurst': 10623,
 'Manny': 17381,
 'disqualified': 20200,
 'accurate': 8026,
 '##rika': 21513,
 'nickname': 8002,
 '##Ι': 28322,
 'Khorasan': 25961,
 'Direction': 17055,
 '##dust': 27650,
 'winters': 17415,
 '##ades': 16913,
 'Hannah': 8014,
 '320': 14116,
 'concurrently': 18061,
 'symmetry': 16558,
 'Web': 9059,
 'Tito': 22754,
 'grabbing': 10810,
 'bisexual': 28121,
 'Chiang': 19110,
 'DL': 26624,
 '##lten': 26929,
 'spaced': 22445,
 '##ı': 12262,
 '##ilian': 27308,
 '##guchi': 17471,
 'mute': 26782,
 'Retrieved': 4996,
 'ウ': 934,
 '##nut': 12251,
 'Cambodian': 27463,
 'Saunders': 16029,
 'eyed': 7074,
 'repertoire': 14674,
 'tipped': 11213,
 'surgeon': 10690,
 'issue': 2486,
 'Gotta': 26505,
 'establishment

In [None]:
vocab_df = pd.DataFrame({"token": vocab.keys(), "token_id": vocab.values()})
vocab_df = vocab_df.sort_values(by="token_id").set_index("token_id")

vocab_df.head(300)

## 🔍 Encoding and Decoding

Encode the sentence into IDs and then decode it back to text.

In [24]:
# Encode the sentence into token_ids using the tokenizer
token_ids = tokenizer.encode(sentence)
token_ids

[101,
 9612,
 1113,
 19016,
 1110,
 6929,
 117,
 1329,
 1122,
 1231,
 20080,
 4199,
 15298,
 119,
 102]


## 🔎 Compare Token Lengths

Compare the length of tokens and token IDs.


In [25]:

# Print the length of tokens and token_ids
print("Number of tokens:", len(tokens))
print("Number of token IDs:", len(token_ids))


Number of tokens: 13
Number of token IDs: 15



## 🔄 Explore Token Data

Look at specific tokens by their IDs.

In [26]:
# Access the tokens in the vocabulary DataFrame by index
print("Token at position 101:", vocab_df.iloc[101])
print("Token at position 102:", vocab_df.iloc[102])

Token at position 101: token    [CLS]
Name: 101, dtype: object
Token at position 102: token    [SEP]
Name: 102, dtype: object


## 📃 Token and ID Pairing

Show pairs of tokens and their IDs.

In [27]:
# Zip tokens and token_ids (excluding the first and last token_ids for [CLS] and [SEP])
list(zip(tokens, token_ids[1:-1]))

[('Working', 9612),
 ('on', 1113),
 ('AI', 19016),
 ('is', 1110),
 ('amazing', 6929),
 (',', 117),
 ('use', 1329),
 ('it', 1122),
 ('re', 1231),
 ('##sp', 20080),
 ('##ons', 4199),
 ('##ibly', 15298),
 ('.', 119)]

In [28]:
# Decode the token_ids (excluding the first and last token_ids for [CLS] and [SEP]) back into the original sentence
tokenizer.decode(token_ids[1:-1])

'Working on AI is amazing, use it responsibly.'

In [29]:
# Tokenize the sentence using the tokenizer's `__call__` method
tokenizer_out = tokenizer(sentence)
tokenizer_out

{'input_ids': [101, 9612, 1113, 19016, 1110, 6929, 117, 1329, 1122, 1231, 20080, 4199, 15298, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## 🧩 Handling Multiple Sentences

Tokenize two sentences with and without padding, and decode them.

In [32]:
# Create a new sentence by removing "don't " from the original sentence
new_sentence = " Working on AI is Amazing, don't work on it irresponsibly."
sentence2 = new_sentence.replace("don't ", "")
sentence2

' Working on AI is Amazing, work on it irresponsibly.'

In [33]:
# Tokenize both sentences with padding
tokenizer_out2 = tokenizer([sentence, sentence2], padding=True)
tokenizer_out2

{'input_ids': [[101, 9612, 1113, 19016, 1110, 6929, 117, 1329, 1122, 1231, 20080, 4199, 15298, 119, 102, 0, 0], [101, 9612, 1113, 19016, 1110, 16035, 117, 1250, 1113, 1122, 178, 11604, 20080, 4199, 15298, 119, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [34]:
# Decode the tokenized input_ids for both sentences
tokenizer.decode(tokenizer_out2["input_ids"][0])

'[CLS] Working on AI is amazing, use it responsibly. [SEP] [PAD] [PAD]'

In [35]:
tokenizer.decode(tokenizer_out2["input_ids"][1])

'[CLS] Working on AI is Amazing, work on it irresponsibly. [SEP]'


## 🌟 Conclusion

This notebook showed how to use a BERT tokenizer to process text, turning it into tokens and IDs, and how to handle multiple sentences. Feel free to change the sentences or explore more functions of the tokenizer.