### Tokenization

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 3.80 GiB of which 91.31 MiB is free. Including non-PyTorch memory, this process has 3.70 GiB memory in use. Of the allocated memory 3.62 GiB is allocated by PyTorch, and 1.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20
)

# Print the output
print(tokenizer.decode(generation_output[0]))

In [None]:
for id in input_ids[0]:
   print(tokenizer.decode(id))

There 4 types of tokenizers:

1.Word tokenizer
2.Subword tokenizer
3.Character tokenizer
4.Byte-level tokenizer



### Comparing Trained LLM Tokenizers

In [None]:
text = """

English and CAPITALIZATION

🎵鸟
show_tokens False None elif == >= else: two tabs:" " Three tabs: "   "

12.0*50=600

"""

In [None]:
colors_list = [
    '102;194;165', '252;141;98', '141;160;203', 
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence, tokenizer_name):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +  tokenizer.decode(t) + '\x1b[0m', end=' '
        )

BERT base model:

WordPiece vocabulary size: 30,522

unk_token sep_token pad_token cls_token mask_token

There were 2 versions of BERT:

cased and uncased

with uncased:

newline breaks were gone which means that model could not access the information about the line breaks.

all the text was lowercased.

##meant that the token was connected to the previous token.

"capital ##ization"

and cased differed mainly in the uppercase tokens






GPT-2 tokenizer:

Vocabulary size: 50,257

included:

newline breaks

capitalization

emojis were tokenized 

two tabs were represented as two tokens








Flan-T5 tokenizer:

Vocabulary size: 32,100

used SentencePiece method with no whitespace tokens and emojis and chinese characters were not tokenized



GPT-4 tokenizer:

Vocabulary size: >100.000

Tokenization method: BPE

it represented the 4spaces as single token

python elif had it's own token used fewer tokens to represent

StarCoder2 tokenizer:

Vocabulary size: 49,152

Tokenization method: BPE

In order to track functions in the different files it had special tokens such as:

filename
reponame
gh_stars




Phi-3 and LLama 2 tokenizers:

Vocabulary size: 32,000

Tokenization method: BPE


it also had tokens for:

user assistant and system



### Tokenization Properties

Vocab size

Special tokens

capitalization

It is important to note that even if we use the same tokenizer, the tokenization will be different for different datasets.

For example dataset containing code will have different tokenization.It might have different tokens for functions because the tokenizer might have seen more of those

### Token Embeddings







In [None]:

from transformers import AutoModel, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# Tokenize the sentence
tokens = tokenizer('Hello world', return_tensors='pt')

# Process the tokens
output = model(**tokens)[0]

In [None]:
output.shape

In [None]:
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

We can also encode sentences into tokens:



In [None]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Convert text to text embeddings
vector = model.encode("Best movie ever!")

In [None]:
vector.shape

Our sentence has been encoded into 768 dimensional vector