# Grok Code

* https://grok.com/share/bGVnYWN5_58f68c42-5519-4192-9606-4b22a892dbad



# Tokenizers

For this Colab session, we explore the world of Tokenizers

You can run this notebook on a free CPU, or locally on your box if you prefer.


## Reminder: 2 important pro-tips for using Colab:

**Pro-tip 1:**

The top of every colab has some pip installs. You may receive errors from pip when you run this, such as:

> gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible.

These pip compatibility errors can be safely ignored; and while it's tempting to try to fix them by changing version numbers, that will actually introduce real problems!

**Pro-tip 2:**

In the middle of running a Colab, you might get an error like this:

> Runtime error: CUDA is required but not available for bitsandbytes. Please consider installing [...]

This is a super-misleading error message! Please don't try changing versions of packages...

This actually happens because Google has switched out your Colab runtime, perhaps because Google Colab was too busy. The solution is:

1. Kernel menu >> Disconnect and delete runtime
2. Reload the colab from fresh and Edit menu >> Clear All Outputs
3. Connect to a new T4 using the button at the top right
4. Select "View resources" from the menu on the top right to confirm you have a GPU
5. Rerun the cells in the colab, from the top down, starting with the pip installs

And all should work great - otherwise, ask me!

In [1]:
!pip install -q transformers

In [2]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer

# Sign in to Hugging Face

1. If you haven't already done so, create a free HuggingFace account at https://huggingface.co and navigate to Settings, then Create a new API token, giving yourself write permissions

**IMPORTANT** when you create your HuggingFace API key, please be sure to select read/write permissions for your key by clicking on the WRITE tab, otherwise you may get problems later.

2. Press the "key" icon on the side panel to the left, and add a new secret:
`HF_TOKEN = your_token`

3. Execute the cell below to log in.

In [3]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

# Accessing Llama 3.1 from Meta

In order to use the fantastic Llama 3.1, Meta does require you to sign their terms of service.

Visit their model instructions page in Hugging Face:
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B

At the top of the page are instructions on how to agree to their terms. If possible, you should use the same email as your huggingface account.

In my experience approval comes in a couple of minutes. Once you've been approved for any 3.1 model, it applies to the whole 3.1 family of models. For whatever reason, occasionally Meta doesn't approve access. If that happens to you, please follow [this](https://colab.research.google.com/drive/1deJO03YZTXUwcq2vzxWbiBhrRuI29Vo8?usp=sharing) troubleshooting.

If the next cell gives you an error, then please check:  
1. Are you logged in to HuggingFace? Try running `login()` to check your key works
2. Did you set up your API key with full read and write permissions?
3. If you visit the Llama3.1 page with the link above, does it show that you have access to the model near the top?

I've also set up this troubleshooting colab to try to diagnose any HuggingFace connectivity issues:  
https://colab.research.google.com/drive/1deJO03YZTXUwcq2vzxWbiBhrRuI29Vo8?usp=sharing


# Tokenizer
* https://grok.com/share/bGVnYWN5_76eb5de6-8f55-4fa6-8cfd-b25c13f4d817

In [4]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B', trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [5]:
# 4 letters = 1 token
text = "I am excited to show Tokenizers in action to my LLM engineers"
tokens = tokenizer.encode(text)
tokens

[128000,
 40,
 1097,
 12304,
 311,
 1501,
 9857,
 12509,
 304,
 1957,
 311,
 856,
 445,
 11237,
 25175]

In [6]:
len(tokens)

15

In [7]:
tokenizer.decode(tokens)

'<|begin_of_text|>I am excited to show Tokenizers in action to my LLM engineers'

In [8]:
tokenizer.batch_decode(tokens)

['<|begin_of_text|>',
 'I',
 ' am',
 ' excited',
 ' to',
 ' show',
 ' Token',
 'izers',
 ' in',
 ' action',
 ' to',
 ' my',
 ' L',
 'LM',
 ' engineers']

In [9]:
tokenizer.vocab
# tokenizer.get_added_vocab()

{'Ġattribution': 63124,
 'ĠPartialEq': 56139,
 '800': 4728,
 'Ġpillar': 62307,
 '.Border': 15313,
 'ĠÃ§Ä±kÄ±ÅŁ': 124119,
 'ĠSTDMETHOD': 40865,
 'âĶĲ': 116707,
 '"><?': 8227,
 'ĠRM': 31915,
 'ĠHer': 6385,
 'ĠkanÄ±': 121138,
 'CanBe': 70685,
 'Threshold': 38941,
 '-initial': 49067,
 'Ġexe': 48293,
 'adem': 60472,
 '%",Ċ': 21531,
 'ĠFlatten': 86738,
 'Ġoptimized': 34440,
 'ĠHed': 75263,
 'ĠÐ¿Ð¾Ð»Ð½Ð¾ÑģÑĤÑĮÑİ': 113674,
 'Ġìłķë¶Ģ': 118951,
 '<{': 39691,
 'ĠAnders': 48693,
 'Ġypos': 71693,
 'Composite': 42785,
 '/maps': 37893,
 'BYTES': 98949,
 'Ġnn': 11120,
 'ëŀ«': 124689,
 'usable': 23620,
 'ĠÑĢÐ°ÑģÑĤÐ²Ð¾ÑĢ': 112445,
 'ĠFUNC': 44452,
 'ĠÐ½ÐµÐ±Ð¾Ð»ÑĮ': 120229,
 'zas': 51455,
 'ÂłÂłÂłÂłÂł': 109719,
 'ĠÐ´Ð¾Ðº': 102255,
 '/ĊĊĊ': 76358,
 'ROUTE': 100211,
 'ãģĻãĤĮãģ°': 126693,
 'abyte': 67811,
 'est': 478,
 'contra': 97843,
 'ĠSusan': 31033,
 'FRAME': 55203,
 'Science': 36500,
 'Ð¾Ñħ': 101897,
 'CÃ´ng': 113369,
 'merge': 19590,
 'Todo': 25151,
 '/me': 51999,
 'ÑĭÐ¿': 114206,
 'ĠLoginPage': 64511

# Instruct variants of models

Many models have a variant that has been trained for use in Chats.  
These are typically labelled with the word "Instruct" at the end.  
They have been trained to expect prompts with a particular format that includes system, user and assistant prompts.  

There is a utility method `apply_chat_template` that will convert from the messages list format we are familiar with, into the right input prompt for this model.

In [10]:
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct', trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [11]:

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>




# Trying new models

We will now work with 3 models:

Phi3 from Microsoft
Qwen2 from Alibaba Cloud
Starcoder2 from BigCode (ServiceNow + HuggingFace + NVidia)

In [12]:
PHI3_MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
QWEN2_MODEL_NAME = "Qwen/Qwen2-7B-Instruct"
STARCODER2_MODEL_NAME = "bigcode/starcoder2-3b" # 3 companies, huggingface, servicenow, ividia

In [13]:
phi3_tokenizer = AutoTokenizer.from_pretrained(PHI3_MODEL_NAME)

text = "I am excited to show Tokenizers in action to my LLM engineers"
# use llama 3.1:
print("llama 3.1")
print(tokenizer.encode(text))
print("PHI3")
# use phi3:
tokens = phi3_tokenizer.encode(text)
print(tokens)
print()
print(phi3_tokenizer.batch_decode(tokens))


tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

llama 3.1
[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 25175]
PHI3
[306, 626, 24173, 304, 1510, 25159, 19427, 297, 3158, 304, 590, 365, 26369, 6012, 414]

['I', 'am', 'excited', 'to', 'show', 'Token', 'izers', 'in', 'action', 'to', 'my', 'L', 'LM', 'engine', 'ers']


In [14]:
print("use llama 31:\n")
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("use phi3 model: \n")
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

use llama 31:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>


use phi3 model: 

<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>



## Need to pick the right tokonizer for the right model,
## mismatch will give you garbage

In [16]:
qwen2_tokenizer = AutoTokenizer.from_pretrained(QWEN2_MODEL_NAME)

text = "I am excited to show Tokenizers in action to my LLM engineers"
# model llama 3.1
print("llama 3.1")
print(tokenizer.encode(text))
print("PHI3")
# model phi3
print(phi3_tokenizer.encode(text))
print("QWEN2")
# model Qwen2
print(qwen2_tokenizer.encode(text))

llama 3.1
[128000, 40, 1097, 12304, 311, 1501, 9857, 12509, 304, 1957, 311, 856, 445, 11237, 25175]
PHI3
[306, 626, 24173, 304, 1510, 25159, 19427, 297, 3158, 304, 590, 365, 26369, 6012, 414]
QWEN2
[40, 1079, 12035, 311, 1473, 9660, 12230, 304, 1917, 311, 847, 444, 10994, 24198]


## Check whether assistant message is printed in video, as I do not get it for all 3 models below.

In [17]:
print("Ollama 3.1")
print(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("PHI3")
print(phi3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
print("Qwen2")
print(qwen2_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

Ollama 3.1
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>


PHI3
<|system|>
You are a helpful assistant<|end|>
<|user|>
Tell a light-hearted joke for a room of Data Scientists<|end|>
<|assistant|>

Qwen2
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Tell a light-hearted joke for a room of Data Scientists<|im_end|>
<|im_start|>assistant



# starcode2 tokenizer is designed for code

In [18]:
starcoder2_tokenizer = AutoTokenizer.from_pretrained(STARCODER2_MODEL_NAME, trust_remote_code=True)
code = """
def hello_world(person):
  print("Hello", person)
"""
tokens = starcoder2_tokenizer.encode(code)
for token in tokens:
  print(f"{token}={starcoder2_tokenizer.decode(token)}")

tokenizer_config.json:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

222=

610=def
17966= hello
100=_
5879=world
45=(
6427=person
731=):
353=
 
1489= print
459=("
8302=Hello
411=",
4944= person
46=)
222=

