# Tokenize text using LumiOpen/Llama-Poro-2-70B-Instruct's tokenizer

[`LumiOpen/Llama-Poro-2-70B-Instruct`](https://huggingface.co/LumiOpen/Llama-Poro-2-70B-Instruct) is based on the Llama 3.1 70B architecture and has been fine-tuned for instruction following and conversational AI applications. The model supports both English and Finnish conversations. It has the same tokenizer as the original model. 

Let's demonstrate how text is chunked into pieces called tokens and how these tokens are represented as numerical IDs in the token vocabulary.

Here’s how the process works:
* Text input: We start with a raw input string (e.g., a sentence).
* Tokenization: The tokenizer will break the input into smaller units called tokens.
* Numerical IDs: These tokens are mapped to corresponding numerical IDs from the tokenizer’s vocabulary.

In [None]:
from transformers import AutoTokenizer

# Load the Llama-Poro-2-70B-Instruct-tokenizer
tokenizer = AutoTokenizer.from_pretrained("LumiOpen/Llama-Poro-2-70B-Instruct")

In [None]:
text_1 = "Hello world!"

# Tokenize the text and get token IDs (numerical representation)
token_ids1 = tokenizer(text_1)
print("Token IDs:", token_ids1['input_ids'])

# Get tokens (subwords) from the text
tokens1 = tokenizer.tokenize(text_1)
print("Tokens:", tokens1)   



In [None]:
text_2 = "Large Language Models are AI systems trained in vasts amounts of data."

# Tokenize the text and get token IDs (numerical representation)
token_ids2 = tokenizer(text_2)
print("Token IDs:", token_ids2['input_ids'])

# Get tokens (subwords) from the text
tokens2 = tokenizer.tokenize(text_2)
print("Tokens:", tokens2)   

In [None]:
# Decode token IDs back into human-readable form (decoded text)
decoded_text = tokenizer.decode(token_ids2['input_ids'])
print("Decoded text:", decoded_text)


# Test the older Poro-34B-chat model's tokenizer

The creators of the [LumiOpen/Poro-34B-chat](https://huggingface.co/LumiOpen/Poro-34B-chat) generated a new tokenizer for the base model training. Specifically, they trained custom byte-level [BPE tokenizer](https://huggingface.co/learn/nlp-course/chapter6/5) to handle multilingual text (Finnish & English) and code efficiently. This is based on the article ["Poro 34B and the Blessing of Multilinguality"](https://arxiv.org/pdf/2404.01856).

In [None]:
# Load the Poro-34B-chat tokenizer
tokenizer_older = AutoTokenizer.from_pretrained("LumiOpen/Poro-34B-chat")

In [None]:
text_3 = "Hello world!"

# Tokenize the text and get token IDs (numerical representation)
token_ids3 = tokenizer_older(text_3)
print("Token IDs:", token_ids3['input_ids'])

# Get tokens (subwords) from the text
tokens3 = tokenizer_older.tokenize(text_3)
print("Tokens:", tokens3)   

In [None]:
text_4 = "Large Language Models are AI systems trained in vasts amounts of data."

# Tokenize the text and get token IDs (numerical representation)
token_ids4 = tokenizer_older(text_4)
print("Token IDs:", token_ids4['input_ids'])

# Get tokens (subwords) from the text
tokens4 = tokenizer_older.tokenize(text_4)
print("Tokens:", tokens4)   

In [None]:
# Decode token IDs back into human-readable form (decoded text)
decoded_text = tokenizer_older.decode(token_ids4['input_ids'])
print("Decoded text:", decoded_text)


**See what happens if you use the wrong tokenizer to decode text...**

In [None]:
# use the Llama-Poro-2-70B-Instruct tokenizer 
decoded_text = tokenizer.decode(token_ids4['input_ids']) # these input id's came from the Poro-34B-chat models tokenizer
print("Decoded text:", decoded_text)

In [None]:
# use the Poro-34B-chat tokenizer 
decoded_text = tokenizer_older.decode(token_ids2['input_ids']) # these input id's came from the Llama-Poro-2-70B-Instruct models tokenizer
print("Decoded text:", decoded_text)