# Tokenize text using LumiOpen/Poro-34B model's tokenizer

The creators of the [Poro-34B-model](https://huggingface.co/LumiOpen/Poro-34B) generated a new tokenizer for the model training. Specifically, they trained custom byte-level [BPE tokenizer](https://huggingface.co/learn/nlp-course/chapter6/5) to handle multilingual text (Finnish & English) and code efficiently. This is based on the article ["Poro 34B and the Blessing of Multilinguality"](https://arxiv.org/pdf/2404.01856).

Let's demonstrate how text is chunked into pieces called tokens and how these tokens are represented as numerical IDs in the token vocabulary.

Here’s how the process works:
* Text input: We start with a raw input string (e.g., a sentence).
* Tokenization: The tokenizer will break the input into smaller units called tokens.
* Numerical IDs: These tokens are mapped to corresponding numerical IDs from the tokenizer’s vocabulary.

In [None]:
from transformers import AutoTokenizer

# Load the Poro-34B tokenizer
tokenizer = AutoTokenizer.from_pretrained("LumiOpen/Poro-34B")

In [None]:
text_1 = "Hello world!"

# Tokenize the text and get token IDs (numerical representation)
token_ids1 = tokenizer(text_1)
print("Token IDs:", token_ids1['input_ids'])

# Get tokens (subwords) from the text
tokens1 = tokenizer.tokenize(text_1)
print("Tokens:", tokens1)   



In [None]:
text_2 = "Large Language Models are AI systems trained in vasts amounts of data."

# Tokenize the text and get token IDs (numerical representation)
token_ids2 = tokenizer(text_2)
print("Token IDs:", token_ids2['input_ids'])

# Get tokens (subwords) from the text
tokens2 = tokenizer.tokenize(text_2)
print("Tokens:", tokens2)   

In [None]:
# Decode token IDs back into human-readable form (decoded text)
decoded_text = tokenizer.decode(token_ids2['input_ids'])
print("Decoded text:", decoded_text)
