Usually the choice of LLM model goes hand-in-hand with the choice of tokenizer. By default llama-index uses the `cl100k` tokenizer from `tiktoken` which is what is used by the default OpenAI model `gpt-3.5-turbo`. Why would llama-index need to use a tokenizer? All LLMs accept the actual text as input, not tokens. According to the llama-index [documentation](http://127.0.0.1:8000/module_guides/models/llms.html#a-note-on-tokenization) it uses the tokenizer for token counting. If that is all there is to it, then I don't really care about it if I don't get the most accurate count of tokens for each of my API calls. However, it has this gem of sentence in there as well - 

> If you change the LLM, you mane need to update this tokenizer to ensure accurate token counts, chunking, and prompting.

Does it mess around the prompts and doc chunks before it sends them to the LLM? Who knows!! In any case, if I change the LLM, I should try out my test cases with and without its custom tokenizer.

The tokenizer needs to be a callable that takes in a single string and returns a list of ints. Lets look at the default tokenizer -

In [1]:
import tiktoken

In [2]:
tokenize = tiktoken.encoding_for_model("gpt-3.5-turbo").encode

In [3]:
toks = tokenize("I love to program.")
print(len(toks), type(toks[0]))
print(toks)

5 <class 'int'>
[40, 3021, 311, 2068, 13]


I can take any other tokenizer, e.g., from HuggingFace -

In [4]:
from transformers import AutoTokenizer

In [5]:
tokenize = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta").encode

In [6]:
tokenize("I love to write code.")

[1, 315, 2016, 298, 3324, 2696, 28723]

I can make any tokenizer default by calling the `llama_index.set_global_tokenizer(tokenize)`.