# Token Counting in ZhipuAI GLM API

**This tutorial is available in English and is attached below the Chinese explanation**

此代码将讲述在 GLM 模型 是如何计算 Token 的消耗的。这将能帮你对在 API 使用中的 Token 计算产生更清晰的认识。

This cookbook will describe how Token consumption is calculated in the GLM model. This will help you have a clearer understanding of Token calculations in API usage.

## Load Tokenizer and count

由于没有任何的公开的 Zhipu AI token计算工具，因此，我使用了 [chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) 这个开源模型的 tokenizer进行加载。这种计算方式仅能作为参考，尚且不能认定是最终的 API token 计算方式。具体的计价方式以官方文档为主。

Since there is no public Zhipu AI token calculation tool, I used the tokenizer of this open source model [chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) to load it. This calculation method can only be used as a reference and cannot be considered as the final API token calculation method.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("chatglm3-6b", trust_remote_code=True, encode_special_tokens=True)

我们书写一个Python 代码，分别有以下两个功能

1. `count_encode` : 将自然语言转换为 token 并计算 token 消耗。
2. `decode`: 输入一串 token 进行解码，并展示 词语是怎么被拆开的。

we write a Python code with the following two functions:

1. `count_encode`: Convert natural language into token and calculate token consumption.
2. `decode`: Input a string of tokens to decode and show how the words are split.

In [2]:
def count_encode(inputs: str = ""):
    encoded_input = tokenizer.encode(inputs)
    num_tokens = len(encoded_input)
    return encoded_input, num_tokens


def decode(inputs: list = []):
    decode_sentence = tokenizer.convert_ids_to_tokens(inputs)
    return decode_sentence


text = "here is a text demo"
encoded_input, num_tokens = count_encode(text)
print(f"Count Token: {num_tokens}")
print(f"Token IDs: {encoded_input}")
decode_sentence = decode(encoded_input)
print(f"Decode sentence: {decode_sentence}", end="\t")

Count Token: 7
Token IDs: [64790, 64792, 985, 323, 260, 2254, 16948]
Decode sentence: ['[gMASK]', 'sop', '▁here', '▁is', '▁a', '▁text', '▁demo']	

## Token calculation in conversation

刚才的代码并不是在真正对话中 token 计算的使用量， 在对话中，由于存在模型的 **对话模板** 和 **聊天历史**，对话中的 token 计算会明显变高。

我先展示了在对话场景中，一条 message 的实际对应的token数量。

The code just now is not the usage of token calculation in the real conversation. In the conversation, due to the existence of the model's **conversation template** and **chat history**, the token calculation in the conversation will be significantly higher.

I first showed the actual number of tokens corresponding to a message in the conversation scenario.

In [3]:
messages = []
messages.append({"role": "user", "content": "here is a text demo"})
chat_inputs = tokenizer.apply_chat_template(messages,
                                            add_generation_prompt=True,
                                            tokenize=True)
num_tokens = len(chat_inputs)
print(f"Count Token: {num_tokens}")
print(f"Token IDs: {chat_inputs}")
model_input_tokens = tokenizer.convert_ids_to_tokens(chat_inputs)
print(model_input_tokens, end="\t")

Count Token: 11
Token IDs: [64790, 64792, 64795, 265, 13, 985, 323, 260, 2254, 16948, 64796]
['[gMASK]', 'sop', '<|user|>', '▁▁', '<0x0A>', '▁here', '▁is', '▁a', '▁text', '▁demo', '<|assistant|>']	

可以看到，由于 **对话模板** 的存在，这次的 token 计算 多出了 4 个 token 数量，这个数量就是模板的占用。**这部分 token 会在 API 进行计费的过程中被扣除**。

It can be seen that due to the existence of **dialogue template**, the token calculation this time has 4 more tokens, which is the occupation of the template. **This part of the token will be deducted during the API billing process**.

## Calculation of token consumption in actual conversations

现在，我们开始模拟在实际的对话中，token 消耗的方式，首先，你需要填写 API 
Now, we start to simulate the way token is consumed in an actual conversation. First, you need to fill in the API

In [4]:
import os
from zhipuai import ZhipuAI

os.environ["ZHIPUAI_API_KEY"] = "your api key"

接着，我们使用一段带有历史记录的对话来计算随着 历史记录的不断增加， Token的消耗速度。
我书写了一段Python脚本，在这里，你可以和模型进行连续对话，代码将会返回随着对话的进行，每次输入的 Token 数量 和 输出的 Token数量。
这个代码模拟了一个简单的命令行交互画面，在这里你可以和大模型进行交互，同时，输出会计算 你的 输入 token 消耗 和 模型的输出token消耗。
请注意，`输入消耗` 在 API 消耗的值，是 用户当前输入和对话历史的全部内容，因此，你将会在对话的过程看到 输入使用的 token 逐渐增加。

Next, we use a conversation with historical records to calculate the Token consumption rate as the historical records continue to increase.
I wrote a Python script, where you can have a continuous conversation with the model, and the code will return the number of Tokens input and the number of Tokens output each time as the conversation proceeds.

This code simulates a simple command line interaction screen, where you can interact with the large model. At the same time, the output will calculate your input token consumption and the model's output token consumption.

Please note that the value consumed by `input consumption` in the API is the entire content of the user's current input and conversation history. Therefore, you will see the token used for input gradually increase during the conversation.

In [5]:
from transformers import AutoTokenizer

class ChatSession:
    def __init__(self, max_tokens=8192):
        self.client = ZhipuAI()
        self.history = []
        self.tokenizer = AutoTokenizer.from_pretrained(
            "chatglm3-6b",
            trust_remote_code=True,
            encode_special_tokens=True
        )
        self.max_tokens = max_tokens

    def calculate_tokens(self, inputs):
        return len(tokenizer.apply_chat_template(
            inputs,
            add_generation_prompt=True,
            tokenize=True)
        )

    def chat(self):
        while True:
            user_input = input("You: ")
            if user_input.lower() in ["exit", "quit"]:
                break
            self.history.append({"role": "user", "content": user_input})
            print(f"Human: {user_input}")
            user_tokens = self.calculate_tokens(self.history)
            print("======================")
            print(f"Human message with history tokens: {user_tokens}")

            response = self.client.chat.completions.create(
                model="glm-4",
                messages=self.history,
                top_p=0.7,
                temperature=0.9,
                stream=False,
                max_tokens=self.max_tokens,
            )

            reply = response.choices[0].message.content
            print(f"AI: {reply}")
            self.history.append({"role": "assistant", "content": reply})
            reply_tokens = self.calculate_tokens(self.history[-2:])
            print("======================")
            print(f"AI response tokens: {reply_tokens}", end="\n**********************\n")


开始对话吧，通过这次对话，你应该会意识到缩短历史记录，并定期清空历史记录的重要性。

Start a conversation！That should help you realize the importance of shortening your history and clearing it regularly.

In [6]:
chat_session = ChatSession()
chat_session.chat()

Human: hello! what a nice day!
Human message with history tokens: 13
AI: Hello! Indeed, it looks like a beautiful day! Is there anything I can help you with today?
AI response tokens: 37
**********************
Human: Are you good at joke?
Human message with history tokens: 46
AI: I can certainly share a joke or two! Here's a light-hearted one for you:

Why don't scientists trust atoms?

Because they make up everything!

Feel free to share a joke with me too, or let me know if you'd like to hear another one.
AI response tokens: 80
**********************
Human: give me a example
Human message with history tokens: 121
AI: Sure! Here's another one:

Why did the scarecrow win an award?

Because he was outstanding in his field!

And here's a classic math joke:

What did the zero say to the eight?

"Nice belt!" 🙂

Let me know if you want more jokes or if there's anything else I can assist you with!
AI response tokens: 95
**********************
Human: tell me some joke about school or universi