---
title: "A prompt in Japanese uses 77% more tokens than in English"
date: 2025-06-27
description-meta: "Counting tokens in different languages, and their implications for cost."
categories:
  - til
  - openai 
  - tiktoken 
---

OpenAI mentions in their documentation that [1 token corresponds to roughly 4 characters](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them). 

I was curious how this would work for different languages. So I took a small section of Paul Graham's [How to Do Great Work](https://www.paulgraham.com/greatwork.html) and translated it into different languages (English, Spanish, French, German, Japanese, Chinese, Hindi, Russian, and Portuguese), and counted the tokens.

Here's the code:

In [10]:
import tiktoken

def read_text(file_path):
    with open(file_path, "r") as file:
        return file.read()

text_en = read_text("../_extras/counting-tokens/en.md")
text_es = read_text("../_extras/counting-tokens/es.md")
text_fr = read_text("../_extras/counting-tokens/fr.md")
text_de = read_text("../_extras/counting-tokens/de.md")
text_jp = read_text("../_extras/counting-tokens/jp.md")
text_zh = read_text("../_extras/counting-tokens/zh.md")
text_hi = read_text("../_extras/counting-tokens/hi.md")
text_ru = read_text("../_extras/counting-tokens/ru.md")
text_pt = read_text("../_extras/counting-tokens/pt.md")

def count_tokens(text):
    return len(tiktoken.encoding_for_model("gpt-4o").encode(text))

chars_count = {
    "en": len(text_en),
    "es": len(text_es),
    "fr": len(text_fr),
    "de": len(text_de),
    "jp": len(text_jp),
    "zh": len(text_zh),
    "hi": len(text_hi),
    "ru": len(text_ru),
    "pt": len(text_pt),
}

tokens_count = {
    "en": count_tokens(text_en),
    "es": count_tokens(text_es),
    "fr": count_tokens(text_fr),
    "de": count_tokens(text_de),
    "jp": count_tokens(text_jp),
    "zh": count_tokens(text_zh),
    "hi": count_tokens(text_hi),
    "ru": count_tokens(text_ru),
    "pt": count_tokens(text_pt),
}

This reads the text from the file, and uses tiktoken to count the tokens. I also counted the number of characters in the text.

Then I calculated the ratio of tokens to characters for each language.

In [11]:
for lang in ["en", "es", "fr", "de", "jp", "zh", "hi", "ru", "pt"]:
    chars = chars_count[lang]
    tokens = tokens_count[lang]
    print(f"{lang}: {chars / tokens}, {chars} chars, {tokens} tokens")

en: 4.752314814814815, 2053 chars, 432 tokens
es: 4.5602409638554215, 2271 chars, 498 tokens
fr: 4.692844677137871, 2689 chars, 573 tokens
de: 4.4586330935251794, 2479 chars, 556 tokens
jp: 1.409387222946545, 1081 chars, 767 tokens
zh: 1.3314500941619585, 707 chars, 531 tokens
hi: 3.5104, 2194 chars, 625 tokens
ru: 4.019434628975265, 2275 chars, 566 tokens
pt: 4.631578947368421, 2200 chars, 475 tokens


Here are some highlights from the results:

- English is the most efficient language, with 4.75 tokens per character.
- Mandarin Chinese is the least efficient language, with 1.33 tokens per character.
- The same prompt in Japanese uses 77% more tokens than in English.
- Languages that use a latin alphabet (English, Spanish, French, German, Portuguese) are more efficient than languages that use a non-latin alphabet (Japanese, Chinese, Hindi, Russian). However, Russian is the most efficient language that uses a non-latin alphabet.