# Token Consumption Comparison: JSON vs TOON Format for LLMs

This notebook demonstrates a practical comparison of token consumption between the standard JSON format and a custom TOON format, using a sample data structure. The goal is to help users understand how different serialization formats impact token usage when interacting with Large Language Models (LLMs).

**Key Steps:**
* Generate representative sample data
* Serialize the data to JSON and TOON formats
* Count tokens in each format using whitespace splitting (a simple proxy for LLM tokenization)
* Summarize the results and implications for LLM usage

---

In [0]:
# Step 1: Generate sample data
# This dictionary simulates a typical payload that might be sent to an LLM for processing.
# It includes nested fields and lists to represent realistic complexity.
sample_data = {
    "user": {
        "id": 123,
        "name": "Alice",
        "roles": ["admin", "user"],
        "active": True
    },
    "metrics": {
        "score": 98.5,
        "rank": 1,
        "tags": ["top", "verified"]
    }
}
sample_data

{'user': {'id': 123,
  'name': 'Alice',
  'roles': ['admin', 'user'],
  'active': True},
 'metrics': {'score': 98.5, 'rank': 1, 'tags': ['top', 'verified']}}

In [0]:
# Step 2: Serialize sample data to JSON format
# JSON is a widely used serialization format for APIs and LLMs.
# We use the standard json library to convert the dictionary to a JSON string.
import json
json_str = json.dumps(sample_data)
json_str

'{"user": {"id": 123, "name": "Alice", "roles": ["admin", "user"], "active": true}, "metrics": {"score": 98.5, "rank": 1, "tags": ["top", "verified"]}}'

In [0]:
# Step 3: Serialize sample data to TOON format
# TOON is a custom, flat key-value format for demonstration purposes.
# It flattens nested dictionaries and lists into a compact string representation.
def to_toon(data, parent_key=''):
    items = []
    for k, v in data.items():
        key = f"{parent_key}.{k}" if parent_key else k
        if isinstance(v, dict):
            items.append(to_toon(v, key))
        elif isinstance(v, list):
            items.append(f"{key}=[{', '.join(map(str, v))}];")
        else:
            items.append(f"{key}={v};")
    return '\n'.join(items)

toon_str = to_toon(sample_data)
toon_str

'user.id=123;\nuser.name=Alice;\nuser.roles=[admin, user];\nuser.active=True;\nmetrics.score=98.5;\nmetrics.rank=1;\nmetrics.tags=[top, verified];'

In [0]:
# Step 4: Count tokens in both formats
# Tokenization here uses whitespace splitting as a simple proxy for LLM tokenization.
# For production use, consider using the tokenizer specific to your LLM (e.g., OpenAI tiktoken).
def count_tokens(text):
    return len(text.split())

json_token_count = count_tokens(json_str)
toon_token_count = count_tokens(toon_str)

print(f"JSON token count: {json_token_count}")
print(f"TOON token count: {toon_token_count}")

JSON token count: 18
TOOn token count: 9


## Results: Token Consumption Comparison

* The sample data was serialized to both JSON and TOON formats.
* Token count (using whitespace splitting):
  * **JSON format:** 18 tokens
  * **TOON format:** 9 tokens
* The TOON format resulted in fewer tokens for this sample, likely due to its more compact, flat representation.

### Implications for LLM Usage
* Fewer tokens can reduce LLM costs and improve performance, especially for large payloads or frequent API calls.
* The actual token count may vary depending on the LLM's tokenizer; this notebook uses whitespace splitting for simplicity.
* Custom formats like TOON may be beneficial for specific use cases, but always validate with your target LLM's tokenizer.

---
**Feel free to adapt this notebook for your own data and LLM workflows.**